Selecting Topics for Web Resource Discovery: Efficiency Issues in a Database Approach +

Size: px

Start display at page:

Download "Selecting Topics for Web Resource Discovery: Efficiency Issues in a Database Approach +"

Jonas Griffith
5 years ago
Views:

1 Selecting Topics for Web Resource Discovery: Efficiency Issues in a Database Approach + Abdullah Al-Hamdani, Gultekin Ozsoyoglu Electrical Engineering and Computer Science Dept, Case Western Reserve University, Cleveland, Ohio (abd, tekin)@eecs.cwru.edu Abstract. This paper discusses algorithms for topic selection queries, designed to query a database containing metadata about web information resources. The metadata database contains topics and relationships, called metalinks, about topics. Topics in the database contain associated importance scores. The topic selection operator TSelection selects, within time T, topics that satisfy a given selection formula and having output importance scores above a given threshold value or in the top-k. The selection formula contains expensive predicates, in the form of user-defined functions. To minimize the number of expensive predicate evaluations (probes) in the TSelection algorithm, we introduce and evaluate three heuristics. Also, due to the time constraint T, the TSelection algorithm may terminate without locating all output tuples. In order to maximize the number of output tuples found, we introduce and evaluate three heuristics to locate a tuple to evaluate at a given time. 1 Introduction Search engines such as Yahoo! use topics and topic hierarchies, extracted as metadata from the web, in order to allow for keyword-based searches over the entire web. We propose (i) restricting the scope of metadata extraction to specific web information resources such as the ACM Digital Library [1], and (ii) extending the metadata extraction process to include automatic extraction of topics and relationships among topics (called metalinks in this paper). Such metadata is stored in a database, and employed for ad hoc querying of web information resources [4, 5]. Data Model. To model the metadata extracted from a web information resource, we have recently used [3, 4, 5] a topic maps-based [9] data model with topic entities (a keyword or a phrase) and metalinks. Examples of topics for the DBLP Bibliography [6] and the ACM SIGMOD Anthology [7] are T1: query optimization (a phrase ), T2: database dependency theory (a phrase), and T3: The Interaction between Functional Dependencies and Template Dependencies (the title of a paper by Sadri and Ullman [8] in ACM SIGMOD Anthology). Topics and metalinks have associated importance scores, which may be obtained using data mining techniques (e.g., association rule-based mining), derived by information retrieval techniques (e.g., the vector space model [10] and cosine similarities) [14], or others (e.g., [12]). Each topic has one or more topic sources; for example, the pdf file for paper with title T3 in the ACM SIGMOD Anthology constitutes a topic source for both topics T2 and T3. Note that topics and metalinks are metadata, (e.g., information about the web resource) + This research is supported by the National Science Foundation grants INT and DBI

2 whereas topic sources constitute data. Maintaining topics and metalinks as metadata in a database allows for ad hoc queries to locate relevant topic sources. Normally, the number of topics satisfying a user request is quite large. Therefore, an algebra-based way of ranking topics and returning only a small number of highly important topics (and their topic sources) is needed. Towards this goal, we have proposed [4] a sideway value algebra (SVA) for object-relational databases. This paper investigates the evaluation of one SVA operator for web computing, namely, the SVA selection operator that implements topic selection queries. Topic Selection (TS) Queries. Given a database of topics, metalinks, and topic source URLs, a TS query takes (i) a single topic relation R with each tuple t containing information about topics and having an importance score Imp(R(t)), (ii) a (propositional calculus) selection formula C with predicates on topic similarity, (iii) an output importance score computation function f out (t), (iv) a query stopping condition β in terms of top-k importance-scored output tuples or output tuples whose importance scores are above a given threshold, and (v) a query response time limit T. The TS query returns within time T those tuples t of R satisfying the selection formula C and with output importance scores computed as f out (t) satisfying β. Example 1. (TS query). Consider the web resources DBLP Bibliography and the ACM SIGMOD Anthology, and the associated metadata database at Assume that the relation RelatedToPapers, extracted from DBLP and Anthology, has the schema RelatedToPapers(Pid 1, Title 1, Abstract 1, Pid 2, Title 2, Abstract 2, ) where the importance score of each RelatedToPapers tuple t can be obtained through the function Imp(RelatedToPapers(t)). Topics/metalinks/topic source importance scores are reals in the range [0,1]. The paper type is a specialization of the topic type, and a paper (instance) is an object/entity with an object-id (i.e., Pid). A user is interested in selecting, within 2 minutes, the top 10 papers in DBLP and Anthology that are related to the paper [8] by Sadri and Ullman (say, with Pid of p23) with a RelatedToPapers importance score of 0.8 or above; and selected papers have either titlesimilarity of 0.9 or above to the paper p23, or an abstract-similarity to p23 of 0.95 or above. Such a query can be expressed using an SQL-like syntax as: select * from RelatedToPapers RT where RT.Pid1= p23 and Imp(RT) 0.8 and (Sim (RT.Title1, RT.Title2)> 0.9 or Sim(RT.Abstract1, RT.Abstract2) 0.95) propagate importance within selection as min for conjunctions and avg for disjunctions stop after 10 most important and within time 2 min where Sim() denotes a similarity function. Propagate importance clause defines the output tuple importance score computation function f out which, in this case, is defined as f out ( ) Imp(RT( )) * AVG [Sim(RT.Abstract 1, RT.Abstract 2 ), Sim(RT.Title 1, RT.Title 2 )] Issues addressed in this paper. We investigate two query processing issues for TS queries: (a) Expensive metalink importance score computation: Some of the metalink types such as RelatedTo are expensive [13] in that their importance score computations are time-consuming executions. We call such functions expensive functions, and any predicate that contains an expensive function an expensive predicate. As an example, the importance score of the metalink type RelatedToPapers, given papers pid 1 and pid 2, can be computed [14] by the function

3 Imp(RelatedToPapers(pid 1, pid 2 )) = w Title * Sim Title (pid 1.title, pid 2.title) +w Authors * Sim Authors (pid 1.authors, pid 2.authors) + w Abstract * Sim Abstract (pid 1.abstract, pid 2.abstract) + w IndexTerms * Sim IndexTerms (pid 1.index-terms, pid 2.index-terms) + w Body * Sim Body (pid 1.body, pid 2.body) + w References * Sim References (pid 1.references, pid 2.references). (1) where Sim Title ( ), Sim Authors ( ), Sim Abstract ( ), Sim IndexTerms ( ), Sim Body ( ), and Sim References ( ) denote similarity functions for pairs of paper titles, authors, abstracts, index terms, paper body, and references, respectively; and w terms constitute weight terms with the constraint w Title + w IndexTerms + w Authors + w Body + w References + w Abstract = 1. We refer to a metalink importance score evaluation as a probe. Clearly, a RelatedToPapers probe (i.e., the computation of Imp(RelatedToPapers(pid 1, pid 2 )) in equation (1)) is expensive, and the system must attempt to minimize the number of such probes. We assume that (i) all probes of a given metalink type have the same cost, and (ii) there is a total ordering to the probe costs of different metalink types. We make the assumption that it is not expensive to compute topic importance scores, and topic importance scores are computed a priori (using pre-collected topic source data) and maintained in the metadata database. (b) Time Constraints. TS query computation times can be too high. Therefore, such queries may have a time constraint clause of the form, say, Time=2 minutes. Time constraints can be transformed into constraints on the number of expensive predicate evaluations (i.e., probes). Time-constrained query evaluation algorithms must be correct ; i.e., given a top-k query with a time constraint, all output tuples must be in the top-k; and, for a threshold query with a threshold τ and a time constraint, importance scores of output tuples must be greater than τ. In the ideal case, a TS query evaluation performs only the probes needed for the output tuples, i.e., the positive probes. In general there will be some probes that will not contribute towards an output tuple, resulting in wasted time. In such cases, there can be multiple goals such as maximizing the number of highest importance-scored output tuples, or maximizing the number of output tuples satisfying an importance score threshold, etc. We present different heuristics and their evaluations for different goals. Contributions. We discuss TS algorithms for evaluating the SVA Topic Selection operator. The algorithms are pipelined: they continuously generate output, and attempt to maximize the number of positive probes they make. In section 2, we present the top-k and threshold-based TS algorithms. Section 3 briefly presents the experimental evaluations. Section 4 concludes. 2 Top-k and Threshold-based Topic Selection Algorithms In this section, we use an operator-based specification of the topic selection query (instead of the SQL-like syntax) as σ * C, f out, β, T (R) where each tuple r of the relation R has an input importance score Imp in (r), C is the selection condition with only expensive predicates, f out () is the output importance score function, β is the output threshold which is either a positive integer k as the ranking threshold, a real-valued importance score threshold V t in the range [0, 1], or the two-tuple (k, V t ), and T is the

4 time constraint. The operator σ * returns, in decreasing order of output importance scores and within time T, either (i) top k f out -ranking output tuples that satisfy the selection condition C (when β is k), or (ii) all tuples of R with an f out -importance score greater than V t and satisfy the selection condition C (when β is V t ), or (iii) top k f out - ranking output tuples that satisfy the selection condition C and with an f out -importance score greater than V t (when β is the two-tuple (k, V t )). When the time constraint T is not sufficient to get the answer, the selection query evaluation becomes a best-effort evaluation. We assume that the input relation R is sorted in decreasing order by the input importance scores Imp in of its tuples. Example 2. In the selection query of example 1, after eliminating inexpensive predicates, we have C as C = Imp(RT) 0.9 and [Sim(RT.Title 1, RT.Title 2 ) 0.9 or Sim(RT.Abstract 1, RT.Abstract 2 ) 0.95)] = C 1 I (C 2 U C 3 ) = (C 1 I C 2 ) U (C 1 I C 3 ) = term 1 U term 2. Using the importance score functions Min and Avg as specified in the query, we have, for a given tuple r, Imp(C, r) = AVG(Imp(term 1, r), Imp(term 2, r)) = AVG(MIN(Imp(C 1, r), Imp(C 2, r)), MIN(Imp(C 1, r), Imp(C 3, r))). If a predicate C i in a given term t is false (i.e., Imp(C i, r)=0) then term t is false (i.e., Imp(t, r)=0). Therefore, we do not need to evaluate the unevaluated predicates in term t. Note that, in our environment, the expensive predicates are either importance score computation functions (e.g., Imp(RelatedToPapers()) or similarity computation predicates (e.g., Sim() > 0.9). Without losing generality, we assume that the selection condition C is in the disjunctive normal form C = U i term i where term i = I j C j and C j is an atomic expensive predicate. The output importance score f out (r) for a given tuple r is computed as f out (r)=imp in (r) * Imp(C,r) = Imp in (r) * g 1 (g 2 (term 1,r), g 2 (term 2,r),.) where g 1 ( ) and g 2 ( ) are monotone functions that are used to incorporate the effects of the disjunctions and the conjunctions, respectively, on the output importance score computation. When g 1 ( ) and g 2 ( ) are monotone, the decision that a given tuple is not in the topk or does not satisfy the threshold value V t can be derived without evaluating all of its atomic expensive predicates. The monotonicity of a given function is defined below. Definition 1 (Monotone Function). A given function g( ) with n parameters is monotone iff it satisfies the following two conditions: (a) If a i b i for all 1 i n then g(a 1,..,a n ) g(b 1,..,b n ), and (b) g(a 1,,a n )=1.0 iff a i =1.0 for all 1 i n. Some examples of monotone functions are PRODUCT(a 1,, a n ) = n i= 1 a i, MIN(a 1,, a n ) = a i where a i a j for all 1 j n, and AVG(a 1,, a n ) = (a 1 +a 2 + +a n )/n. In this paper, we assume that g 1 ( ) and g 2 ( ) functions are monotone. 2.1 Fixed Order Probe-Optimal Topic Selection Algorithm Chang and Hwang [2] have proposed the MPro algorithm to evaluate the top-k Selection operator with only conjunctive expensive predicates using only one monotone function F(x,p 1,p 2,,p n ), where x is a pre-computed inexpensive predicate (i.e., with zero cost) and p 1,p 2,..,p n are expensive predicates. They have proven that MPro algorithm is probe-optimal assuming that there is a pre-defined and fixed evaluation ordering of the predicates. The problem with the MPro algorithm is that

5 usually there is no fixed optimal predicate evaluation order, and the best predicate evaluation order dynamically changes with respect to the tuple being evaluated. In the rest of this section, assuming a fixed pre-defined predicate evaluation ordering, we adapt and extend the minimal probing algorithm MPro in order to evaluate the top-k or threshold-based topic selection operator with two evaluation functions g 1 and g 2. Then, in section 2.2, we eliminate the fixed predefined predicate ordering assumption, and revise the algorithms with dynamically chosen predicates to evaluate. First, we define the evaluation cost of a given expensive predicate. Definition 2 (Expected Evaluation Cost). Let Cost(C i ) be the expected evaluation (time) cost of C i where C i is a conjunct in a selection formula C, 1 i n, which is in the disjunctive normal form. Then the expected evaluation cost of C is defined as n Cost(C) = Σ i= 1 Cost(C i ). Assume, at a given time, the predicates C 1 to C current, 1 current n, have been evaluated using Imp( ), here referred to as EvaluatedImp( ). Definition 3 (Unevaluated Predicate Cost). Let C 1, C 2, C 3,, C n be the pre-defined evaluation ordering for computing f out for a topic selection operator on relation R, and Cost(C i ) be the expected evaluation cost of a given predicate C i. The unevaluated predicate cost UCost(r) for a given tuple r after computing the predicate C current with n tuple r in relation R is defined as UCost(r) = Σ Cost(C i ). i= current + 1 Definition 4(Imp(C j,r)). Let C 1, C 2, C 3,, C n be the pre-defined evaluation ordering for computing f out for a topic selection operator on relation R. The importance score for the j th expensive predicate C j after computing the predicate C current with tuple r in relation R is defined as: Imp(C j,r) = EvaluatedImp(C j, r) if j current 1 otherwise When j>current, we refer to C j as an unevaluated expensive predicate; otherwise it is an evaluated expensive predicate. Definition 5 (Current Output Importance Score). Consider a topic selection operator with the selection condition C = U m i= 1 term i and term i = j C j. The current output importance score of a tuple r on a relation R using the topic selection operator after evaluating the expensive predicate C current is Imp current (r) = Imp in (r) * g 1 (g 2 (term 1,r), g 2 (term 2,r),, g 2 (term m,r)), where g 1 ( ), g 2 ( ) are monotone functions, n is the number of atomic selection predicates, 1 current n, m is number of terms, g 2 (term i,r) = g 2 (Imp(C j,r),imp(c j+1,r), ), and Imp(C j,r) is as defined in definition 4. Proofs of all lemma and theorems are presented in [11]. Lemma 1. (a) For 1 current <n, f out (r) Imp current (r) (b) When current=n then f out (r) = Imp current (r) For the threshold-based TSelection algorithm, lemma 2 states the early termination criteria for a tuple to be dropped from the output. Lemma 2. Assume that the threshold-based topic selection operator is to be applied to tuple r in relation R with atomic expensive predicates C 1, C 2,..., C n.. During the

6 evaluation of expensive predicates, if Imp current (r) becomes less than the threshold value V t then the tuple r cannot be in the output. Note that, in comparison, the top-k based TSelection algorithm does not have the early termination criteria. A tuple r can be dropped from the output only if Imp current (r) less is than the importance score f out of k fully evaluated tuples. Fig.1 illustrates the threshold-based topic selection Threshold-TSelection algorithm. In each iteration, the algorithm finds a tuple r for evaluation using the LocateTuple( ) function, and finds an unevaluated predicate C j of tuple r using the LocatePredicate( ) function. It evaluates the predicate C j for tuple r and computes Imp current (r) using g 1 ( ) and g 2 ( ) functions. If Imp current (r) is less than the threshold value V t then tuple r is discarded. If all predicates are evaluated for tuple r and Imp current (r) V t then tuple r is added to the output. The algorithm stops when all output tuples are found or when the time T runs out. We assume in this section that the LocatePredicate( ) function locates the next predicate to evaluate by using the same pre-defined predicate evaluation ordering for all tuples. Algorithm Threshold-TSelection (V t,r,c,g1,g2,t) For each tuple r in R do{ Imp current (r)=imp in (r); Set Imp(C i,r)=1.0 for all expensive predicates C i ; if(imp in (r) V t )then Add r into PossibleOutput;} While(PossibleOutput is not empty)and(time T is sufficient)do{ r=locatetuple(possibleoutput); C j =LocatePredicate(r);//C j is an unevaluated expensive predicate Imp(C j,r)=probe(c j ); Compute Imp current (r)using g 1 and g 2 ; if(imp current (r)< V t ) then remove r from PossibleOutput else if(all predicates of r are evaluated)then add r into Output;} Fig. 1: Threshold-TSelection Algorithm Theorem 1. Threshold-Selection algorithm has no false drops; i.e., it does not output tuples that are not in the output of the TSelection operator. And, when T is sufficiently large to evaluate all tuples in PossibleOutput, the algorithm has no false dismissals; i.e., it outputs all tuples that are in the output of the TSelection operator. Fig. 2 illustrates the top-k topic selection Top-k-TSelection algorithm. Algorithm Top-k-TSelection (k,r,c,g 1,g 2,T) For each tuple r in R do{ Set Imp(C i,r)=1.0 for all expensive predicates C i ; Imp current (r)=imp in (r); Add r into PossibleOutput;} While ( Output <k) and (Time T is sufficient) do{ Let CurrentTopK be (k- Output ) tuples in PossibleOutput with the highest Imp current. r=locatetuple(currenttopk); if(there exist an unevaluated expensive predicate in tuple r) then{c j =LocatePredicate(r);//C j is an unevaluated expensive predicate Imp(C j,r)=probe(c j ); Compute Imp current (r)using g 1 and g 2 ;} if(all r s predicates are evaluated)and(r is current top-k tuples) then add r into Output;} Fig. 2: Top-k-TSelection Algorithm

7 Theorem 2. Top-k-Selection algorithm has no false drops. And, when T is sufficiently large to evaluate all tuples in PossibleOutput, the algorithm has no false dismissals. 2.2 Time-Constrained Query Evaluation Heuristics Due to the time constraint T, TSelection algorithm is terminated at time T, possibly before locating all output tuples that satisfy the given threshold value V t. There are multiple possible query evaluation goals: (1) Maximize the number of highest importance-scored output tuples. MaxImpLT (Locate Tuple) Heuristic: Locate the tuple r with the highest Imp current ( ) from PossibleOutput. (2) Maximize the number of higher (but, not the highest) importance-scored output tuples with lower unevaluated predicate costs. HigherImpLT Heuristic: Locate the tuple r with the highest Imp current ( ) / UCost( ) from PossibleOutput. (3) Maximize the size of the query output. MaxSizeLT Heuristic: Locate the tuple r with the lowest UCost( ) from PossibleOutput. If two or more tuples have the lowest UCost( ) then locate among them the tuple with the highest Imp current ( ). In section 3.2, we present comparative evaluations of the three heuristics. 2.3 Dynamic Predicate Evaluation Order Heuristics Next, we discuss heuristics for the LocatePredicate() function. The best predicate evaluation order may change with respect to the tuple chosen with LocateTuple( ) while evaluating the expensive predicates. For example, assume that the importance score of a given evaluated expensive predicate C i is very small (say, Imp(C i )=0.05) for a given tuple r. Therefore, the remaining unevaluated predicates in the same conjunctive term with C i possibly have a very small effect on the overall output importance score Imp current (r). Therefore, after computing Imp(C i )=0.05 for r, it is better to evaluate an expensive predicate in another term. Thus, the best order of the expensive predicates for a given tuple should be dynamically computed. Next, we define the heuristic (Predicate-Selection-by-)D(ynamic-)E(ffect-On-)C(ost)1 that locates the next expensive predicate to be evaluated in a dynamic manner for a given tuple r. DEC1-LP (LocatePredicate) Heuristic: Assume that, at a given time, the predicates C 1,C 2,, C current in the selection condition C are evaluated for a given tuple r in a given relation R. We choose the unevaluated expensive predicate C i with the highest max-effect-on-cost as the next predicate in the dynamic evaluation order. The maxeffect-on-cost(c i ) for each unevaluated predicate C i is computed as follows: Let Imp min (C i,r) be computed as Imp current (r) after assigning Imp(C i,r)=0 and Imp(C j,r)=1 for unevaluated predicates C j, j i. Max-Effect-On-Cost(C i,r) [Imp current (r) Imp min (C i,r)]/cost(c i ) It is easy to prove that if the importance score of the predicates of a given selection condition C are uniformly distributed and independent from each other then the DEC1 heuristic gives the optimum order of predicate evaluation for a given tuple r. If the importance scores of predicates are not uniformly distributed or their distribution is not known then we may take a small sample of the tuples from relation R.

8 DEC2-LP Heuristic. At a given time, let the predicates C 1,C 2,, C current be evaluated for a given tuple r in a given relation R. The heuristic chooses the unevaluated expensive predicate C i with the highest Max-Effect-On-Cost2 as the next predicate in the dynamic evaluation order. The Max-Effect-On-Cost2(C i ) for each unevaluated predicate C i is computed as follows: Let Imp min (C i,r) be computed as Imp current (r) after assigning Imp(C i,r)=0 and Imp(C j,r)=1 for unevaluated predicates C j, j i. Also, let SampleSel(C i ) be the selectivity of the importance scores of predicate C i using a sample from relation R. Max-Effect-On-Cost2(C i,r) [(Imp current (r) Imp min (C i,r))+(1-samplesel(c i )]/Cost(C i ) The following heuristic (Predicate-Selection-by-)D(ynamic-)A(vg-)E(ffect-On- )C(ost) uses the average expected Effect-On-Cost for each unevaluated predicate, and chooses the predicate with the highest Avg-Effect-On-Cost as the next predicate to evaluate. DAEC-LP Heuristic. Assume that, at a given time, the predicates C 1,C 2,, C current are evaluated in the selection condition C for a given tuple r in a given relation R. The heuristic chooses the next predicate to evaluate as the predicate C i with the highest Avg-effect-On-Cost. The Avg-effect-On-Cost(C i,r) for each unevaluated predicate C i is computed as follows: Let Imp avg (C i,r) be computed as Imp current (r) after assigning Imp(C i,r)= expected average of Imp(C i ), computed using sampling, and Imp(C j,r)=1 for unevaluated predicates C j, j i. Avg-Effect-On-Cost(C i ) [Imp current (r) Imp avg (C i,r)]/ Cost(C i ) 3 Experimental Evaluations of TSelection Algorithms We have implemented Top-k-TSelection and Threshold-TSelection algorithms using synthetic data. The importance scores for the expensive predicates are generated using uniform distribution, and normal distributions. We compute the selectivity of predicates by randomly evaluating the predicates for 1% of the tuples from a relation R. For each tuple t, we compute its derived importance score Imp(t) using g 1 ( ) and g 2 ( ) functions. Let Imp max be the tuple with the highest Imp on the sample S. The selectivity Sel(P) of a predicate P is established as the number of tuples in S with Imp(t, P) > Imp max divided by the total number of tuples in S. 3.1 Locate-Predicate Heuristics We compare the performances of the MPro, DEC1, DEC2 and DAEC heuristics in terms of the time differences between their evaluation times and the evaluation time of the Best heuristic. For the Best heuristic, we assume that at a given time we know the actual truth values and importance scores for all predicates for a given tuple t. As in [2], the fixed predicate ordering for MPro heuristic is the descending ordering of the predicates by their ranks, where the rank of a predicate P is (1 - Sel(P)) / Cost(P). We use the selection condition (C 1 I C 2 ) U (C 3 I C 4 I C 5 ) with the expected evaluation cost Cost(C i )={0.5, 0.3, 1.0, 0.4, 0.1} seconds, respectively. That is, C 1 takes 0.5 seconds to evaluate, C 2 takes 0.3 seconds to evaluate and so on. The size of the input relation R is 1000 tuples. We compute the derived importance score Imp(t) of a given tuple t using the average function as g 1 and the minimum function as g 2. a) Uniform distribution: The importance scores of all expensive predicates have been generated using the uniform distribution. We have observed that the dynamic

9 predicate evaluation order heuristics improve the performances of the top-k and threshold based TSelection algorithms by 8% to 20%. The time difference between the MPro heuristic evaluation time and that using other dynamic heuristics increases as the threshold V t decreases or k increases. As expected, the DEC1 heuristic has lower total cost (i.e., faster) than other heuristics for both top-k and threshold based TSelection algorithms. The DEC1 heuristic is (14, 20)% better than the MPro heuristic. The evaluation time using the DAEC heuristic is very close to that using the DEC1 heuristics, whereas, the DEC2 heuristic has a higher evaluation time. The increase in the evaluation time for threshold-based TSelection is almost linear with respect to the decrease threshold V t. b) Combinations of distributions: The distributions of the importance scores for the expensive predicates C 1 and C 3 are chosen as uniform, and C 2, C 4, and C 5 are chosen as normal with a mean of 0.7 and a standard deviation of 0.2.We have observed that the evaluation time differences between the MPro heuristic and other dynamic heuristics is small as compared to those in other distributions. The difference decreases when k increases or threshold V t decreases. The dynamic heuristics are (0, 9)% better than the MPro heuristic for threshold-based and top-k algorithms. Also, the DEC2 heuristic has the best performance for both top-k and threshold based TSelection algorithms. In conclusion, using any distribution to generate the importance scores of the expensive predicates, the dynamic predicate evaluation order heuristics improve the performances of the top-k and threshold based TSelection algorithms. If the importance scores for all expensive predicates have the same distribution then the best heuristic is the DEC1 heuristic. If the importance scores have different distributions then the DEC2 heuristic has the best performance. 3.2 Time Constraints We have evaluated top-k and threshold based TSelection algorithms with a time constraint T and by using different distributions to generate importance scores for expensive predicates. We used the MaxImpLT, HigherImpLT, and MaxSizeLT heuristics to locate the tuple from which an unevaluated expensive predicate is to be evaluated at a given time. For a given time constraint T, we compare the performances of the three heuristics in terms of the precision and wasted time ratio. Definition 6 (Wasted Time Ratio, Precision): Let T be a query evaluation time limit, and t useful be the time spent, out of time T, to completely evaluate those tuples that are verified to be in the output. Then, Wasted time ratio = 1 - ( t useful / T), and Precision =(No. of output tuples found within time T)/(No. of tuples in the fully evaluated output). We used an input relation R of size 500 tuples. We used the selection condition (C 1 I C 2 ) U (C 3 I C 4 I C 5 ) with the expected evaluation cost Cost(C i ) = {0.5, 0.3, 1.0, 0.4, 0.1} seconds, respectively. The derived importance score Imp(t) of a given tuple t is computed using the average function as g 1 and the minimum function as g 2. For the threshold-based TSelection algorithm, we located the tuples that satisfy the threshold value V t of 0.5. For the top-k TSelection algorithm, we computed the top-50 tuples. Let T Required be the required time to fully evaluate a given query. We used 0.5, 0.6, 0.7, 0.8 and 0.9 of T Required as the time constraint T, and computed the precision and wasted time ratio.

10 First, the importance scores for all expensive predicates are generated using the normal distribution with a mean of 0.7 and a standard deviation of 0.2. The MaxImpLT heuristic has the worst performance as compared to the other heuristics. It has (14, 54)% lower precision and (20,63)% higher wasted time ratio. The MaxSizeLT has the best performance for all values of the time constraint T: it has (0, 23)% higher precision and (0, 42)% wasted time ratio as compared to the HigherImpLT heuristic. As for the performance for the top-k TSelection algorithm using the normal distribution, the MaxImpLT has the worst performance and there is a small difference between its performance and that of other heuristics: it has (0, 20)% lower precision and (0, 4)% higher wasted time ratio. The HigherImpLT heuristic has the best performance, it has (0, 6)% higher precision and (0, 2)% lower wasted time ratio as compared to the MaxSizeLT heuristic. In conclusion, using any distribution to generate the importance scores for expensive predicates, the MaxImpLT heuristic has the worst performance for the top-k and threshold-based TSelection algorithms. In the threshold-based algorithm, there is a large difference between the performances of the MaxImpLT heuristic and that of other heuristics. And, at a given time T, MaxSizeLT has the best performance. In the top-k based algorithm, there is no large difference in the performances of the MaxImpLT heuristic and that of other heuristics, and HigherImpLT has the best performance. 4 Conclusions We have presented algorithms to evaluate the topic selection TSelection operator for information resource discovery. We have proposed and evaluated heuristics to locate tuples and to evaluate expensive predicates. References 1. ACM Digital Library, at 2. Chan, K., C-C, Hwang, S-W., "Minimal Probing: Supporting Expensive Predicates for Topk Queries", ACM SIGMOD, Altingovde, I.S., et al, Topic-Centric Querying of Web Information Resources, Proc, DEXA Ozsoyoglu, G, Al-Hamdani, A, Altıngovde, I.S, Ozel, S.A, Ulusoy, O, Ozsoyoglu, Z.M., "Sideway Value Algebra for Object-Relational Databases", VLDB Conf., Ozsoyoglu, G., Altingovde, I. S., Al-Hamdani, A., Ozel, S. A., Ulusoy, O., Ozsoyoglu, M., Extending SQL for Metadata-based Querying, Submitted for journal publication, DBLP Bibliography, by Michael Ley, at 7. ACM SIGMOD Anthology, at 8. Sadri, F., Ullman, J., The Interaction between Functional Dependencies and Template Dependencies, SIGMOD Conf., Biezunski, M., Bryan, M., Newcomb, S., editors, ISO/IEC 13250, Topic Maps, available at Salton, G., Automatic Text Processing, Addison-Wesley, Al-Hamdani, A., Ozsoyoglu, G., Selecting Topics for Web Resource Discovery: Efficiency Issues in a Database Approach, technical report, EECS, CWRU, Agichtein, E., Gravano, L., Snowball: Extracting Relations from Large Plain-Text Collections, Proc. of the 5 th ACM International Conf. on Digital Libraries, Hellerstein, J.M, Stonebraker, M., Predicate Migration: Optimizing queries with expensive predicates, ACM SIGMOD Li, Li, Finding Related Papers in a Digital Library, MS Thesis, CWRU, June 2003.

Topic Area: Infrastructure for information systems Category: Research

Topic Area: Infrastructure for information systems Category: Research Paper Title: Sideway Value Algebra for Object-Relational Databases Paper Authors: Ozsoyoglu #, G, Al-Hamdani, A, Altıngovde, I.S, Ozel, S.A, Ulusoy, O, Ozsoyoglu, Z.M Note: G. Ozsoyoglu is on the Core