Selecting Topics for Web Resource Discovery: Efficiency Issues in a Database Approach +

Size: px
Start display at page:

Download "Selecting Topics for Web Resource Discovery: Efficiency Issues in a Database Approach +"

Transcription

1 Selecting Topics for Web Resource Discovery: Efficiency Issues in a Database Approach + Abdullah Al-Hamdani, Gultekin Ozsoyoglu Electrical Engineering and Computer Science Dept, Case Western Reserve University, Cleveland, Ohio (abd, tekin)@eecs.cwru.edu Abstract. This paper discusses algorithms for topic selection queries, designed to query a database containing metadata about web information resources. The metadata database contains topics and relationships, called metalinks, about topics. Topics in the database contain associated importance scores. The topic selection operator TSelection selects, within time T, topics that satisfy a given selection formula and having output importance scores above a given threshold value or in the top-k. The selection formula contains expensive predicates, in the form of user-defined functions. To minimize the number of expensive predicate evaluations (probes) in the TSelection algorithm, we introduce and evaluate three heuristics. Also, due to the time constraint T, the TSelection algorithm may terminate without locating all output tuples. In order to maximize the number of output tuples found, we introduce and evaluate three heuristics to locate a tuple to evaluate at a given time. 1 Introduction Search engines such as Yahoo! use topics and topic hierarchies, extracted as metadata from the web, in order to allow for keyword-based searches over the entire web. We propose (i) restricting the scope of metadata extraction to specific web information resources such as the ACM Digital Library [1], and (ii) extending the metadata extraction process to include automatic extraction of topics and relationships among topics (called metalinks in this paper). Such metadata is stored in a database, and employed for ad hoc querying of web information resources [4, 5]. Data Model. To model the metadata extracted from a web information resource, we have recently used [3, 4, 5] a topic maps-based [9] data model with topic entities (a keyword or a phrase) and metalinks. Examples of topics for the DBLP Bibliography [6] and the ACM SIGMOD Anthology [7] are T1: query optimization (a phrase ), T2: database dependency theory (a phrase), and T3: The Interaction between Functional Dependencies and Template Dependencies (the title of a paper by Sadri and Ullman [8] in ACM SIGMOD Anthology). Topics and metalinks have associated importance scores, which may be obtained using data mining techniques (e.g., association rule-based mining), derived by information retrieval techniques (e.g., the vector space model [10] and cosine similarities) [14], or others (e.g., [12]). Each topic has one or more topic sources; for example, the pdf file for paper with title T3 in the ACM SIGMOD Anthology constitutes a topic source for both topics T2 and T3. Note that topics and metalinks are metadata, (e.g., information about the web resource) + This research is supported by the National Science Foundation grants INT and DBI

2 whereas topic sources constitute data. Maintaining topics and metalinks as metadata in a database allows for ad hoc queries to locate relevant topic sources. Normally, the number of topics satisfying a user request is quite large. Therefore, an algebra-based way of ranking topics and returning only a small number of highly important topics (and their topic sources) is needed. Towards this goal, we have proposed [4] a sideway value algebra (SVA) for object-relational databases. This paper investigates the evaluation of one SVA operator for web computing, namely, the SVA selection operator that implements topic selection queries. Topic Selection (TS) Queries. Given a database of topics, metalinks, and topic source URLs, a TS query takes (i) a single topic relation R with each tuple t containing information about topics and having an importance score Imp(R(t)), (ii) a (propositional calculus) selection formula C with predicates on topic similarity, (iii) an output importance score computation function f out (t), (iv) a query stopping condition β in terms of top-k importance-scored output tuples or output tuples whose importance scores are above a given threshold, and (v) a query response time limit T. The TS query returns within time T those tuples t of R satisfying the selection formula C and with output importance scores computed as f out (t) satisfying β. Example 1. (TS query). Consider the web resources DBLP Bibliography and the ACM SIGMOD Anthology, and the associated metadata database at Assume that the relation RelatedToPapers, extracted from DBLP and Anthology, has the schema RelatedToPapers(Pid 1, Title 1, Abstract 1, Pid 2, Title 2, Abstract 2, ) where the importance score of each RelatedToPapers tuple t can be obtained through the function Imp(RelatedToPapers(t)). Topics/metalinks/topic source importance scores are reals in the range [0,1]. The paper type is a specialization of the topic type, and a paper (instance) is an object/entity with an object-id (i.e., Pid). A user is interested in selecting, within 2 minutes, the top 10 papers in DBLP and Anthology that are related to the paper [8] by Sadri and Ullman (say, with Pid of p23) with a RelatedToPapers importance score of 0.8 or above; and selected papers have either titlesimilarity of 0.9 or above to the paper p23, or an abstract-similarity to p23 of 0.95 or above. Such a query can be expressed using an SQL-like syntax as: select * from RelatedToPapers RT where RT.Pid1= p23 and Imp(RT) 0.8 and (Sim (RT.Title1, RT.Title2)> 0.9 or Sim(RT.Abstract1, RT.Abstract2) 0.95) propagate importance within selection as min for conjunctions and avg for disjunctions stop after 10 most important and within time 2 min where Sim() denotes a similarity function. Propagate importance clause defines the output tuple importance score computation function f out which, in this case, is defined as f out ( ) Imp(RT( )) * AVG [Sim(RT.Abstract 1, RT.Abstract 2 ), Sim(RT.Title 1, RT.Title 2 )] Issues addressed in this paper. We investigate two query processing issues for TS queries: (a) Expensive metalink importance score computation: Some of the metalink types such as RelatedTo are expensive [13] in that their importance score computations are time-consuming executions. We call such functions expensive functions, and any predicate that contains an expensive function an expensive predicate. As an example, the importance score of the metalink type RelatedToPapers, given papers pid 1 and pid 2, can be computed [14] by the function

3 Imp(RelatedToPapers(pid 1, pid 2 )) = w Title * Sim Title (pid 1.title, pid 2.title) +w Authors * Sim Authors (pid 1.authors, pid 2.authors) + w Abstract * Sim Abstract (pid 1.abstract, pid 2.abstract) + w IndexTerms * Sim IndexTerms (pid 1.index-terms, pid 2.index-terms) + w Body * Sim Body (pid 1.body, pid 2.body) + w References * Sim References (pid 1.references, pid 2.references). (1) where Sim Title ( ), Sim Authors ( ), Sim Abstract ( ), Sim IndexTerms ( ), Sim Body ( ), and Sim References ( ) denote similarity functions for pairs of paper titles, authors, abstracts, index terms, paper body, and references, respectively; and w terms constitute weight terms with the constraint w Title + w IndexTerms + w Authors + w Body + w References + w Abstract = 1. We refer to a metalink importance score evaluation as a probe. Clearly, a RelatedToPapers probe (i.e., the computation of Imp(RelatedToPapers(pid 1, pid 2 )) in equation (1)) is expensive, and the system must attempt to minimize the number of such probes. We assume that (i) all probes of a given metalink type have the same cost, and (ii) there is a total ordering to the probe costs of different metalink types. We make the assumption that it is not expensive to compute topic importance scores, and topic importance scores are computed a priori (using pre-collected topic source data) and maintained in the metadata database. (b) Time Constraints. TS query computation times can be too high. Therefore, such queries may have a time constraint clause of the form, say, Time=2 minutes. Time constraints can be transformed into constraints on the number of expensive predicate evaluations (i.e., probes). Time-constrained query evaluation algorithms must be correct ; i.e., given a top-k query with a time constraint, all output tuples must be in the top-k; and, for a threshold query with a threshold τ and a time constraint, importance scores of output tuples must be greater than τ. In the ideal case, a TS query evaluation performs only the probes needed for the output tuples, i.e., the positive probes. In general there will be some probes that will not contribute towards an output tuple, resulting in wasted time. In such cases, there can be multiple goals such as maximizing the number of highest importance-scored output tuples, or maximizing the number of output tuples satisfying an importance score threshold, etc. We present different heuristics and their evaluations for different goals. Contributions. We discuss TS algorithms for evaluating the SVA Topic Selection operator. The algorithms are pipelined: they continuously generate output, and attempt to maximize the number of positive probes they make. In section 2, we present the top-k and threshold-based TS algorithms. Section 3 briefly presents the experimental evaluations. Section 4 concludes. 2 Top-k and Threshold-based Topic Selection Algorithms In this section, we use an operator-based specification of the topic selection query (instead of the SQL-like syntax) as σ * C, f out, β, T (R) where each tuple r of the relation R has an input importance score Imp in (r), C is the selection condition with only expensive predicates, f out () is the output importance score function, β is the output threshold which is either a positive integer k as the ranking threshold, a real-valued importance score threshold V t in the range [0, 1], or the two-tuple (k, V t ), and T is the

4 time constraint. The operator σ * returns, in decreasing order of output importance scores and within time T, either (i) top k f out -ranking output tuples that satisfy the selection condition C (when β is k), or (ii) all tuples of R with an f out -importance score greater than V t and satisfy the selection condition C (when β is V t ), or (iii) top k f out - ranking output tuples that satisfy the selection condition C and with an f out -importance score greater than V t (when β is the two-tuple (k, V t )). When the time constraint T is not sufficient to get the answer, the selection query evaluation becomes a best-effort evaluation. We assume that the input relation R is sorted in decreasing order by the input importance scores Imp in of its tuples. Example 2. In the selection query of example 1, after eliminating inexpensive predicates, we have C as C = Imp(RT) 0.9 and [Sim(RT.Title 1, RT.Title 2 ) 0.9 or Sim(RT.Abstract 1, RT.Abstract 2 ) 0.95)] = C 1 I (C 2 U C 3 ) = (C 1 I C 2 ) U (C 1 I C 3 ) = term 1 U term 2. Using the importance score functions Min and Avg as specified in the query, we have, for a given tuple r, Imp(C, r) = AVG(Imp(term 1, r), Imp(term 2, r)) = AVG(MIN(Imp(C 1, r), Imp(C 2, r)), MIN(Imp(C 1, r), Imp(C 3, r))). If a predicate C i in a given term t is false (i.e., Imp(C i, r)=0) then term t is false (i.e., Imp(t, r)=0). Therefore, we do not need to evaluate the unevaluated predicates in term t. Note that, in our environment, the expensive predicates are either importance score computation functions (e.g., Imp(RelatedToPapers()) or similarity computation predicates (e.g., Sim() > 0.9). Without losing generality, we assume that the selection condition C is in the disjunctive normal form C = U i term i where term i = I j C j and C j is an atomic expensive predicate. The output importance score f out (r) for a given tuple r is computed as f out (r)=imp in (r) * Imp(C,r) = Imp in (r) * g 1 (g 2 (term 1,r), g 2 (term 2,r),.) where g 1 ( ) and g 2 ( ) are monotone functions that are used to incorporate the effects of the disjunctions and the conjunctions, respectively, on the output importance score computation. When g 1 ( ) and g 2 ( ) are monotone, the decision that a given tuple is not in the topk or does not satisfy the threshold value V t can be derived without evaluating all of its atomic expensive predicates. The monotonicity of a given function is defined below. Definition 1 (Monotone Function). A given function g( ) with n parameters is monotone iff it satisfies the following two conditions: (a) If a i b i for all 1 i n then g(a 1,..,a n ) g(b 1,..,b n ), and (b) g(a 1,,a n )=1.0 iff a i =1.0 for all 1 i n. Some examples of monotone functions are PRODUCT(a 1,, a n ) = n i= 1 a i, MIN(a 1,, a n ) = a i where a i a j for all 1 j n, and AVG(a 1,, a n ) = (a 1 +a 2 + +a n )/n. In this paper, we assume that g 1 ( ) and g 2 ( ) functions are monotone. 2.1 Fixed Order Probe-Optimal Topic Selection Algorithm Chang and Hwang [2] have proposed the MPro algorithm to evaluate the top-k Selection operator with only conjunctive expensive predicates using only one monotone function F(x,p 1,p 2,,p n ), where x is a pre-computed inexpensive predicate (i.e., with zero cost) and p 1,p 2,..,p n are expensive predicates. They have proven that MPro algorithm is probe-optimal assuming that there is a pre-defined and fixed evaluation ordering of the predicates. The problem with the MPro algorithm is that

5 usually there is no fixed optimal predicate evaluation order, and the best predicate evaluation order dynamically changes with respect to the tuple being evaluated. In the rest of this section, assuming a fixed pre-defined predicate evaluation ordering, we adapt and extend the minimal probing algorithm MPro in order to evaluate the top-k or threshold-based topic selection operator with two evaluation functions g 1 and g 2. Then, in section 2.2, we eliminate the fixed predefined predicate ordering assumption, and revise the algorithms with dynamically chosen predicates to evaluate. First, we define the evaluation cost of a given expensive predicate. Definition 2 (Expected Evaluation Cost). Let Cost(C i ) be the expected evaluation (time) cost of C i where C i is a conjunct in a selection formula C, 1 i n, which is in the disjunctive normal form. Then the expected evaluation cost of C is defined as n Cost(C) = Σ i= 1 Cost(C i ). Assume, at a given time, the predicates C 1 to C current, 1 current n, have been evaluated using Imp( ), here referred to as EvaluatedImp( ). Definition 3 (Unevaluated Predicate Cost). Let C 1, C 2, C 3,, C n be the pre-defined evaluation ordering for computing f out for a topic selection operator on relation R, and Cost(C i ) be the expected evaluation cost of a given predicate C i. The unevaluated predicate cost UCost(r) for a given tuple r after computing the predicate C current with n tuple r in relation R is defined as UCost(r) = Σ Cost(C i ). i= current + 1 Definition 4(Imp(C j,r)). Let C 1, C 2, C 3,, C n be the pre-defined evaluation ordering for computing f out for a topic selection operator on relation R. The importance score for the j th expensive predicate C j after computing the predicate C current with tuple r in relation R is defined as: Imp(C j,r) = EvaluatedImp(C j, r) if j current 1 otherwise When j>current, we refer to C j as an unevaluated expensive predicate; otherwise it is an evaluated expensive predicate. Definition 5 (Current Output Importance Score). Consider a topic selection operator with the selection condition C = U m i= 1 term i and term i = j C j. The current output importance score of a tuple r on a relation R using the topic selection operator after evaluating the expensive predicate C current is Imp current (r) = Imp in (r) * g 1 (g 2 (term 1,r), g 2 (term 2,r),, g 2 (term m,r)), where g 1 ( ), g 2 ( ) are monotone functions, n is the number of atomic selection predicates, 1 current n, m is number of terms, g 2 (term i,r) = g 2 (Imp(C j,r),imp(c j+1,r), ), and Imp(C j,r) is as defined in definition 4. Proofs of all lemma and theorems are presented in [11]. Lemma 1. (a) For 1 current <n, f out (r) Imp current (r) (b) When current=n then f out (r) = Imp current (r) For the threshold-based TSelection algorithm, lemma 2 states the early termination criteria for a tuple to be dropped from the output. Lemma 2. Assume that the threshold-based topic selection operator is to be applied to tuple r in relation R with atomic expensive predicates C 1, C 2,..., C n.. During the

6 evaluation of expensive predicates, if Imp current (r) becomes less than the threshold value V t then the tuple r cannot be in the output. Note that, in comparison, the top-k based TSelection algorithm does not have the early termination criteria. A tuple r can be dropped from the output only if Imp current (r) less is than the importance score f out of k fully evaluated tuples. Fig.1 illustrates the threshold-based topic selection Threshold-TSelection algorithm. In each iteration, the algorithm finds a tuple r for evaluation using the LocateTuple( ) function, and finds an unevaluated predicate C j of tuple r using the LocatePredicate( ) function. It evaluates the predicate C j for tuple r and computes Imp current (r) using g 1 ( ) and g 2 ( ) functions. If Imp current (r) is less than the threshold value V t then tuple r is discarded. If all predicates are evaluated for tuple r and Imp current (r) V t then tuple r is added to the output. The algorithm stops when all output tuples are found or when the time T runs out. We assume in this section that the LocatePredicate( ) function locates the next predicate to evaluate by using the same pre-defined predicate evaluation ordering for all tuples. Algorithm Threshold-TSelection (V t,r,c,g1,g2,t) For each tuple r in R do{ Imp current (r)=imp in (r); Set Imp(C i,r)=1.0 for all expensive predicates C i ; if(imp in (r) V t )then Add r into PossibleOutput;} While(PossibleOutput is not empty)and(time T is sufficient)do{ r=locatetuple(possibleoutput); C j =LocatePredicate(r);//C j is an unevaluated expensive predicate Imp(C j,r)=probe(c j ); Compute Imp current (r)using g 1 and g 2 ; if(imp current (r)< V t ) then remove r from PossibleOutput else if(all predicates of r are evaluated)then add r into Output;} Fig. 1: Threshold-TSelection Algorithm Theorem 1. Threshold-Selection algorithm has no false drops; i.e., it does not output tuples that are not in the output of the TSelection operator. And, when T is sufficiently large to evaluate all tuples in PossibleOutput, the algorithm has no false dismissals; i.e., it outputs all tuples that are in the output of the TSelection operator. Fig. 2 illustrates the top-k topic selection Top-k-TSelection algorithm. Algorithm Top-k-TSelection (k,r,c,g 1,g 2,T) For each tuple r in R do{ Set Imp(C i,r)=1.0 for all expensive predicates C i ; Imp current (r)=imp in (r); Add r into PossibleOutput;} While ( Output <k) and (Time T is sufficient) do{ Let CurrentTopK be (k- Output ) tuples in PossibleOutput with the highest Imp current. r=locatetuple(currenttopk); if(there exist an unevaluated expensive predicate in tuple r) then{c j =LocatePredicate(r);//C j is an unevaluated expensive predicate Imp(C j,r)=probe(c j ); Compute Imp current (r)using g 1 and g 2 ;} if(all r s predicates are evaluated)and(r is current top-k tuples) then add r into Output;} Fig. 2: Top-k-TSelection Algorithm

7 Theorem 2. Top-k-Selection algorithm has no false drops. And, when T is sufficiently large to evaluate all tuples in PossibleOutput, the algorithm has no false dismissals. 2.2 Time-Constrained Query Evaluation Heuristics Due to the time constraint T, TSelection algorithm is terminated at time T, possibly before locating all output tuples that satisfy the given threshold value V t. There are multiple possible query evaluation goals: (1) Maximize the number of highest importance-scored output tuples. MaxImpLT (Locate Tuple) Heuristic: Locate the tuple r with the highest Imp current ( ) from PossibleOutput. (2) Maximize the number of higher (but, not the highest) importance-scored output tuples with lower unevaluated predicate costs. HigherImpLT Heuristic: Locate the tuple r with the highest Imp current ( ) / UCost( ) from PossibleOutput. (3) Maximize the size of the query output. MaxSizeLT Heuristic: Locate the tuple r with the lowest UCost( ) from PossibleOutput. If two or more tuples have the lowest UCost( ) then locate among them the tuple with the highest Imp current ( ). In section 3.2, we present comparative evaluations of the three heuristics. 2.3 Dynamic Predicate Evaluation Order Heuristics Next, we discuss heuristics for the LocatePredicate() function. The best predicate evaluation order may change with respect to the tuple chosen with LocateTuple( ) while evaluating the expensive predicates. For example, assume that the importance score of a given evaluated expensive predicate C i is very small (say, Imp(C i )=0.05) for a given tuple r. Therefore, the remaining unevaluated predicates in the same conjunctive term with C i possibly have a very small effect on the overall output importance score Imp current (r). Therefore, after computing Imp(C i )=0.05 for r, it is better to evaluate an expensive predicate in another term. Thus, the best order of the expensive predicates for a given tuple should be dynamically computed. Next, we define the heuristic (Predicate-Selection-by-)D(ynamic-)E(ffect-On-)C(ost)1 that locates the next expensive predicate to be evaluated in a dynamic manner for a given tuple r. DEC1-LP (LocatePredicate) Heuristic: Assume that, at a given time, the predicates C 1,C 2,, C current in the selection condition C are evaluated for a given tuple r in a given relation R. We choose the unevaluated expensive predicate C i with the highest max-effect-on-cost as the next predicate in the dynamic evaluation order. The maxeffect-on-cost(c i ) for each unevaluated predicate C i is computed as follows: Let Imp min (C i,r) be computed as Imp current (r) after assigning Imp(C i,r)=0 and Imp(C j,r)=1 for unevaluated predicates C j, j i. Max-Effect-On-Cost(C i,r) [Imp current (r) Imp min (C i,r)]/cost(c i ) It is easy to prove that if the importance score of the predicates of a given selection condition C are uniformly distributed and independent from each other then the DEC1 heuristic gives the optimum order of predicate evaluation for a given tuple r. If the importance scores of predicates are not uniformly distributed or their distribution is not known then we may take a small sample of the tuples from relation R.

8 DEC2-LP Heuristic. At a given time, let the predicates C 1,C 2,, C current be evaluated for a given tuple r in a given relation R. The heuristic chooses the unevaluated expensive predicate C i with the highest Max-Effect-On-Cost2 as the next predicate in the dynamic evaluation order. The Max-Effect-On-Cost2(C i ) for each unevaluated predicate C i is computed as follows: Let Imp min (C i,r) be computed as Imp current (r) after assigning Imp(C i,r)=0 and Imp(C j,r)=1 for unevaluated predicates C j, j i. Also, let SampleSel(C i ) be the selectivity of the importance scores of predicate C i using a sample from relation R. Max-Effect-On-Cost2(C i,r) [(Imp current (r) Imp min (C i,r))+(1-samplesel(c i )]/Cost(C i ) The following heuristic (Predicate-Selection-by-)D(ynamic-)A(vg-)E(ffect-On- )C(ost) uses the average expected Effect-On-Cost for each unevaluated predicate, and chooses the predicate with the highest Avg-Effect-On-Cost as the next predicate to evaluate. DAEC-LP Heuristic. Assume that, at a given time, the predicates C 1,C 2,, C current are evaluated in the selection condition C for a given tuple r in a given relation R. The heuristic chooses the next predicate to evaluate as the predicate C i with the highest Avg-effect-On-Cost. The Avg-effect-On-Cost(C i,r) for each unevaluated predicate C i is computed as follows: Let Imp avg (C i,r) be computed as Imp current (r) after assigning Imp(C i,r)= expected average of Imp(C i ), computed using sampling, and Imp(C j,r)=1 for unevaluated predicates C j, j i. Avg-Effect-On-Cost(C i ) [Imp current (r) Imp avg (C i,r)]/ Cost(C i ) 3 Experimental Evaluations of TSelection Algorithms We have implemented Top-k-TSelection and Threshold-TSelection algorithms using synthetic data. The importance scores for the expensive predicates are generated using uniform distribution, and normal distributions. We compute the selectivity of predicates by randomly evaluating the predicates for 1% of the tuples from a relation R. For each tuple t, we compute its derived importance score Imp(t) using g 1 ( ) and g 2 ( ) functions. Let Imp max be the tuple with the highest Imp on the sample S. The selectivity Sel(P) of a predicate P is established as the number of tuples in S with Imp(t, P) > Imp max divided by the total number of tuples in S. 3.1 Locate-Predicate Heuristics We compare the performances of the MPro, DEC1, DEC2 and DAEC heuristics in terms of the time differences between their evaluation times and the evaluation time of the Best heuristic. For the Best heuristic, we assume that at a given time we know the actual truth values and importance scores for all predicates for a given tuple t. As in [2], the fixed predicate ordering for MPro heuristic is the descending ordering of the predicates by their ranks, where the rank of a predicate P is (1 - Sel(P)) / Cost(P). We use the selection condition (C 1 I C 2 ) U (C 3 I C 4 I C 5 ) with the expected evaluation cost Cost(C i )={0.5, 0.3, 1.0, 0.4, 0.1} seconds, respectively. That is, C 1 takes 0.5 seconds to evaluate, C 2 takes 0.3 seconds to evaluate and so on. The size of the input relation R is 1000 tuples. We compute the derived importance score Imp(t) of a given tuple t using the average function as g 1 and the minimum function as g 2. a) Uniform distribution: The importance scores of all expensive predicates have been generated using the uniform distribution. We have observed that the dynamic

9 predicate evaluation order heuristics improve the performances of the top-k and threshold based TSelection algorithms by 8% to 20%. The time difference between the MPro heuristic evaluation time and that using other dynamic heuristics increases as the threshold V t decreases or k increases. As expected, the DEC1 heuristic has lower total cost (i.e., faster) than other heuristics for both top-k and threshold based TSelection algorithms. The DEC1 heuristic is (14, 20)% better than the MPro heuristic. The evaluation time using the DAEC heuristic is very close to that using the DEC1 heuristics, whereas, the DEC2 heuristic has a higher evaluation time. The increase in the evaluation time for threshold-based TSelection is almost linear with respect to the decrease threshold V t. b) Combinations of distributions: The distributions of the importance scores for the expensive predicates C 1 and C 3 are chosen as uniform, and C 2, C 4, and C 5 are chosen as normal with a mean of 0.7 and a standard deviation of 0.2.We have observed that the evaluation time differences between the MPro heuristic and other dynamic heuristics is small as compared to those in other distributions. The difference decreases when k increases or threshold V t decreases. The dynamic heuristics are (0, 9)% better than the MPro heuristic for threshold-based and top-k algorithms. Also, the DEC2 heuristic has the best performance for both top-k and threshold based TSelection algorithms. In conclusion, using any distribution to generate the importance scores of the expensive predicates, the dynamic predicate evaluation order heuristics improve the performances of the top-k and threshold based TSelection algorithms. If the importance scores for all expensive predicates have the same distribution then the best heuristic is the DEC1 heuristic. If the importance scores have different distributions then the DEC2 heuristic has the best performance. 3.2 Time Constraints We have evaluated top-k and threshold based TSelection algorithms with a time constraint T and by using different distributions to generate importance scores for expensive predicates. We used the MaxImpLT, HigherImpLT, and MaxSizeLT heuristics to locate the tuple from which an unevaluated expensive predicate is to be evaluated at a given time. For a given time constraint T, we compare the performances of the three heuristics in terms of the precision and wasted time ratio. Definition 6 (Wasted Time Ratio, Precision): Let T be a query evaluation time limit, and t useful be the time spent, out of time T, to completely evaluate those tuples that are verified to be in the output. Then, Wasted time ratio = 1 - ( t useful / T), and Precision =(No. of output tuples found within time T)/(No. of tuples in the fully evaluated output). We used an input relation R of size 500 tuples. We used the selection condition (C 1 I C 2 ) U (C 3 I C 4 I C 5 ) with the expected evaluation cost Cost(C i ) = {0.5, 0.3, 1.0, 0.4, 0.1} seconds, respectively. The derived importance score Imp(t) of a given tuple t is computed using the average function as g 1 and the minimum function as g 2. For the threshold-based TSelection algorithm, we located the tuples that satisfy the threshold value V t of 0.5. For the top-k TSelection algorithm, we computed the top-50 tuples. Let T Required be the required time to fully evaluate a given query. We used 0.5, 0.6, 0.7, 0.8 and 0.9 of T Required as the time constraint T, and computed the precision and wasted time ratio.

10 First, the importance scores for all expensive predicates are generated using the normal distribution with a mean of 0.7 and a standard deviation of 0.2. The MaxImpLT heuristic has the worst performance as compared to the other heuristics. It has (14, 54)% lower precision and (20,63)% higher wasted time ratio. The MaxSizeLT has the best performance for all values of the time constraint T: it has (0, 23)% higher precision and (0, 42)% wasted time ratio as compared to the HigherImpLT heuristic. As for the performance for the top-k TSelection algorithm using the normal distribution, the MaxImpLT has the worst performance and there is a small difference between its performance and that of other heuristics: it has (0, 20)% lower precision and (0, 4)% higher wasted time ratio. The HigherImpLT heuristic has the best performance, it has (0, 6)% higher precision and (0, 2)% lower wasted time ratio as compared to the MaxSizeLT heuristic. In conclusion, using any distribution to generate the importance scores for expensive predicates, the MaxImpLT heuristic has the worst performance for the top-k and threshold-based TSelection algorithms. In the threshold-based algorithm, there is a large difference between the performances of the MaxImpLT heuristic and that of other heuristics. And, at a given time T, MaxSizeLT has the best performance. In the top-k based algorithm, there is no large difference in the performances of the MaxImpLT heuristic and that of other heuristics, and HigherImpLT has the best performance. 4 Conclusions We have presented algorithms to evaluate the topic selection TSelection operator for information resource discovery. We have proposed and evaluated heuristics to locate tuples and to evaluate expensive predicates. References 1. ACM Digital Library, at 2. Chan, K., C-C, Hwang, S-W., "Minimal Probing: Supporting Expensive Predicates for Topk Queries", ACM SIGMOD, Altingovde, I.S., et al, Topic-Centric Querying of Web Information Resources, Proc, DEXA Ozsoyoglu, G, Al-Hamdani, A, Altıngovde, I.S, Ozel, S.A, Ulusoy, O, Ozsoyoglu, Z.M., "Sideway Value Algebra for Object-Relational Databases", VLDB Conf., Ozsoyoglu, G., Altingovde, I. S., Al-Hamdani, A., Ozel, S. A., Ulusoy, O., Ozsoyoglu, M., Extending SQL for Metadata-based Querying, Submitted for journal publication, DBLP Bibliography, by Michael Ley, at 7. ACM SIGMOD Anthology, at 8. Sadri, F., Ullman, J., The Interaction between Functional Dependencies and Template Dependencies, SIGMOD Conf., Biezunski, M., Bryan, M., Newcomb, S., editors, ISO/IEC 13250, Topic Maps, available at Salton, G., Automatic Text Processing, Addison-Wesley, Al-Hamdani, A., Ozsoyoglu, G., Selecting Topics for Web Resource Discovery: Efficiency Issues in a Database Approach, technical report, EECS, CWRU, Agichtein, E., Gravano, L., Snowball: Extracting Relations from Large Plain-Text Collections, Proc. of the 5 th ACM International Conf. on Digital Libraries, Hellerstein, J.M, Stonebraker, M., Predicate Migration: Optimizing queries with expensive predicates, ACM SIGMOD Li, Li, Finding Related Papers in a Digital Library, MS Thesis, CWRU, June 2003.

Topic Area: Infrastructure for information systems Category: Research

Topic Area: Infrastructure for information systems Category: Research Paper Title: Sideway Value Algebra for Object-Relational Databases Paper Authors: Ozsoyoglu #, G, Al-Hamdani, A, Altıngovde, I.S, Ozel, S.A, Ulusoy, O, Ozsoyoglu, Z.M Note: G. Ozsoyoglu is on the Core

More information

THE EFFECT OF JOIN SELECTIVITIES ON OPTIMAL NESTING ORDER

THE EFFECT OF JOIN SELECTIVITIES ON OPTIMAL NESTING ORDER THE EFFECT OF JOIN SELECTIVITIES ON OPTIMAL NESTING ORDER Akhil Kumar and Michael Stonebraker EECS Department University of California Berkeley, Ca., 94720 Abstract A heuristic query optimizer must choose

More information

IJESRT. Scientific Journal Impact Factor: (ISRA), Impact Factor: 2.114

IJESRT. Scientific Journal Impact Factor: (ISRA), Impact Factor: 2.114 [Saranya, 4(3): March, 2015] ISSN: 2277-9655 IJESRT INTERNATIONAL JOURNAL OF ENGINEERING SCIENCES & RESEARCH TECHNOLOGY A SURVEY ON KEYWORD QUERY ROUTING IN DATABASES N.Saranya*, R.Rajeshkumar, S.Saranya

More information

An Overview of various methodologies used in Data set Preparation for Data mining Analysis

An Overview of various methodologies used in Data set Preparation for Data mining Analysis An Overview of various methodologies used in Data set Preparation for Data mining Analysis Arun P Kuttappan 1, P Saranya 2 1 M. E Student, Dept. of Computer Science and Engineering, Gnanamani College of

More information

Leveraging Set Relations in Exact Set Similarity Join

Leveraging Set Relations in Exact Set Similarity Join Leveraging Set Relations in Exact Set Similarity Join Xubo Wang, Lu Qin, Xuemin Lin, Ying Zhang, and Lijun Chang University of New South Wales, Australia University of Technology Sydney, Australia {xwang,lxue,ljchang}@cse.unsw.edu.au,

More information

Leveraging Transitive Relations for Crowdsourced Joins*

Leveraging Transitive Relations for Crowdsourced Joins* Leveraging Transitive Relations for Crowdsourced Joins* Jiannan Wang #, Guoliang Li #, Tim Kraska, Michael J. Franklin, Jianhua Feng # # Department of Computer Science, Tsinghua University, Brown University,

More information

A Study on Reverse Top-K Queries Using Monochromatic and Bichromatic Methods

A Study on Reverse Top-K Queries Using Monochromatic and Bichromatic Methods A Study on Reverse Top-K Queries Using Monochromatic and Bichromatic Methods S.Anusuya 1, M.Balaganesh 2 P.G. Student, Department of Computer Science and Engineering, Sembodai Rukmani Varatharajan Engineering

More information

Hash-Based Indexing 165

Hash-Based Indexing 165 Hash-Based Indexing 165 h 1 h 0 h 1 h 0 Next = 0 000 00 64 32 8 16 000 00 64 32 8 16 A 001 01 9 25 41 73 001 01 9 25 41 73 B 010 10 10 18 34 66 010 10 10 18 34 66 C Next = 3 011 11 11 19 D 011 11 11 19

More information

Efficient World-Wide-Web Information Gathering. Tian Fanjiang Wang Xidong Wang Dingxing

Efficient World-Wide-Web Information Gathering. Tian Fanjiang Wang Xidong Wang Dingxing Efficient World-Wide-Web Information Gathering Tian Fanjiang Wang Xidong Wang Dingxing (Department of Computer Science and Technology, Tsinghua University, Beijing 100084,tfj@www.cs.tsinghua.edu.cn) Abstract

More information

Designing and Building an Automatic Information Retrieval System for Handling the Arabic Data

Designing and Building an Automatic Information Retrieval System for Handling the Arabic Data American Journal of Applied Sciences (): -, ISSN -99 Science Publications Designing and Building an Automatic Information Retrieval System for Handling the Arabic Data Ibrahiem M.M. El Emary and Ja'far

More information

DIRA : A FRAMEWORK OF DATA INTEGRATION USING DATA QUALITY

DIRA : A FRAMEWORK OF DATA INTEGRATION USING DATA QUALITY DIRA : A FRAMEWORK OF DATA INTEGRATION USING DATA QUALITY Reham I. Abdel Monem 1, Ali H. El-Bastawissy 2 and Mohamed M. Elwakil 3 1 Information Systems Department, Faculty of computers and information,

More information

T h e incomplete database

T h e incomplete database T h e incomplete database Karen L. Kwast University of Amsterdam Departments of Mathematics and Computer Science, Plantage Muidergracht 24, 1018 TV, Amsterdam Abstract The introduction of nulls (unknown

More information

FUZZY SPECIFICATION IN SOFTWARE ENGINEERING

FUZZY SPECIFICATION IN SOFTWARE ENGINEERING 1 FUZZY SPECIFICATION IN SOFTWARE ENGINEERING V. LOPEZ Faculty of Informatics, Complutense University Madrid, Spain E-mail: ab vlopez@fdi.ucm.es www.fdi.ucm.es J. MONTERO Faculty of Mathematics, Complutense

More information

Optimization of Queries with User-Defined Predicates

Optimization of Queries with User-Defined Predicates Optimization of Queries with User-Defined Predicates SURAJIT CHAUDHURI Microsoft Research and KYUSEOK SHIM Bell Laboratories Relational databases provide the ability to store user-defined functions and

More information

Advanced Database Systems

Advanced Database Systems Lecture IV Query Processing Kyumars Sheykh Esmaili Basic Steps in Query Processing 2 Query Optimization Many equivalent execution plans Choosing the best one Based on Heuristics, Cost Will be discussed

More information

Finding Hubs and authorities using Information scent to improve the Information Retrieval precision

Finding Hubs and authorities using Information scent to improve the Information Retrieval precision Finding Hubs and authorities using Information scent to improve the Information Retrieval precision Suruchi Chawla 1, Dr Punam Bedi 2 1 Department of Computer Science, University of Delhi, Delhi, INDIA

More information

Top-k Keyword Search Over Graphs Based On Backward Search

Top-k Keyword Search Over Graphs Based On Backward Search Top-k Keyword Search Over Graphs Based On Backward Search Jia-Hui Zeng, Jiu-Ming Huang, Shu-Qiang Yang 1College of Computer National University of Defense Technology, Changsha, China 2College of Computer

More information

UAPRIORI: AN ALGORITHM FOR FINDING SEQUENTIAL PATTERNS IN PROBABILISTIC DATA

UAPRIORI: AN ALGORITHM FOR FINDING SEQUENTIAL PATTERNS IN PROBABILISTIC DATA UAPRIORI: AN ALGORITHM FOR FINDING SEQUENTIAL PATTERNS IN PROBABILISTIC DATA METANAT HOOSHSADAT, SAMANEH BAYAT, PARISA NAEIMI, MAHDIEH S. MIRIAN, OSMAR R. ZAÏANE Computing Science Department, University

More information

Chapter 12: Query Processing. Chapter 12: Query Processing

Chapter 12: Query Processing. Chapter 12: Query Processing Chapter 12: Query Processing Database System Concepts, 6 th Ed. See www.db-book.com for conditions on re-use Chapter 12: Query Processing Overview Measures of Query Cost Selection Operation Sorting Join

More information

Exploiting Index Pruning Methods for Clustering XML Collections

Exploiting Index Pruning Methods for Clustering XML Collections Exploiting Index Pruning Methods for Clustering XML Collections Ismail Sengor Altingovde, Duygu Atilgan and Özgür Ulusoy Department of Computer Engineering, Bilkent University, Ankara, Turkey {ismaila,

More information

NP-Completeness of 3SAT, 1-IN-3SAT and MAX 2SAT

NP-Completeness of 3SAT, 1-IN-3SAT and MAX 2SAT NP-Completeness of 3SAT, 1-IN-3SAT and MAX 2SAT 3SAT The 3SAT problem is the following. INSTANCE : Given a boolean expression E in conjunctive normal form (CNF) that is the conjunction of clauses, each

More information

CSCI.6962/4962 Software Verification Fundamental Proof Methods in Computer Science (Arkoudas and Musser) Chapter p. 1/27

CSCI.6962/4962 Software Verification Fundamental Proof Methods in Computer Science (Arkoudas and Musser) Chapter p. 1/27 CSCI.6962/4962 Software Verification Fundamental Proof Methods in Computer Science (Arkoudas and Musser) Chapter 2.1-2.7 p. 1/27 CSCI.6962/4962 Software Verification Fundamental Proof Methods in Computer

More information

Element Algebra. 1 Introduction. M. G. Manukyan

Element Algebra. 1 Introduction. M. G. Manukyan Element Algebra M. G. Manukyan Yerevan State University Yerevan, 0025 mgm@ysu.am Abstract. An element algebra supporting the element calculus is proposed. The input and output of our algebra are xdm-elements.

More information

A novel supervised learning algorithm and its use for Spam Detection in Social Bookmarking Systems

A novel supervised learning algorithm and its use for Spam Detection in Social Bookmarking Systems A novel supervised learning algorithm and its use for Spam Detection in Social Bookmarking Systems Anestis Gkanogiannis and Theodore Kalamboukis Department of Informatics Athens University of Economics

More information

Diversified Top-k Graph Pattern Matching

Diversified Top-k Graph Pattern Matching Diversified Top-k Graph Pattern Matching Wenfei Fan 1,2 Xin Wang 1 Yinghui Wu 3 1 University of Edinburgh 2 RCBD and SKLSDE Lab, Beihang University 3 UC Santa Barbara {wenfei@inf, x.wang-36@sms, y.wu-18@sms}.ed.ac.uk

More information

On Multiple Query Optimization in Data Mining

On Multiple Query Optimization in Data Mining On Multiple Query Optimization in Data Mining Marek Wojciechowski, Maciej Zakrzewicz Poznan University of Technology Institute of Computing Science ul. Piotrowo 3a, 60-965 Poznan, Poland {marek,mzakrz}@cs.put.poznan.pl

More information

Information Discovery, Extraction and Integration for the Hidden Web

Information Discovery, Extraction and Integration for the Hidden Web Information Discovery, Extraction and Integration for the Hidden Web Jiying Wang Department of Computer Science University of Science and Technology Clear Water Bay, Kowloon Hong Kong cswangjy@cs.ust.hk

More information

A List Heuristic for Vertex Cover

A List Heuristic for Vertex Cover A List Heuristic for Vertex Cover Happy Birthday Vasek! David Avis McGill University Tomokazu Imamura Kyoto University Operations Research Letters (to appear) Online: http://cgm.cs.mcgill.ca/ avis revised:

More information

Principles of AI Planning. Principles of AI Planning. 7.1 How to obtain a heuristic. 7.2 Relaxed planning tasks. 7.1 How to obtain a heuristic

Principles of AI Planning. Principles of AI Planning. 7.1 How to obtain a heuristic. 7.2 Relaxed planning tasks. 7.1 How to obtain a heuristic Principles of AI Planning June 8th, 2010 7. Planning as search: relaxed planning tasks Principles of AI Planning 7. Planning as search: relaxed planning tasks Malte Helmert and Bernhard Nebel 7.1 How to

More information

Announcement. Reading Material. Overview of Query Evaluation. Overview of Query Evaluation. Overview of Query Evaluation 9/26/17

Announcement. Reading Material. Overview of Query Evaluation. Overview of Query Evaluation. Overview of Query Evaluation 9/26/17 Announcement CompSci 516 Database Systems Lecture 10 Query Evaluation and Join Algorithms Project proposal pdf due on sakai by 5 pm, tomorrow, Thursday 09/27 One per group by any member Instructor: Sudeepa

More information

8. Relational Calculus (Part II)

8. Relational Calculus (Part II) 8. Relational Calculus (Part II) Relational Calculus, as defined in the previous chapter, provides the theoretical foundations for the design of practical data sub-languages (DSL). In this chapter, we

More information

On the Hardness of Counting the Solutions of SPARQL Queries

On the Hardness of Counting the Solutions of SPARQL Queries On the Hardness of Counting the Solutions of SPARQL Queries Reinhard Pichler and Sebastian Skritek Vienna University of Technology, Faculty of Informatics {pichler,skritek}@dbai.tuwien.ac.at 1 Introduction

More information

Chapter 13: Query Processing

Chapter 13: Query Processing Chapter 13: Query Processing! Overview! Measures of Query Cost! Selection Operation! Sorting! Join Operation! Other Operations! Evaluation of Expressions 13.1 Basic Steps in Query Processing 1. Parsing

More information

An Optimization of Disjunctive Queries : Union-Pushdown *

An Optimization of Disjunctive Queries : Union-Pushdown * An Optimization of Disjunctive Queries : Union-Pushdown * Jae-young hang Sang-goo Lee Department of omputer Science Seoul National University Shilim-dong, San 56-1, Seoul, Korea 151-742 {jychang, sglee}@mercury.snu.ac.kr

More information

WEB DATA EXTRACTION METHOD BASED ON FEATURED TERNARY TREE

WEB DATA EXTRACTION METHOD BASED ON FEATURED TERNARY TREE WEB DATA EXTRACTION METHOD BASED ON FEATURED TERNARY TREE *Vidya.V.L, **Aarathy Gandhi *PG Scholar, Department of Computer Science, Mohandas College of Engineering and Technology, Anad **Assistant Professor,

More information

Delay-minimal Transmission for Energy Constrained Wireless Communications

Delay-minimal Transmission for Energy Constrained Wireless Communications Delay-minimal Transmission for Energy Constrained Wireless Communications Jing Yang Sennur Ulukus Department of Electrical and Computer Engineering University of Maryland, College Park, M0742 yangjing@umd.edu

More information

TERM BASED WEIGHT MEASURE FOR INFORMATION FILTERING IN SEARCH ENGINES

TERM BASED WEIGHT MEASURE FOR INFORMATION FILTERING IN SEARCH ENGINES TERM BASED WEIGHT MEASURE FOR INFORMATION FILTERING IN SEARCH ENGINES Mu. Annalakshmi Research Scholar, Department of Computer Science, Alagappa University, Karaikudi. annalakshmi_mu@yahoo.co.in Dr. A.

More information

Introduction to Query Processing and Query Optimization Techniques. Copyright 2011 Ramez Elmasri and Shamkant Navathe

Introduction to Query Processing and Query Optimization Techniques. Copyright 2011 Ramez Elmasri and Shamkant Navathe Introduction to Query Processing and Query Optimization Techniques Outline Translating SQL Queries into Relational Algebra Algorithms for External Sorting Algorithms for SELECT and JOIN Operations Algorithms

More information

Chapter 12: Query Processing

Chapter 12: Query Processing Chapter 12: Query Processing Database System Concepts, 6 th Ed. See www.db-book.com for conditions on re-use Overview Chapter 12: Query Processing Measures of Query Cost Selection Operation Sorting Join

More information

Relational Databases

Relational Databases Relational Databases Jan Chomicki University at Buffalo Jan Chomicki () Relational databases 1 / 49 Plan of the course 1 Relational databases 2 Relational database design 3 Conceptual database design 4

More information

Designing Views to Answer Queries under Set, Bag,and BagSet Semantics

Designing Views to Answer Queries under Set, Bag,and BagSet Semantics Designing Views to Answer Queries under Set, Bag,and BagSet Semantics Rada Chirkova Department of Computer Science, North Carolina State University Raleigh, NC 27695-7535 chirkova@csc.ncsu.edu Foto Afrati

More information

Analysis of Basic Data Reordering Techniques

Analysis of Basic Data Reordering Techniques Analysis of Basic Data Reordering Techniques Tan Apaydin 1, Ali Şaman Tosun 2, and Hakan Ferhatosmanoglu 1 1 The Ohio State University, Computer Science and Engineering apaydin,hakan@cse.ohio-state.edu

More information

Semantic Optimization of Preference Queries

Semantic Optimization of Preference Queries Semantic Optimization of Preference Queries Jan Chomicki University at Buffalo http://www.cse.buffalo.edu/ chomicki 1 Querying with Preferences Find the best answers to a query, instead of all the answers.

More information

Active Blocking Scheme Learning for Entity Resolution

Active Blocking Scheme Learning for Entity Resolution Active Blocking Scheme Learning for Entity Resolution Jingyu Shao and Qing Wang Research School of Computer Science, Australian National University {jingyu.shao,qing.wang}@anu.edu.au Abstract. Blocking

More information

A synthetic query-aware database generator

A synthetic query-aware database generator A synthetic query-aware database generator Anonymous Department of Computer Science Golisano College of Computing and Information Sciences Rochester, NY 14586 Abstract In database applications and DBMS

More information

CS 512, Spring 2017: Take-Home End-of-Term Examination

CS 512, Spring 2017: Take-Home End-of-Term Examination CS 512, Spring 2017: Take-Home End-of-Term Examination Out: Tuesday, 9 May 2017, 12:00 noon Due: Wednesday, 10 May 2017, by 11:59 am Turn in your solutions electronically, as a single PDF file, by placing

More information

Exploring a Few Good Tuples From Text Databases

Exploring a Few Good Tuples From Text Databases Exploring a Few Good Tuples From Text Databases Alpa Jain, Divesh Srivastava Columbia University, AT&T Labs-Research Abstract Information extraction from text databases is a useful paradigm to populate

More information

! A relational algebra expression may have many equivalent. ! Cost is generally measured as total elapsed time for

! A relational algebra expression may have many equivalent. ! Cost is generally measured as total elapsed time for Chapter 13: Query Processing Basic Steps in Query Processing! Overview! Measures of Query Cost! Selection Operation! Sorting! Join Operation! Other Operations! Evaluation of Expressions 1. Parsing and

More information

Chapter 13: Query Processing Basic Steps in Query Processing

Chapter 13: Query Processing Basic Steps in Query Processing Chapter 13: Query Processing Basic Steps in Query Processing! Overview! Measures of Query Cost! Selection Operation! Sorting! Join Operation! Other Operations! Evaluation of Expressions 1. Parsing and

More information

Situation Calculus and YAGI

Situation Calculus and YAGI Situation Calculus and YAGI Institute for Software Technology 1 Progression another solution to the projection problem does a sentence hold for a future situation used for automated reasoning and planning

More information

SPARK: Top-k Keyword Query in Relational Database

SPARK: Top-k Keyword Query in Relational Database SPARK: Top-k Keyword Query in Relational Database Wei Wang University of New South Wales Australia 20/03/2007 1 Outline Demo & Introduction Ranking Query Evaluation Conclusions 20/03/2007 2 Demo 20/03/2007

More information

Keyword Join: Realizing Keyword Search in P2P-based Database Systems

Keyword Join: Realizing Keyword Search in P2P-based Database Systems Keyword Join: Realizing Keyword Search in P2P-based Database Systems Bei Yu, Ling Liu 2, Beng Chin Ooi 3 and Kian-Lee Tan 3 Singapore-MIT Alliance 2 Georgia Institute of Technology, 3 National University

More information

Efficient Incremental Mining of Top-K Frequent Closed Itemsets

Efficient Incremental Mining of Top-K Frequent Closed Itemsets Efficient Incremental Mining of Top- Frequent Closed Itemsets Andrea Pietracaprina and Fabio Vandin Dipartimento di Ingegneria dell Informazione, Università di Padova, Via Gradenigo 6/B, 35131, Padova,

More information

Parallel Query Processing and Edge Ranking of Graphs

Parallel Query Processing and Edge Ranking of Graphs Parallel Query Processing and Edge Ranking of Graphs Dariusz Dereniowski, Marek Kubale Department of Algorithms and System Modeling, Gdańsk University of Technology, Poland, {deren,kubale}@eti.pg.gda.pl

More information

Towards a Logical Reconstruction of Relational Database Theory

Towards a Logical Reconstruction of Relational Database Theory Towards a Logical Reconstruction of Relational Database Theory On Conceptual Modelling, Lecture Notes in Computer Science. 1984 Raymond Reiter Summary by C. Rey November 27, 2008-1 / 63 Foreword DB: 2

More information

Data Integration: Logic Query Languages

Data Integration: Logic Query Languages Data Integration: Logic Query Languages Jan Chomicki University at Buffalo Datalog Datalog A logic language Datalog programs consist of logical facts and rules Datalog is a subset of Prolog (no data structures)

More information

CMSC424: Database Design. Instructor: Amol Deshpande

CMSC424: Database Design. Instructor: Amol Deshpande CMSC424: Database Design Instructor: Amol Deshpande amol@cs.umd.edu Databases Data Models Conceptual representa1on of the data Data Retrieval How to ask ques1ons of the database How to answer those ques1ons

More information

Keyword search in relational databases. By SO Tsz Yan Amanda & HON Ka Lam Ethan

Keyword search in relational databases. By SO Tsz Yan Amanda & HON Ka Lam Ethan Keyword search in relational databases By SO Tsz Yan Amanda & HON Ka Lam Ethan 1 Introduction Ubiquitous relational databases Need to know SQL and database structure Hard to define an object 2 Query representation

More information

Handout 9: Imperative Programs and State

Handout 9: Imperative Programs and State 06-02552 Princ. of Progr. Languages (and Extended ) The University of Birmingham Spring Semester 2016-17 School of Computer Science c Uday Reddy2016-17 Handout 9: Imperative Programs and State Imperative

More information

Towards Incremental Grounding in Tuffy

Towards Incremental Grounding in Tuffy Towards Incremental Grounding in Tuffy Wentao Wu, Junming Sui, Ye Liu University of Wisconsin-Madison ABSTRACT Markov Logic Networks (MLN) have become a powerful framework in logical and statistical modeling.

More information

Exact and Approximate Generic Multi-criteria Top-k Query Processing

Exact and Approximate Generic Multi-criteria Top-k Query Processing Exact and Approximate Generic Multi-criteria Top-k Query Processing Mehdi Badr, Dan Vodislav To cite this version: Mehdi Badr, Dan Vodislav. Exact and Approximate Generic Multi-criteria Top-k Query Processing.

More information

Content Based Cross-Site Mining Web Data Records

Content Based Cross-Site Mining Web Data Records Content Based Cross-Site Mining Web Data Records Jebeh Kawah, Faisal Razzaq, Enzhou Wang Mentor: Shui-Lung Chuang Project #7 Data Record Extraction 1. Introduction Current web data record extraction methods

More information

Evaluating Top-k Queries Over Web-Accessible Databases

Evaluating Top-k Queries Over Web-Accessible Databases Evaluating Top-k Queries Over Web-Accessible Databases AMÉLIE MARIAN Columbia University, New York NICOLAS BRUNO Microsoft Research, Redmond, Washington and LUIS GRAVANO Columbia University, New York A

More information

Notes for Chapter 12 Logic Programming. The AI War Basic Concepts of Logic Programming Prolog Review questions

Notes for Chapter 12 Logic Programming. The AI War Basic Concepts of Logic Programming Prolog Review questions Notes for Chapter 12 Logic Programming The AI War Basic Concepts of Logic Programming Prolog Review questions The AI War How machines should learn: inductive or deductive? Deductive: Expert => rules =>

More information

Overview of Query Evaluation. Chapter 12

Overview of Query Evaluation. Chapter 12 Overview of Query Evaluation Chapter 12 1 Outline Query Optimization Overview Algorithm for Relational Operations 2 Overview of Query Evaluation DBMS keeps descriptive data in system catalogs. SQL queries

More information

Image retrieval based on bag of images

Image retrieval based on bag of images University of Wollongong Research Online Faculty of Informatics - Papers (Archive) Faculty of Engineering and Information Sciences 2009 Image retrieval based on bag of images Jun Zhang University of Wollongong

More information

Estimating the Quality of Databases

Estimating the Quality of Databases Estimating the Quality of Databases Ami Motro Igor Rakov George Mason University May 1998 1 Outline: 1. Introduction 2. Simple quality estimation 3. Refined quality estimation 4. Computing the quality

More information

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University CS473: CS-473 Course Review Luo Si Department of Computer Science Purdue University Basic Concepts of IR: Outline Basic Concepts of Information Retrieval: Task definition of Ad-hoc IR Terminologies and

More information

ROUGH MEMBERSHIP FUNCTIONS: A TOOL FOR REASONING WITH UNCERTAINTY

ROUGH MEMBERSHIP FUNCTIONS: A TOOL FOR REASONING WITH UNCERTAINTY ALGEBRAIC METHODS IN LOGIC AND IN COMPUTER SCIENCE BANACH CENTER PUBLICATIONS, VOLUME 28 INSTITUTE OF MATHEMATICS POLISH ACADEMY OF SCIENCES WARSZAWA 1993 ROUGH MEMBERSHIP FUNCTIONS: A TOOL FOR REASONING

More information

Query Processing & Optimization

Query Processing & Optimization Query Processing & Optimization 1 Roadmap of This Lecture Overview of query processing Measures of Query Cost Selection Operation Sorting Join Operation Other Operations Evaluation of Expressions Introduction

More information

ProUD: Probabilistic Ranking in Uncertain Databases

ProUD: Probabilistic Ranking in Uncertain Databases Proc. 20th Int. Conf. on Scientific and Statistical Database Management (SSDBM'08), Hong Kong, China, 2008. ProUD: Probabilistic Ranking in Uncertain Databases Thomas Bernecker, Hans-Peter Kriegel, Matthias

More information

THE RELATIONAL MODEL. University of Waterloo

THE RELATIONAL MODEL. University of Waterloo THE RELATIONAL MODEL 1-1 List of Slides 1 2 The Relational Model 3 Relations and Databases 4 Example 5 Another Example 6 What does it mean? 7 Example Database 8 What can we do with it? 9 Variables and

More information

Discovering Periodic Patterns in Database Audit Trails

Discovering Periodic Patterns in Database Audit Trails Vol.29 (DTA 2013), pp.365-371 http://dx.doi.org/10.14257/astl.2013.29.76 Discovering Periodic Patterns in Database Audit Trails Marcin Zimniak 1, Janusz R. Getta 2, and Wolfgang Benn 1 1 Faculty of Computer

More information

QUANTIZER DESIGN FOR EXPLOITING COMMON INFORMATION IN LAYERED CODING. Mehdi Salehifar, Tejaswi Nanjundaswamy, and Kenneth Rose

QUANTIZER DESIGN FOR EXPLOITING COMMON INFORMATION IN LAYERED CODING. Mehdi Salehifar, Tejaswi Nanjundaswamy, and Kenneth Rose QUANTIZER DESIGN FOR EXPLOITING COMMON INFORMATION IN LAYERED CODING Mehdi Salehifar, Tejaswi Nanjundaswamy, and Kenneth Rose Department of Electrical and Computer Engineering University of California,

More information

Two hours UNIVERSITY OF MANCHESTER SCHOOL OF COMPUTER SCIENCE. Date: Thursday 16th January 2014 Time: 09:45-11:45. Please answer BOTH Questions

Two hours UNIVERSITY OF MANCHESTER SCHOOL OF COMPUTER SCIENCE. Date: Thursday 16th January 2014 Time: 09:45-11:45. Please answer BOTH Questions Two hours UNIVERSITY OF MANCHESTER SCHOOL OF COMPUTER SCIENCE Advanced Database Management Systems Date: Thursday 16th January 2014 Time: 09:45-11:45 Please answer BOTH Questions This is a CLOSED book

More information

An Evolutionary Algorithm for the Multi-objective Shortest Path Problem

An Evolutionary Algorithm for the Multi-objective Shortest Path Problem An Evolutionary Algorithm for the Multi-objective Shortest Path Problem Fangguo He Huan Qi Qiong Fan Institute of Systems Engineering, Huazhong University of Science & Technology, Wuhan 430074, P. R. China

More information

Question Score Points Out Of 25

Question Score Points Out Of 25 University of Texas at Austin 6 May 2005 Department of Computer Science Theory in Programming Practice, Spring 2005 Test #3 Instructions. This is a 50-minute test. No electronic devices (including calculators)

More information

arxiv: v2 [cs.cc] 29 Mar 2010

arxiv: v2 [cs.cc] 29 Mar 2010 On a variant of Monotone NAE-3SAT and the Triangle-Free Cut problem. arxiv:1003.3704v2 [cs.cc] 29 Mar 2010 Peiyush Jain, Microsoft Corporation. June 28, 2018 Abstract In this paper we define a restricted

More information

Database Theory VU , SS Introduction: Relational Query Languages. Reinhard Pichler

Database Theory VU , SS Introduction: Relational Query Languages. Reinhard Pichler Database Theory Database Theory VU 181.140, SS 2018 1. Introduction: Relational Query Languages Reinhard Pichler Institut für Informationssysteme Arbeitsbereich DBAI Technische Universität Wien 6 March,

More information

Module 9: Selectivity Estimation

Module 9: Selectivity Estimation Module 9: Selectivity Estimation Module Outline 9.1 Query Cost and Selectivity Estimation 9.2 Database profiles 9.3 Sampling 9.4 Statistics maintained by commercial DBMS Web Forms Transaction Manager Lock

More information

An Attribute-Based Access Matrix Model

An Attribute-Based Access Matrix Model An Attribute-Based Access Matrix Model Xinwen Zhang Lab for Information Security Technology George Mason University xzhang6@gmu.edu Yingjiu Li School of Information Systems Singapore Management University

More information

Answering Aggregation Queries on Hierarchical Web Sites Using Adaptive Sampling (Technical Report, UCI ICS, August, 2005)

Answering Aggregation Queries on Hierarchical Web Sites Using Adaptive Sampling (Technical Report, UCI ICS, August, 2005) Answering Aggregation Queries on Hierarchical Web Sites Using Adaptive Sampling (Technical Report, UCI ICS, August, 2005) Foto N. Afrati Computer Science Division NTUA, Athens, Greece afrati@softlab.ece.ntua.gr

More information

Refinement Types as Proof Irrelevance. William Lovas with Frank Pfenning

Refinement Types as Proof Irrelevance. William Lovas with Frank Pfenning Refinement Types as Proof Irrelevance William Lovas with Frank Pfenning Overview Refinement types sharpen existing type systems without complicating their metatheory Subset interpretation soundly and completely

More information

PathStack : A Holistic Path Join Algorithm for Path Query with Not-predicates on XML Data

PathStack : A Holistic Path Join Algorithm for Path Query with Not-predicates on XML Data PathStack : A Holistic Path Join Algorithm for Path Query with Not-predicates on XML Data Enhua Jiao, Tok Wang Ling, Chee-Yong Chan School of Computing, National University of Singapore {jiaoenhu,lingtw,chancy}@comp.nus.edu.sg

More information

Algorithms for Query Processing and Optimization. 0. Introduction to Query Processing (1)

Algorithms for Query Processing and Optimization. 0. Introduction to Query Processing (1) Chapter 19 Algorithms for Query Processing and Optimization 0. Introduction to Query Processing (1) Query optimization: The process of choosing a suitable execution strategy for processing a query. Two

More information

Fountain Codes Based on Zigzag Decodable Coding

Fountain Codes Based on Zigzag Decodable Coding Fountain Codes Based on Zigzag Decodable Coding Takayuki Nozaki Kanagawa University, JAPAN Email: nozaki@kanagawa-u.ac.jp Abstract Fountain codes based on non-binary low-density parity-check (LDPC) codes

More information

Keywords APSE: Advanced Preferred Search Engine, Google Android Platform, Search Engine, Click-through data, Location and Content Concepts.

Keywords APSE: Advanced Preferred Search Engine, Google Android Platform, Search Engine, Click-through data, Location and Content Concepts. Volume 5, Issue 3, March 2015 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Advanced Preferred

More information

Compression of the Stream Array Data Structure

Compression of the Stream Array Data Structure Compression of the Stream Array Data Structure Radim Bača and Martin Pawlas Department of Computer Science, Technical University of Ostrava Czech Republic {radim.baca,martin.pawlas}@vsb.cz Abstract. In

More information

Optimizing Access Cost for Top-k Queries over Web Sources: A Unified Cost-based Approach

Optimizing Access Cost for Top-k Queries over Web Sources: A Unified Cost-based Approach UIUC Technical Report UIUCDCS-R-03-2324, UILU-ENG-03-1711. March 03 (Revised March 04) Optimizing Access Cost for Top-k Queries over Web Sources A Unified Cost-based Approach Seung-won Hwang and Kevin

More information

Random Permutations, Random Sudoku Matrices and Randomized Algorithms

Random Permutations, Random Sudoku Matrices and Randomized Algorithms Random Permutations, Random Sudoku Matrices and Randomized Algorithms arxiv:1312.0192v1 [math.co] 1 Dec 2013 Krasimir Yordzhev Faculty of Mathematics and Natural Sciences South-West University, Blagoevgrad,

More information

Query Processing. Debapriyo Majumdar Indian Sta4s4cal Ins4tute Kolkata DBMS PGDBA 2016

Query Processing. Debapriyo Majumdar Indian Sta4s4cal Ins4tute Kolkata DBMS PGDBA 2016 Query Processing Debapriyo Majumdar Indian Sta4s4cal Ins4tute Kolkata DBMS PGDBA 2016 Slides re-used with some modification from www.db-book.com Reference: Database System Concepts, 6 th Ed. By Silberschatz,

More information

CSC Discrete Math I, Spring Sets

CSC Discrete Math I, Spring Sets CSC 125 - Discrete Math I, Spring 2017 Sets Sets A set is well-defined, unordered collection of objects The objects in a set are called the elements, or members, of the set A set is said to contain its

More information

ALGORITHMIC DECIDABILITY OF COMPUTER PROGRAM-FUNCTIONS LANGUAGE PROPERTIES. Nikolay Kosovskiy

ALGORITHMIC DECIDABILITY OF COMPUTER PROGRAM-FUNCTIONS LANGUAGE PROPERTIES. Nikolay Kosovskiy International Journal Information Theories and Applications, Vol. 20, Number 2, 2013 131 ALGORITHMIC DECIDABILITY OF COMPUTER PROGRAM-FUNCTIONS LANGUAGE PROPERTIES Nikolay Kosovskiy Abstract: A mathematical

More information

Evaluation of relational operations

Evaluation of relational operations Evaluation of relational operations Iztok Savnik, FAMNIT Slides & Textbook Textbook: Raghu Ramakrishnan, Johannes Gehrke, Database Management Systems, McGraw-Hill, 3 rd ed., 2007. Slides: From Cow Book

More information

Chapter 2 & 3: Representations & Reasoning Systems (2.2)

Chapter 2 & 3: Representations & Reasoning Systems (2.2) Chapter 2 & 3: A Representation & Reasoning System & Using Definite Knowledge Representations & Reasoning Systems (RRS) (2.2) Simplifying Assumptions of the Initial RRS (2.3) Datalog (2.4) Semantics (2.5)

More information

arxiv: v3 [cs.db] 20 Feb 2018

arxiv: v3 [cs.db] 20 Feb 2018 Variance-Optimal Offline and Streaming Stratified Random Sampling Trong Duc Nguyen 1 Ming-Hung Shih 1 Divesh Srivastava 2 Srikanta Tirthapura 1 Bojian Xu 3 arxiv:1801.09039v3 [cs.db] 20 Feb 2018 1 Iowa

More information

Tree Interpolation in Vampire

Tree Interpolation in Vampire Tree Interpolation in Vampire Régis Blanc 1, Ashutosh Gupta 2, Laura Kovács 3, and Bernhard Kragl 4 1 EPFL 2 IST Austria 3 Chalmers 4 TU Vienna Abstract. We describe new extensions of the Vampire theorem

More information

Typed Lambda Calculus

Typed Lambda Calculus Department of Linguistics Ohio State University Sept. 8, 2016 The Two Sides of A typed lambda calculus (TLC) can be viewed in two complementary ways: model-theoretically, as a system of notation for functions

More information

Rank-aware XML Data Model and Algebra: Towards Unifying Exact Match and Similar Match in XML

Rank-aware XML Data Model and Algebra: Towards Unifying Exact Match and Similar Match in XML Proceedings of the 7th WSEAS International Conference on Multimedia, Internet & Video Technologies, Beijing, China, September 15-17, 2007 253 Rank-aware XML Data Model and Algebra: Towards Unifying Exact

More information

Inducing Parameters of a Decision Tree for Expert System Shell McESE by Genetic Algorithm

Inducing Parameters of a Decision Tree for Expert System Shell McESE by Genetic Algorithm Inducing Parameters of a Decision Tree for Expert System Shell McESE by Genetic Algorithm I. Bruha and F. Franek Dept of Computing & Software, McMaster University Hamilton, Ont., Canada, L8S4K1 Email:

More information