Optimizing Access Cost for Top-k Queries over Web Sources: A Unified Cost-based Approach

Size: px

Start display at page:

Download "Optimizing Access Cost for Top-k Queries over Web Sources: A Unified Cost-based Approach"

Dwayne Leonard
5 years ago
Views:

1 UIUC Technical Report UIUCDCS-R , UILU-ENG March 03 (Revised March 04) Optimizing Access Cost for Top-k Queries over Web Sources A Unified Cost-based Approach Seung-won Hwang and Kevin Chen-Chuan Chang Computer Science Department University of Illinois at Urbana-Champaign ABSTRACT This paper studies minimizing access costs by cost-based optimization for top- queries in middlewares, and in particular over Web sources. By dynamic search over a space of algorithms, cost-based optimization is general across a wide range of access capabilities, yet adaptive to the specific access costs at runtime. Such optimization is crucial, especially for querying Web sources, to handle their heterogeneous capabilities and dynamic costs However, techniques for systematic optimizations are clearly missing for top- queries To begin with, what is the algorithm space to optimize over? Our approach hinges on developing an abstract framework to induce this space Analyzing the logical structure of top- queries, we build Framework by focusing on necessary scoring tasks, which thus achieves both generality and specificity as an algorithm space. Further, how do we identify an effective algorithm in the space? We develop dynamic search schemes, adopting two scheduling heuristics for reducing the search space. Our experiments indicate that this cost-based approach indeed outperforms existing algorithms specifically designed for their scenarios. 1. INTRODUCTION As the Web has rapidly evolved into an ultimate repository of extensive and up-to-date information, querying over Web sources is essential for searching and integrating online information. Such querying, with the overwhelming scale of data, naturally demands ranked answers, or best first, to enable users to focus on a few top results. In particular, such ranking is prevalent across many Web search engines and searchable databases. We study the problem of supporting ranked queries over Web sources. To motivate, consider a Web travel agent scenario for finding restaurants and hotels, as Examples 1 and 2 illustrate. (We use this real scenario as benchmark queries for experiments; Section 9.) In particular, how to access sources with different capabilities and costs, to answer queries efficiently? As our Web middleware coordinates various sources, each source access will incur network communication and server computation. This paper aims at optimizing such access costs which dominate the overall query processing (like I/O in relational DBMS). Example 1 To find top- restaurants (say, in the Chicago area) that are highly-rated and close to her place myaddr, a user may ask a ranked query (in SQL-like syntax) select name from r order by! (r), "$#&%')( (r, myaddr)* stop after 5 (Query ) For query answering, our middleware will access some Web sources to evaluate the predicates, e.g., +! and "$#&%')( into scores in [01], which are then aggregated by some scoring function,, e.g.,,.-/01, to determine the highest-scored 5 restaurants. Our middleware can use various sources in query answering Figure 1(a) shows one possible scenario For evaluating "$#&%')( superpages.com is capable of 1) returning the "$#&%')( score for a specific restaurant ( random access ) and 2) returning restaurants in their descending order of scores ( sorted access ). For +! dineme.com similarly provides both sorted and random accesses. The middleware will coordinate these accesses to find the top results. Such accesses are typically expensive (as compared to local computations) with varying costs To characterize, Figure 1(a) shows the average access latency (thus including both network and server times) of both sorted and random access (denoted 243 and 2)5 respectively) for each predicate In this scenario, random accesses are more expensive in both sources (i.e., ), but with different actual scales (i.e., 265 ) and ratios (i.e., 9; ). 9< Example 2 Consider query > for the top- hotels that are close, with high star-rating, yet within the budget select name from?a@ B h order by CDFEG"$#&%'$( (h, myaddr),! (h), "IHJ(+6K (h)* stop after 5 (Query > ) Figure 1(b) describes another scenario, with hotels.com providing sorted access to all the predicates. In this setting, since a sorted access (e.g., for "$#&%')( ) also retrieves all the attributes of a hotel (e.g., stars and price ), the subsequent random accesses 1 to the same hotel are essentially of zero access costs (2)5 0ms) e.g., using stars and price, the middleware can locally compute! and "IHJ(+6K. This scenario thus significantly contrasts with expensive random accesses of Example 1. Our goal is to develop middleware algorithms, or query plans for coordinating sources, to minimize access costs. This task is 1 In a middleware, random accesses to an object L can only occur after L is first seen from sorted accesses or, no wild guess [9]. 1

2 We discuss related work in Section 2, and start with preliminaries in Section 3. Section 4-6 defines Framework as a space for top- algorithms. Section 7 then develops optimization schemes over this space. Section 8 discusses how our framework unifies and contrasts with existing algorithms. Section 9 reports our experiments. Figure 1 Web query scenarios for (a) and (b). challenging First, sources are heterogeneous, with widely varying access capabilities and costs (e.g., as the real sources in Figure 1 shows) Our algorithms must be general for various capability configurations. Second, the Web is dynamic, with cost scenarios changing over time (e.g., depending on source load and availability). Our algorithm must be adaptive to runtime factors. While many middleware algorithms exist, they do not satisfy these Web querying requirements, as Section 2 will review For generality, the existing algorithms have mostly been designed with specific cost scenarios in mind. (In fact, even together, they do not cover some scenarios, e.g., Example 2.) For adaptivity, they largely lack systematic runtime optimization, with at most only limited heuristics. We take a cost-based optimization approach By dynamic search over some space of algorithms, cost-based optimization is general across virtually all cost scenarios, yet adaptive to the specific one at runtime. While such optimization has been taken for granted for relational queries from early on [15], it is clearly lacking for ranked queries. However, such optimization is challenging To begin with, is there a complete yet focused algorithm space to search over? Our approach hinges on developing an abstract framework to induce this space Inspired by relational algebraic framework with logical operators, we analyze the logical structure of top- queries, and construct Framework By focusing on necessary scoring tasks, it achieves both generality and specificity as an algorithm space. Second, with defined, we need to develop systematic optimization schemes to effectively identify, in principle, the optimal algorithm in. Such search must balance both the overhead and the quality of optimization. While we study Web querying, our approach is applicable in any middleware environments (e.g., multimedia systems [16]), where access costs are significant. Our experiments thus evaluate both real-life Web querying (using our travel agent benchmark scenarios) and a wider range of synthesized middleware settings. The results are indeed encouraging Our framework outperforms the existing algorithms specifically designed for their scenarios. Overall, this paper develops cost-based optimization for top- querying (over Web sources) To our knowledge, our framework is the first such optimization. In realizing this goal, our contributions are as follows We define Framework as a complete yet focused algorithm space for top- queries Identifying such a space is essential for systematic optimization. We develop dynamic optimization schemes for searching over the space to find an effective algorithm. We report experimental evaluation using both real-life and synthetic scenarios. Our study indicates the generality and adaptivity of a cost-based approach. 2. RELATED WORK Supporting top- queries over Web sources has also been studied by [2, 5], in more limited scenarios where sources support only random accesses (or probes ). In contrast, our work schedules arbitrary accesses (random, sorted, and potentially beyond), which complicate optimization with the progressiveness and side-effect of sorted accesses (Section 3.2), and thus enable general applicability to any top- scenarios. In fact, our main results (e.g., Theorems 1 and 2) make no assumptions on the access types. In the broader context of middlewares, many algorithms have been proposed for different cost scenarios Figure 2 summarizes a matrix of access scenarios that have been studied, each characterized by how sources relatively support either type of access, e.g., cheap, expensive, or impossible. Fagin pioneered Algorithm FA [8, 16] for scenarios where random and sorted accesses are supported with uniform cost (the diagonal cells in Figure 2). [14, 9] then proposed (or equivalents) with a stronger sense of optimality. Meanwhile, some works [9, 11, 1] explored non-uniform scenarios e.g., CA (when random access is expensive), NRA (when random access is impossible), and, MPro and Upper (when sorted access is impossible). Further, SR-Combine[1], Quick- Combine[10], and Stream-Combine [11] enhance the above base algorithms with some runtime optimization. However, their heuristics has limited applicability e.g., it uses the partial derivative of scoring functions as an indicator, which may not applicable to all functions (e.g., min). In contrast to existing algorithms, our goal is to develop systematic cost-based optimization 1) Our approach is rather general it not only unifies existing algorithms in Figure 2 but also extends to a larger space, for any scenarios that our cost function (Section 3.2) can model. In particular, the scenario when random access is cheaper, as in Example 2, has not been studied (marked with? in the matrix). 2) By dynamic optimization, our approach naturally adapts to a given query at runtime such adaptation is largely lacking in existing algorithms. Meanwhile, ranked queries have also been proposed for relational databases Carey et al. [3, 4] presented optimization techniques for exploiting the limited cardinalities of ranked queries. References [7, 6] then proposed to exploit probabilistic distributions and histograms respectively, to process rank queries as equivalent Boolean selections. 3. SEMANTICS AND MODELS To establish the context of our discussion, this section describes the semantics and a cost model for top- queries. 3.1 Query Semantics A top- query (,, ), with scoring function, and retrieval size, selects top objects ranked by,, from database,,. Each object has a predicate score for every and an overall query score,0 *, * 2. Without loss of generality, we assume that all scores are in. 2 To be more rigorous,,0 * is in fact,! G*, where - *, i.e., a composition of, and predicates. 2

3 > Random Access Sorted Access cheap cr i 1 expensive cr i h impossible cr i cheap cs i 1 FA,, Quick-Combine CA, SR-Combine NRA, Stream-Combine expensive cs i h? FA,, Quick-Combine NRA, Stream-Combine impossible cs i Z, MPro, Upper Z, MPro, Upper X Figure 2 Access scenarios and their proposed algorithms. As a standard assumption,, is monotonic, i.e.,, *, * when and ) and their scores. As output, a top- query returns a sorted list of top objects (i.e., ), along with and ranked by their overall, scores, such that, 7, D, and D. Note that, to give deterministic semantics, we assume that there are no ties otherwise, a deterministic tie-breaker function can be used to determine an order, e.g., by unique object IDs (e.g., hotel names) 3. As our running example, we will consider (Example 1) for finding top- restaurant, i.e., -. (For notational brevity, we will write predicates +! and "$#&%'$( as and respectively.) For our illustration, let s assume Dataset 1 (Figure 3) as our example restaurant objects (i.e., (which can only be known by accessing the Web sources). For instance, object scores -, -, and, - * -. Overall, as a top- query, will return an answer 0.7, i.e., is the top-ranked object with score, Cost Model for Middleware Accesses For ranked querying over Web sources, a middleware algorithm will gather predicate scores by some supported accesses to sources As Section 1 introduces, a source may support 1) sorted access on predicate, denoted 3$C ; or 2) random access on predicate for object, denoted 5C *. To illustrate, consider our example over Dataset 1. Figure 3(b) illustrates the sorted accesses For instance, dineme.com supports 34CG (sorted access on ) (note! ). Each 3$C will return one next-ranked object in the order of i.e.,.7,.65, and.6. Alternatively, random access will directly return an object s score on some predicate For instance, superpages.com supports 5CJ ;* by returning the score (note "$#&%')( ) for, e.g., -. A middleware algorithm is thus a query plan that uses (and schedules) such accesses for query answering. Different algorithms will perform different set of accesses to gather the scores needed, as we illustrate below Example 3 (Performed Accesses) To illustrate, consider an algorithm! performing the following accesses " #! 6* 3$CG, 5C 6*, 3$C, 5C 4*, 34C, 5C *. Note we use "! 6* to denote the performed accesses by!. With these accesses,! has gathered enough information to answer In particular, it simply gathers the exact scores of every object for every predicate The top- can then be identified by sorting objects by their, scores. Note, the same query can be answered by different algorithms with different sets of accesses, e.g., "! *>- 3$CG 34C 34C 34C 34C 63$C 4. As a remark, we note that the two types of accesses differ fundamentally in two aspects side-effects Sorted access 34C has side-effects; To illustrate, in Figure 3(b), the first 34C not only evaluates.7 but also bounds the maximal-possible score of for every unseen 3 Such enforcement of certain tie breaker enables optimization to compare only truly comparable algorithms returning the same results. $&%' (*) (,+ -. ) /,0 132 /,0 4 / /50 1 /,0 6 /,0 1.*7 /50 8 /,0 8 /,0 8 (a) dataset (b) sorted accesses on and G Figure 3 Dataset 9 objects with this last-seen score e.g.,.7. In contrast, random access 5C+ * has no effect on other objects than itself. progressiveness Sorted access 34C is progressive in that repeated accesses give more information For instance, repeated 3$C evaluates,, and in turn, by as accessing deeper into s sorted list. In contrast, 5C * will return the same every time and thus it should not be repeated. Over Web sources, each access incurs some cost, e.g., network communication or server computation. As such costs often dominate, our goal is to minimize the total access cost, which represents the total resource usage. To capture various scenarios, our cost model uses 243 and 2)5 to specify the unit cost of a sorted and random access respectively for predicate. The total access cost will then aggregate the costs of all accesses; i.e., let < and be the number of sorted and random accesses respectively, for performed by some algorithm!, the total cost is *-@? #! < 4243 BA $265 (1) Example 4 (Cost Model) To illustrate how our cost model works, continue Example 3. In an access scenario illustrated in Figure 1(a), where 243 -DCE 3 +2)5F DCGF,F 3 and 2)5 - F 3, Algorithm! performing 3 3$C and 3 5C (i.e., < -HC and -HC ) incurs the total cost of Meanwhile,! performing 3 3$C and 3 3$C incurs a smaller cost of However, observe that optimization is specific to the given cost scenario at run time In another scenario like Figure 1(b), where IF,F 3 and 265F - 2)50-3, Algorithm! is more efficient than!. Note the total access cost, as a standard cost model used in top- works [9], reflects not only total resource usage, but also elapsed time as well, when accesses are performed sequentially. Thus, in general, our access minimization framework will naturally optimize for both. However, the two optimization goals can conflict, when sources can handle concurrent accesses (as Web sources typically do) While elapsed time benefits from high concurrency, unrestrained concurrent accesses will certainly abuse resources (e.g., causing the server to congest). To address the conflicting goals, we model concurrency as bounded and optimize within this concurrency limit We will show that such parallelization can simply build upon our accesses minimization framework (Section 9.1.1). 4. MOTIVATION ALGORITHM FRAME- WORK ; 3

4 > To enable optimization or search for an effective algorithm we must first define a space of algorithms to search over. Put simply, the goal of cost-based optimization is, in principle, to find the optimal > algorithm! in that space, with respect to the cost model, i.e.,! -,! * (2) While crucial, such a space has not been developed for top- queries. Defining this space is challenging The space must be both large, or general, to encompass all comparable algorithms while still sufficiently small, or specific, to allow efficient search. For relational queries, this space is induced by an algebraic framework As a query is composed of relational operators (e.g., joins and selections), the space of algorithms consists of those query plans that are equivalent algebraically. Each query plan is thus simply a schedule of the operators (by their commutativity and associativity). The algebraic framework induces a space of query plans, each as a different schedule. Optimization is to find a good schedule of operations, conforming to the framework. Our approach builds on this insight of an algorithm framework, or an abstract algorithmic structure, to induce the space of query plans, or algorithms. As the basis, we focus on sequential frameworks that iteratively schedule accesses For our objective of minimizing total access costs (Eq. 1), sequential query plans are sufficient, since parallel accesses do not reduce total costs. However, we stress that parallelism can be built upon an effective sequential plan Since parallel accesses are possible over Web sources, Section will discuss parallelization. To concretely motivate this notion of framework, we start in this section with a simple one, Framework TG, which captures all sequential algorithms. With TG, we will contrast the requirements of generality and specificity. To begin with, in abstract, all sequential algorithms (or query plans) simply iterate accesses one by one As Figure 4 shows, in this Framework TG, any sequential algorithm! will continue (in the while-loop) to select and perform an access until the top- can be determined. In each iteration, let " (the accesses-so-far ) be the accesses that! has performed so far (initially empty).! will stop when " has gathered sufficient information for the query; otherwise,! will keep selecting some access from (i.e., the pool of all supported accesses) to proceed. Note that, as an abstract framework, TG generates a space of concrete algorithms. This space, denoted TG*, consists of all sequential algorithms Any concrete algorithms, while sharing this framework, will differ in their access schedules, by different Select strategies (line 6). As TG is rather unrestrictive it allows any accesses (in ) as alternatives to select from any algorithm as a sequence of supported accesses can fit into TG. Example 5 (TG) To see how TG generates query plans, consider! and! in Example 3 Suppose! executes accesses in "! * by the order as listed 3$C 5C * 3$C 5C *. TG can generate! by, at each Select, choosing 34C and 5C alternatively. Similarly, it generates! by alternating 3$C and 3$C. Is Framework TG general enough for optimization? That is, if we focus on only those in TG*, will we miss the best algorithm overall (i.e., without the restriction of the framework)? More formally, a framework is general, with respect to a cost function > (e.g., Eq. 1), if it can generate the optimal algorithm under >. As just explained, we consider sequential algorithms for our optimization. Thus, TG is trivially general By simply encompassing all sequential algorithms, it will not miss the optimal one. Framework TG(Q, D) Trivially General Input query Q (F(p 1,..., p m ), k), database D {u 1,..., u n } Output K, top-k objects from D w.r.t. to F 1) S {sa i, ra i (u j ) p i, u j }; //all supported accesses. 2) P φ; //accesses-so-far 3) while (P has not gathered 4) sufficient scoring information for determining K) 5) alternatives S; 6) Select access A from alternatives;//access selection. 7) perform A; update K; P P {A}; 8) return K; Figure 4 Framework TG for top- query processing. Such generality allows us to focus on the framework in optimization, by simply searching over concrete query plans within TG*. For TG, this search amounts to finding a good access scheduling strategy of Select Different algorithms will have different schedules and thus different costs e.g., while both are in TG* (Example 5),! and! cost differently (Example 4). Further, to enable more focused search, a framework must also be specific. Unfortunately, though general, TG is extremely nonspecific It simply allows any supported access to be selected from at each iteration, i.e., alternatives, which is often a very large set of choices. For instance, for with - objects and with - predicates, alternatives - A - F. As different choices generate different algorithms, such non-specificity renders an extremely large algorithm space. It is thus difficult to find an effective algorithm within TG. In summary, as a motivating framework, TG is trivially general but extremely non-specific; it is thus not useful for optimization. Our goal is to develop, by refining TG, a framework that is both general and specific. To achieve specificity, we must make alternatives at each iteration as small as possible While specializing these choices, can we still maintain the generality of the framework? To construct an effective framework, it is critical to first analyze the logical structure of top- queries, so as to understand the building blocks. Analogously, relational queries are composed of relational operators as the task units for query plans to schedule. However, it is not obvious how a top- query, as an arbitrary scoring function, e.g.,, *, can be decomposed into logical tasks. Section 5 will thus develop task decomposition, as the basis for building Framework in Section THE BASIS SK DECOMPOSITION While accesses are physical means for gathering object scores, what are logical tasks that a top- query must fulfill? This section studies task decomposition of a top- query as a set of necessary tasks, to be the building blocks for constructing an effective framework (Section 6). 5.1 Defining Scoring Tasks We take an information-theoretic view and ask What is the required information for answering a top- query? Given a database, any algorithm! must gather certain score information for each object, to determine the top-. We can thus compose the work of! by a set of required scoring tasks,. To define such tasks, let D D be the top- answers (where each D represents some from ). A task is to gather the (exact or partial) scores of object, by using relevant accesses, in order to either (if ) compute s overall score or (else) prove that it cannot score higher than D (the 4

5 , answer). Definition 1 (Scoring Tasks) Consider a top- query (,, ), with top- answers DA +DF. The scoring task for object is 1. for must compute the exact, score; or 2. otherwise must indicate (by some partial scores) the maximalpossible, score, tight enough to support that,, D. (Note we remove potential equality by deterministic tie breaking.) As a remark, note that these tasks are specified with given (the top- answers) and, D (the score). These values, unfortunately, will remain undetermined before query processing is fully completed For this task view to be useful, our challenge (as we will discuss) is thus to develop mechanisms for identifying unsatisfied tasks during query processing, before and, D are known. Example 6 (Scoring Tasks) Consider our running example over,, (Figure 3) For -, the answer is with,.7 (these values are not known until is processed). We can specify the scoring tasks,, for the three objects as follows. Consider task Since, must gather all predicate and G for computing,. Note can do scores so in various ways, e.g., by one sorted access 3$C into (which hits and returns.7) and a random access 5CJ * (returning.7). To contrast, task for (and similarly for ) needs only to prove, by gathering some partial scores, that,,.7. To do so, can use, say, two sorted accesses 3$CG into, which return first.7 and then.65 Now, since is still unseen from the sorted list of, it is bounded by the last-seen score, i.e.,. As, ; G *,, cannot be higher than, i.e.,,. We stress that these scoring tasks are both necessary and atomic First, each is necessary If any is not satisfied,! cannot properly handle object 1) if is a top- answer,! cannot return its final score; 2) otherwise, without proving,, D,! cannot safely exclude from the top-. Second, each, as a per-object task, is atomic For arbitrary,, cannot generally be decomposed into smaller required subtasks. For case (1) of Definition 1, when, obviously all predicate scores are required. For case (2), no subsets of s predicate scores are absolutely required, as long as the upper-bound inequity can be proved. In summary, we now view query processing as equivalent to fulfilling a set of (necessary and atomic) tasks Each task, for object, gathers the required per-object information. Only when (and clearly when) all the tasks are fulfilled, the query can be answered. 5.2 Identifying Unsatisfied Tasks To focus query processing, it is critical to identify unsatisfied tasks to concentrate on. However, during query processing, it is challenging to judge whether a task is satisfied, since DA D F, which our task specification (Definition 1) requires, is not determined until the very end. In fact, for our purpose, we can address a slightly different problem Given a set of accesses-so-far " that has been performed, can we find any unsatisfied task? Instead of identifying all, for query processing to move on, it is sufficient to find just one. (Note any unsatisfied task must eventually be fulfilled.) Our insight is, by OID p 1 p 2 F u u u Figure 5 The score state of Example 7. comparing the score state of objects, we can always reason some tasks to be clearly unsatisfied, despite the eventual result. Example 7 (Unsatisfied Tasks) Consider over Suppose, at some point, we have performed " 3$CG, 34C, 3$C, 5CG$ 4*. Referring to Figure 3, these accesses will gather the following score information The two sorted accesses 3$C on will hit.7 and.65. As side-effect (Section 3), the unseen objects ; (i.e., ) will be bounded by the last-seen score, i.e.,. The one sorted access 34C on will return - ;, and set upper bounds G and G. The random access 5CG4 4* returns -. Putting together, Figure 5 summarizes the current score state. For The above accesses gathered - and, and thus, * -. Similarly,, - ; and, At this point, while we do not know what will be (as Definition 1 requires), we can identify at least the scoring task for as unsatisfied, no matter what is if (i.e., will eventually be the top- ) needs to gather exact G to compute the, score. if in this case, the top- is or, with, scores of at most.65 and.6 respectively (Figure 5) Thus, the top- score (i.e.,, D in Definition 1) is at most.65. Clearly, has not proved that,, since can score as high as.7. As Example 7 hints, task is unsatisfied, if has potential to be in the top- results. For such (e.g., ), regardless of what will be, we must know more about its scores to declare it as either top- or not. We thus identify whether is unsatisfied as follows We quantify the current potential of (with respect to " ), and determine if this potential is high enough to make the top- results. To begin with, we measure current potential of an object by its maximal-possible score. Define, as the maximal score that may possibly achieve, given the partial scores that accessesso-far " has gathered. As, is monotonic, we compute, by substituting unevaluated predicates with their maximal-possible scores Note that is bounded by the last-seen score from its sorted accesses, denoted. (Section 3.2 discussed such sideeffects of sorted accesses.) For instance, as Figure 5 shows,, * - - *.65. Thus, formally,, * - if " has determined (3) - otherwise. Further, we focus on the current top- objects by their potentials. Let D,, D be these current top objects ranked by their, scores. (To illustrate, in Example 7,.) There are two situations, depending on if any current top objects are incomplete First, if contains any incomplete object one that has not been fully evaluated (i.e., with only partial scores) As Example 7 argued for (an incomplete top- ), such D needs further accesses either way, by Definition 1 1) If D is indeed the final top-, it 5

6 needs complete evaluation. 2) Else, it needs further accesses to lower its maximal-possible score, to be safely excluded from top-. Thus, task for such incomplete D is clearly unsatisfied. Second, if all objects D,, D in are complete These current top- with respect to " are now indeed the final top- (i.e., ) (and the query can halt with these answers). To see why, we make two observations 1) Every D is complete and thus has its exact score, i.e.,, D, D. 2) Every object, with the current ranking, has its maximal-possible score lower than the above exact scores, i.e.,,, D. It follows that those D are the top- answers, fully evaluated. Meanwhile, with these two observations, Definition 1 will declare all scoring tasks (either case) as satisfied. That is, checking from the task perspective, it is consistent to see that all tasks are fulfilled thus query processing can indeed halt. Theorem 1 states our results on identifying unsatisfied tasks. Theorem 1 (Unsatisfied Scoring Tasks) Consider a top- query (,, ) over -. With respect to a set " of performed accesses, let D,, D be the current top- objects ranked by,. 1. D s.t. D has not been completely evaluated, its scoring task is unsatisfied. 2. If all D s are complete, then every scoring task,, is satisfied, and is the top- results. Proof (1) If D has not been completely evaluated, its scoring task is unsatisfied No matter what will eventually be, there are two possible situations If D As its scoring task must compute, D, the task is not complete until we gather D for every unevaluated predicate of D since D has not been completely evaluated, such must exist and thus is still unsatisfied (by Definition 1, Case 1). If D Suppose its scoring task is satisfied It will indicate that there are at least objects (e.g., those in ) satisfying, D,, which in turn satisfy, D, ", as,, ". Meanwhile, as D, there are at most objects, " D, ", a contradiction. (2) If all D s are complete,, D -, D 7,,,, and thus -. With this, we can show that scoring task is satisfied, for every. As every has been completely evaluated, is satisfied (by Definition 1, Case 1). As, D 7, D (as shown above), is thus satisfied (by Definition 1, Case 2). We stress that Theorem 1 is generically useful First, it is useful, by guaranteeing to identify some unsatisfied tasks, if there exist any Condition 2 gives a precise way to determine if there still exist any unsatisfied tasks. If so, Condition 1 will identify at least some of them (i.e., those incomplete D ). Second, it is rather generic its treatment of logical tasks makes no assumptions on particular physical accesses. We can thus uniformly handle both random and sorted accesses (and beyond), despite the progressiveness and side-effects (Section 3). (As Section 2 discussed, some earlier works [5, 2] assume random access-only scenarios.) These results provide a basis for constructing a specific framework (Section 6), by focusing on unsatisfied tasks, without compromising generality. 6. FRAMEWORK This section develops a framework that is both general and focused, by refining TG (Section 4). Built upon our task decomposition (Section 5), Framework concentrates on, at each iteration, Framework (Q, D) Necessary Choices Input query Q (F(p 1,..., p m ), k), database D {u 1,..., u n } Output K, top-k objects from D w.r.t. to F 1) P φ; //accesses-so-far 2) K P {v 1,..., v k top-k from D ranked by F P [ ]}; 3) while (U {v j v j K P ; v j is incomplete}) 4) v j any object in U; //e.g., the highest-ranked 5) N j {sa i, ra i (v j ) p i [v j ] is undetermined by P}; alternatives N j ; 6) Select access A from alternatives;//access selection. 7) perform A; update K P ; P P {A}; 8) return K K P ; Figure 6 Framework. a small set of necessary choices, as induced by an unsatisfied task. Section 6.1 will first present the framework, before 6.2 discusses its generality. 6.1 The Framework This framework hinges on the insight that query processing can focus on only unsatisfied tasks while still general enough to preserve potential optimality. Our motivating framework, TG, is rather unfocused since iterative accesses can be selected from the entire pool of supported ones. In contrast, Framework will first identify some unsatisfied task and then focus selection on those accesses for fulfilling. This insight is built on task decomposition (Section 5) that top- query processing is equivalent to fulfilling a set of (necessary and atomic) tasks. With this task view, during processing, when a set of accesses " has been performed, we can identify unsatisfied tasks, by Theorem 1. (When all tasks are satisfied, query processing can halt, as Theorem 1 also asserts.) For any unsatisfied, we can construct a set of accesses, specifically for satisfying, by collecting all and only accesses that can further process These accesses constitute the necessary choices for fulfilling. More precisely, will consist of any (random or sorted) accesses that can return (exact or bounding) scores about s unevaluated predicates. (As Theorem 1 states, for such unsatisfied, its object must be still incomplete.) Example 8 (Necessary Choices) Continue our running example. Example 7 identified that task is unsatisfied, for object, with a score state (.7,.9,.7), as Figure 5 shows. Note that is unsatisfied, since the accesses-so-far " has not gathered sufficient information for (for either case of Definition 1). To satisfy, we must know more of s scores in particular, for predicate, whose exact score is unknown. Thus, the following accesses can contribute 5C Sorted accesses on Performing 3$C can lower the upper bound of G As " (Example 7) has already one 3$CJ, the next 3$C will return with score.8 (Figure 3). This new lastseen score by 3$C will give a tighter bound for (from to ). Random access on Performing 5CJ * will return the exact score of for G, thus turning into completely evaluated,,.7). In fact, is now with score state (.7, G.7 satisfied. Thus, for satisfying, the set of possible choices is 3$C, *. 6

7 D step p 1 p 2 K P alternatives Select {u 3 } N 3 {sa 1, sa 2, ra 1 (u 3 ), ra 2 (u 3 )} sa {u 3 } N 3 {sa 2, ra 2 (u 3 )} ra 2 (u 3 ) Figure 7 Illustration of. Definition 2 (Necessary Choices) Given a set of performed accesses ", let be an unsatisfied scoring task, for object. The necessary choices for with respect to " is 34C, 5C is undetermined by ">. As Figure 6 shows, Framework builds upon TG, with additional steps for identifying necessary choices. Theorem 1 guides this process At any point, maintains, the current top- objects with respect to accesses-so-far ", ranked by maximal-possible scores,. Some objects in may still be incomplete, which variable collects. As Theorem 1 specifies, there are two situations 1. If - As all top- objects are complete, Theorem 1 asserts no more unsatisfied tasks, which is thus the termination condition of will break the while-loop (since - ), and return. 2. Otherwise Since -, there are incomplete top- objects. Any such object D corresponds to an unsatisfied task, by Theorem 1. arbitrarily picks any such D (say, the highestranked one), and constructs the necessary choices (by Definition 2) as alternatives for selecting further access Note that essentially relies on Theorem 1 to isolate a set of necessary choices. Theorem 1 enables an effective way to search for necessary choices, by maintaining, the current top- objects. Thus, a search mechanism for finding unsatisfied tasks should return top- objects when requested e.g., a priority queue that orders objects by maximal-possible scores as priorities. Note that, initially, all objects have the same maximal-possible score (i.e., a perfect 1.0). This initial condition is simply a special case of ties In principle, will initialize (in Step 2) with some deterministic tie-breaking order (Section 3). In practice, any tiebreaker (e.g., run-time order that does not require resorting) can be used our optimization will hold for algorithms returning the same results. However, for the sake of presentation, our examples will assume some OID as a tie-breaker, e.g., when and * tie and 7, then effectively, 7,. Observe that, at each iteration, there may be multiple incomplete in. We stress that can simply choose any such D to proceed. Each D designates an unsatisfied task Any such must be further carried out, and is thus equally necessary (Section 5.1). More precisely, an unsatisfied task will induce a set of necessary choices with a desired completeness property As Section 6.2 will discuss, with this completeness, any can guarantee the generality of. Example 9 illustrates how works. Example 9 (Framework ) Figure 7 shows the execution of an example algorithm! (for query of Figure 3) that can generate Initially, at Step 1 (Figure 7), as all the maximal-possible scores tie as 1.0, is set to (by the highest OID, our tiebreaker), which induces alternatives. According to,! then Select an access, 3$CG in this case, among the alternatives, which returns.7 (see Figure 3) and lowers to.7. At Step 2, as all the maximal-possible scores tie as.7, remains as the top in. However, now induces a smaller, with accesses only for its unevaluated predicate.! then Select 5C *, which returns.7 and completes with *,.7. Since with as the top- is now fully complete, according to,! will halt, with total accesses "! *- 3$C 5C *. 6.2 Generality and Specificity Our objectives toward an effective framework, as Section 4 motivated, are both generality and specificity. We next show that, unlike our motivating framework TG, is not only far more specific but also sufficiently general. First, we note that, by focusing on only necessary choices, is clearly more specific than TG (in which access selection must consider any arbitrary accesses). For instance, Section 4 motivated the non-specificity of TG with an example of alternatives- A - For the same setup, will have a far smaller choice set, according to Definition * 2 alternatives- E ) - (i.e., one 3$C and one 5C * for each ). Further, we stress that, although more specific, is still general enough for optimization. This generality results from the completeness property of necessary choices, which uses as alternatives. In particular, we define a set of alternatives as complete with respect to accesses-so-far ", if any algorithm! performed " that has must also perform at least one access from alternatives. Thus, alternatives - (as in TG) is trivially complete If " is not sufficient to determine query answers, any algorithm having done " must continue with at least one more access which by definition must be in, all supported accesses. In fact, while focuses on a much smaller alternatives, it is still complete. To see why, note that identifies a set of necessary choices which, by Definition 2, contains all accesses that can contribute to the unsatisfied task. Since is necessary (Section 5.1), at least one access in must be further executed, or cannot be satisfied and thus the query cannot be answered (For instance, for in Example 8, if neither 3$CJ nor 5CJ * is executed after ", will remain unsatisfied.) Thus, is complete, with respect to accesses-so-far ". This completeness holds for the necessary choices of any unsatisfied task since any such must be fulfilled, sooner or later. This completeness property ensures that is sufficiently general for optimization. That is, in our optimization (Section 7), we only need to consider the space of algorithms, denoted *, generated by Framework. For this purpose, we deem a space as sufficiently general, if it contains a comparable counterpart algorithm for every possible algorithm. That is, any arbitrary algorithm will find some counterpart in * with no more cost, as Theorem 2 below states. With this guarantee, it is sufficient to search only within for an optimal algorithm. Theorem 2 ( Generality) For any algorithm! with an access cost > with respect to the cost model (Eq. 1), there exists an algorithm! in * with cost, such that. Proof Consider any query processing by! (for some query over database ). We will show the generality of by constructing an algorithm! in Framework for the same processing, such that! costs no more than!. Let " be the total accesses that! has performed, i.e., "! 6* ". Since! follows the interative framework (Figure 7), let " be the accesses of! before the iteration; initially, " -. Similarly, let alternatives be alternatives of! at iteration. Our proof is based on the following two lemmas and for every iteration, which we show later. " ". alternatives " " * -. 7

8 ! > > Note that, by, algorithm! incurs no more access than!, when! halts at some iteration (denoted > as! ) " ". > Note, this immediately implies that!! * as well, * because our cost function (Eq. 1) is monotonic to accesses performed If! performs more times of every kind of access than, then! will have an overall higher cost, i.e., > "! * "! I* -! *! 6* To complete the proof, we now show by induction that and hold; we will also specify the behavior of! for each iteration, to show how it can be constructed in the framework. - is trivial, since initially " -. Consider We note that, by definition of the Framework, alternatives is complete that any algorithm (like! ) that has performed " must have performed in addition some access among alternatives. Thus, as! has performed " (trivially, since " - ), it must have performed access alternatives in addition. That is, is in both alternatives and " ", and thus holds. - As the induction hypothesis, assume for -, the lemmas hold. What should algorithm! do in each iteration? We now construct! for iteration If! exhausts ", which provides enough information to answer,! halts right before this iteration. Otherwise, requires that! select one access from alternatives to continue We will let! choose an access that is also in " " Such must exist by, ( alternatives " " *. -8 A First, holds Note that " ". Since " " (by the induction hypothesis on ) and " " (by the construction of! ), it follows that " " holds. Second, holds By (just proven above) that " ",! has performed ". By the completeness of alternatives,! must have performed, in addition to ", some access alternatives. That is, is in both alternatives and " ", and thus holds. In summary, we stress that, as an algorithm generating framework, defines an optimization space that is general yet specific. This space, *, consists of algorithms that conform to but implements Select differently. Our goal, in principle, is thus to instantiate an optimal algorithm! in *, which depends on query and data-specific factors. Section 7 will discuss optimization techniques for finding! such that, refining Eq. 2! - 5 #! * (4) 7. SEARCH DYNAMIC OPTIMIZATION In this section, we discuss how to actually optimize top- queries, using Framework in Section 6. As briefly discussed, with optimization space * defined, query optimization problem is now identifying the cost optimal algorithm! in Eq. 4. For systematic optimization, we must address the following three tasks, each of which corresponds to its counterpart in Boolean query optimization 1. Space reduction While already much focused than the space of arbitrary algorithms, * is still too large for exhaustive search. We thus design a suite of systematic heuristics to reduce the space. Similarly, Boolean query optimization relies on systematic heuristics for effective search, such as focusing only on linear joins.! #"%$&(' )*,+ -/.0.1*2 3* *9 ;8<6&>? *@ >? #@A? #B B 78C(9 >? D#4 E/4 F 5 678*9 ;8<6&>? #@!A? B 78C(9 >? G&4 E/4 F EH4 I 678*9 ;8<6&>? #@!A? B 78C(9 >? JH4 E/4 F EH4 K 678*9 ;8<6&>? #@!A? B 78C(9 A? B 78C Figure 8 Illustration of SR/G heuristics. 2. Search Within the space identified, we design effective optimization schemes focusing search on promising algorithms. Similarly, Boolean optimization focuses its search on plans enumerated in particular ways, e.g., by dynamic programming. 3. Cost estimation As a ground to compare algorithms in the space, the optimizer must be able to estimate the cost of each algorithm. Our cost estimation extends the insight of its Boolean counterpart, as we will discuss in Section Space Reduction While help optimization by inducing a focused algorithm space, it is still large for exhaustive search At each iteration, may Select any type of access on any unevaluated predicates of top- objects. We thus need to further focus within, with some systematic heuristics. These heuristics contribute in two ways First, they reduce the space significantly, while still retaining the promising algorithms for consideration. Second, they give orders to the reduced space, so that algorithm can be systematically enumerated, by varying a few configuration parameters. In particular, we use the following heuristics for optimization First, we choose to focus only on < algorithms (for sortedthen-random), which perform all 3$C on predicate, if done at all, before any any 5C *. Lemma 1 states that, for any top- algorithm, we have its < -counterpart gathering the same score information, with no more cost. Lemma 1 (LNM -counterpart) For any algorithm! *, there > exists its < -counterpart! > with no more cost, i.e., (! ) (! ). Lemma 1 allows us to reduce our plan space by focusing only on the subset of SR algorithms, i.e., < -subset. However, how good is this heuristics? Will we miss the actual optimal algorithm, by such reduction? By Lemma 1, we can conclude that the!o* reduction has no loss of optimality as long as the < -counterpart of!o is still in a property we call < -inclusion. We believe <& -subset reduction is at least a good heuristics with little loss of optimality, as <& -inclusion does hold in our empirical observations, though we don t have a formal proof. Second, we assume that random access on every object follows the same global order P. That is, when multiple random accesses exist in alternatives, we follow some particular order P (given by the optimizer; See Section 7.2) to choose which to perform. To illustrate, supposing necessary choices are alternatives - 5C * +5C given P - *, we pick 5C+ 6* first as the next unevaluated predicate of is according to P, which we denote as RQ TS$ *P*0-. This heuristics has been first studied in [5] (which focuses only random access probes, unlike our general optimization). As [5] reported, such global scheduling achieves comparable optimization result, while significantly reducing the complexity. By focusing on the above two heuristics, we propose Framework with SR/G (SR-subset and Global scheduling) heuristics. These heuristics customize the Select routine of as Figure 9 shows Now the selection is more focused, guided by two * 8

9 > * > Procedure Select (alternatives, *P ) if 3$C alternatives such that 7 34C ; else if A5C alternatives such that -/RQ3 S$ 5C * ; Figure 9 Select with SR/G heuristics. *P * parameters - * and P - *, which will be determined by the optimizer (Section 7.2). In essence, Select chooses sorted access whenever there exists 3$C which hasn t reached the suggested depth, i.e., 7. Otherwise, it performs random access in alternatives, by picking the next unevaluated predicate (according to P ). Example 10 illustrates how these heuristics actually work with our running example. (For the sake of presentation, from here on refers to the framework with SR/G heuristics.) Example 10 (SR/G heuristics) Consider our running example on Dataset Figure 8 illustrates how SR/G heuristics guide the access selection of when - * and P - G$*. At step 1, among necessary choices alternatives -, Select focuses on 3$C and 34C, as the suggested sorted access depths haven t been reached yet i.e., 7 - and G>7 -. (We arbitrary pick one, e.g., 3$C.) Similarly, at step 2 and 3, Select chooses 3$CJ, until it lowers G below the suggested depth after step 3. Then, at step 4, we perform 5C *, which completes the evaluation on. can thus return as the top-1 answer with four accesses " - 3$C I3$CJI3$CJ+5C *, as, than the maximal-possible scores of the rest. is higher In addition to reducing the search space, the SR/G heuristics enable to enumerate algorithms by parameters and P, i.e., every SR algorithm can be identified by (,P ) pair. Consequently, our optimization problem can now be restated as identifying the minimal-cost algorithm * *P * such that *P * - + *P*+*., P 7.2 Search Toward identifying the optimal algorithm * *P approximate the problem by identifying and P -optimization We first identify the optimal depth * respect to some initial schedule P, i.e., -, + *P *+* *, we first in turn, with P -optimization We then identify the optimal scheduling P with respect to identified. For P optimization, we can adopt [5], which similarly determines a global predicate scheduling, as explained in Section 7.1. Thus, in this section, we focus on optimization As Example 11 will illustrate, optimization is specific to runtime factors, e.g., score functions, predicate score distributions, and cost scenarios. Example 11 ( Optimization Possibilities) To illustrate, we continue Example 10 with a different depth configuration - *. In fact, generates the algorithm illustrated in Figure 7 it starts with 3$C as 7, but chooses 5CJ * next as G. Observe from this example that different configurations imply different access costs While a parallel configuration of - * required four accesses to answer (Figure 8), a focused configuration - 4* requires only two accesses (Figure 7). However, note that, this finding is only specific For instance, when scoring function, is CDFE (the average function) for the same query, requires less accesses (4 accesses) than (6 accesses). Consequently, we need search schemes that systematically adapt to the given query, in exploring space, i.e., -dimensional space of 8-. We first discuss an exhaustive search scheme Naive, which will be used as a baseline for comparison (Section 9). We then enhance the scheme with more informed (either query-driven or generic) search. (Scheme Naive) Naive simply explores the whole space by meshing it into a finite set of grid points. Then, for every grid point, it estimates the cost (See Section 7.3) of every algorithm P * and idenfies the minimal-cost configuration among them. Though simple, Naive obviously suffers from scalability and performance limitations, especially when space explodes for large. We thus enhance Naive to systematically focus on a promising subset of, as follow. (Scheme Strategies) Strategies enhances Naive approach by applying query-driven strategies in the search for. As illustrated in Example 11, a particular scoring function often implies a particular best strategy to narrow down search, e.g., parallel configurations for CDFE and focused configurations for 01. Thus, Scheme Strategies focuses its search on some configurations corresponding to the given strategy. (Scheme HClimb) As an alternative to query-specific Strategies scheme, one can apply a generic informed search to enhance Naive scheme. For instance, one can apply hill climbing scheme From a random point, HClimb simply searches towards its neighboring configuration with less estimated cost, until it reaches the minimum. The scheme is typically enhanced with multiple random starting points, to avoid being stuck at the local minimum. In particular, our experiments in Section 9 will adopt HClimb as an optimization scheme, which is evaluated to be the most effective from our experiments in Appendix. 7.3 Cost Estimation Finally, we discuss how to estimate the cost of algorithms in space. To motivate, recall the cost estimation for Boolean queries First, optimizer estimates the selectivity of each predicate using some statistical samples, e.g., histograms. Second, it then estimates their aggregate effect, from which the overall cost can be computed The aggregated effect is computed analytically in Boolean queries, as predicates are composed by the known set of relational operators, e.g., or. For instance, in a simple conjunctive query, the aggregate selectivity is simply the product of selectivities, assuming predicate independence. For top- queries, we extend the same intuition in the following ways First, we generalize Boolean selectivity into the selectivity of probabilistic score distributions, which can be similarly estimated from statistical samples. Second, we estimate the aggregate selectivity of predicates, which is challenging for top- queries As predicates are aggregated by arbitrary function,, the aggregate effect cannot be quantified by analytic composition as in Boolean optimization, but only by simulation runs Simulation is essentially a mimic of the actual execution on sample objects. In particular, we perform a simulation run on the samples, transforming a top- query on the database into a top- query on the samples. The retrieval size is determined in proportional to the sample size 3, i.e., G- <. In principle, samples can be obtained from online sampling, or built offline (e.g., based on a priori knowledge on predicate score 9

"! #!%$& '( ) ) * + * -,/.01 2 01 3 01 4( 56( 7 ) 89 + 9,/.01 2 1 ; 89< 3 1 ; 8>9< DEFG HIJK Figure 10 with no wild guesses. distribution.

10 "! #!%$& '( ) ) * + * -,/ ( 56( 7 ) ,/ ; 89< 3 1 ; 8>9< DEFG HIJK Figure 10 with no wild guesses. distribution.) However, when samples are unavailable or too costly to obtain online, one can generate dummy samples based on the assumed distribution (e.g., uniform) Though such samples cannot represent actual score distributions, they help optimize for other important aspects, such as, or. While our optimizer will certainly benefit from accurate samples, Section 9 will implement our optimization framework using dummy samples, to validate our framework in the worst case scenario. 8. UNIFICATION AND CONTRAST With general optimization, should in principle unify algorithms for specific scenarios We thus study how 4 in fact unifies specific algorithms, by generating similar behaviors, and further contrast them, by identifying those ungeneralizable behaviors. As middleware algorithms generally assume no-wild-guesses [9], we first describe how handles this restriction (while can generally work with or without). In such settings, an algorithm cannot refer to an object (for random access) before knowing it from some sorted access. Thus must distinguish between seen and unseen objects will remain unseen until hit by some sorted access, when it becomes seen. We introduce a virtual object unseen to represent all unseen objects Note all such objects share the same maximal-possible score, unseen,0 *. This virtual object needs special handling, as Figure 10 shows with query First, initially all objects are unseen, so now initializes with only the unseen. Second, when this unseen is at the top (e.g., step 1), its induced choices unseen will contain only sorted accesses, since random access is not allowed for an unseen object, by the no-wild-guesses assumption. Third, objects hit by some sorted access will become seen (e.g., seen by 3$CG at step 1) They will be then handled as usual and may surface to (e.g., at step 2). 8.1 Algorithm We now observe how adapts to scenarios. As Figure 2 summarized, aims at access scenarios where sorted and random access have uniform unit costs, i.e., 2$3@? 2)5. In brief, works as follows Perform sorted accesses on predicates in parallel, or equaldepth 5. As an object is seen from any sorted access, perform 5C * exhaustively for every unevaluated predicate to compute its final score,. Add to, if it is one of the highest so far. Let threshold AB>C, *. As soon as has objects with scores no less than AB>C, stop and output. In essence, can be characterized by three behaviors (1) equal-depth-sorted-access At each iteration it performs sorted accesses to all predicates. (2) exhaustive-random-access It then does 4 For notational simplicity, we use interchangeably as an abstract framework and as the optimal algorithm generated. 5 Note that the depth of sorted access, in this context of, refers to the number of objects accessed, instead of the score reached. (a) scenario < (b) scenario < Figure 11 Illustration of and. exhaustive random accesses on every seen object. (3) early-stop It terminates as soon as the stop condition D, D A BC is satisfied. So, would adapt to uniform scenarios by dynamic optimization and generate similar behaviors? Unification In symmetric cases (which will be clear later), which s behaviors are optimized for, will indeed generate We illustrate with a scenario < with scoring function, -8CDFEG G4*, in which the scores of and G are uniformly distributed over and 243-2)5 -. To observe how adapts to <, Figure 11(a) shows a contour plot of > *P * with respect to.-. 4$*. identifies the minimal-cost, or the darkest cell marked by a rectangle, at around (.85,.83). To compare, the figure also marks the depth reaches (by an oval) at (.84,.84). 6 Observe that the two algorithms are indeed almost identical (1) Both perform equal-depth-sorted-access up to similar depths. (2) By accessing the same depths, they will both see the same set of objects Since does not use exhaustive random access, it will only perform less random accesses than, e.g., slightly outperforms (by 1%) in Figure 11(a). (3) The output of shares the same early-stop condition as Since, unseen - A B>C (by definition) and unseen, it follows that D, D, unseen -/AB>C. Contrast However, contrasts with by being able to adapt Even among uniform scenarios, in the asymmetric cases, s characteristic behaviors cannot adapt well. 1. Equal-depth-sorted-access is not desirable, in scenarios when the optimal depth is not equal across predicates e.g., for, 01 *, focused sorted access is more effective (Example 11). 2. Exhaustive-random-access is not desirable As contrasted above, by scheduling both sorted and random accesses, performs less random accesses. 3. Early-stop is not desirable, if performing deeper sorted access can trade those random accesses to follow and thus reduce the total cost i.e., trade-off exists between deeper sorted accesses and more random accesses. In fact, in such scenarios, will adapt beyond and thus generate a rather different algorithm. To contrast, Figure 11(b) shows scenario < with, - 01 (and otherwise the same as < ). Observe and differ significantly focuses sorted accesses - *, while performs equal-depth sorted with access up to 5, *. Observe also their cost difference is significant as well saves access cost by 30% from, by focusing sorted accesses. For a closer observation, Figure 12 compares the relative access costs of and (normalized to the total cost of as ML ) in various scenarios As symmetric cases, Figure 12(a) first considers scenario <, which is rather favorable to (as ex- 6 This figure can be viewed or printed in color, for better visibility. 10

Treewidth and graph minors

Treewidth and graph minors Lectures 9 and 10, December 29, 2011, January 5, 2012 We shall touch upon the theory of Graph Minors by Robertson and Seymour. This theory gives a very general condition under