Clustered Index Queries. Non-clustered Index Queries. Non-index Queries

Size: px

Start display at page:

Download "Clustered Index Queries. Non-clustered Index Queries. Non-index Queries"

Suzanna Banks
5 years ago
Views:

1 Query Classication in Multidatabase Systems Banchong Harangsri John Shepherd Anne Ngu School of Computer Science and Engineering, The University of New South Wales, Sydney 2052, AUSTRALIA. Abstract Query optimisation is a signicant unsolved problem in the development of multidatabase systems. The main reason for this is that the query cost functions for the component database systems may not be known to the global query optimiser. In this paper, we describe a method, based on a classical clustering algorithm, for classifying queries which allows us to derive accurate approximations of these query cost functions. The experimental results show that the cost functions derived by the clustering algorithm yield a lower average error as compared to the error produced by a manual classication. Keywords: Cost function derivation, Classication, Query optimisation, Multidatabase systems 1 Introduction Query optimisation in multidatabase systems is fundamentally dierent from distributed query optimisation, for three major reasons [5]: site autonomy, system heterogeneity and semantic heterogeneity. Site autonomy means that the essential information for optimisation, namely cost functions and database statistics, may not be available to the global query optimiser to assist in choosing query execution plans. Clearly, before eective query optimisation is possible in such a system, some means must be found of estimating query costs in the component (or local) database systems. Du et al. [2] were the rst to address this problem. They identied three types of component database systems: proprietary databases, for which cost functions and database statistics are known; conforming databases, which can provide database statistics but not cost functions; non-conforming databases, for which neither cost functions nor database statistics are available. Du et al.'s approach to this problem was to derive coecients (parameters) of the cost functions by using a Proceedings of the 7th Australasian Database Conference, Melbourne, Australia, January 29{ January synthetic database. The main limitations of this approach were: The derivation can be done only with conforming databases. The derivation process requires us to know a priori the access methods employed by the component databases. The synthetic relations used by the calibrating process must always have a eld (attribute) whose values are normally distributed. Recently Zhu and Larson [8] proposed the use of a query sampling method to derive the cost parameters of a local database. Their method has two steps: 1. develop a manual classication of queries based on their access semantics, and derive a cost function for each class 2. sample queries from each class and run them on a real local database (not synthetic) to observe the running times of the queries 3. use multiple linear regression to derive the parameters of the cost function for the local database The manual classication proposed by Zhu and Larson [8] which we call ZL classication basically gives three of select or join queries: clustered index, non-clustered index, and non-index. The main problem with this method is that when it is used for non-conforming databases, the manual classication of queries is not possible to produce the three (since we know nothing about the underlying access methods). Thus all queries are placed in only a single class, with a single cost function which has a relatively high average error of cost estimation. While query classication is clearly important in deriving accurate cost functions, a signicant question is \How many do we need to get accurate cost functions?". The higher number of of queries we have, the more likely the average error over all will be reduced. There

2 are reasons, however, that can prevent us from having the maximum number of. First, the more we have, the more sampling we are required to do. In non-conforming databases, it is reasonable that one will classify queries into two main of select queries, i.e., the equi and non-equi select queries. The maximum number of query would be 2a where a is the number of all attributes in the local database (the value of 2a will be claried again in section 4.3). Suppose the database we are considering has 10 relations each of which has 10 attributes so the total number of attributes is 100. Thus the maximum number of would be = 200. Each class of select queries requires at least 40 queries [8] to make each cost function of the queries in the class accurate enough when bringing them to use on-line. Let us relax the assumption of 2a to be only half of it; that is, we ignore a of equi queries. The total number of queries we need to perform sampling is = 4000 queries. According to our experiments we have done with database sizes of 10,000 up to 25,000 tuples and 5-10 attributes per relation, the non-equi select queries can run on average approximately queries per hour and therefore the time we would need to perform the sampling for the local database would be days. The problem will be considerably worse when we want to perform the sampling over non-equi join queries and the maximum number of of join queries is required!. Thus the sampling process may form a signicant part of the total expense of running the database system, if a large number of are involved. Second, for dynamic local databases, periodically we need to perform a new sampling to update the cost function coecients. Last, the fact that, in general, several query will have similar characteristics in their running time 1, means that grouping them together into the same class would not reduce the accuracy of the cost functions. In this paper, we suggest that the number of k can be based on the number of queries q that we are willing to run in the sample. Whenever the required number of query is less than the maximum, the query classication problem can be formulated as a clustering problem in a large search space. Here, we propose to use a hierarchical clustering algorithm (HCA) [4] to perform classi- cation. Note that our method does not require any a priori knowledge about the local database system, apart from knowing the relational schema, which makes it more widely applicable than the ZL method (which requires us to know the access methods used in the local database). 1 In this paper, we use the elapsed running time of queries as a cost metric which is the same as [8, 2] The rest of the paper is organised as follows: Section 2 describes the model of queries that we use in optimisation and the cost functions on which the global query optimiser is based. In section 3, we examine query classication (1) where we can use a priori knowledge of the local database to classify queries into top-level and (2) where we have no such knowledge but still can classify the queries further. In section 4 we describes how to perform query sampling. The algorithm HCA is explained in section 5 and its experimental results are shown in section 6. An example of how to apply HCA to a local database is given in the appendix. In the last section, we present our conclusions and give some issues for future research. 2 Query Optimisation and Cost Function In a multidatabase environment, each local database system may use dierent kinds of data models but, in this paper, for the global level the relational data model is assumed. That is, each local database is connected to the global multidatabase agent via an interface that provides the relational appearance, even if the participating database is non-relational [8]. In this paper, we adopt the standard treatment of queries that is used in most of the query optimisation work in the literature. A query is regarded as a sequence of select (), project () and join (./) operations. The cost of a query is the sum of costs of those composite operations sequenced in a particular order. The projection operation usually is grouped together with a select or join operation so its cost is computed in conjunction with the cost of the select or join operation. One of the aims of this work is to produce cost functions which can be exploited by a global query optimiser to answer select-project-join queries in conjunctive normal form (CNF) 2. Each predicate of a CNF query is ANDed together to form the whole condition of the query and is of the form R i :a j const or R i :a j R s :b t, where R i :a j is attribute j of relation i, const is a constant value in the domain of attribute j and 2 f=; 6=; >; ; < ; g. Fundamentally, a predicate is either select or join operation. Our proposed method may be used to derive query cost functions for either of the two main of queries (i.e. select and join queries). However, in this paper, we study the application of our method only to the classication of select queries (for the rest of the paper, we use the word \query" to refer to \select query"). Select queries are of the form L ( F (R i )) as in [8], where L is 2 CNF is the most commonly used query form in the optimisation literature, basically because its search space is smaller than its counter form disjunctive normal form.

3 a list of projected attributes of relation R i and F a predicate of the form R i :a j const. Although simple, the select queries in such a form are enough to be employed to compute the cost of any complex CNF select queries on a single relation. Under our scheme, select queries are to be classied into sub where each subclass has its own cost function which is found by a least squared error (LSE) method. Each select query has two independent variables which aect the running time of the query (which is the dependent variable). The number of tuples of the input relation x1 and the output relation x2 are the two independent variables. Basically, x2 is unknown at runtime, but it can be estimated by the selectivity value of the select query, i.e., x2 = selectivity times x1. Therefore what we try to do is to derive one cost function ^t = f(x1; x2) for queries in each subclass, where ^t is the estimate of the real running time t of the query. To account for variations in system load, we ran each query three times and used the average running time as the value for t. 3 Query Classication We propose a query classication scheme that can be used with local databases which: 1. can provide some priori knowledge 2. cannot provide any knowledge For the former, we can use knowledge gained from the applications at hand such as database schema, key information, query type information (for instance, point query, multipoint query, range query, prex match query [6]) etc. to roughly classify all given queries into top-level. For example, the ZL manual classication method as shown in Figure 1 can be considered as knowledgebased approach; queries which have any kinds of clustered indexes are in the rst class, queries with any kinds of non-clustered indexes would be in the second and queries in the last class are the ones which are out of any of the rst two. Q1 Clustered Index Queries Q Q2 Non-clustered Index Queries Q3 Figure 1: ZL Manual Classication Non-index Queries As for the second classication, this assumes no such knowledge is required to assist in classifying queries in a higher-level class into sub. For example, based on the top-level (Q1; Q2; and Q3), the queries in each class can then be classied further into sub based on this classication in order to gain more accurate cost functions. Recall that the more we have, the more accurate the cost functions we obtain. An algorithm HCA described in sections 5 is used to carry out this kind of classication. Basically this classication can be useful in 2 situations: 1. It can be used to enhance the priori knowledge classication. 2. It can be used to properly derive more sub from a higher-level single class when the local database system is non-conforming the system cannot reveal any useful information to help classify queries into the top-level. For the former situation, let us look at an example. In the ZL manual classication which is knowledge-based, we can see that for each query class in the top level, namely Q1; Q2 or Q3, all queries (such as all clustered index queries) would be in the single class whose only one cost function basically yields a high average error as compared to a lower average error produced by multiple of the same queries. In the second situation where the classication without knowledge can help, this is particularly useful for any non-conforming database system. We start o from a single query class Q which contains all queries from Q1; Q2 and Q3 (in contrast to the knowledge-based classication which starts o from a certain number of, 3 in Figure 1). In both the situations (either starting from Q or Q1; Q2 and Q3), based on the number of queries given, the HCA algorithm works out query sub with their cost functions having a low average error. 4 Query Sampling Query sampling we use here is simple random sampling [7] which is the same as the one in [8]. For the purpose of describing a number of parameters, we will explain the sampling method based on the ZL knowledge-based classication. Although the sampling method presented here is for the knowledge-based classication, the method can be applied straightforwardly to non-conforming database systems, where no knowledge about the local database is available. 4.1 Sampling Method Sampling is best explained by Figure 2. The local database in the gure consists of 7 relations. In query set Q1 (Figure 2(a)), relation R1:a3; R2:b1, and so on are all clustered index attributes, whereas

4 Q1 = Clustered Index Queries R1 R2 R3 R4 R5 R6 R7 R1.a3 R2.b1 R3.h1 R4.d1 R5.e4 R6.f1 R7.g3 {R1.a3=c1} {R2.b1=c2} {R3.h1=c3} {R4.d1=c4} {R5.e4=c5} {R6.f4=c6} {R7.g3=c7} Q11 Q12 Q13 Q14 Q15 Q16 Q17 (a) Sampling Q 1 Q2 = Non-Clustered Index Queries R1 R2 R3 R4 R5 R6 R7 R1.a1 R1.a2 R3.h2 R3.h3 R4.d2 R5.e3 R6.f4 R6.f5 R7.g2 {R1.a1=c1} {R1.a2=c2} {R3.h2=c3}{R3.h3=c4} {R4.d2=c5} {R5.e3=c6} {R6.f4=c7} {R6.f5=c8} {R7.g2=c9} Q21 Q22 Q23 Q24 Q25 Q26 Q27 Q28 Q29 (b) Sampling Q 2 Q3 = Non-Index Queries R1 R1.a1 R1.a2 R1.a3 R1.a4 R1.a5 R1.an <c1 >c2!=c3 <c4 >c5!=c6 <c7 >c8!=c9 =c10 <c11>c12!=c13 Q31 Q32 Q33 Q34 Q35 Q36 Q37 Q38 Q39 (c) Sampling Q 3 Figure 2: Query Sampling

5 R R.m1 R.m2 R.m3 R.m4 R.mn =c1 Q1 {<,>,!=}c2 =c3 {<,>,!=}c4 =c5 {<,>,!=}c6 =c7 {<,>,!=}c8 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Figure 3: Preliminary clustering of \similar" relational operators in Q2 (Figure 2(b)), R1:a1; R1:a2; R3:h2 and so on are non-clustered index attributes. In Q3 shown in Figure 2(c), we use R1 as a representative for the rest of relations. Note that R1:a1; R1:a2 and R1:a3 are index attributes and therefore while drawing up the queries in Q31; Q32; : : : Q39, we consider only operators f<; >; 6=g 3 since queries that use the \=" operator already appear in sets Q11; Q21; and Q22. Given the number of queries to be sampled, what we would do is to sample queries from an entire query population such that each query is randomly chosen with an equal probability. To clarify the sampling method, consider query set Q1. The average number of queries q in each set Q11; Q12; : : : ; Q17 is computed by: q = q K (1) where q is the total number of queries in Q1 to be sampled and K (=7 in Figure 2(c)) is the maximum number of (see the next section). More details about the average number of queries q when q < K; q = K and q > K are explained in reference [8]. 4.2 Maximum Number of Classes (K) Recall that Q(= Q1 [ Q2 [ Q3) is the entire set of queries to be sampled. The maximum number of for query set Q is: 4a (2) where a is the total number of attributes for all relations in the local database and the constant factor 4 is due to the four dierent relational operators f=; <; >; 6=g. Now let us consider the maximum number of for each individual Q1; Q2 and Q3 (see Figure 2). The maximum number of for Q1 and Q2, respectively, is the number of all clustered and non-clustered index attributes in the database. Suppose K1 is the number of clustered indexes and 3 and are treated similarly to < and >, respectively. K2 is the number of non-clustered indexes. For Q3, the maximum number of K3 is 4a? (K1 + K2). Therefore, K1 + K2 + K3 = 4a. 4.3 Preliminary Clustering Since K is a vital factor in controlling the time and search space in searching for the best clustering of query by the algorithm HCA and generally K is large, here we propose a preliminary clustering method to reduce the value of K and thus reduce both time and search space. Figure 3 helps to clarify the method. The basic idea is to cluster \similar" relational operators together into the same query class. For example, in the gure, one may want to cluster equi select queries together to form one class and non-equi queries to form another perhaps because of the justication that equi select queries should have similar characteristic in their running time and so should the non-equi queries. Note that even though the time and search space can be reduced by clustering some relational operators, the maximum number of is still large; namely, the total number of for query set Q after the preliminary clustering is 2a. Recall that a is the total number of attributes in a local database and the preliminary clustering of queries in Figure 3 is based on 2 relational operators, namely equi (=) and non-equi (<; >; 6=) operators; hence, the total number of is equal to 2a. 4.4 Number of Classes Required (k) The number of queries q users wish to sample is the indication of how many k we need. There are expensive of queries, which are around 3/4 of the maximum number of (4a). Therefore, we can aord to have only a certain number of less than the maximum. To make each cost function accurate enough, we require at least w queries. w is more than or equal to 40 queries for select queries as proposed in [8]. Therefore the number of we need is: k = b q w c (3) where bc is the maximum integer value less than or equal to q=w.

6 Figure 4 Algorithm HCA (k) let k be the number of query required, let O be a set of initial query to be clustered, let C i =C c=c ij be a cluster of initial query, let M(C i ; C j ) be the matrix of average RMS errors of size n n, 1 n joj place each initial query class in O in its own cluster C i where i = 1::jOj num clus joj for each pair of clusters C i ; C j do compute an average error RMS of M(C i ; C j ) endfor while num clus > k do choose two clusters C i ; C j of the least RMS in matrix M update matrix M by grouping C i and C j into a new cluster C ij for each cluster c such that c 6= ij do compute an average error RMS of M(C c; C ij ) endfor num clus num clus? 1 endwhile 5 Hierarchical Clustering Algorithm Hierarchical clustering has been applied successfully in several applications. It may yield suboptimal solutions, but its great advantage is that it has polynomial running time. In any hierarchical clustering algorithm, one is required to dene a matrix of similarity values [4, 1]. Based on such values, two clusters of \entities" which have a highest similarity are grouped together into a new cluster. The semantics of similarity are problem-dependent: it could be Euclidean distance, correlation, and so on [1]. In this case, it is the average error value of root mean squared errors (RMS) of each cost function. P c RMS = i=1 RMS i n i Pc i=1 n (4) i where n i is the number of queries in each class, c the number of, and each RMS i can be dened as: s Pni j=1 RMS i = (t j? ^t j ) 2 (5) n i where t j is a real observed running time of query j and ^t j a running time from a cost function. The HCA algorithm is given in Figure 4. The initial query in O are all query sub such as Q11; Q12; : : : ; Q17 for query set Q1 (see Figure 2(a)) and Q21; Q22; : : : ; Q29 for query set Q2 (see Figure 2(b)). The algorithm starts with each initial query class placed in its own cluster, and, in each iteration, combines two existing clusters into a new cluster. That is, the algorithm starts with joj clusters, then joj? 1, joj? 2... until the number of clusters is equal to the desired number of query k. Table 1 shows how the matrix M is updated from 5 5 to 4 4. Note that M is symmetrical. C 1 C 2 C 3 C 4 C 5 C C C C C (a) Before grouping C 2 and C 5 C 1 C 25 C 3 C 4 C C C C (b) After grouping C 2 and C 5 Table 1: Grouping C2 and C5 6 Experimental Results The main aim of the experiments here is to see how well the HCA algorithm performs in reducing the average error RM S of query cost functions. To do this, we compare the average error from the manual query classication [8] to the average error from a query classication produced by HCA. In each experiment, when the number of queries we fed into algorithm HCA multiply increases, namely 40, 80, 120, : : : and so on, the number of will increment by 1. Table 2 shows three dierent database congurations. The databases have tuples per relation and each relation has 5-10 attributes. The number of queries in each experiment varies from 1200 to 2000 for each individual class of clustered index, non-clustered index and non-index queries. We used around 30% of the queries as the sample set to derive cost functions and the remainder as the test set for measuring an average error yielded by the cost functions. The results for each database conguration are shown in Figures 5, 6 and 7, respectively. The results show a tendency towards decreasing the average error when the number of query increases. To illustrate, let us describe the graphs in Figure 5 as an example. In graph 5(a), the maximum number of is 7 as labelled on the rightmost of the X-axis. This number stems from the total number of clustered index attributes, which is the sum of values of K1's row in Table 2(a). For graph 5(b), the maximum number of is 10, which is the total number of non-clustered indexes in K2's row of Table 2(a). As to graph 5(c), we showed only part of the maximum number of

7 K 1 = number of clustered index attributes K 2 = number of non-clustered index attributes others = any other non-index attributes R 1 R 2 R 3 R 4 R 5 R 6 R 7 R 8 R 9 R 10 tuples K K others (a) Database 1 R 1 R 2 R 3 R 4 R 5 R 6 R 7 R 8 R 9 R 10 R 11 tuples K K others (b) Database 2 R 1 R 2 R 3 R 4 R 5 R 6 R 7 R 8 R 9 R 10 R 11 R 12 tuples K K others (c) Database 3 Table 2: Dierent Database Congurations non-index attribute, i.e., 10 out of 307 = 4 81?(7+10). The solid line, for example in graph 5(a) shows the average errors when having only a single class as compared to the errors of a dashed line with multiple (2-7 in Figure 5(a)). Obviously, in most of the cases of dierent number of query, the HCA algorithm manifests its performance in reducing average error and thus provides better cost estimates as compared to a single class. From 81 cases, 71 cases yielded by HCA give lower average errors whereas only 7 cases give worse errors. The reason that there are 7 cases giving the worse errors could be that the number of queries is not sucient initially but after it reaches a sucient number, then the average errors produced by multiple again become lower than the average errors produced by a single class (see Figure 6(c) for example). 7 Conclusion and Future Research The paper addressed the derivation of query cost functions for multidatabase systems. The following are the contributions of the paper: We propose to use a hierarchical clustering algorithm to perform query classication, which achieves a better performance in reducing average error of query cost estimation than the manual classication. We propose a query classication which can be used with both conforming and nonconforming database systems. Especially for the non-conforming systems, the query classication in these systems has not been tackled successfully before. There are several issues of interest that we plan to investigate further: Extend the current method to use non-linear regression techniques to compare with the linear regression technique we currently use. The reason is that there are cases that the distribution of attribute values may not be uniform and therefore the running times of queries in a class may not be linear. Thus, non-linear regression techniques could be better in nding best-t cost functions. Compare the HCA algorithm with other classication algorithms such as the partitioning algorithm in [3] or the algorithms used in machine learning. Investigate how many queries to be sampled are \sucient" for each query class. Investigate how the cost functions of multiple query derived by HCA aect the choice of query execution plans as compared to the

8 "ZL.cl1.db1" "HCA.cl1.db1" "ZL.cl2.db1" "HCA.cl2.db1" "ZL.cl3.db1" "HCA.cl3.db1" (a) clustered index class (b) non-clustered index class (c) non-index class Figure 5: Database "ZL.cl1.db2" "HCA.cl1.db2" "ZL.cl2.db2" "HCA.cl2.db2" 18 "ZL.cl3.db2" "HCA.cl3.db2" (a) clustered index class (b) non-clustered index class (c) non-index class Figure 6: Database "ZL.cl1.db3" "HCA.cl1.db3" "ZL.cl2.db3" "HCA.cl2.db3" "ZL.cl3.db3" "HCA.cl3.db3" (a) clustered index class (b) non-clustered index class (c) non-index class Figure 7: Database 3

9 three cost functions of each individual class (clustered index, non-clustered index and nonindex) derived by the manual classication. Investigate the use of other cost metrics instead of just the elapsed running time. The HCA algorithm runs in polynomial time to produce cost functions of each query class and this is an advantage when we want to combine this algorithm with a non-linear regression technique which perhaps is slower than the linear regression one in nding best-t cost functions. Acknowledgements We would like to thank Christopher R. Birchenhall from University of Manchester, UK for his superb state-of-the-art C++ matclass package that he made publically avaliable together with his excellent manual. His math class library contains several useful linear LSE functions which help our project, such as SVD, QR, LU decompositions. References [1] M.R. Anderberg. Cluster Analysis for Applications. Academic Press, [2] W. Du, R. Krishnamurthy and M.C. Shan. Query Optimization in Heterogeneous DBMS. In Proceedings of the 18th VLDB Conference, pages 277{291, [3] J.A. Hartigan. Clustering Algorithms. John Wiley & Sons, [4] S.C. Johnson. Hierarchical clustering schemes. Psychometrika, Volume 32, Number 3, pages 241{254, September [5] H. Lu, B.C. Ooi and C.H. Goh. Multidatabase Query Optimization: Issues and Solutions. In Proceedings of Third International Workshop on Research Issues in Data Engineering: Interoperability in Multidatabase Systems, pages 137{143, [6] D.E. Shasha. Database Tuning: A Principled Approach, Chapter 3, pages 53{88. Prentice Hall, Englewood Clis, New Jersy, [7] S.K. Thomson. Sampling. John Wiley & Sons, Inc., Basic and Advanced Sampling Methods. [8] Q. Zhu and P.A. Larson. A Query Sampling Method for Estimating Local Cost Parameters in a Multidatabase System. In Data Engineering, pages 144{153, class query x1 x2 t where r1.a5 = 39138; C1 where r1.a5 = 38464; where r1.a5 = 38828; where r5.e7 = 13006; C2 where r5.e7 = 26025; where r5.e7 = 32182; where r7.g3 = 33075; C3 where r7.g3 = 32120; where r7.g3 = 33262; where r8.h2 = 44688; C4 where r8.h2 = 55941; where r8.h2 = 55410; where r9.i3 = 41119; C5 where r9.i3 = 40610; where r9.i3 = 35895; where r10.j9 = 23224; C6 where r10.j9 = 14760; where r10.j9 = 11924; where r11.k5 = 20329; C7 where r11.k5 = 8263; where r11.k5 = 21263; where r12.l3 = 72124; C8 where r12.l3 = 61785; where r12.l3 = 62874; Table 3: Queries, input and output tuples and running times Appendix Example This appendix is to demonstrate how to apply HCA algorithm to yield the solution of query with a low average error. In database conguration 3 (see table 2(c)), there are 8 clustered indexes, namely, r1.a5, r5.e7, r7.g3, r8.h2, r9.i3, r10.j9, r11.k5 and r12.l3. Each clustered index forms one initial query class; that is, queries which use index r1.a5 are in query class C1, queries which use index r5.e7 are in C2 and so on. Table 3 shows queries 4, the number of tuples of an input relation x1 and of an output relation x2 and their running times, in column 2, 3, 4 and 5 respectively. Note that we show only 3 queries (out of 10) for each query class. The number of total queries we were willing to sample in this experiment is 80 and thus, for each class, divided by 8 (number of clustered indexes), giving 10 queries for each class. Initially, HCA placed 8 initial query in their own clusters as shown in the rst column of table 4. The second, third and fourth column in the table are regression coecients of equation 4 Recall that select queries are of the form L ( F (R i )). To make table 3 concise, we omit to show the lists of projected attributes of queries in the table.

10 cluster f C1 g e e f C2 g e e f C3 g e e f C4 g e e f C5 g e e f C6 g e e f C7 g e e f C8 g e e Table 4: clusters and their regression coecients ^t = f(x1; x2) = 0 +1 x1 +2 x2. By \learning" from the number of input and output tuples x1 and x2 of queries and their running times t, regression formulas (cost functions) are found by the LSE method for each individual cluster. These formulas can then be employed to estimate running times of unseen queries at on-line. In the rst iteration of HCA, query C3 and C8 were merged into the same cluster as shown in table 5. The reason C3 and C8 got merged into the same cluster is that compared with other mergings (such as between C1 and C2, C1 and C3 and so on), the merging of C3 and C8 gave the least average error RM S. In addition, HCA recomputed the coecients of the new cluster comprising C3 and C8. cluster f C1 g e e f C2 g e e f C4 g e e f C5 g e e f C6 g e e f C7 g e e f C3; C8 g e Table 5: clusters and their regression coecients Due to the limit of the length of the paper, we ignore to show the outputs from the second to fth iteration and show only the last iteration 6 in table 6, which yielded the nal solution of applying HCA. Recall that HCA will stop when the number of clusters (numclus) is less than or equal to the number of clusters desired 2 (calculated by q=w = 80=40 = 2). cluster f C1 g e e f C2; C3; C4; e C5; C6; C7; C8 g Table 6: clusters and their regression coecients

highest cosine coecient [5] are returned. Notice that a query can hit documents without having common terms because the k indexing dimensions indicate

highest cosine coecient [5] are returned. Notice that a query can hit documents without having common terms because the k indexing dimensions indicate Searching Information Servers Based on Customized Proles Technical Report USC-CS-96-636 Shih-Hao Li and Peter B. Danzig Computer Science Department University of Southern California Los Angeles, California