[13] W. Litwin. Linear hashing: a new tool for le and table addressing. In. Proceedings of the 6th International Conference on Very Large Databases,

Size: px

Start display at page:

Download "[13] W. Litwin. Linear hashing: a new tool for le and table addressing. In. Proceedings of the 6th International Conference on Very Large Databases,"

Bathsheba Wade
5 years ago
Views:

1 [12] P. Larson. Linear hashing with partial expansions. In Proceedings of the 6th International Conference on Very Large Databases, pages 224{232, [13] W. Litwin. Linear hashing: a new tool for le and table addressing. In Proceedings of the 6th International Conference on Very Large Databases, pages 212{223, [14] J. W. Lloyd. Optimal partial-match retrieval. BIT, 20:406{413, [15] J. W. Lloyd and K. Ramamohanarao. Partial-match retrieval for dynamic les. BIT, 22:150{168, [16] T. H. Merrett. Why sort-merge gives the best implementation of the natural join. SIGMOD Record, 13(2):39{51, January [17] S. Moran. On the complexity of designing optimal partial-match retrieval systems. ACM Transactions on Database Systems, 8(4):543{551, December [18] S. Nahar, S. Sahni, and E. Shargowitz. Experiments with simulated annealing. In Proceedings of the 22nd Design Automation Conference, pages 748{752, [19] J. Nievergelt, H. Hinterberger, and K. C. Sevcik. The grid le: An adaptable, symmetric multikey le structure. ACM Transactions on Database Systems, 9(1):38{71, March [20] K. Ramamohanarao and J. W. Lloyd. Dynamic hashing schemes. The Computer Journal, 25:478{485, [21] K. Ramamohanarao, J. Shepherd, and R. Sacks-Davis. Multi-attribute hashing with multiple le copies for high performance partial-match retrieval. BIT, 30:404{423, [22] R. L. Rivest. Partial-match retrieval algorithms. SIAM Journal on Computing, 5(1):19{50, March [23] T. J. Sager. A polynomial time generator for minimal perfect hash functions. Communications of the ACM, 28(5):523{532, May [24] J. A. Thom, K. Ramamohanarao, and L. Naish. A superjoin algorithm for deductive databases. In Proceedings of the 12th International Conference on Very Large Databases, pages 189{196, Kyoto, Japan, August [25] K.-Y. Whang and R. Krishnamurthy. The multilevel grid le a dynamic hierarchical multidimensional le structure. In International Symposium on Database Systems for Advanced Applications, pages 449{459, Tokyo, Japan, April

2 If we use the hash-join algorithm then the optimal bit allocation may be determined using heuristic techniques in a feasible amount of time. Additionally, we have demonstrated that the hash-join algorithm is likely to be faster than the sort-merge algorithm when the clustering of the data using multi-attribute hashing is optimal. In Table 10 we have shown that the optimal bit allocation using the hash-join algorithm will typically provide an order of magnitude of improvement over the implementation of the algorithm not using an index even when only two attributes are involved. When many attributes are involved the improvement can be several orders of magnitude when appropriate buer sizes are used. References [1] A. V. Aho and J. D. Ullman. Optimal partial-match retrieval when elds are independently specied. ACM Transactions on Database Systems, 4(2):168{179, June [2] J. L. Bentley. Multidimensional binary search trees used for associative searching. Communications of the ACM, 18(9):509{517, September [3] M. W. Blasgen and K. P. Eswaran. Storage and access in relational databases. IBM Systems Journal, 16(4), [4] K. Bratbergsengen. Hashing methods and relational algebra operations. In Proceedings of the 10th International Conference on Very Large Databases, pages 323{333, Singapore, August [5] W. A. Burkhard. Interpolation-based index maintenance. BIT, 23:274{294, [6] R. Cichelli. Minimal perfect hash functions made simple. Communications of the ACM, 23(1):17{19, January [7] C. Faloutsos. Multiattribute hashing using gray codes. In Proceedings of SIGMOD '86, pages 227{238, [8] M. Freeston. The BANG le: a new kind of grid le. In U. Dayal and I. Traiger, editors, Proceedings of SIGMOD '87, pages 260{269, San Francisco, California, USA, May [9] E. P. Harris and K. Ramamohanarao. Optimal dynamic multi-attribute hashing for range queries. Technical Report 91/34, Department of Computer Science, The University of Melbourne, Parkville, Victoria 3052, Australia, December Also published as CITRI Technical Report 91/7. [10] Y. Hsiao and A. L. Tharp. Adaptive hashing. Information Systems, 13(1):111{127, [11] D. E. Knuth. Sorting and Searching, volume 3 of The Art of Computer Programming. Addison-Wesley, Reading, Massachusetts, USA,

3 in the tree for the expected le size generated using simulated annealing and use the MMI algorithm when allocating attributes to additional levels and the maximal marginal decrease algorithm (MMD) when allocating attributes to the upper levels of the tree (and for smaller les sizes). Maximal marginal decrease works in a similar way to MMI, except that, instead of increasing the number of bits (or levels), the number of bits is decremented. It works by decrementing the number of bits allocated from each attribute in turn and calculating the new cost. The bit allocation which results in the best cost value is the one chosen. Thus, if the optimal allocation for n bits is known the \optimal" allocation for n? 1 bits can be determined, then n? 2 bits, and so on down to the rst bit to be allocated. This general technique can be used to provide more optimal indexes for the grid le [19], the BANG le [8], and the multilevel grid le [25], in addition to the k-d tree and other similar indexing schemes in which a choice must be made in determining which attribute to split the data on. 10 Conclusion. We have described a method by which the optimal clustering arrangement may be found for performing a join operation using either the sort-merge or hashjoin algorithms on data les indexed using multi-attribute hashing. We have also described how our method may be applied to enhance the performance of other indexing schemes, such as the grid and BANG les. First we consider the sort-merge join algorithm. In the case in which there are no maximally allocated attributes and the buer size is small the index entirely consists of bits from a single attribute. This attribute is determined by calculating the total probability of each attribute occurring rst in a join operation. The attribute with the greatest probability is the one used to create the index. In this case we ignore the fact that the number of bits allocated to an attribute must be greater than d? log 2 B if it is the only attribute allocated any bits. In the case when the number of bits allocated to an attribute may be greater than d? log 2 B, then d? log 2 B bits should be allocated to the attribute and the remainder allocated to the next most probable attribute (assuming that number is not greater than d? log 2 B). When there are attributes which may be maximally allocated a number of calculations must be performed to deal with these because the bits will not necessarily be allocated to a single attribute. An additional feature of this method is that if the hashing functions are order preserving the index is also optimal for sorting the records based on certain attributes, assuming that the probability of being asked to sort the records has been taken into account when determining the optimal index. One of the primary advantages of this method is that, unlike some optimal indexes for partial match queries [9, 21], it is extremely quick and easy to calculate. In addition, it is appropriate to use with dynamic les [20] when a single primary attribute is being used to index the le, enabling it to easily use the dynamic properties of linear hashing with its relatively low maintenance costs.. 22

4 A2 = 1 A1 = 9 A1 = 6 A1 = 7 A1 = 10 A1 = 4 A1 = 10 (7; 2) (9; 1) (10;?1) (11;?4) (4; 0) (6; 3) (10; 3) (15; 6) Figure 2: A k-d tree split on attributes A2, then A1, then A1. 9 Other indexing schemes. Other indexing schemes can benet from using our approach to create indexes which will perform better for join queries with a given join probability distribution than indexes created without knowledge of the distribution. These other indexing schemes can be characterised by a tree structure. At each node of the tree, a choice must be made as to which attribute is to be chosen to perform the partitioning on to store the descendants of the node. This corresponds directly to choosing which attribute to allocate a bit from in the multi-attribute hashing scheme. For example, consider a derivative of the k-d tree [2] in Figure 2, in which the internal nodes represent the attribute and value on which the data is split, and the leaves contain the actual data. We assume that this partitioning gives the optimal performance for join queries for a given join query distribution. At each level of the tree a decision has been made as to the attribute to partition the data on. If the data were indexed using multi-attribute hashing, the hash index would be composed of the rst bit of attribute two, then the rst and second bits of attribute one. This would be the optimal multi-attribute hashing index for the same join query distribution. The additional information which tree structure techniques require, in addition to how many times an attribute must be used to partition on, is the order in which the partitioning must be performed. This is also required of the multi-attribute hashing index when it dynamically changes size. There are two possible methods of determining the appropriate order. The rst is the use the MMI algorithm to determine the optimal attribute to split on at each level of the tree. As was shown in Section 7.2, this does not always lead to the optimal results. The second is to start with the optimal allocation of attributes to levels 21

5 greatest probability for the join operation just containing that attribute. These results support our hypothesis. We note that the improvements in the cost for the optimal bit allocation over the three naive methods: HIGH, PROB and EVEN, is often very large, particularly for larger numbers of attributes. While either PROB or EVEN is often close to optimal, when the number of attributes is greater than three one is typically signicantly larger than the optimal bit allocation, depending on the probability distribution. Thus, we believe determining the optimal bit allocation is worthwhile, particularly as whether PROB or EVEN is near optimal depends on the nature of the probability distribution. The improvement of the optimal bit allocation over the unindexed cost is very large. For example, in Table 10 it is several orders of magnitude when seven attributes and the buer size is large. It is interesting to compare the average cost of sorting in Table 4 and the average cost of partitioning in Table 9. Note that the cost when there is no index is the same. However, as k would not typically be the same for both methods, a direct comparison is not necessarily appropriate. The results show that the optimal bit allocation for the hash-join method involves signicantly fewer disk accesses for the partition phase than the optimal bit allocation for the sorting phase of the sort-merge method, assuming the values of k do not dier greatly. Also, as the buer size increases the relative improvement over the unindexed cost increases for the hash-join algorithm whereas it decreases for the sort-merge algorithm. Additionally, as we noted in Section 5, if the cost of partitioning one relation is zero, then the second of the two relations involved in the join does not need to be partitioned to achieve an optimal join using the hash-join technique. The optimal indexes created using the above technique tends to maximise the number of queries which do not need to be partitioned. This will result in signicantly higher join performance than may be rst indicated, as the second attribute need not be partitioned for these queries. Given that the cost of reading and writing the two relations after the sort/partition phase is the same, the hash-join method should be much faster. Thus, we support the conclusion of Bratbergsengen [4] that hash-join is faster than sort-merge under equivalent circumstances. In the presence of optimal indexes the hash-join method of joining will have a lower average join cost than the sort-merge method. Unlike the sort-merge method which tends to allocate a large number of bits to only a few attributes, and very few bits to the other attributes, the results obtained for the hash-join partitioning are much more like those obtained when generating optimal bit allocations for partial-match queries [15]. Thus, we would expect that if multiple copies of the data could be made with dierent indexes, the procedure described in [21], using Equation 9 as a basis, could be used to determine the bit allocations in each le. 20

6 B = 32 B = 64 B = 128 B = 256 n Cost Impr. Cost Impr. Cost Impr. Cost Impr * Table 10: Cost of the hash-join partitioning for various buer sizes, k = 2, d = 13. n Algorithm MMI SANN SANN SANN Table 11: Computational times in seconds of the hash-join partitioning for various algorithms, B = 32, k = 2, d = 13. join operations possible when ve attributes are considered are of this type. If join operations involving all but two of the attributes are included, the number becomes 300 out of 325. The cost of partitioning in the join operations involving all the attributes will be zero, and the cost of partitioning the join operations involving all but one of the attributes will be zero if the number of bits allocated to the one attribute is greater than or equal to log 2 B. If the probabilities of the join operations are (approximately) uniformly distributed, to enable the cost of each join operation in this set to be zero, a minimum of log 2 B bits need to be allocated to each attribute, in which case the cost of the above partition phase will be zero. The second of the distributions with ve attributes was originally created to test this hypothesis. As described above in Section 7.2, it was generated using a skew distribution. This distribution was randomly generated, however, the probability of any given join operation was inversely proportional to the number of attributes in the join operation. The probabilities of the ve join operations containing only one attribute outweighed any of the other probabilities and greatly outweighed the sum of the probabilities of the 240 join operations containing all of the attributes or all but one of the attributes. Additionally, there were no constraints placed on the maximum number of bits any attribute may have. As expected, d? log 2 B bits were allocated to the attribute with the greatest probability for the join operation just containing that attribute, and the remaining log 2 B bits were allocated to the attribute with the second 19

7 B = 32 B = 64 n Algorithm Bit allocation Cost Impr. Bit allocation Cost Impr. Constraints MMI SA SA SA HIGH PROB EVEN Constraints MMI SA SA SA HIGH PROB EVEN Constraints MMI SA SA SA HIGH PROB EVEN Constraints MMI SA SA SA HIGH PROB EVEN Constraints MMI SA SA SA HIGH PROB EVEN * NONE Table 9: Cost of the hash-join partitioning for various attributes with constraints, k = 2, d =

8 therefore the index should always be used to partition the data to minimise the cost. It follows that the cost of the partitioning for a join operation q, where Q is a set containing the attributes used in the join operation, is C(q) = = = ( Q i2q 2d i :k Q i62q 2d i ( ( log B Q i62q 2d i if Q i62q 2d i > B 0 if Q i62q 2 d i B k2 d log 2 Q i62q 2d i = log 2 B if Q i62q 2d i > B 0 if Q i62q 2 d i B d? P i2q d i if Q i62q 2 d i > B 0 if Q i62q 2 d i B where = k2 d = log 2 B. Note that if the index based partition size, m, is less than the buer size, then no partitioning needs to be performed. Applying constraints to the number of bits which may be allocated to any attribute has no eect on the method used to determine the optimal bit allocation. Combining Equations 1 and 8, the average cost of partitioning is C P = X i2p (8) p i C(i) (9) where P is the set of all possible joins which may be performed on the relation, and p i is the probability of the join i. Note, that if we assume that Q i62q 2d i > B for all i, Equation 9 is similar to Equation 7. Therefore, we should be able to determine the optimal bit allocation using a similar approach, that is, heuristically using either the MMI or simulated annealing algorithms. Again we tested both methods to determine whether they are applicable. 8.2 Computational results of heuristic bit allocation. We obtained a series of computational results using exactly the same join operation probability distributions as in Section 7.2. We used Equation 9 to determine the average cost of the partition phase, and again set d = 13, B = 32, 64, 128 and 256, k = 2 and assumed that the page size is 4k. The same values for T, S and L shown in Table 3 were used by the simulated annealing algorithms. The results were again generated on a lightly loaded Silicon Graphics 4D/340 with 128Mb of main memory. The results are presented in Tables 9, 10 and 11. The times shown are the sum of the user and system time measured in seconds. The results show that the number of bits allocated to each attribute appears to be based heavily on the constant log 2 B (which is either ve or six for these results). If an attribute is not allocated this number of bits, the sum of the number of bits allocated to two attributes is often log 2 B. We conjecture that this is because the partition phase of the hash-join operation does not need to be performed if the sum of the number of bits allocated to the attributes are at least d? log 2 B. A large number of the join operations involve either all of the attributes, or all but one of the attributes. For example, 240 of the

9 performed on the data to partition the data into blocks small enough to be be merged in memory can be derived in a manner similar to that of determining the optimal number of merge phases in the merge sort [4]. The hash-join method partitions data using a hash function. Multi-attribute hashing also partitions data using a hash function. Therefore we should be able to create an index which will reduce the amount of partitioning which needs to be performed by the hash-join algorithm. Instead of commencing with one data le which needs to be partitioned, we can start with a number of smaller data les which are distinguished by having dierent indexes for the appropriate attributes. The superjoin method [24] does this, however Thom, et al., do not attempt to construct an optimal index for the data. We will shortly show how the optimal index may be determined. 8.1 Cost of partitioning in the hash-join method. The cost of partitioning the data in the hash-join method was given by Bratbergsengen in [4] and can be given by T t = 2:ta:2 d :w T t is the total time, ta is the time to transport a page from disk to memory, 2 d is the size of the data to be partitioned in pages and w is given by w = 8 >< >: 0 if N = 1 1 if 1 < N B q? b(b q? N)=(B? 1)c=N if N > B B is the size of the internal buer in blocks, N = b(2 d? 1)=Bc + 1 is the number of partitions to be created, and by q = blog B (N? 1)c? 1 Bratbergsengen also gave an approximation of the cost and it can be given T = k2 d log B 2 d where k is a constant. We use the approximation which provides an upper bound on the number of disk accesses since it is similar to the approximation we made on the number of disk accesses used by the merge sort in Section 6. All of the indexing bits for each attribute in the join operation should be used to partition the data. For example, if a relation P (A; B) is involved in a join based on attribute A, then a separate partitioning phase should be performed for each distinct index value of A. If there are p distinct index values for attribute A there will be p partitions of size m such that p:m = n where n is the total size of the data, then the cost of performing the partitioning on these p partitions is p:km log B m. This is equal to kn log B (n=p). If there is no partitioning based on the index value then the cost would be kn log B n. kn log B n > kn log B (n=p), 16

10 Attribute Constraints Optimal allocation Probability Table 8: Probability of sorting on the rst attribute for ve attributes when d = 13, B = 64. over the unindexed method decreases as the buer size increases. Table 8 shows the probabilities of sorting on an attribute combination starting with each attribute, for the case with ve attributes. This is the same attribute combination as the rst of the two with ve attributes in Table 4. It is interesting to note the while the most probable attribute to initially sort on is attribute three (by a comparatively large margin), it is not allocated the greatest number of bits; instead attribute ve is maximally allocated. Also, attributes one, two and four are allocated a bit each, when allocating all three to attribute three would maximally allocate it. This shows that if the attributes may be maximally allocated the optimal bit allocation must be determined heuristically using Equation 7, and not by the rules used when attributes cannot be maximally allocated. 8 Bit allocations using the hash-join algorithm. An optimal index may be created for data les which use the hash-join method [4] to perform join operations in a similar way to those we have created for join operations performed using the sort-merge technique. Bratbergsengen [4] showed that the nested loop method, although inecient when the volume of data is large and on many pages of disk, is the fastest known method for joining two relations when one of the relations ts into the buers in memory. For equally sized operands, the nested loop method is faster than the sort-merge method when the operand volume is less than four to twelve times the buer size. In general, it is faster when the smallest operand is less than three to eight times the buer size. The hash-join method works by partitioning the relations into a number of blocks which may be joined using the nested loop method in memory. The hash-join partitioning of the relation is accomplished by making a number of passes over the data. On each pass a hash function is joined to split the input le up into multiple output les. If two values are to be joined they must be the same, therefore they will have the same hash value. If the output le is larger than the memory buer in which the nested loop method will be performed then a later pass will split it up. The optimal split pattern determining the number of passes over the les (and over which les) required to be 15

11 B = 32 B = 64 B = 128 B = 256 n Sort cost Impr. Sort cost Impr. Sort cost Impr. Sort cost Impr * Table 5: Cost of sorting for various buer sizes, k = 2, d = 13. n Algorithm MMI SANN SANN SANN Table 6: Computational times in seconds for various algorithms, B = 32, k = 2, d = 13. Attributes Cost Time Constraints Table 7: MMI progress for seven attributes with constraints, k = 2, B = 32, d =

12 B = 32 B = 64 n Algorithm Bit allocation Sort cost Impr. Bit allocation Sort cost Impr. Constraints MMI SA SA SA HIGH PROB EVEN Constraints MMI SA SA SA HIGH PROB EVEN Constraints MMI SA SA SA HIGH PROB EVEN Constraints MMI SA SA SA HIGH PROB EVEN Constraints MMI SA SA SA HIGH PROB EVEN * NONE Table 4: Computational results for various attributes with constraints, k = 2, d =

13 number of bits. The remaining bits were then allocated to the next most probable attribute up to a maximum of d? log 2 B bits, then the next attribute, and so on. EVEN simply allocates an equal number of bits to each attribute, with excess bits going to the rst attributes occurring in a relation. The three sets of results were all very similar in form, for the sake of brevity we include only one of the three, plus one other with ve attributes, for two of the buer sizes, in Table 4. Table 5 shows the cost of the optimal bit allocation for the tests with ve attributes and all of the buer sizes. The improvement (Impr.) column in Tables 4 and 5 represent the ratio of the unindexed cost to the cost bit allocation of that method. Table 7 shows the progress of the MMI algorithm for the case of seven attributes. The results were generated on a lightly loaded Silicon Graphics 4D/340 with 128Mb of main memory. The times shown, in Table 6 is the sum of the system and user times taken by the algorithms measured in seconds for the rst set of results and is representative of all of the times of the algorithms for the appropriate number of attributes. The results show that the MMI algorithm, although the fastest, is only useful when there are only a few attributes to choose from. With larger numbers of attributes it is too easy to allocate a bit to the wrong attribute. Table 7 demonstrates another problem with the MMI algorithm. When the whole of the data to be sorted ts into the internal buer, it doesn't matter which attribute the bits are allocated to, the cost will be the same. If it arbitrarily chooses an attribute (our version chose the rst), this attribute may be allocated bits when it would otherwise not have (for example, if the buer size was 1). This can result in the wrong attribute being chosen to allocate bits to, which cannot result in the optimal bit allocation being determined. From the simulated annealing results it is clear that the nal bit allocation (for all of the algorithms) will typically consist of a number of attributes allocated as many bits as possible. These results show that both the maximum number of bits permitted to be allocated to an attribute, and that allocating more than d? log 2 B bits to an attribute does not decrease the cost, is important in determining how many bits to be allocated to an attribute. The time taken to determine the optimal bit allocation could be improved signicantly by taking this information into account, by modifying the MMI and simulated annealing algorithms, perhaps to use this information to generate starting points when searching for the optimal allocation. It is interesting to note that the cost improvement over the even distribution is not always very large, even when seven attributes are involved. The second of the distributions with ve attributes was skewed towards higher probabilities for sort combinations with fewer attributes and more towards the rst attribute than the second, and more towards the second than the third, and so on. With this distribution quite a large improvement was achieved, but still only on the order of 50%. The improvement gained over the one based on the probabilities of attributes occurring in a query, and which took into account of the buer size, was very small. However, the probabilities would still be required to determine these bit allocations and the time taken to determine the optimal bit allocation is not very large. Improvements over the unindexed cost range from around 20% to over 100%. Although the cost itself decreases, Table 5 the cost improvement 12

14 Algorithm T S L SA SA SA Table 3: Values of simulated annealing constants. exponential function with the number of the iteration of the trial as a parameter. The eect of the cooling function is that the simulated annealing algorithm is likely to accept bit allocations with worse values for the cost function in early trials, but only accept improving values in later trials. This is done so that better bit allocations may be found which may only be reached from the initial bit allocation by passing through worse bit allocations. While simulated annealing is not ideal for all optimisation applications [18], it has proved to be a worthwhile, if expensive, operation in applications of this type before [9, 21] and thus we considered it likely to be appropriate again. 7.2 Computational results. We obtained a series of computational results in an attempt to determine whether it is possible to determine the optimal bit allocation in a feasible amount of time. To this end we generated three sets of random probability distributions for join operations for a number of attributes between two and seven and for two dierent buer sizes. We randomly generated a constraining number of bits for each attribute such that this number was the maximum number of bits allowed to be allocated to that attribute. These constraints represent the domain size into which the attribute values are being hashed. A perfect hashing function may, but need not, be used to do this. According to [23] generating a perfect hash function is feasible for domain sizes up to around 2 9, that is, when the number of constraining bits is nine. We used Equation 7 to determine the cost of each bit allocation and we set d = 13, B = 32, 64, 128 and 256, and k = 2 (therefore the results indicate the number of disk accesses). With these le and block sizes we are dealing with 32Mb data les and a 128k, 256k, 512k or 1024k buer, assuming a 4k page size. We obtained computational results for the minimal marginal increase algorithm and for three versions of the simulated annealing algorithm, each with dierent values for the constants T, S and L. The values of the constants used were the same as those used in [21] and later in [9] and are summarised in Table 3. In addition to the basic four algorithms we also tested three naive algorithms: HIGH, PROB and EVEN, and the cost when there is no index, NONE. HIGH and PROB are allocations based on the probability of an attribute appearing in a sort combination. The sum of the probabilities of the combinations in which each attribute appears was taken. For HIGH, all the bits were allocated to the most probable attribute. For PROB, the bits were allocated to the most probable attribute up to a maximum of d? log 2 B bits or its constraining 11

15 where C(s) is given by Equation 6, S is the set of all sort combinations, and the attributes are ordered by their position in the combination s. The optimal bit allocation will give the minimum value for Equation 7. The allocation of bits to attributes may change the attribute A i in each combination of C(s). Therefore, we attempt to nd the optimal allocation of bits using an heuristic approach which attempts to nd the minimum value of Equation 7 given the constraints imposed by the value of d. 7.1 Heuristic approaches to bit allocation. For a variety of similar bit allocation problems, when there is no known algorithmic method which obtains an optimal bit allocation in polynomial time given constraints imposed by the domain in which the work is involved, heuristic approaches to bit allocation have been used [9, 15, 21]. Two of the methods which have been used for this are minimal marginal increase (MMI) [15] and simulated annealing (SA) [9, 21]. We will consider both of them Minimal marginal increase. The method of minimal marginal increase starts with no bits allocated to any of the attributes. It works in d steps. At each step a single bit is added to a single attribute. To determine which attribute to add the single bit to, in each step, the bit is added to each attribute in turn and the value of the cost function we are trying to minimise, Equation 7, is calculated. The attribute the bit is then added to is the one which results in the smallest value of the cost function Simulated annealing. The method of simulated annealing works by performing a number of trials, T, and returning the bit allocation of the best value for the cost function from amongst these trials. In each trial a random bit allocation is generated, using all d bits, and the value of the cost function, Equation 7, is determined for that bit allocation. A number of iterations of bit perturbing, S, is then performed, starting with this bit allocation. In each iteration the bit allocation is perturbed by decrementing the number of bits allocated to one randomly selected attribute by one and incrementing the number of bits allocated to another randomly selected attribute by one. The value of the cost function for this bit allocation is then determined. If this value is less than the best value of the cost function, or if the value of a cooling function is true, this bit allocation is used as the basis for the next iteration, otherwise the previous bit allocation is used. The number of iterations performed is also constrained by a number which species the maximum number of iterations permitted to be performed without nding a new bit allocation with a better value for the cost function, L. If this number is exceeded the iterations for this trial are stopped instead of the total number of iterations, S, being completed. The cooling function is a function which determines whether or not a bit allocation which does not improve the value of the cost function is accepted. It tests whether a randomly generated number is less than the value of an inverse 10

16 Thus, the nth copy of the data le should be indexed by the attribute with the nth highest probability of being sorted. Similarly, if d? log 2 B bits may be allocated to an attribute (in any le) then it should be allocated to the attribute with the highest probability rst, then the second highest probability, and so on. In summary, assuming an unconstrained domain, the attribute with the highest probability of being sorted should be allocated as many bits as possible within a single data le. 7 Maximally allocated attributes. A special situation arises when an attribute has the maximum number of bits possible allocated to it. For these attributes it has no eect if extra bits are allocated to them. For example, if an attribute denotes a person's sex it will have one of two values. Therefore, the hash function need only generate one bit for the attribute. To generate more bits would not provide greater discrimination between the values. We say that this attribute is maximally allocated at one bit. We dene a maximally allocated attribute to be an attribute in which, for values of an attribute a and b, 8a; b : h(a) = h(b) ) a b where h(x) is the hash function used to construct the hash value for the attribute. If all of the bits of a maximally allocated attribute are used in constructing the hash index for the data le then the time taken to join on this attribute alone is linear, because the data le is already sorted on this attribute using the sort key denition above. Consider a sort combination (A1; A2; A3). If attribute A1 is maximally allocated then the le is completely sorted on this attribute. Instead of using A1 as the only attribute to allocate bits to, if we allocate bits to A2 the number of records which needs to be sorted will decrease. This only occurs if A1 is maximally allocated. Similarly, we may allocate bits to A3 if both A1 and A2 are maximally allocated. Thus the cost of sorting for a single combination of attributes, a generalisation of Equation 2 and 3, becomes C((A1; : : : ; A i ; : : : ; A m )) = ( d? P i j=1 d A j if P i j=1 d A j < d? log 2 B 2:2 d if P i j=1 d A j d? log 2 B where A i is now the rst non-maximally allocated attribute in the combination. Note that if the maximum number of bits which may be allocated to the attribute A i is greater than or equal to d then i will be 1 and Equation 6 reduces to Equations 2 and 3. If we combine Equations 1 and 6, we nd that the average cost of sorting on all combinations is C S = X s2s p s C(s) (7) (6) 9

17 We wish to minimise Equation 4, the average sort cost. Since and d are constant, it follows that we wish to maximise nx p i d i : i=1 The maximum value of this is p s d where p s = max i=1::n(p i ): Thus, to form an index to minimise the average sorting cost we allocate all of the bits to the attribute with the highest probability of being required to be sorted. This is a surprising result. As a consequence, in the optimal bit allocation we allocate all the bits to the attribute with the highest probability if this number is less than, or equal to, d? log 2 B. If it is greater than d? log 2 B then d? log 2 B bits should be allocated to it and the rest allocated to the attribute with the second highest probability, providing this number is less than, or equal to, d? log 2 B. If it is greater than d? log 2 B then d? log 2 B bits should be allocated to it and the rest allocated to the attribute with the third highest probability, and so on. If there are constraints on the domain size of an attribute then fewer bits may also be allocated to an attribute. For example, if the attributes denotes a person's sex there are only two possible values. This is fully discussed in Section 7, the current result requires an unconstrained domain, that is, a domain size larger than 2 d =B from which hash values are generated. 6.1 Multiple copies of data. If we can create multiple copies of the data le, each indexed using a dierent hash indexing scheme, our aim is still to minimise the average sort cost. If the number of copies is given by m, and the number of bits allocated to attribute i in copy j is d j i, the average cost of sorting is given by C S = nx i=1 p i = d? i=1 min ((d? j=1::m dj i )) nx! p i max j=1::m (dj i ) We wish to minimise Equation 5. Since and d are constant, it follows that we wish to maximise nx i=1 p i max j=1::m (dj i ) As above, if there is one copy of the data le (m = 1), the maximum value is p s1 d where p s1 is the highest probability. For m copies of the data le, the maximum value is mx p sj d j=1 where p sj is the jth highest probability. To achieve this, in the jth le d bits should be allocated to the attribute with the jth highest probability. (5) 8

18 on 2 d blocks. Therefore, C is given by C((A1; A2; : : : ; A m )) = 2 d A 1 k2 d?d A 1 log B (2 d?d A 1 ) = k2 d log B (2 d?d A 1 ) = (d? d A1 ) 9 >= >; if d A 1 < d? log 2 B (2) where = k2 d = log 2 B, 1 m n, and n is the total number of attributes. k is a small constant associated with the disk based merge sort which takes into account features of this method such as the decreasing number of total blocks as the sorting takes place, because hashing does not ensure a 100% occupancy of the blocks. k, d and B are constants, therefore is a constant. We have assumed that d A1 < d? log 2 B, that is, that each of the 2 d?d A 1 blocks of records to be sorted cannot all be contained within memory at once. If they can be contained within memory then the number of disk accesses (reads and writes) for the sort operation is given by ) C((A1; A2; : : : ; A m )) = 2:2 d A 1 2 d?d A 1 = 2:2 d if d A1 d? log 2 B (3) as no merge phase need occur. We note that it is extremely desirable to allocate d? log 2 B bits to an attribute, however the performance does not improve by allocating more bits than this. For the remainder of the calculations in this section we assume that d A1 < d? log 2 B. Note that Equation 2 is independent of all of the attributes except the rst attribute in each combination, and is independent of both m and n. As and d are constants, the only variable in Equation 2 is the number of bits allocated to the rst attribute, d A1. Therefore, the cost of sorting a combination of attributes is the same as only sorting on the rst attribute of the combination. Thus, we dene the cost of sorting on an attribute combination starting with A1 to be C(A1) = C((A1)) = C((A1; A2; A3)) = C((A1; A3; A2)) = C((A1; : : :)) = (d? d A1 ) Therefore the cost of any attribute combination is one of the n costs: C(A1), : : :, C(A n ). If we combine Equations 1 and 2, we nd that the average sorting cost is nx C S = p i (d? d i ) i=1! nx = d? p i d i i=1 where d i is the number of bits allocated to the ith attribute and p i is the sum of the probabilities of the combinations in which the ith attribute is rst. 7 (4)

19 (A1; A2) A1 A (A2; A1) A1 A Table 2: Two sort combinations. A1 A Figure 1: The arrangement of pages in a hash le. subsequent bits of the hash value of A1, and the value of A1 itself, are more signicant than any of those of A2. Each of these hash value combinations represents a block in the le and they may be retrieved in any order at little additional cost, therefore the cost of sorting on the combination (A1; A2) is eectively the same when the bits are bits of the attribute are grouped together and when they are interleaved. We now wish to determine the cost in disk accesses, C, of sorting on any combination of attributes (A1; A2; : : : ; A m ). The cost of a disk-based merge sort is O(n log s n) [11] where n is the number of elements to be sorted and s the number of streams which are merged simultaneously. To sort the whole of the data le of 2 d blocks would take k2 d log B 2 d disk accesses, where B is the buer size in blocks and k a small constant. However, the le is already partially sorted on the rst attribute if we use the sort method described above. If the number of bits allocated to the rst attribute is d A1 then we can perform 2 d A 1 merge sort operations on 2 d?d A 1 blocks instead of a single merge sort operation 6

20 becomes feasible to devote one copy, or more, to the ecient answering of queries containing the join operator, especially if join is a frequent operation. We wish to create an index which minimises the cost of performing all join operations. If we use the sort-merge join algorithm this problem reduces to minimising the cost of performing a sorting operation on the same relations. The average cost of sorting, C S, is given by jsj X C S = p i C i (1) i=1 where S is the set of all combinations of attributes on which a sort operation is performed, p i is the probability of the ith sort combination, and C i is the cost of the ith sort combination (a sort combination and its cost are dened below). It is extremely dicult to maintain sorted les in a dynamic environment when the les are to be sorted based on multiple attributes. While data structures which have an implicit ordering within them, such as B-trees, perform the task of ordering based on a single attribute very well, to retrieve a sorted le based on a dierent attribute is extremely expensive. Even if a separate index is used the data retrieved will still not be clustered making retrieval expensive. However, data structures such as the grid or BANG les can be used with the optimisations we suggest in this paper (this is discussed in Section 9). Although the sort-merge algorithm requires the relations to be sorted, they do not have to be sorted solely on the value of the attributes. We dene a new partial sort key to be the hashed value of the attribute concatenated with the value of the attribute. The presence of the value of the attribute is to deal with the case of dierent attribute values having the same hash value. The most signicant part of the sort key is the hash value. A variation of this idea appears in both the hash-join and superjoin techniques. One advantage of performing joins using this technique is that the indexes generated may easily be used to retrieve sorted relations directly if the hash functions are order preserving [5]. We now dene a notation for sorting on combinations of attributes. Let (A1; : : : ; A n ) be the result of sorting the contents of a data le based upon A1, then A2, and so on, up to A n. The result will be ordered upon increasing (or decreasing, if desired) values of A1. Within each distinct value of A1 the records will be ordered on increasing (or decreasing) values of A2, and within each value of A n?1 the records will be ordered on increasing (or decreasing) values of A n. The sort types of the attributes may also be mixed, that is, some may be in increasing order and some may be in decreasing order. For example, in Table 2 the records in the relation on the left are ordered on the sort combination (A1; A2) and the records in the relation on the right are ordered on the sort combination (A2; A1). The index of a le using multi-attribute hashing is already partially sorted using our sort keys. For example, consider a le with two attributes, A1 and A2, with three bits allocated from each to form the index. The le may be stored as in Figure 1. If the data is required to be sorted on the combination (A1; A2), that is, sort on A1 then on A2, each of the blocks with the same value of A1 would have to be sorted and merged together. This is because the fourth and 5

21 necessary to understand exactly what the stages are in each method and which stages we have measured. The sort-merge algorithm consists of a sorting phase and a merging phase. The cost of the join operation, in disk accesses, may be described as follows where cost(p 1 A=B Q) = sort(a)(p ) + sort(b)(q) + merge merge = read(p ) + read(q) + output and sort(a)(p ) is the cost of sorting relation P based on attribute A, read(p ) is the cost of reading relation P, and output is the cost of writing the result of the join. The value of sort(a)(p ) will be zero if, and only if, all the attributes can be directly retrieved in sorted order using the indexes. That is, no sorting needs to be performed across data blocks. The hash-join algorithm consists of a partitioning phase, described in Section 8, and a merging phase. The cost of the join operation, in disk accesses, may be described as follows cost(p 1 A=B Q) = partition(a)(p ) + partition(b)(q) + merge where partition(a)(p) is the total cost of partitioning the relation P based on the attribute A and merge is a similar process to the merge in the sortmerge algorithm and has the same number of disk accesses. Note that if partition(a)(p) is zero then the partitioning of relation Q is unnecessary, so partition(b)(q) is zero. When the initial partition sizes, based on the index of the relation, are less than, or equal to, the size of the buer in memory the partitioning phase is unnecessary, therefore partition(a)(p) will be zero. If one relation, or index-based partition thereof, will totally t into memory then only a single read of the other relation will be necessary to perform the join. Thus the partitioning of the other relation is unnecessary. The costs described in Sections 6 and 7 represent the cost of the sorting phase of the sort-merge algorithm in terms of the number of disk accesses (both block reads and writes). Similarly, the costs described in Section 8 represent the cost of the partitioning phase of the hash-join algorithm in terms of the number of disk accesses. We ignore the cost of the merge phase in each of these sections because we cannot minimise the number of reads and writes in this phase, they are constant. Thus we concentrate on optimising the sorting and partitioning phases. 6 General solution using the sort-merge algorithm. As the cost of mass storage decreases it is becoming increasingly cost eective to have multiple copies of data les, each indexed using a dierent scheme. This results in an increase in performance for retrieval of data at the expense of requiring additional storage space and increased insertion, deletion and update costs. If multi-attribute hashing is used to index the les each index uses a dierent number of bits from each attribute. Under these circumstances it 4

Cluster quality 15. Running time 0.7. Distance between estimated and true means Running time [s]

Cluster quality 15. Running time 0.7. Distance between estimated and true means Running time [s] Fast, single-pass K-means algorithms Fredrik Farnstrom Computer Science and Engineering Lund Institute of Technology, Sweden arnstrom@ucsd.edu James Lewis Computer Science and Engineering University of