[13] W. Litwin. Linear hashing: a new tool for le and table addressing. In. Proceedings of the 6th International Conference on Very Large Databases,

Size: px
Start display at page:

Download "[13] W. Litwin. Linear hashing: a new tool for le and table addressing. In. Proceedings of the 6th International Conference on Very Large Databases,"

Transcription

1 [12] P. Larson. Linear hashing with partial expansions. In Proceedings of the 6th International Conference on Very Large Databases, pages 224{232, [13] W. Litwin. Linear hashing: a new tool for le and table addressing. In Proceedings of the 6th International Conference on Very Large Databases, pages 212{223, [14] J. W. Lloyd. Optimal partial-match retrieval. BIT, 20:406{413, [15] J. W. Lloyd and K. Ramamohanarao. Partial-match retrieval for dynamic les. BIT, 22:150{168, [16] T. H. Merrett. Why sort-merge gives the best implementation of the natural join. SIGMOD Record, 13(2):39{51, January [17] S. Moran. On the complexity of designing optimal partial-match retrieval systems. ACM Transactions on Database Systems, 8(4):543{551, December [18] S. Nahar, S. Sahni, and E. Shargowitz. Experiments with simulated annealing. In Proceedings of the 22nd Design Automation Conference, pages 748{752, [19] J. Nievergelt, H. Hinterberger, and K. C. Sevcik. The grid le: An adaptable, symmetric multikey le structure. ACM Transactions on Database Systems, 9(1):38{71, March [20] K. Ramamohanarao and J. W. Lloyd. Dynamic hashing schemes. The Computer Journal, 25:478{485, [21] K. Ramamohanarao, J. Shepherd, and R. Sacks-Davis. Multi-attribute hashing with multiple le copies for high performance partial-match retrieval. BIT, 30:404{423, [22] R. L. Rivest. Partial-match retrieval algorithms. SIAM Journal on Computing, 5(1):19{50, March [23] T. J. Sager. A polynomial time generator for minimal perfect hash functions. Communications of the ACM, 28(5):523{532, May [24] J. A. Thom, K. Ramamohanarao, and L. Naish. A superjoin algorithm for deductive databases. In Proceedings of the 12th International Conference on Very Large Databases, pages 189{196, Kyoto, Japan, August [25] K.-Y. Whang and R. Krishnamurthy. The multilevel grid le a dynamic hierarchical multidimensional le structure. In International Symposium on Database Systems for Advanced Applications, pages 449{459, Tokyo, Japan, April

2 If we use the hash-join algorithm then the optimal bit allocation may be determined using heuristic techniques in a feasible amount of time. Additionally, we have demonstrated that the hash-join algorithm is likely to be faster than the sort-merge algorithm when the clustering of the data using multi-attribute hashing is optimal. In Table 10 we have shown that the optimal bit allocation using the hash-join algorithm will typically provide an order of magnitude of improvement over the implementation of the algorithm not using an index even when only two attributes are involved. When many attributes are involved the improvement can be several orders of magnitude when appropriate buer sizes are used. References [1] A. V. Aho and J. D. Ullman. Optimal partial-match retrieval when elds are independently specied. ACM Transactions on Database Systems, 4(2):168{179, June [2] J. L. Bentley. Multidimensional binary search trees used for associative searching. Communications of the ACM, 18(9):509{517, September [3] M. W. Blasgen and K. P. Eswaran. Storage and access in relational databases. IBM Systems Journal, 16(4), [4] K. Bratbergsengen. Hashing methods and relational algebra operations. In Proceedings of the 10th International Conference on Very Large Databases, pages 323{333, Singapore, August [5] W. A. Burkhard. Interpolation-based index maintenance. BIT, 23:274{294, [6] R. Cichelli. Minimal perfect hash functions made simple. Communications of the ACM, 23(1):17{19, January [7] C. Faloutsos. Multiattribute hashing using gray codes. In Proceedings of SIGMOD '86, pages 227{238, [8] M. Freeston. The BANG le: a new kind of grid le. In U. Dayal and I. Traiger, editors, Proceedings of SIGMOD '87, pages 260{269, San Francisco, California, USA, May [9] E. P. Harris and K. Ramamohanarao. Optimal dynamic multi-attribute hashing for range queries. Technical Report 91/34, Department of Computer Science, The University of Melbourne, Parkville, Victoria 3052, Australia, December Also published as CITRI Technical Report 91/7. [10] Y. Hsiao and A. L. Tharp. Adaptive hashing. Information Systems, 13(1):111{127, [11] D. E. Knuth. Sorting and Searching, volume 3 of The Art of Computer Programming. Addison-Wesley, Reading, Massachusetts, USA,

3 in the tree for the expected le size generated using simulated annealing and use the MMI algorithm when allocating attributes to additional levels and the maximal marginal decrease algorithm (MMD) when allocating attributes to the upper levels of the tree (and for smaller les sizes). Maximal marginal decrease works in a similar way to MMI, except that, instead of increasing the number of bits (or levels), the number of bits is decremented. It works by decrementing the number of bits allocated from each attribute in turn and calculating the new cost. The bit allocation which results in the best cost value is the one chosen. Thus, if the optimal allocation for n bits is known the \optimal" allocation for n? 1 bits can be determined, then n? 2 bits, and so on down to the rst bit to be allocated. This general technique can be used to provide more optimal indexes for the grid le [19], the BANG le [8], and the multilevel grid le [25], in addition to the k-d tree and other similar indexing schemes in which a choice must be made in determining which attribute to split the data on. 10 Conclusion. We have described a method by which the optimal clustering arrangement may be found for performing a join operation using either the sort-merge or hashjoin algorithms on data les indexed using multi-attribute hashing. We have also described how our method may be applied to enhance the performance of other indexing schemes, such as the grid and BANG les. First we consider the sort-merge join algorithm. In the case in which there are no maximally allocated attributes and the buer size is small the index entirely consists of bits from a single attribute. This attribute is determined by calculating the total probability of each attribute occurring rst in a join operation. The attribute with the greatest probability is the one used to create the index. In this case we ignore the fact that the number of bits allocated to an attribute must be greater than d? log 2 B if it is the only attribute allocated any bits. In the case when the number of bits allocated to an attribute may be greater than d? log 2 B, then d? log 2 B bits should be allocated to the attribute and the remainder allocated to the next most probable attribute (assuming that number is not greater than d? log 2 B). When there are attributes which may be maximally allocated a number of calculations must be performed to deal with these because the bits will not necessarily be allocated to a single attribute. An additional feature of this method is that if the hashing functions are order preserving the index is also optimal for sorting the records based on certain attributes, assuming that the probability of being asked to sort the records has been taken into account when determining the optimal index. One of the primary advantages of this method is that, unlike some optimal indexes for partial match queries [9, 21], it is extremely quick and easy to calculate. In addition, it is appropriate to use with dynamic les [20] when a single primary attribute is being used to index the le, enabling it to easily use the dynamic properties of linear hashing with its relatively low maintenance costs.. 22

4 A2 = 1 A1 = 9 A1 = 6 A1 = 7 A1 = 10 A1 = 4 A1 = 10 (7; 2) (9; 1) (10;?1) (11;?4) (4; 0) (6; 3) (10; 3) (15; 6) Figure 2: A k-d tree split on attributes A2, then A1, then A1. 9 Other indexing schemes. Other indexing schemes can benet from using our approach to create indexes which will perform better for join queries with a given join probability distribution than indexes created without knowledge of the distribution. These other indexing schemes can be characterised by a tree structure. At each node of the tree, a choice must be made as to which attribute is to be chosen to perform the partitioning on to store the descendants of the node. This corresponds directly to choosing which attribute to allocate a bit from in the multi-attribute hashing scheme. For example, consider a derivative of the k-d tree [2] in Figure 2, in which the internal nodes represent the attribute and value on which the data is split, and the leaves contain the actual data. We assume that this partitioning gives the optimal performance for join queries for a given join query distribution. At each level of the tree a decision has been made as to the attribute to partition the data on. If the data were indexed using multi-attribute hashing, the hash index would be composed of the rst bit of attribute two, then the rst and second bits of attribute one. This would be the optimal multi-attribute hashing index for the same join query distribution. The additional information which tree structure techniques require, in addition to how many times an attribute must be used to partition on, is the order in which the partitioning must be performed. This is also required of the multi-attribute hashing index when it dynamically changes size. There are two possible methods of determining the appropriate order. The rst is the use the MMI algorithm to determine the optimal attribute to split on at each level of the tree. As was shown in Section 7.2, this does not always lead to the optimal results. The second is to start with the optimal allocation of attributes to levels 21

5 greatest probability for the join operation just containing that attribute. These results support our hypothesis. We note that the improvements in the cost for the optimal bit allocation over the three naive methods: HIGH, PROB and EVEN, is often very large, particularly for larger numbers of attributes. While either PROB or EVEN is often close to optimal, when the number of attributes is greater than three one is typically signicantly larger than the optimal bit allocation, depending on the probability distribution. Thus, we believe determining the optimal bit allocation is worthwhile, particularly as whether PROB or EVEN is near optimal depends on the nature of the probability distribution. The improvement of the optimal bit allocation over the unindexed cost is very large. For example, in Table 10 it is several orders of magnitude when seven attributes and the buer size is large. It is interesting to compare the average cost of sorting in Table 4 and the average cost of partitioning in Table 9. Note that the cost when there is no index is the same. However, as k would not typically be the same for both methods, a direct comparison is not necessarily appropriate. The results show that the optimal bit allocation for the hash-join method involves signicantly fewer disk accesses for the partition phase than the optimal bit allocation for the sorting phase of the sort-merge method, assuming the values of k do not dier greatly. Also, as the buer size increases the relative improvement over the unindexed cost increases for the hash-join algorithm whereas it decreases for the sort-merge algorithm. Additionally, as we noted in Section 5, if the cost of partitioning one relation is zero, then the second of the two relations involved in the join does not need to be partitioned to achieve an optimal join using the hash-join technique. The optimal indexes created using the above technique tends to maximise the number of queries which do not need to be partitioned. This will result in signicantly higher join performance than may be rst indicated, as the second attribute need not be partitioned for these queries. Given that the cost of reading and writing the two relations after the sort/partition phase is the same, the hash-join method should be much faster. Thus, we support the conclusion of Bratbergsengen [4] that hash-join is faster than sort-merge under equivalent circumstances. In the presence of optimal indexes the hash-join method of joining will have a lower average join cost than the sort-merge method. Unlike the sort-merge method which tends to allocate a large number of bits to only a few attributes, and very few bits to the other attributes, the results obtained for the hash-join partitioning are much more like those obtained when generating optimal bit allocations for partial-match queries [15]. Thus, we would expect that if multiple copies of the data could be made with dierent indexes, the procedure described in [21], using Equation 9 as a basis, could be used to determine the bit allocations in each le. 20

6 B = 32 B = 64 B = 128 B = 256 n Cost Impr. Cost Impr. Cost Impr. Cost Impr * Table 10: Cost of the hash-join partitioning for various buer sizes, k = 2, d = 13. n Algorithm MMI SANN SANN SANN Table 11: Computational times in seconds of the hash-join partitioning for various algorithms, B = 32, k = 2, d = 13. join operations possible when ve attributes are considered are of this type. If join operations involving all but two of the attributes are included, the number becomes 300 out of 325. The cost of partitioning in the join operations involving all the attributes will be zero, and the cost of partitioning the join operations involving all but one of the attributes will be zero if the number of bits allocated to the one attribute is greater than or equal to log 2 B. If the probabilities of the join operations are (approximately) uniformly distributed, to enable the cost of each join operation in this set to be zero, a minimum of log 2 B bits need to be allocated to each attribute, in which case the cost of the above partition phase will be zero. The second of the distributions with ve attributes was originally created to test this hypothesis. As described above in Section 7.2, it was generated using a skew distribution. This distribution was randomly generated, however, the probability of any given join operation was inversely proportional to the number of attributes in the join operation. The probabilities of the ve join operations containing only one attribute outweighed any of the other probabilities and greatly outweighed the sum of the probabilities of the 240 join operations containing all of the attributes or all but one of the attributes. Additionally, there were no constraints placed on the maximum number of bits any attribute may have. As expected, d? log 2 B bits were allocated to the attribute with the greatest probability for the join operation just containing that attribute, and the remaining log 2 B bits were allocated to the attribute with the second 19

7 B = 32 B = 64 n Algorithm Bit allocation Cost Impr. Bit allocation Cost Impr. Constraints MMI SA SA SA HIGH PROB EVEN Constraints MMI SA SA SA HIGH PROB EVEN Constraints MMI SA SA SA HIGH PROB EVEN Constraints MMI SA SA SA HIGH PROB EVEN Constraints MMI SA SA SA HIGH PROB EVEN * NONE Table 9: Cost of the hash-join partitioning for various attributes with constraints, k = 2, d =

8 therefore the index should always be used to partition the data to minimise the cost. It follows that the cost of the partitioning for a join operation q, where Q is a set containing the attributes used in the join operation, is C(q) = = = ( Q i2q 2d i :k Q i62q 2d i ( ( log B Q i62q 2d i if Q i62q 2d i > B 0 if Q i62q 2 d i B k2 d log 2 Q i62q 2d i = log 2 B if Q i62q 2d i > B 0 if Q i62q 2 d i B d? P i2q d i if Q i62q 2 d i > B 0 if Q i62q 2 d i B where = k2 d = log 2 B. Note that if the index based partition size, m, is less than the buer size, then no partitioning needs to be performed. Applying constraints to the number of bits which may be allocated to any attribute has no eect on the method used to determine the optimal bit allocation. Combining Equations 1 and 8, the average cost of partitioning is C P = X i2p (8) p i C(i) (9) where P is the set of all possible joins which may be performed on the relation, and p i is the probability of the join i. Note, that if we assume that Q i62q 2d i > B for all i, Equation 9 is similar to Equation 7. Therefore, we should be able to determine the optimal bit allocation using a similar approach, that is, heuristically using either the MMI or simulated annealing algorithms. Again we tested both methods to determine whether they are applicable. 8.2 Computational results of heuristic bit allocation. We obtained a series of computational results using exactly the same join operation probability distributions as in Section 7.2. We used Equation 9 to determine the average cost of the partition phase, and again set d = 13, B = 32, 64, 128 and 256, k = 2 and assumed that the page size is 4k. The same values for T, S and L shown in Table 3 were used by the simulated annealing algorithms. The results were again generated on a lightly loaded Silicon Graphics 4D/340 with 128Mb of main memory. The results are presented in Tables 9, 10 and 11. The times shown are the sum of the user and system time measured in seconds. The results show that the number of bits allocated to each attribute appears to be based heavily on the constant log 2 B (which is either ve or six for these results). If an attribute is not allocated this number of bits, the sum of the number of bits allocated to two attributes is often log 2 B. We conjecture that this is because the partition phase of the hash-join operation does not need to be performed if the sum of the number of bits allocated to the attributes are at least d? log 2 B. A large number of the join operations involve either all of the attributes, or all but one of the attributes. For example, 240 of the

9 performed on the data to partition the data into blocks small enough to be be merged in memory can be derived in a manner similar to that of determining the optimal number of merge phases in the merge sort [4]. The hash-join method partitions data using a hash function. Multi-attribute hashing also partitions data using a hash function. Therefore we should be able to create an index which will reduce the amount of partitioning which needs to be performed by the hash-join algorithm. Instead of commencing with one data le which needs to be partitioned, we can start with a number of smaller data les which are distinguished by having dierent indexes for the appropriate attributes. The superjoin method [24] does this, however Thom, et al., do not attempt to construct an optimal index for the data. We will shortly show how the optimal index may be determined. 8.1 Cost of partitioning in the hash-join method. The cost of partitioning the data in the hash-join method was given by Bratbergsengen in [4] and can be given by T t = 2:ta:2 d :w T t is the total time, ta is the time to transport a page from disk to memory, 2 d is the size of the data to be partitioned in pages and w is given by w = 8 >< >: 0 if N = 1 1 if 1 < N B q? b(b q? N)=(B? 1)c=N if N > B B is the size of the internal buer in blocks, N = b(2 d? 1)=Bc + 1 is the number of partitions to be created, and by q = blog B (N? 1)c? 1 Bratbergsengen also gave an approximation of the cost and it can be given T = k2 d log B 2 d where k is a constant. We use the approximation which provides an upper bound on the number of disk accesses since it is similar to the approximation we made on the number of disk accesses used by the merge sort in Section 6. All of the indexing bits for each attribute in the join operation should be used to partition the data. For example, if a relation P (A; B) is involved in a join based on attribute A, then a separate partitioning phase should be performed for each distinct index value of A. If there are p distinct index values for attribute A there will be p partitions of size m such that p:m = n where n is the total size of the data, then the cost of performing the partitioning on these p partitions is p:km log B m. This is equal to kn log B (n=p). If there is no partitioning based on the index value then the cost would be kn log B n. kn log B n > kn log B (n=p), 16

10 Attribute Constraints Optimal allocation Probability Table 8: Probability of sorting on the rst attribute for ve attributes when d = 13, B = 64. over the unindexed method decreases as the buer size increases. Table 8 shows the probabilities of sorting on an attribute combination starting with each attribute, for the case with ve attributes. This is the same attribute combination as the rst of the two with ve attributes in Table 4. It is interesting to note the while the most probable attribute to initially sort on is attribute three (by a comparatively large margin), it is not allocated the greatest number of bits; instead attribute ve is maximally allocated. Also, attributes one, two and four are allocated a bit each, when allocating all three to attribute three would maximally allocate it. This shows that if the attributes may be maximally allocated the optimal bit allocation must be determined heuristically using Equation 7, and not by the rules used when attributes cannot be maximally allocated. 8 Bit allocations using the hash-join algorithm. An optimal index may be created for data les which use the hash-join method [4] to perform join operations in a similar way to those we have created for join operations performed using the sort-merge technique. Bratbergsengen [4] showed that the nested loop method, although inecient when the volume of data is large and on many pages of disk, is the fastest known method for joining two relations when one of the relations ts into the buers in memory. For equally sized operands, the nested loop method is faster than the sort-merge method when the operand volume is less than four to twelve times the buer size. In general, it is faster when the smallest operand is less than three to eight times the buer size. The hash-join method works by partitioning the relations into a number of blocks which may be joined using the nested loop method in memory. The hash-join partitioning of the relation is accomplished by making a number of passes over the data. On each pass a hash function is joined to split the input le up into multiple output les. If two values are to be joined they must be the same, therefore they will have the same hash value. If the output le is larger than the memory buer in which the nested loop method will be performed then a later pass will split it up. The optimal split pattern determining the number of passes over the les (and over which les) required to be 15

11 B = 32 B = 64 B = 128 B = 256 n Sort cost Impr. Sort cost Impr. Sort cost Impr. Sort cost Impr * Table 5: Cost of sorting for various buer sizes, k = 2, d = 13. n Algorithm MMI SANN SANN SANN Table 6: Computational times in seconds for various algorithms, B = 32, k = 2, d = 13. Attributes Cost Time Constraints Table 7: MMI progress for seven attributes with constraints, k = 2, B = 32, d =

12 B = 32 B = 64 n Algorithm Bit allocation Sort cost Impr. Bit allocation Sort cost Impr. Constraints MMI SA SA SA HIGH PROB EVEN Constraints MMI SA SA SA HIGH PROB EVEN Constraints MMI SA SA SA HIGH PROB EVEN Constraints MMI SA SA SA HIGH PROB EVEN Constraints MMI SA SA SA HIGH PROB EVEN * NONE Table 4: Computational results for various attributes with constraints, k = 2, d =

13 number of bits. The remaining bits were then allocated to the next most probable attribute up to a maximum of d? log 2 B bits, then the next attribute, and so on. EVEN simply allocates an equal number of bits to each attribute, with excess bits going to the rst attributes occurring in a relation. The three sets of results were all very similar in form, for the sake of brevity we include only one of the three, plus one other with ve attributes, for two of the buer sizes, in Table 4. Table 5 shows the cost of the optimal bit allocation for the tests with ve attributes and all of the buer sizes. The improvement (Impr.) column in Tables 4 and 5 represent the ratio of the unindexed cost to the cost bit allocation of that method. Table 7 shows the progress of the MMI algorithm for the case of seven attributes. The results were generated on a lightly loaded Silicon Graphics 4D/340 with 128Mb of main memory. The times shown, in Table 6 is the sum of the system and user times taken by the algorithms measured in seconds for the rst set of results and is representative of all of the times of the algorithms for the appropriate number of attributes. The results show that the MMI algorithm, although the fastest, is only useful when there are only a few attributes to choose from. With larger numbers of attributes it is too easy to allocate a bit to the wrong attribute. Table 7 demonstrates another problem with the MMI algorithm. When the whole of the data to be sorted ts into the internal buer, it doesn't matter which attribute the bits are allocated to, the cost will be the same. If it arbitrarily chooses an attribute (our version chose the rst), this attribute may be allocated bits when it would otherwise not have (for example, if the buer size was 1). This can result in the wrong attribute being chosen to allocate bits to, which cannot result in the optimal bit allocation being determined. From the simulated annealing results it is clear that the nal bit allocation (for all of the algorithms) will typically consist of a number of attributes allocated as many bits as possible. These results show that both the maximum number of bits permitted to be allocated to an attribute, and that allocating more than d? log 2 B bits to an attribute does not decrease the cost, is important in determining how many bits to be allocated to an attribute. The time taken to determine the optimal bit allocation could be improved signicantly by taking this information into account, by modifying the MMI and simulated annealing algorithms, perhaps to use this information to generate starting points when searching for the optimal allocation. It is interesting to note that the cost improvement over the even distribution is not always very large, even when seven attributes are involved. The second of the distributions with ve attributes was skewed towards higher probabilities for sort combinations with fewer attributes and more towards the rst attribute than the second, and more towards the second than the third, and so on. With this distribution quite a large improvement was achieved, but still only on the order of 50%. The improvement gained over the one based on the probabilities of attributes occurring in a query, and which took into account of the buer size, was very small. However, the probabilities would still be required to determine these bit allocations and the time taken to determine the optimal bit allocation is not very large. Improvements over the unindexed cost range from around 20% to over 100%. Although the cost itself decreases, Table 5 the cost improvement 12

14 Algorithm T S L SA SA SA Table 3: Values of simulated annealing constants. exponential function with the number of the iteration of the trial as a parameter. The eect of the cooling function is that the simulated annealing algorithm is likely to accept bit allocations with worse values for the cost function in early trials, but only accept improving values in later trials. This is done so that better bit allocations may be found which may only be reached from the initial bit allocation by passing through worse bit allocations. While simulated annealing is not ideal for all optimisation applications [18], it has proved to be a worthwhile, if expensive, operation in applications of this type before [9, 21] and thus we considered it likely to be appropriate again. 7.2 Computational results. We obtained a series of computational results in an attempt to determine whether it is possible to determine the optimal bit allocation in a feasible amount of time. To this end we generated three sets of random probability distributions for join operations for a number of attributes between two and seven and for two dierent buer sizes. We randomly generated a constraining number of bits for each attribute such that this number was the maximum number of bits allowed to be allocated to that attribute. These constraints represent the domain size into which the attribute values are being hashed. A perfect hashing function may, but need not, be used to do this. According to [23] generating a perfect hash function is feasible for domain sizes up to around 2 9, that is, when the number of constraining bits is nine. We used Equation 7 to determine the cost of each bit allocation and we set d = 13, B = 32, 64, 128 and 256, and k = 2 (therefore the results indicate the number of disk accesses). With these le and block sizes we are dealing with 32Mb data les and a 128k, 256k, 512k or 1024k buer, assuming a 4k page size. We obtained computational results for the minimal marginal increase algorithm and for three versions of the simulated annealing algorithm, each with dierent values for the constants T, S and L. The values of the constants used were the same as those used in [21] and later in [9] and are summarised in Table 3. In addition to the basic four algorithms we also tested three naive algorithms: HIGH, PROB and EVEN, and the cost when there is no index, NONE. HIGH and PROB are allocations based on the probability of an attribute appearing in a sort combination. The sum of the probabilities of the combinations in which each attribute appears was taken. For HIGH, all the bits were allocated to the most probable attribute. For PROB, the bits were allocated to the most probable attribute up to a maximum of d? log 2 B bits or its constraining 11

15 where C(s) is given by Equation 6, S is the set of all sort combinations, and the attributes are ordered by their position in the combination s. The optimal bit allocation will give the minimum value for Equation 7. The allocation of bits to attributes may change the attribute A i in each combination of C(s). Therefore, we attempt to nd the optimal allocation of bits using an heuristic approach which attempts to nd the minimum value of Equation 7 given the constraints imposed by the value of d. 7.1 Heuristic approaches to bit allocation. For a variety of similar bit allocation problems, when there is no known algorithmic method which obtains an optimal bit allocation in polynomial time given constraints imposed by the domain in which the work is involved, heuristic approaches to bit allocation have been used [9, 15, 21]. Two of the methods which have been used for this are minimal marginal increase (MMI) [15] and simulated annealing (SA) [9, 21]. We will consider both of them Minimal marginal increase. The method of minimal marginal increase starts with no bits allocated to any of the attributes. It works in d steps. At each step a single bit is added to a single attribute. To determine which attribute to add the single bit to, in each step, the bit is added to each attribute in turn and the value of the cost function we are trying to minimise, Equation 7, is calculated. The attribute the bit is then added to is the one which results in the smallest value of the cost function Simulated annealing. The method of simulated annealing works by performing a number of trials, T, and returning the bit allocation of the best value for the cost function from amongst these trials. In each trial a random bit allocation is generated, using all d bits, and the value of the cost function, Equation 7, is determined for that bit allocation. A number of iterations of bit perturbing, S, is then performed, starting with this bit allocation. In each iteration the bit allocation is perturbed by decrementing the number of bits allocated to one randomly selected attribute by one and incrementing the number of bits allocated to another randomly selected attribute by one. The value of the cost function for this bit allocation is then determined. If this value is less than the best value of the cost function, or if the value of a cooling function is true, this bit allocation is used as the basis for the next iteration, otherwise the previous bit allocation is used. The number of iterations performed is also constrained by a number which species the maximum number of iterations permitted to be performed without nding a new bit allocation with a better value for the cost function, L. If this number is exceeded the iterations for this trial are stopped instead of the total number of iterations, S, being completed. The cooling function is a function which determines whether or not a bit allocation which does not improve the value of the cost function is accepted. It tests whether a randomly generated number is less than the value of an inverse 10

16 Thus, the nth copy of the data le should be indexed by the attribute with the nth highest probability of being sorted. Similarly, if d? log 2 B bits may be allocated to an attribute (in any le) then it should be allocated to the attribute with the highest probability rst, then the second highest probability, and so on. In summary, assuming an unconstrained domain, the attribute with the highest probability of being sorted should be allocated as many bits as possible within a single data le. 7 Maximally allocated attributes. A special situation arises when an attribute has the maximum number of bits possible allocated to it. For these attributes it has no eect if extra bits are allocated to them. For example, if an attribute denotes a person's sex it will have one of two values. Therefore, the hash function need only generate one bit for the attribute. To generate more bits would not provide greater discrimination between the values. We say that this attribute is maximally allocated at one bit. We dene a maximally allocated attribute to be an attribute in which, for values of an attribute a and b, 8a; b : h(a) = h(b) ) a b where h(x) is the hash function used to construct the hash value for the attribute. If all of the bits of a maximally allocated attribute are used in constructing the hash index for the data le then the time taken to join on this attribute alone is linear, because the data le is already sorted on this attribute using the sort key denition above. Consider a sort combination (A1; A2; A3). If attribute A1 is maximally allocated then the le is completely sorted on this attribute. Instead of using A1 as the only attribute to allocate bits to, if we allocate bits to A2 the number of records which needs to be sorted will decrease. This only occurs if A1 is maximally allocated. Similarly, we may allocate bits to A3 if both A1 and A2 are maximally allocated. Thus the cost of sorting for a single combination of attributes, a generalisation of Equation 2 and 3, becomes C((A1; : : : ; A i ; : : : ; A m )) = ( d? P i j=1 d A j if P i j=1 d A j < d? log 2 B 2:2 d if P i j=1 d A j d? log 2 B where A i is now the rst non-maximally allocated attribute in the combination. Note that if the maximum number of bits which may be allocated to the attribute A i is greater than or equal to d then i will be 1 and Equation 6 reduces to Equations 2 and 3. If we combine Equations 1 and 6, we nd that the average cost of sorting on all combinations is C S = X s2s p s C(s) (7) (6) 9

17 We wish to minimise Equation 4, the average sort cost. Since and d are constant, it follows that we wish to maximise nx p i d i : i=1 The maximum value of this is p s d where p s = max i=1::n(p i ): Thus, to form an index to minimise the average sorting cost we allocate all of the bits to the attribute with the highest probability of being required to be sorted. This is a surprising result. As a consequence, in the optimal bit allocation we allocate all the bits to the attribute with the highest probability if this number is less than, or equal to, d? log 2 B. If it is greater than d? log 2 B then d? log 2 B bits should be allocated to it and the rest allocated to the attribute with the second highest probability, providing this number is less than, or equal to, d? log 2 B. If it is greater than d? log 2 B then d? log 2 B bits should be allocated to it and the rest allocated to the attribute with the third highest probability, and so on. If there are constraints on the domain size of an attribute then fewer bits may also be allocated to an attribute. For example, if the attributes denotes a person's sex there are only two possible values. This is fully discussed in Section 7, the current result requires an unconstrained domain, that is, a domain size larger than 2 d =B from which hash values are generated. 6.1 Multiple copies of data. If we can create multiple copies of the data le, each indexed using a dierent hash indexing scheme, our aim is still to minimise the average sort cost. If the number of copies is given by m, and the number of bits allocated to attribute i in copy j is d j i, the average cost of sorting is given by C S = nx i=1 p i = d? i=1 min ((d? j=1::m dj i )) nx! p i max j=1::m (dj i ) We wish to minimise Equation 5. Since and d are constant, it follows that we wish to maximise nx i=1 p i max j=1::m (dj i ) As above, if there is one copy of the data le (m = 1), the maximum value is p s1 d where p s1 is the highest probability. For m copies of the data le, the maximum value is mx p sj d j=1 where p sj is the jth highest probability. To achieve this, in the jth le d bits should be allocated to the attribute with the jth highest probability. (5) 8

18 on 2 d blocks. Therefore, C is given by C((A1; A2; : : : ; A m )) = 2 d A 1 k2 d?d A 1 log B (2 d?d A 1 ) = k2 d log B (2 d?d A 1 ) = (d? d A1 ) 9 >= >; if d A 1 < d? log 2 B (2) where = k2 d = log 2 B, 1 m n, and n is the total number of attributes. k is a small constant associated with the disk based merge sort which takes into account features of this method such as the decreasing number of total blocks as the sorting takes place, because hashing does not ensure a 100% occupancy of the blocks. k, d and B are constants, therefore is a constant. We have assumed that d A1 < d? log 2 B, that is, that each of the 2 d?d A 1 blocks of records to be sorted cannot all be contained within memory at once. If they can be contained within memory then the number of disk accesses (reads and writes) for the sort operation is given by ) C((A1; A2; : : : ; A m )) = 2:2 d A 1 2 d?d A 1 = 2:2 d if d A1 d? log 2 B (3) as no merge phase need occur. We note that it is extremely desirable to allocate d? log 2 B bits to an attribute, however the performance does not improve by allocating more bits than this. For the remainder of the calculations in this section we assume that d A1 < d? log 2 B. Note that Equation 2 is independent of all of the attributes except the rst attribute in each combination, and is independent of both m and n. As and d are constants, the only variable in Equation 2 is the number of bits allocated to the rst attribute, d A1. Therefore, the cost of sorting a combination of attributes is the same as only sorting on the rst attribute of the combination. Thus, we dene the cost of sorting on an attribute combination starting with A1 to be C(A1) = C((A1)) = C((A1; A2; A3)) = C((A1; A3; A2)) = C((A1; : : :)) = (d? d A1 ) Therefore the cost of any attribute combination is one of the n costs: C(A1), : : :, C(A n ). If we combine Equations 1 and 2, we nd that the average sorting cost is nx C S = p i (d? d i ) i=1! nx = d? p i d i i=1 where d i is the number of bits allocated to the ith attribute and p i is the sum of the probabilities of the combinations in which the ith attribute is rst. 7 (4)

19 (A1; A2) A1 A (A2; A1) A1 A Table 2: Two sort combinations. A1 A Figure 1: The arrangement of pages in a hash le. subsequent bits of the hash value of A1, and the value of A1 itself, are more signicant than any of those of A2. Each of these hash value combinations represents a block in the le and they may be retrieved in any order at little additional cost, therefore the cost of sorting on the combination (A1; A2) is eectively the same when the bits are bits of the attribute are grouped together and when they are interleaved. We now wish to determine the cost in disk accesses, C, of sorting on any combination of attributes (A1; A2; : : : ; A m ). The cost of a disk-based merge sort is O(n log s n) [11] where n is the number of elements to be sorted and s the number of streams which are merged simultaneously. To sort the whole of the data le of 2 d blocks would take k2 d log B 2 d disk accesses, where B is the buer size in blocks and k a small constant. However, the le is already partially sorted on the rst attribute if we use the sort method described above. If the number of bits allocated to the rst attribute is d A1 then we can perform 2 d A 1 merge sort operations on 2 d?d A 1 blocks instead of a single merge sort operation 6

20 becomes feasible to devote one copy, or more, to the ecient answering of queries containing the join operator, especially if join is a frequent operation. We wish to create an index which minimises the cost of performing all join operations. If we use the sort-merge join algorithm this problem reduces to minimising the cost of performing a sorting operation on the same relations. The average cost of sorting, C S, is given by jsj X C S = p i C i (1) i=1 where S is the set of all combinations of attributes on which a sort operation is performed, p i is the probability of the ith sort combination, and C i is the cost of the ith sort combination (a sort combination and its cost are dened below). It is extremely dicult to maintain sorted les in a dynamic environment when the les are to be sorted based on multiple attributes. While data structures which have an implicit ordering within them, such as B-trees, perform the task of ordering based on a single attribute very well, to retrieve a sorted le based on a dierent attribute is extremely expensive. Even if a separate index is used the data retrieved will still not be clustered making retrieval expensive. However, data structures such as the grid or BANG les can be used with the optimisations we suggest in this paper (this is discussed in Section 9). Although the sort-merge algorithm requires the relations to be sorted, they do not have to be sorted solely on the value of the attributes. We dene a new partial sort key to be the hashed value of the attribute concatenated with the value of the attribute. The presence of the value of the attribute is to deal with the case of dierent attribute values having the same hash value. The most signicant part of the sort key is the hash value. A variation of this idea appears in both the hash-join and superjoin techniques. One advantage of performing joins using this technique is that the indexes generated may easily be used to retrieve sorted relations directly if the hash functions are order preserving [5]. We now dene a notation for sorting on combinations of attributes. Let (A1; : : : ; A n ) be the result of sorting the contents of a data le based upon A1, then A2, and so on, up to A n. The result will be ordered upon increasing (or decreasing, if desired) values of A1. Within each distinct value of A1 the records will be ordered on increasing (or decreasing) values of A2, and within each value of A n?1 the records will be ordered on increasing (or decreasing) values of A n. The sort types of the attributes may also be mixed, that is, some may be in increasing order and some may be in decreasing order. For example, in Table 2 the records in the relation on the left are ordered on the sort combination (A1; A2) and the records in the relation on the right are ordered on the sort combination (A2; A1). The index of a le using multi-attribute hashing is already partially sorted using our sort keys. For example, consider a le with two attributes, A1 and A2, with three bits allocated from each to form the index. The le may be stored as in Figure 1. If the data is required to be sorted on the combination (A1; A2), that is, sort on A1 then on A2, each of the blocks with the same value of A1 would have to be sorted and merged together. This is because the fourth and 5

21 necessary to understand exactly what the stages are in each method and which stages we have measured. The sort-merge algorithm consists of a sorting phase and a merging phase. The cost of the join operation, in disk accesses, may be described as follows where cost(p 1 A=B Q) = sort(a)(p ) + sort(b)(q) + merge merge = read(p ) + read(q) + output and sort(a)(p ) is the cost of sorting relation P based on attribute A, read(p ) is the cost of reading relation P, and output is the cost of writing the result of the join. The value of sort(a)(p ) will be zero if, and only if, all the attributes can be directly retrieved in sorted order using the indexes. That is, no sorting needs to be performed across data blocks. The hash-join algorithm consists of a partitioning phase, described in Section 8, and a merging phase. The cost of the join operation, in disk accesses, may be described as follows cost(p 1 A=B Q) = partition(a)(p ) + partition(b)(q) + merge where partition(a)(p) is the total cost of partitioning the relation P based on the attribute A and merge is a similar process to the merge in the sortmerge algorithm and has the same number of disk accesses. Note that if partition(a)(p) is zero then the partitioning of relation Q is unnecessary, so partition(b)(q) is zero. When the initial partition sizes, based on the index of the relation, are less than, or equal to, the size of the buer in memory the partitioning phase is unnecessary, therefore partition(a)(p) will be zero. If one relation, or index-based partition thereof, will totally t into memory then only a single read of the other relation will be necessary to perform the join. Thus the partitioning of the other relation is unnecessary. The costs described in Sections 6 and 7 represent the cost of the sorting phase of the sort-merge algorithm in terms of the number of disk accesses (both block reads and writes). Similarly, the costs described in Section 8 represent the cost of the partitioning phase of the hash-join algorithm in terms of the number of disk accesses. We ignore the cost of the merge phase in each of these sections because we cannot minimise the number of reads and writes in this phase, they are constant. Thus we concentrate on optimising the sorting and partitioning phases. 6 General solution using the sort-merge algorithm. As the cost of mass storage decreases it is becoming increasingly cost eective to have multiple copies of data les, each indexed using a dierent scheme. This results in an increase in performance for retrieval of data at the expense of requiring additional storage space and increased insertion, deletion and update costs. If multi-attribute hashing is used to index the les each index uses a dierent number of bits from each attribute. Under these circumstances it 4

Cluster quality 15. Running time 0.7. Distance between estimated and true means Running time [s]

Cluster quality 15. Running time 0.7. Distance between estimated and true means Running time [s] Fast, single-pass K-means algorithms Fredrik Farnstrom Computer Science and Engineering Lund Institute of Technology, Sweden arnstrom@ucsd.edu James Lewis Computer Science and Engineering University of

More information

Join algorithm costs revisited

Join algorithm costs revisited The VLDB Journal (1996) 5: 64 84 The VLDB Journal c Springer-Verlag 1996 Join algorithm costs revisited Evan P. Harris, Kotagiri Ramamohanarao Department of Computer Science, The University of Melbourne,

More information

Using the Holey Brick Tree for Spatial Data. in General Purpose DBMSs. Northeastern University

Using the Holey Brick Tree for Spatial Data. in General Purpose DBMSs. Northeastern University Using the Holey Brick Tree for Spatial Data in General Purpose DBMSs Georgios Evangelidis Betty Salzberg College of Computer Science Northeastern University Boston, MA 02115-5096 1 Introduction There is

More information

The only known methods for solving this problem optimally are enumerative in nature, with branch-and-bound being the most ecient. However, such algori

The only known methods for solving this problem optimally are enumerative in nature, with branch-and-bound being the most ecient. However, such algori Use of K-Near Optimal Solutions to Improve Data Association in Multi-frame Processing Aubrey B. Poore a and in Yan a a Department of Mathematics, Colorado State University, Fort Collins, CO, USA ABSTRACT

More information

Optimum Alphabetic Binary Trees T. C. Hu and J. D. Morgenthaler Department of Computer Science and Engineering, School of Engineering, University of C

Optimum Alphabetic Binary Trees T. C. Hu and J. D. Morgenthaler Department of Computer Science and Engineering, School of Engineering, University of C Optimum Alphabetic Binary Trees T. C. Hu and J. D. Morgenthaler Department of Computer Science and Engineering, School of Engineering, University of California, San Diego CA 92093{0114, USA Abstract. We

More information

Worst-case running time for RANDOMIZED-SELECT

Worst-case running time for RANDOMIZED-SELECT Worst-case running time for RANDOMIZED-SELECT is ), even to nd the minimum The algorithm has a linear expected running time, though, and because it is randomized, no particular input elicits the worst-case

More information

A Hybrid Recursive Multi-Way Number Partitioning Algorithm

A Hybrid Recursive Multi-Way Number Partitioning Algorithm Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence A Hybrid Recursive Multi-Way Number Partitioning Algorithm Richard E. Korf Computer Science Department University

More information

Egemen Tanin, Tahsin M. Kurc, Cevdet Aykanat, Bulent Ozguc. Abstract. Direct Volume Rendering (DVR) is a powerful technique for

Egemen Tanin, Tahsin M. Kurc, Cevdet Aykanat, Bulent Ozguc. Abstract. Direct Volume Rendering (DVR) is a powerful technique for Comparison of Two Image-Space Subdivision Algorithms for Direct Volume Rendering on Distributed-Memory Multicomputers Egemen Tanin, Tahsin M. Kurc, Cevdet Aykanat, Bulent Ozguc Dept. of Computer Eng. and

More information

As an additional safeguard on the total buer size required we might further

As an additional safeguard on the total buer size required we might further As an additional safeguard on the total buer size required we might further require that no superblock be larger than some certain size. Variable length superblocks would then require the reintroduction

More information

CAS CS 460/660 Introduction to Database Systems. Query Evaluation II 1.1

CAS CS 460/660 Introduction to Database Systems. Query Evaluation II 1.1 CAS CS 460/660 Introduction to Database Systems Query Evaluation II 1.1 Cost-based Query Sub-System Queries Select * From Blah B Where B.blah = blah Query Parser Query Optimizer Plan Generator Plan Cost

More information

Andrew Davenport and Edward Tsang. fdaveat,edwardgessex.ac.uk. mostly soluble problems and regions of overconstrained, mostly insoluble problems as

Andrew Davenport and Edward Tsang. fdaveat,edwardgessex.ac.uk. mostly soluble problems and regions of overconstrained, mostly insoluble problems as An empirical investigation into the exceptionally hard problems Andrew Davenport and Edward Tsang Department of Computer Science, University of Essex, Colchester, Essex CO SQ, United Kingdom. fdaveat,edwardgessex.ac.uk

More information

Richard E. Korf. June 27, Abstract. divide them into two subsets, so that the sum of the numbers in

Richard E. Korf. June 27, Abstract. divide them into two subsets, so that the sum of the numbers in A Complete Anytime Algorithm for Number Partitioning Richard E. Korf Computer Science Department University of California, Los Angeles Los Angeles, Ca. 90095 korf@cs.ucla.edu June 27, 1997 Abstract Given

More information

Adaptive Estimation of Distributions using Exponential Sub-Families Alan Gous Stanford University December 1996 Abstract: An algorithm is presented wh

Adaptive Estimation of Distributions using Exponential Sub-Families Alan Gous Stanford University December 1996 Abstract: An algorithm is presented wh Adaptive Estimation of Distributions using Exponential Sub-Families Alan Gous Stanford University December 1996 Abstract: An algorithm is presented which, for a large-dimensional exponential family G,

More information

An Improved Algebraic Attack on Hamsi-256

An Improved Algebraic Attack on Hamsi-256 An Improved Algebraic Attack on Hamsi-256 Itai Dinur and Adi Shamir Computer Science department The Weizmann Institute Rehovot 76100, Israel Abstract. Hamsi is one of the 14 second-stage candidates in

More information

Rowena Cole and Luigi Barone. Department of Computer Science, The University of Western Australia, Western Australia, 6907

Rowena Cole and Luigi Barone. Department of Computer Science, The University of Western Australia, Western Australia, 6907 The Game of Clustering Rowena Cole and Luigi Barone Department of Computer Science, The University of Western Australia, Western Australia, 697 frowena, luigig@cs.uwa.edu.au Abstract Clustering is a technique

More information

Optimal Sequential Multi-Way Number Partitioning

Optimal Sequential Multi-Way Number Partitioning Optimal Sequential Multi-Way Number Partitioning Richard E. Korf, Ethan L. Schreiber, and Michael D. Moffitt Computer Science Department University of California, Los Angeles Los Angeles, CA 90095 IBM

More information

Heap-Filter Merge Join: A new algorithm for joining medium-size relations

Heap-Filter Merge Join: A new algorithm for joining medium-size relations Oregon Health & Science University OHSU Digital Commons CSETech January 1989 Heap-Filter Merge Join: A new algorithm for joining medium-size relations Goetz Graefe Follow this and additional works at:

More information

Something to think about. Problems. Purpose. Vocabulary. Query Evaluation Techniques for large DB. Part 1. Fact:

Something to think about. Problems. Purpose. Vocabulary. Query Evaluation Techniques for large DB. Part 1. Fact: Query Evaluation Techniques for large DB Part 1 Fact: While data base management systems are standard tools in business data processing they are slowly being introduced to all the other emerging data base

More information

A Recursive Coalescing Method for Bisecting Graphs

A Recursive Coalescing Method for Bisecting Graphs A Recursive Coalescing Method for Bisecting Graphs The Harvard community has made this article openly available. Please share how this access benefits you. Your story matters. Citation Accessed Citable

More information

A Population-Based Learning Algorithm Which Learns Both. Architectures and Weights of Neural Networks y. Yong Liu and Xin Yao

A Population-Based Learning Algorithm Which Learns Both. Architectures and Weights of Neural Networks y. Yong Liu and Xin Yao A Population-Based Learning Algorithm Which Learns Both Architectures and Weights of Neural Networks y Yong Liu and Xin Yao Computational Intelligence Group Department of Computer Science University College,

More information

highest cosine coecient [5] are returned. Notice that a query can hit documents without having common terms because the k indexing dimensions indicate

highest cosine coecient [5] are returned. Notice that a query can hit documents without having common terms because the k indexing dimensions indicate Searching Information Servers Based on Customized Proles Technical Report USC-CS-96-636 Shih-Hao Li and Peter B. Danzig Computer Science Department University of Southern California Los Angeles, California

More information

Implementation of Relational Operations

Implementation of Relational Operations Implementation of Relational Operations Module 4, Lecture 1 Database Management Systems, R. Ramakrishnan 1 Relational Operations We will consider how to implement: Selection ( ) Selects a subset of rows

More information

PARALLEL EXECUTION OF HASH JOINS IN PARALLEL DATABASES. Hui-I Hsiao, Ming-Syan Chen and Philip S. Yu. Electrical Engineering Department.

PARALLEL EXECUTION OF HASH JOINS IN PARALLEL DATABASES. Hui-I Hsiao, Ming-Syan Chen and Philip S. Yu. Electrical Engineering Department. PARALLEL EXECUTION OF HASH JOINS IN PARALLEL DATABASES Hui-I Hsiao, Ming-Syan Chen and Philip S. Yu IBM T. J. Watson Research Center P.O.Box 704 Yorktown, NY 10598, USA email: fhhsiao, psyug@watson.ibm.com

More information

THE EFFECT OF JOIN SELECTIVITIES ON OPTIMAL NESTING ORDER

THE EFFECT OF JOIN SELECTIVITIES ON OPTIMAL NESTING ORDER THE EFFECT OF JOIN SELECTIVITIES ON OPTIMAL NESTING ORDER Akhil Kumar and Michael Stonebraker EECS Department University of California Berkeley, Ca., 94720 Abstract A heuristic query optimizer must choose

More information

9/24/ Hash functions

9/24/ Hash functions 11.3 Hash functions A good hash function satis es (approximately) the assumption of SUH: each key is equally likely to hash to any of the slots, independently of the other keys We typically have no way

More information

Evaluation of Relational Operations

Evaluation of Relational Operations Evaluation of Relational Operations Chapter 14 Comp 521 Files and Databases Fall 2010 1 Relational Operations We will consider in more detail how to implement: Selection ( ) Selects a subset of rows from

More information

Heap-on-Top Priority Queues. March Abstract. We introduce the heap-on-top (hot) priority queue data structure that combines the

Heap-on-Top Priority Queues. March Abstract. We introduce the heap-on-top (hot) priority queue data structure that combines the Heap-on-Top Priority Queues Boris V. Cherkassky Central Economics and Mathematics Institute Krasikova St. 32 117418, Moscow, Russia cher@cemi.msk.su Andrew V. Goldberg NEC Research Institute 4 Independence

More information

On the Complexity of Interval-Based Constraint. Networks. September 19, Abstract

On the Complexity of Interval-Based Constraint. Networks. September 19, Abstract On the Complexity of Interval-Based Constraint Networks Rony Shapiro 1, Yishai A. Feldman 2, and Rina Dechter 3 September 19, 1998 Abstract Acyclic constraint satisfaction problems with arithmetic constraints

More information

Evaluation of Relational Operations. Relational Operations

Evaluation of Relational Operations. Relational Operations Evaluation of Relational Operations Chapter 14, Part A (Joins) Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 1 Relational Operations v We will consider how to implement: Selection ( )

More information

Laboratoire de l Informatique du Parallélisme

Laboratoire de l Informatique du Parallélisme Laboratoire de l Informatique du Parallélisme École Normale Supérieure de Lyon Unité Mixte de Recherche CNRS-INRIA-ENS LYON n o 8512 SPI Multiplication by an Integer Constant Vincent Lefevre January 1999

More information

O(n): printing a list of n items to the screen, looking at each item once.

O(n): printing a list of n items to the screen, looking at each item once. UNIT IV Sorting: O notation efficiency of sorting bubble sort quick sort selection sort heap sort insertion sort shell sort merge sort radix sort. O NOTATION BIG OH (O) NOTATION Big oh : the function f(n)=o(g(n))

More information

Striped Grid Files: An Alternative for Highdimensional

Striped Grid Files: An Alternative for Highdimensional Striped Grid Files: An Alternative for Highdimensional Indexing Thanet Praneenararat 1, Vorapong Suppakitpaisarn 2, Sunchai Pitakchonlasap 1, and Jaruloj Chongstitvatana 1 Department of Mathematics 1,

More information

An Evaluation of Information Retrieval Accuracy. with Simulated OCR Output. K. Taghva z, and J. Borsack z. University of Massachusetts, Amherst

An Evaluation of Information Retrieval Accuracy. with Simulated OCR Output. K. Taghva z, and J. Borsack z. University of Massachusetts, Amherst An Evaluation of Information Retrieval Accuracy with Simulated OCR Output W.B. Croft y, S.M. Harding y, K. Taghva z, and J. Borsack z y Computer Science Department University of Massachusetts, Amherst

More information

Sparse Hypercube 3-Spanners

Sparse Hypercube 3-Spanners Sparse Hypercube 3-Spanners W. Duckworth and M. Zito Department of Mathematics and Statistics, University of Melbourne, Parkville, Victoria 3052, Australia Department of Computer Science, University of

More information

Lecture 2 - Graph Theory Fundamentals - Reachability and Exploration 1

Lecture 2 - Graph Theory Fundamentals - Reachability and Exploration 1 CME 305: Discrete Mathematics and Algorithms Instructor: Professor Aaron Sidford (sidford@stanford.edu) January 11, 2018 Lecture 2 - Graph Theory Fundamentals - Reachability and Exploration 1 In this lecture

More information

Enumeration of Full Graphs: Onset of the Asymptotic Region. Department of Mathematics. Massachusetts Institute of Technology. Cambridge, MA 02139

Enumeration of Full Graphs: Onset of the Asymptotic Region. Department of Mathematics. Massachusetts Institute of Technology. Cambridge, MA 02139 Enumeration of Full Graphs: Onset of the Asymptotic Region L. J. Cowen D. J. Kleitman y F. Lasaga D. E. Sussman Department of Mathematics Massachusetts Institute of Technology Cambridge, MA 02139 Abstract

More information

Localization in Graphs. Richardson, TX Azriel Rosenfeld. Center for Automation Research. College Park, MD

Localization in Graphs. Richardson, TX Azriel Rosenfeld. Center for Automation Research. College Park, MD CAR-TR-728 CS-TR-3326 UMIACS-TR-94-92 Samir Khuller Department of Computer Science Institute for Advanced Computer Studies University of Maryland College Park, MD 20742-3255 Localization in Graphs Azriel

More information

Maximal Monochromatic Geodesics in an Antipodal Coloring of Hypercube

Maximal Monochromatic Geodesics in an Antipodal Coloring of Hypercube Maximal Monochromatic Geodesics in an Antipodal Coloring of Hypercube Kavish Gandhi April 4, 2015 Abstract A geodesic in the hypercube is the shortest possible path between two vertices. Leader and Long

More information

The Grid File: An Adaptable, Symmetric Multikey File Structure

The Grid File: An Adaptable, Symmetric Multikey File Structure The Grid File: An Adaptable, Symmetric Multikey File Structure Presentation: Saskia Nieckau Moderation: Hedi Buchner The Grid File: An Adaptable, Symmetric Multikey File Structure 1. Multikey Structures

More information

Introduction to Query Processing and Query Optimization Techniques. Copyright 2011 Ramez Elmasri and Shamkant Navathe

Introduction to Query Processing and Query Optimization Techniques. Copyright 2011 Ramez Elmasri and Shamkant Navathe Introduction to Query Processing and Query Optimization Techniques Outline Translating SQL Queries into Relational Algebra Algorithms for External Sorting Algorithms for SELECT and JOIN Operations Algorithms

More information

Reinforcement Control via Heuristic Dynamic Programming. K. Wendy Tang and Govardhan Srikant. and

Reinforcement Control via Heuristic Dynamic Programming. K. Wendy Tang and Govardhan Srikant. and Reinforcement Control via Heuristic Dynamic Programming K. Wendy Tang and Govardhan Srikant wtang@ee.sunysb.edu and gsrikant@ee.sunysb.edu Department of Electrical Engineering SUNY at Stony Brook, Stony

More information

reasonable to store in a software implementation, it is likely to be a signicant burden in a low-cost hardware implementation. We describe in this pap

reasonable to store in a software implementation, it is likely to be a signicant burden in a low-cost hardware implementation. We describe in this pap Storage-Ecient Finite Field Basis Conversion Burton S. Kaliski Jr. 1 and Yiqun Lisa Yin 2 RSA Laboratories 1 20 Crosby Drive, Bedford, MA 01730. burt@rsa.com 2 2955 Campus Drive, San Mateo, CA 94402. yiqun@rsa.com

More information

Visualizing Weighted Edges in Graphs

Visualizing Weighted Edges in Graphs Visualizing Weighted Edges in Graphs Peter Rodgers and Paul Mutton University of Kent, UK P.J.Rodgers@kent.ac.uk, pjm2@kent.ac.uk Abstract This paper introduces a new edge length heuristic that finds a

More information

Chapter 12: Indexing and Hashing. Basic Concepts

Chapter 12: Indexing and Hashing. Basic Concepts Chapter 12: Indexing and Hashing! Basic Concepts! Ordered Indices! B+-Tree Index Files! B-Tree Index Files! Static Hashing! Dynamic Hashing! Comparison of Ordered Indexing and Hashing! Index Definition

More information

R has a ordered clustering index file on its tuples: Read index file to get the location of the tuple with the next smallest value

R has a ordered clustering index file on its tuples: Read index file to get the location of the tuple with the next smallest value 1 of 8 3/3/2018, 10:01 PM CS554, Homework 5 Question 1 (20 pts) Given: The content of a relation R is as follows: d d d d... d a a a a... a c c c c... c b b b b...b ^^^^^^^^^^^^^^ ^^^^^^^^^^^^^ ^^^^^^^^^^^^^^

More information

Implementations of Dijkstra's Algorithm. Based on Multi-Level Buckets. November Abstract

Implementations of Dijkstra's Algorithm. Based on Multi-Level Buckets. November Abstract Implementations of Dijkstra's Algorithm Based on Multi-Level Buckets Andrew V. Goldberg NEC Research Institute 4 Independence Way Princeton, NJ 08540 avg@research.nj.nec.com Craig Silverstein Computer

More information

Joint Entity Resolution

Joint Entity Resolution Joint Entity Resolution Steven Euijong Whang, Hector Garcia-Molina Computer Science Department, Stanford University 353 Serra Mall, Stanford, CA 94305, USA {swhang, hector}@cs.stanford.edu No Institute

More information

Chapter 12: Indexing and Hashing

Chapter 12: Indexing and Hashing Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B+-Tree Index Files B-Tree Index Files Static Hashing Dynamic Hashing Comparison of Ordered Indexing and Hashing Index Definition in SQL

More information

Announcement. Reading Material. Overview of Query Evaluation. Overview of Query Evaluation. Overview of Query Evaluation 9/26/17

Announcement. Reading Material. Overview of Query Evaluation. Overview of Query Evaluation. Overview of Query Evaluation 9/26/17 Announcement CompSci 516 Database Systems Lecture 10 Query Evaluation and Join Algorithms Project proposal pdf due on sakai by 5 pm, tomorrow, Thursday 09/27 One per group by any member Instructor: Sudeepa

More information

Kalev Kask and Rina Dechter. Department of Information and Computer Science. University of California, Irvine, CA

Kalev Kask and Rina Dechter. Department of Information and Computer Science. University of California, Irvine, CA GSAT and Local Consistency 3 Kalev Kask and Rina Dechter Department of Information and Computer Science University of California, Irvine, CA 92717-3425 fkkask,dechterg@ics.uci.edu Abstract It has been

More information

CSCI 5454 Ramdomized Min Cut

CSCI 5454 Ramdomized Min Cut CSCI 5454 Ramdomized Min Cut Sean Wiese, Ramya Nair April 8, 013 1 Randomized Minimum Cut A classic problem in computer science is finding the minimum cut of an undirected graph. If we are presented with

More information

An Overview of Cost-based Optimization of Queries with Aggregates

An Overview of Cost-based Optimization of Queries with Aggregates An Overview of Cost-based Optimization of Queries with Aggregates Surajit Chaudhuri Hewlett-Packard Laboratories 1501 Page Mill Road Palo Alto, CA 94304 chaudhuri@hpl.hp.com Kyuseok Shim IBM Almaden Research

More information

So the actual cost is 2 Handout 3: Problem Set 1 Solutions the mark counter reaches c, a cascading cut is performed and the mark counter is reset to 0

So the actual cost is 2 Handout 3: Problem Set 1 Solutions the mark counter reaches c, a cascading cut is performed and the mark counter is reset to 0 Massachusetts Institute of Technology Handout 3 6854/18415: Advanced Algorithms September 14, 1999 David Karger Problem Set 1 Solutions Problem 1 Suppose that we have a chain of n 1 nodes in a Fibonacci

More information

Evaluation of relational operations

Evaluation of relational operations Evaluation of relational operations Iztok Savnik, FAMNIT Slides & Textbook Textbook: Raghu Ramakrishnan, Johannes Gehrke, Database Management Systems, McGraw-Hill, 3 rd ed., 2007. Slides: From Cow Book

More information

Progress in Image Analysis and Processing III, pp , World Scientic, Singapore, AUTOMATIC INTERPRETATION OF FLOOR PLANS USING

Progress in Image Analysis and Processing III, pp , World Scientic, Singapore, AUTOMATIC INTERPRETATION OF FLOOR PLANS USING Progress in Image Analysis and Processing III, pp. 233-240, World Scientic, Singapore, 1994. 1 AUTOMATIC INTERPRETATION OF FLOOR PLANS USING SPATIAL INDEXING HANAN SAMET AYA SOFFER Computer Science Department

More information

Process Allocation for Load Distribution in Fault-Tolerant. Jong Kim*, Heejo Lee*, and Sunggu Lee** *Dept. of Computer Science and Engineering

Process Allocation for Load Distribution in Fault-Tolerant. Jong Kim*, Heejo Lee*, and Sunggu Lee** *Dept. of Computer Science and Engineering Process Allocation for Load Distribution in Fault-Tolerant Multicomputers y Jong Kim*, Heejo Lee*, and Sunggu Lee** *Dept. of Computer Science and Engineering **Dept. of Electrical Engineering Pohang University

More information

QUERY PROCESSING IN A RELATIONAL DATABASE MANAGEMENT SYSTEM

QUERY PROCESSING IN A RELATIONAL DATABASE MANAGEMENT SYSTEM QUERY PROCESSING IN A RELATIONAL DATABASE MANAGEMENT SYSTEM GAWANDE BALAJI RAMRAO Research Scholar, Dept. of Computer Science CMJ University, Shillong, Meghalaya ABSTRACT Database management systems will

More information

336 THE STATISTICAL SOFTWARE NEWSLETTER where z is one (randomly taken) pole of the simplex S, g the centroid of the remaining d poles of the simplex

336 THE STATISTICAL SOFTWARE NEWSLETTER where z is one (randomly taken) pole of the simplex S, g the centroid of the remaining d poles of the simplex THE STATISTICAL SOFTWARE NEWSLETTER 335 Simple Evolutionary Heuristics for Global Optimization Josef Tvrdk and Ivan Krivy University of Ostrava, Brafova 7, 701 03 Ostrava, Czech Republic Phone: +420.69.6160

More information

Chapter 11: Indexing and Hashing

Chapter 11: Indexing and Hashing Chapter 11: Indexing and Hashing Basic Concepts Ordered Indices B + -Tree Index Files B-Tree Index Files Static Hashing Dynamic Hashing Comparison of Ordered Indexing and Hashing Index Definition in SQL

More information

size, runs an existing induction algorithm on the rst subset to obtain a rst set of rules, and then processes each of the remaining data subsets at a

size, runs an existing induction algorithm on the rst subset to obtain a rst set of rules, and then processes each of the remaining data subsets at a Multi-Layer Incremental Induction Xindong Wu and William H.W. Lo School of Computer Science and Software Ebgineering Monash University 900 Dandenong Road Melbourne, VIC 3145, Australia Email: xindong@computer.org

More information

would be included in is small: to be exact. Thus with probability1, the same partition n+1 n+1 would be produced regardless of whether p is in the inp

would be included in is small: to be exact. Thus with probability1, the same partition n+1 n+1 would be produced regardless of whether p is in the inp 1 Introduction 1.1 Parallel Randomized Algorihtms Using Sampling A fundamental strategy used in designing ecient algorithms is divide-and-conquer, where that input data is partitioned into several subproblems

More information

Lecture 8 13 March, 2012

Lecture 8 13 March, 2012 6.851: Advanced Data Structures Spring 2012 Prof. Erik Demaine Lecture 8 13 March, 2012 1 From Last Lectures... In the previous lecture, we discussed the External Memory and Cache Oblivious memory models.

More information

Fractals for Secondary Key Retrieval

Fractals for Secondary Key Retrieval Carnegie Mellon University Research Showcase @ CMU Computer Science Department School of Computer Science 1989 Fractals for Secondary Key Retrieval Christos Faloutsos University of Maryland - College Park

More information

A Real Time GIS Approximation Approach for Multiphase Spatial Query Processing Using Hierarchical-Partitioned-Indexing Technique

A Real Time GIS Approximation Approach for Multiphase Spatial Query Processing Using Hierarchical-Partitioned-Indexing Technique International Journal of Scientific Research in Computer Science, Engineering and Information Technology 2017 IJSRCSEIT Volume 2 Issue 6 ISSN : 2456-3307 A Real Time GIS Approximation Approach for Multiphase

More information

An On-line Variable Length Binary. Institute for Systems Research and. Institute for Advanced Computer Studies. University of Maryland

An On-line Variable Length Binary. Institute for Systems Research and. Institute for Advanced Computer Studies. University of Maryland An On-line Variable Length inary Encoding Tinku Acharya Joseph F. Ja Ja Institute for Systems Research and Institute for Advanced Computer Studies University of Maryland College Park, MD 242 facharya,

More information

D-Optimal Designs. Chapter 888. Introduction. D-Optimal Design Overview

D-Optimal Designs. Chapter 888. Introduction. D-Optimal Design Overview Chapter 888 Introduction This procedure generates D-optimal designs for multi-factor experiments with both quantitative and qualitative factors. The factors can have a mixed number of levels. For example,

More information

Implementation of Relational Operations. Introduction. CS 186, Fall 2002, Lecture 19 R&G - Chapter 12

Implementation of Relational Operations. Introduction. CS 186, Fall 2002, Lecture 19 R&G - Chapter 12 Implementation of Relational Operations CS 186, Fall 2002, Lecture 19 R&G - Chapter 12 First comes thought; then organization of that thought, into ideas and plans; then transformation of those plans into

More information

Notes. Some of these slides are based on a slide set provided by Ulf Leser. CS 640 Query Processing Winter / 30. Notes

Notes. Some of these slides are based on a slide set provided by Ulf Leser. CS 640 Query Processing Winter / 30. Notes uery Processing Olaf Hartig David R. Cheriton School of Computer Science University of Waterloo CS 640 Principles of Database Management and Use Winter 2013 Some of these slides are based on a slide set

More information

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2014/15

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2014/15 Systems Infrastructure for Data Science Web Science Group Uni Freiburg WS 2014/15 Lecture II: Indexing Part I of this course Indexing 3 Database File Organization and Indexing Remember: Database tables

More information

Comparing Implementations of Optimal Binary Search Trees

Comparing Implementations of Optimal Binary Search Trees Introduction Comparing Implementations of Optimal Binary Search Trees Corianna Jacoby and Alex King Tufts University May 2017 In this paper we sought to put together a practical comparison of the optimality

More information

Using the Deformable Part Model with Autoencoded Feature Descriptors for Object Detection

Using the Deformable Part Model with Autoencoded Feature Descriptors for Object Detection Using the Deformable Part Model with Autoencoded Feature Descriptors for Object Detection Hyunghoon Cho and David Wu December 10, 2010 1 Introduction Given its performance in recent years' PASCAL Visual

More information

Lecture notes on Transportation and Assignment Problem (BBE (H) QTM paper of Delhi University)

Lecture notes on Transportation and Assignment Problem (BBE (H) QTM paper of Delhi University) Transportation and Assignment Problems The transportation model is a special class of linear programs. It received this name because many of its applications involve determining how to optimally transport

More information

Parallel DBMS. Parallel Database Systems. PDBS vs Distributed DBS. Types of Parallelism. Goals and Metrics Speedup. Types of Parallelism

Parallel DBMS. Parallel Database Systems. PDBS vs Distributed DBS. Types of Parallelism. Goals and Metrics Speedup. Types of Parallelism Parallel DBMS Parallel Database Systems CS5225 Parallel DB 1 Uniprocessor technology has reached its limit Difficult to build machines powerful enough to meet the CPU and I/O demands of DBMS serving large

More information

Interme diate DNS. Local browse r. Authorit ative ... DNS

Interme diate DNS. Local browse r. Authorit ative ... DNS WPI-CS-TR-00-12 July 2000 The Contribution of DNS Lookup Costs to Web Object Retrieval by Craig E. Wills Hao Shang Computer Science Technical Report Series WORCESTER POLYTECHNIC INSTITUTE Computer Science

More information

16 Greedy Algorithms

16 Greedy Algorithms 16 Greedy Algorithms Optimization algorithms typically go through a sequence of steps, with a set of choices at each For many optimization problems, using dynamic programming to determine the best choices

More information

Clustering Using Graph Connectivity

Clustering Using Graph Connectivity Clustering Using Graph Connectivity Patrick Williams June 3, 010 1 Introduction It is often desirable to group elements of a set into disjoint subsets, based on the similarity between the elements in the

More information

CS 245 Midterm Exam Solution Winter 2015

CS 245 Midterm Exam Solution Winter 2015 CS 245 Midterm Exam Solution Winter 2015 This exam is open book and notes. You can use a calculator and your laptop to access course notes and videos (but not to communicate with other people). You have

More information

Henning Koch. Dept. of Computer Science. University of Darmstadt. Alexanderstr. 10. D Darmstadt. Germany. Keywords:

Henning Koch. Dept. of Computer Science. University of Darmstadt. Alexanderstr. 10. D Darmstadt. Germany. Keywords: Embedding Protocols for Scalable Replication Management 1 Henning Koch Dept. of Computer Science University of Darmstadt Alexanderstr. 10 D-64283 Darmstadt Germany koch@isa.informatik.th-darmstadt.de Keywords:

More information

TR-CS The rsync algorithm. Andrew Tridgell and Paul Mackerras. June 1996

TR-CS The rsync algorithm. Andrew Tridgell and Paul Mackerras. June 1996 TR-CS-96-05 The rsync algorithm Andrew Tridgell and Paul Mackerras June 1996 Joint Computer Science Technical Report Series Department of Computer Science Faculty of Engineering and Information Technology

More information

On Covering a Graph Optimally with Induced Subgraphs

On Covering a Graph Optimally with Induced Subgraphs On Covering a Graph Optimally with Induced Subgraphs Shripad Thite April 1, 006 Abstract We consider the problem of covering a graph with a given number of induced subgraphs so that the maximum number

More information

Minoru SASAKI and Kenji KITA. Department of Information Science & Intelligent Systems. Faculty of Engineering, Tokushima University

Minoru SASAKI and Kenji KITA. Department of Information Science & Intelligent Systems. Faculty of Engineering, Tokushima University Information Retrieval System Using Concept Projection Based on PDDP algorithm Minoru SASAKI and Kenji KITA Department of Information Science & Intelligent Systems Faculty of Engineering, Tokushima University

More information

A Performance Study of Hashing Functions for. M. V. Ramakrishna, E. Fu and E. Bahcekapili. Michigan State University. East Lansing, MI

A Performance Study of Hashing Functions for. M. V. Ramakrishna, E. Fu and E. Bahcekapili. Michigan State University. East Lansing, MI A Performance Study of Hashing Functions for Hardware Applications M. V. Ramakrishna, E. Fu and E. Bahcekapili Department of Computer Science Michigan State University East Lansing, MI 48824 frama, fue,

More information

CS122 Lecture 10 Winter Term,

CS122 Lecture 10 Winter Term, CS122 Lecture 10 Winter Term, 2014-2015 2 Last Time: Plan Cos0ng Last time, introduced ways of approximating plan costs Number of rows each plan node produces Amount of disk IO the plan must perform Database

More information

Introduction to Randomized Algorithms

Introduction to Randomized Algorithms Introduction to Randomized Algorithms Gopinath Mishra Advanced Computing and Microelectronics Unit Indian Statistical Institute Kolkata 700108, India. Organization 1 Introduction 2 Some basic ideas from

More information

We assume uniform hashing (UH):

We assume uniform hashing (UH): We assume uniform hashing (UH): the probe sequence of each key is equally likely to be any of the! permutations of 0,1,, 1 UH generalizes the notion of SUH that produces not just a single number, but a

More information

Journal of Global Optimization, 10, 1{40 (1997) A Discrete Lagrangian-Based Global-Search. Method for Solving Satisability Problems *

Journal of Global Optimization, 10, 1{40 (1997) A Discrete Lagrangian-Based Global-Search. Method for Solving Satisability Problems * Journal of Global Optimization, 10, 1{40 (1997) c 1997 Kluwer Academic Publishers, Boston. Manufactured in The Netherlands. A Discrete Lagrangian-Based Global-Search Method for Solving Satisability Problems

More information

where is a constant, 0 < <. In other words, the ratio between the shortest and longest paths from a node to a leaf is at least. An BB-tree allows ecie

where is a constant, 0 < <. In other words, the ratio between the shortest and longest paths from a node to a leaf is at least. An BB-tree allows ecie Maintaining -balanced Trees by Partial Rebuilding Arne Andersson Department of Computer Science Lund University Box 8 S-22 00 Lund Sweden Abstract The balance criterion dening the class of -balanced trees

More information

Algorithms for Query Processing and Optimization. 0. Introduction to Query Processing (1)

Algorithms for Query Processing and Optimization. 0. Introduction to Query Processing (1) Chapter 19 Algorithms for Query Processing and Optimization 0. Introduction to Query Processing (1) Query optimization: The process of choosing a suitable execution strategy for processing a query. Two

More information

Advanced Database Systems

Advanced Database Systems Lecture IV Query Processing Kyumars Sheykh Esmaili Basic Steps in Query Processing 2 Query Optimization Many equivalent execution plans Choosing the best one Based on Heuristics, Cost Will be discussed

More information

Hashing for searching

Hashing for searching Hashing for searching Consider searching a database of records on a given key. There are three standard techniques: Searching sequentially start at the first record and look at each record in turn until

More information

Workload Characterization Techniques

Workload Characterization Techniques Workload Characterization Techniques Raj Jain Washington University in Saint Louis Saint Louis, MO 63130 Jain@cse.wustl.edu These slides are available on-line at: http://www.cse.wustl.edu/~jain/cse567-08/

More information

Computational Complexities of the External Sorting Algorithms with No Additional Disk Space

Computational Complexities of the External Sorting Algorithms with No Additional Disk Space Computational Complexities of the External Sorting Algorithms with o Additional Disk Space Md. Rafiqul Islam, S. M. Raquib Uddin and Chinmoy Roy Computer Science and Engineering Discipline, Khulna University,

More information

Module 9: Selectivity Estimation

Module 9: Selectivity Estimation Module 9: Selectivity Estimation Module Outline 9.1 Query Cost and Selectivity Estimation 9.2 Database profiles 9.3 Sampling 9.4 Statistics maintained by commercial DBMS Web Forms Transaction Manager Lock

More information

Chapter 17: Parallel Databases

Chapter 17: Parallel Databases Chapter 17: Parallel Databases Introduction I/O Parallelism Interquery Parallelism Intraquery Parallelism Intraoperation Parallelism Interoperation Parallelism Design of Parallel Systems Database Systems

More information

Lab 2: Support Vector Machines

Lab 2: Support Vector Machines Articial neural networks, advanced course, 2D1433 Lab 2: Support Vector Machines March 13, 2007 1 Background Support vector machines, when used for classication, nd a hyperplane w, x + b = 0 that separates

More information

Comparison of Priority Queue algorithms for Hierarchical Scheduling Framework. Mikael Åsberg

Comparison of Priority Queue algorithms for Hierarchical Scheduling Framework. Mikael Åsberg Comparison of Priority Queue algorithms for Hierarchical Scheduling Framework Mikael Åsberg mag04002@student.mdh.se August 28, 2008 2 The Time Event Queue (TEQ) is a datastructure that is part of the implementation

More information

Chapter 12: Indexing and Hashing

Chapter 12: Indexing and Hashing Chapter 12: Indexing and Hashing Database System Concepts, 5th Ed. See www.db-book.com for conditions on re-use Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B + -Tree Index Files B-Tree

More information

Mining N-most Interesting Itemsets. Ada Wai-chee Fu Renfrew Wang-wai Kwong Jian Tang. fadafu,

Mining N-most Interesting Itemsets. Ada Wai-chee Fu Renfrew Wang-wai Kwong Jian Tang. fadafu, Mining N-most Interesting Itemsets Ada Wai-chee Fu Renfrew Wang-wai Kwong Jian Tang Department of Computer Science and Engineering The Chinese University of Hong Kong, Hong Kong fadafu, wwkwongg@cse.cuhk.edu.hk

More information

Restricted Delivery Problems on a Network. December 17, Abstract

Restricted Delivery Problems on a Network. December 17, Abstract Restricted Delivery Problems on a Network Esther M. Arkin y, Refael Hassin z and Limor Klein x December 17, 1996 Abstract We consider a delivery problem on a network one is given a network in which nodes

More information

Chapter 3. Algorithms for Query Processing and Optimization

Chapter 3. Algorithms for Query Processing and Optimization Chapter 3 Algorithms for Query Processing and Optimization Chapter Outline 1. Introduction to Query Processing 2. Translating SQL Queries into Relational Algebra 3. Algorithms for External Sorting 4. Algorithms

More information