Striped Grid Files: An Alternative for Highdimensional

Size: px

Start display at page:

Download "Striped Grid Files: An Alternative for Highdimensional"

Joanna Rodgers
6 years ago
Views:

1 Striped Grid Files: An Alternative for Highdimensional Indexing Thanet Praneenararat 1, Vorapong Suppakitpaisarn 2, Sunchai Pitakchonlasap 1, and Jaruloj Chongstitvatana 1 Department of Mathematics 1, Department of Computer Engineering 2 Chulalongkorn University Bangkok 133 THAILAND Abstract-In this paper, we propose an index structure for high-dimensional data which applies to the concept of striping to grid files. We call this structure a striped grid file. In a striped grid file, there are a number of grid files, each of which is indexed by a subset of attributes which is striped from the original set of attributes. Each entry in these grid files is the index of another grid file, which contains pointers to the actual disk pages storing data records. Some experiments are performed to measure the performance of striped grid files, in term of the number of disk accesses and the storage utilization. It is found that striped grid files give much better storage utilization than single grid files, but little higher disk access. However, from the nature of grid files, the number of disk accesses for point queries in striped grid files is also constant. Furthermore, it is found that if the lower the number of dimension of the root grid, the better storage utilization of the striped grid file. I. INTRODUCTION High-dimensional index structures are becoming increasingly necessary because of the use of multimedia databases, text databases, genomic databases, etc. Multidimensional index structures, e.g. kdb-trees [1], R-trees [2], grid files [3], etc., do not work well when the number of dimensions or attributes is high. This problem, called curse of dimensionality [4], is caused by the exponential growth of data space with respect to the number of dimensions. Many index structures for high-dimensional databases are proposed to alleviate this problem. The approaches can be classified into two types. One approach, e.g. TV-trees [5], X-trees [6] and NSP-trees [7], is to use some heuristics to organize the index structures. This approach works well for specific applications and specific data types. The other approach, called tree striping [8], is to reduce the data dimension. The data space is divided into disjoint subspaces of lower dimensionality such that the cross-product of the subspaces is the original data space. The subspaces are organized using an arbitrary multidimensional index structure. A grid file [3] is a multidimensional index structure which requires low disk access at the expense of storage utilization. A grid file is composed of data pages containing data records and directory pages containing a multidimensional array of pointers to data pages. It is not practical to apply grid files for high-dimensional data because the directory size grows exponentially with respect to the number of dimensions. In this paper, we propose to apply striping to grid files in order to improve the storage utilization while maintaining low disk access of grid files. A striped grid file is composed of many reduceddimension grid files, called leaf grids, and another grid file, called root grid, which is used to combine the result of queries from the leaf grids. From the experiments, it is found that striping can reduce the storage required for grid files while maintaining a reasonably low disk access. Furthermore, it is found that the lower the dimension of root grid, the better the storage utilization. This paper is organized as follows. The structure of striped grid files is defined in Section 2. The algorithms for striped grid files are elaborated in Section 3. In Section 4, the experiment set-up, experimental results and discussion are shown. Conclusion is given in Section 5. II. STRUCTURE OF STRIPED GRID FILES A grid file partitions a set of data into smaller sets according to the range of each attribute. The partition for one dimension is independent of that for other dimensions. The partitioned range for each attribute is stored in a onedimensional array called a linear scale. The index for n- dimensional data is stored in an n-dimension array, called a grid directory, in which each entry is a pointer to a disk page, called a data page, containing data within the specified range. As a result, the size of grid directory is in the order of c n, where n is the dimension of data. However, the access of multi-dimensional array is very efficient. To apply the concept of striping on grid files, a reduceddimension grid directory, called a leaf grid, is created to partition data into stripes according to a set of attributes. For example, given data records with key <a 1, a 2, a 3, a 4 >, data can be striped according to <a 1, a 2 > and <a 3, a 4 >. According to attributes a 1 and a 2, data are partitioned using traditional grid file, and the index is stored in a leaf grid g 1. Similarly for attributes a 3 and a 4, the index is stored in a leaf grid g 2. However, a pointer in each leaf grid does not point directly to a data page, but point to another grid directory, called a root grid. In the root grid, each dimension in the grid directory is the range of the attributes in each leaf grid. Following from the previous example, the two dimensions of the root grid correspond to the set of elements in the leaf grids g 1 and g 2. A striped grid file SG can be denoted by <GR, G 1 G k >, where GR is the root grid, and G 1, G 2,, and G k are the leaf grids. For n-dimensional data, a striped grid file partitions data according to k sets of attributes, where k is neither 1 nor n (assuming n is divisible by k). For the striped grid file illustrated in figure 1, there are 2 leaf grid of 2-dimension. Each leaf grid partitions data according to d attributes of the

3 556 LS1 2 1235 3412 5427 LS2 2 point to the same data page only when the two entries are adjacent in the grid directory.

In striped grid files, a page pointer list is included, as an intermediate structure, at the root grid to allow any entry in the root grid to associate with the same data page.

If a data page is associated with more than one entry in the grid directory, it is called a packed page. Otherwise; it is called an unpacked page.

2 3 556 LS LS2 2 point to the same data page only when the two entries are adjacent in the grid directory. Thus, the split algorithm and the merge algorithm must not violate this condition. This can reduce the number of disk access in range queries. However, it can lower the storage utilization. In striped grid files, a page pointer list is included, as an intermediate structure, at the root grid to allow any entry in the root grid to associate with the same data page. A page pointer list, denoted by PR, is a list of pointers to data pages. Figure 2 shows a data page p which is associated with three non-adjacent entries in the root grid. If a data page is associated with more than one entry in the grid directory, it is called a packed page. Otherwise; it is called an unpacked page. When data pages are packed together, the storage utilization can be increased and the split and the merge algorithms are simple. However, the number of disk access in range queries can be increased because data records in a data page are not necessarily adjacent in values of keys. Figure 1. A structure of striped grid files n attributes where d = n/k. Data records are stored in data pages which can contain a fixed number of data records. Next, each structure is described in more detail. A. Leaf Grids Each leaf grid, G i (1 i k), is a single d-dimensional grid directory. Thus, each is composed of d linear scales, LS i j (1 i k, 1 j d), one d-dimensional grid directory, D i, called a leaf grid directory and one buddy tree T i. Linear scales for leaf grids are exactly those for the original grid files. On the other hand, grid directories in each leaf grid are different from those in original grid files. For a leaf grid, each entry in the grid directory contains an integer which is the number of its corresponding leaf node in its buddy tree. That is, G i (a i 1, a i 2,, a i d ) contains a pointer bt to a leaf node L in the buddy tree T i where the leaf node L represents a range R L of values of attributes <a i 1,..., a i d > specified in the linear scales. A buddy tree is a binary tree whose nodes are associated with areas in its grid directory. The root node of a buddy tree associated with the whole area of its corresponding grid directory. Furthermore, in the grid the union of the area in two child nodes is the area in the grid associated with their parent. A number is assigned to each leaf node in the buddy tree T, and is used as an index of the root grid. This number, denoted by Inorder(T, x), is the sequence number of the leaf node visited in an inorder traversal. In other words, Inorder(T, x) = y, if x is the y th leaf node visited in the inorder traversal of T. B. Root Grid The root grid GR is a k-dimensional grid, in which each dimension is indexed by k values obtained from k leaf grids. In the original grid files, two entries in the grid directory can III. ALGORITHMS Algorithms of point query, range query, insertion and deletion depend on those of tree striping and traditional grid files plus some modifications as described below. Basic algorithm for grid files can be found in [3]. A. Point Queries A point query is used to find a record with a specified search key. Algorithm PointQuery Given a striped grid file SG, find a record with a search key <a 1,..., a n >. (1) [Retrieve a data page] Invoke PageQuery to retrieve a data page pp containing data with the key <a 1,..., a n >. (2) [Find the data record in data page] Search for a record containing the key <a 1,..., a n > in the data page pp. If found, return the data record. If not, return null. Figure 2. The association between page pointer list and data pages

3 Algorithm PageQuery Given a search key <a 1,..., a n >, find a data page contains a record with the specified search key. (1) [Divide key] Divide the key <a 1,..., a n > to k d- dimensional keys <a i 1,..., a i d > for 1 i k. (2) [Query the leaf grids] For 1 i k, query leaf grid G i with d-dimensional key <a i 1,..., a i d >. The result is <NL 1, NL 2,, NL k > where NL i is a pointer to a leaf node in a buddy tree T i. (3) [Inorder traversal in the buddy trees] For 1 i k, O i = Inorder(NL i ). (4) [Query the root grid] Query the root grid with the key <O 1,..., O k > obtained from (3). The result is a pointer pr in the page pointer list. (5) [Look up the page pointer] Look up the pointer pr in the page pointer list. Return the pointer to a data page pp as a result. B. Range Queries A range query finds records with search keys that are in the specified range from a striped grid file. The result is a set of data within the specified range. Algorithm RangeQuery Given a striped grid file SG, find records whose search key is in the range R = <[a 1,b 1 ],..., [a n,b n ]>. (1) [Retrieve data pages] Invoke PageRangeQuery to retrieve a set pp of data pages in which each member of pp is in the range R. (2) [Find the data records in data page] Search for a record containing the key <a 1,..., a n > which is in the range R in all data pages in pp. Return all of the members that are found. Algorithm PageRangeQuery Given an n-dimensional range R = <[a 1,b 1 ],..., [a n,b n ]>, find a data page containing records whose search key is in the range R. (1) [Divide range] Divide the range <[a 1,b 1 ],..., [a n,b n ]> to k d-dimensional ranges <[a i 1, b i 1 ],..., [a i d, b i d ]> for 1 i k. (2) [Query the leaf grids] For 1 i k, query a leaf grid G i with d-dimensional ranges <[a i 1,b i 1 ],..., [a i d,b i d ]>. The result is a set of <NL 1, NL 2,, NL k > where NL i is a pointer to a leaf node in a buddy tree T i. (3) [Inorder traversal in the buddy trees] For 1 i k and for each member of the set of <NL 1, NL 2,, NL k >, find the set of <O 1,..., O k > where O i = Inorder(NL i ). (4) [Query the root grid] Query the root grid with each of the key <O 1,..., O k > obtained from (3). The result is a set pr of pointers to an element in the page pointer list. (5) [Look up the page pointers] Look up the set of the pointers pr in the page pointer list. Return a set of the pointers to a data page pp as a result. C. Insertion Insertion algorithm inserts a data record into a striped grid file. If the data page is overflow, then invoke split algorithm. Algorithm Insert Insert a data record with a search key <a 1,..., a n > to a striped grid file SG. (1) [Retrieve data page] Invoke PageQuery with the key <a 1,..., a n > to retrieve a data page pp. (2) [Add record to data page] If the inserted record makes the data page pp overflow, invoke Split to split this data page into two new data pages pp and pr. Add the new record to the data page pp. D. Splitting When the number of records in a data page exceeds the maximum size of the data page, it needs to be split into two new data pages so that new records can be inserted. Splitting occurs at two levels at the page pointer list and at the leaf grid. If an overflow occurs in a packed page, the split occurs at the page pointer list. That is, a new data page is allocated and the pointer to the overflow page in the page pointer list is moved to the new data page, as shown in figure 3. Finally, the data within the range of the data which causes splitting must be reallocated to the new data page. On the other hand, if an overflow occurs in an unpacked page, the split occurs at the leaf grid, and the root grid is split as a result. Splitting at this level is similar to splitting in the original grid file. The following is the splitting algorithm. Algorithm Split Given an entry q in the root grid such that q points to the element pr i in the page pointer list and pr i points to the data page pp, and the range R of data in q. (1) [Determine the level of splitting] If pp is a packed page, goto (2) to split at the page pointer list, else, goto (3) to split at the leaf grid. (2) [Split at the page pointer list] Allocate a new data page pr, move data in the range R from pp to pq, and change the pointer pr i in the page pointer list to point to pq. Then, return pq. (3) [Split at the leaf grid] (3.1) [Choose the attribute for splitting and the splitting point] Randomly choose the attribute S n to be split, and choose the median, sp, of data in pp, with respect to the dimension S n, as the splitting point. (3.2) [Split the leaf grid directory] Find a leaf grid G si in which S n is an index in one of its dimensions. Split G si along the dimension S n at the value sp. (3.3) [Update the buddy tree] Find a leaf node v in the buddy tree corresponding to the leaf grid G si such that v points to the entry q in the root grid. Create two children, v 1 and v 2, of v. Associate these two nodes with their two corresponding entries in G si created in (3.2) (4) [Split the root grid] Find the dimension corresponding to the node v in (3.3) and split q in the root grid in that dimension to make two rows, for v 1 and v 2, instead of v. Then, let the two split entries point to a new data page pr and the old data page pp. Finally, partition data among pr and pp according to the splitting point sp. Figure 3. The method to split a data page

4 E. Deletion Deletion algorithm deletes a specified record from a striped grid file. If a data page containing the newly deleted record is underflow, then invoke merging algorithm to reorganize the structure. Algorithm Delete Delete a record with a search key <a 1,..., a n > from a striped grid file SG. (1) [Retrieve data page] Invoke PageQuery with the key <a 1,..., a n > to retrieve a data page pp. (2) [Delete record from data page] Delete the record with the key <a 1,..., a n > from the data page pp. If the deleted record makes the data page pp underflow, invoke Merge to reorganize the structure. F. Merging When the number of records in some data pages is below the specified threshold or the storage utilization become low because of the split algorithm, the underflow data pages need to be merged to maintain the storage utilization. Merging occurs only at the level of page pointer list although splitting occurs at the level of grid also. Merging at the level of grid can reduce both the size of grid directory and the number of data pages. However, with the restriction of grid directories, merging hardly ever occurs. On the other hand, merging at the level of page pointer list is simple and can occur whenever there is an underflow page. To merge two data pages p i and p j, pointed by PR i and PR j in the page pointer list, data records in p j are added to p i, PR j is set to PR i, and p j is free. As a result, it is effective to maintain the required storage utilization. Algorithm Merge Given a data page list P = <P 1, P 2, P m >. (1) [Sort the data page list] Sort the data pages according to their sizes. (2) [Find a pair of data pages] Find a data page with the minimum number of data records among all the data pages and a data page with the maximum number of data records. If no pair of data pages satisfies this condition, go to (5). (3) [Merge the pair of data pages] Merge the selected pair of data pages from (2) and store its data records in one of the data pages, and free the other data page. (4) [Update the page pointer list] Update the page pointer to point to the merged data page. (5) [Repeat until the condition cannot be satisfied] Goto (2). IV. EXPERIMENTS AND RESULTS To evaluate a performance of striped grid files, we conducted experiments for synthetic data with different dimensionalities and distribution. The simulation of the index structures was implemented on JAVA 5 with 2 GB memory, assuming the disk block is 2 KB, to measure the number of disk access and the storage utilization. Synthetic data sets were generated with uniform distribution. We experimented on 4-, 6-, and 8-dimensional data. For 4-dimensional data, a 4-dimensional grid file is compared to a 2x2 striped grid file (2 stripes of 2- dimensional grid files). For 6-dimensional data, 6- dimensional grid, a 3x2 striped grid file (3 stripes of 2- dimensional grid files), and a 2x3 striped grid file (2 stripes of 3-dimensional grid files) are compared. For 8- dimensional data, a 4x2 striped grid file (4 stripes of 2- dimensional grid files), and a 2x4 striped grid file (2 stripes of 4-dimensional grid files) are compared. An 8-dimensional grid file is not used in the experiment because of its massive storage required. To create an index structure, data records are inserted and deleted alternatively until the required number of data records is met. The performance is measured when the number of data records in the index reached 1K, 2K,, and 1K. A. Storage Utilization Figure 4 and figure 5 show the number of disk pages used for 4-dimensional and 6-dimensional data in single grid files and striped grid files. From these figures, it is clear that the storage required for striped grid files is lower than the storage required for a single grid file. Furthermore, from figure 5, the storage required for a 2x3 striped grid is lower than that for a 3x2 striped grid. The difference is caused mainly by the size of the directories, especially the root grids. For each spilt, the root grid is always split and grows while only one among all the leaf grids is split. As a result, a root grid grows faster than a leaf grid. Furthermore, when the number of the dimensions of the root grid is larger, the size of the root grid grows even faster. As for the overall storage utilization shown in Figure 6, striped grid files yield 5-9%, whereas single grid files yield lower than 5% utilization. This shows that striped grid files use storage more efficiently than single grid files. Moreover, in striped grid files for 6-dimensional data, the storage utilization is lower when the number of data records increases. Number of Disk Pages Directories (single grid 4d) Directories and Data Pages (single grid 4d) Directories (striped 2x2d) Directories and Data Pages (striped 2x2d) Number of Data Figure 4. Number of disk pages used for 4-dimensional grid files B. Number of Disk Accesses In this section, the number of disk accesses for point queries, range queries, insertions, and deletions are examined. Point queries Similar to the traditional grid files, the number of disk accesses for each point query in a striped grid file is constant. When k is a number of leaf grids, the number of disk accesses for each query is k+2 (k disk accesses to access k leaf grids, one disk access for the root grid, and one for the data page containing the data record).

5 Number of Disk Pages Directories (single grid 6d) Directories and Data Pages (single grid 6d) Directories (striped 2x3d) Directories and Data Pages (striped 2x3d) Directories (striped 3x2d) Directories and Data Pages (striped 3x2d) Number of Data Figure 5. Number of disk pages used for 6-dimensional grid files 5 data records. Moreover, the average number of disk accesses for merging page files is only about per deletion. This is because most of insertion and deletion are not required to invoke split and merge algorithm respectively. Hence, disk accesses for both operations depend only on a point query which is used for choosing a suitable page to insert or delete data. Average Disk Access Single Grid 6 d Striped Grid 2 x 3 d Striped Grid 3 x 2 d Query Area (x 1^12) Figure 7. Average disk accesses from range quries Storage Utilization (%) Single grid 4d Striped 2x2d Single grid 6d Striped 2x3d 2 Striped 3x2d Number of Data Figure 6. Storage utilization for 4-, and 6-dimensional grid files Range queries From figure 7, the number of disk accesses for striped grid files is a little higher then that for single grid files. The reason for this is, unlike the traditional grid files, from PageRangeQuery algorithm for striped grid files gives a set of data pages which may not be adjacent. As a result, several parts of the root grid might have to be accessed, and more disk pages are accessed. However, since the root grids are often much smaller than the traditional grid files, the difference is nominal. Insertion and Deletion From our experiment, the average number of disk accesses used for splitting in striped 2x2d grid files is only about.2214 per insertion computing from the insertion of V. CONCLUSIONS We propose to apply the idea of striping to the traditional grid files and call this structure a striped grid file. This structure is composed of many reduced-dimension grid files, called leaf grids, and another grid file, called root grid, which is used to combine the result of queries from the leaf grids. The experiments show that the storage utilization of striped grid files is better than that of traditional grid files, while the number of disk accesses in striped grid file is not much higher than that in traditional grid files. Also, striped grid files are scaled better than traditional grid files when the number of dimensions is increased. It is also found that, as a benefit from the inherent characteristics of grid files, the number of disk accesses for point queries in striped grid files is always a constant, depending on the structure of the striped grid files. Furthermore, we found that if the number of dimensions of the root grid in a striped grid file is low, the striped grid file yields better storage utilization. ACKNOWLEDGMENT We would like to thank Scientific Parallel Computer Engineering Lab, Department of Computer Engineering, Chulalongkorn University for granting access to a computer cluster for our experiments. REFERENCES [1] J.T. Robinson, The K-D-B-tree: a search structure for large multidimensional dynamic indexes, Proc. ACM SIGMOD Int. Conf. on Management of Data, Ann Arbor, MI, 1981, pp [2] A. Guttman, R-trees: a dynamic index structure for spatial searching, Proc. ACM SIGMOD Int. Conf. on Management of Data, Boston, MA, 1984, pp [3] J. Nievergelt, H. Hinterberger, and K.C Sevcik, The grid file: an adaptable, symmetric multikey file structure, ACM Transaction on Database Systems (TODS), 9(1), 1984, pp [4] C. Böhm, S. Berchtold, and D.A. Keim, Searching in highdimensional spaces index structures for improving the performance of multimedia databases, ACM Computing Surveys, 33(3), 21, pp

6 [5] K. Lin, H.V. Jagadish, and C. Faloutsos, The TV-tree: an index structure for high-dimensional data, Very Large Databases Journal (VLDB), 3, 1995, pp [6] S. Berchtold, D. Keim, and H.-P. Kriegel, The X-tree: an index structure for high-dimensional data, 22 nd Conf. on Very Large Data Bases, Bombay, India, [7] G. Qian, Q. Zhu, Q. Xue, and S. Pramanik, A space-partitioningbased indexing method for multidimensional non-ordered discrete data spaces, ACM Transaction on Information Systems (TOIS), 24(1), 26, pp [8] S. Berchthold, C. Böhm, D.A. Keim, H.-P. Kriegel and X. Xu, Optimal multidimensional query processing using tree striping, Proc. 2 nd Int. Conf. on Data Warehousing and Knowledge Discovery (DaWaK), Greenwich, U.K., 2.

X-tree. Daniel Keim a, Benjamin Bustos b, Stefan Berchtold c, and Hans-Peter Kriegel d. SYNONYMS Extended node tree

X-tree. Daniel Keim a, Benjamin Bustos b, Stefan Berchtold c, and Hans-Peter Kriegel d. SYNONYMS Extended node tree X-tree Daniel Keim a, Benjamin Bustos b, Stefan Berchtold c, and Hans-Peter Kriegel d a Department of Computer and Information Science, University of Konstanz b Department of Computer Science, University