Spatial Processing using Oracle Table Functions

Size: px

Start display at page:

Download "Spatial Processing using Oracle Table Functions"

Henry Green
6 years ago
Views:

Spatial Processing using Oracle Table Functions Ravi Kanth V Kothuri, Siva Ravada and Weisheng Xu Spatial Technologies, NEDC, Oracle Corporation, Nashua NH 03062. Ravi.Kothuri, Siva.Ravada, Weisheng.

1 Spatial Processing using Oracle Table Functions Ravi Kanth V Kothuri, Siva Ravada and Weisheng Xu Spatial Technologies, NEDC, Oracle Corporation, Nashua NH Ravi.Kothuri, Siva.Ravada, Abstract Spatial joins and spatial index creation are two of the most expensive operations in Oracle Spatial. Since spatial indexing is implemented in extensible indexing framework where queries only return rows from a single table, spatial joins could not be effectively and efficiently implemented in Oracle8i and prior releases. On the other hand, spatial index creation involves much computation or I/O that could be easily parallelized. In this paper, we describe how Oracle Spatial applies parallel and pipelined table function technology to perform fast spatial joins and parallel index creation. This technology has been introduced in Oracle9i and allows users to iteratively return subsets of result rows to be used in the from clause of a SQL query. We present our experiences with these implementations and examine the performance on real datasets. 1 Introduction Spatial searching is a fundamental primitive in nontraditional databases such as GIS, CAD/CAM and multimedia applications. With the rapid proliferation of these databases in the past decade, extensive research has been conducted on the design of efficient data structures to enable fast spatial searching. Several data structures have been developed in this context. These include Quadtrees [23, 24, 29], R-trees [8, 25], hb-trees [15], TV-trees [14], SS-trees [31], and SR-trees [11]. Subsequent research has improved these basic structures further by proposing new techniques for query processing [3, 4, 6, 9, 12, 16, 18, 19, 28], faster and better index creation [7, 13, 27, 30], and better splitstrategies in dynamic updates [1, 2]. These techniques are especially effective for low-dimensional spatial data such as those in GIS and CAD/CAM applications. Commercial database vendors like IBM, and Oracle have also started implementing these indexing techniques to cater to the large and diverse GIS and CAD/CAM application markets. Oracle Spatial supports two spatial indexes: Linear Quadtrees and R-trees [22]. The Linear Quadtree (or Quadtree for short) computes tile approximations for data geometries at index creation time and creates B-tree indexes on the encoded tile approximations. On the other hand, R-trees construct a hierarchical structure using the MBRs of data geometries. A framework for optimizing most query operations in Quadtree and R-tree indexes has been developed in prior work [21, 22]. This work, however, does not address two grey areas that are still time-consuming: (1) index creation, and (2) R-tree spatial joins, which used a nested-loop join for lack of support for joins in Oracle extensible indexing. In this paper, we describe how to improve the performance of these two operations using parallel and pipelined table functions of Oracle9i [17]. For spatial joins, table functions can be easily used to pipeline the result rows after a join of both indexes. Using table functions for index creation in quadtrees, data is divided into smaller subsets and the subsets tessellated in parallel. Likewise in R-trees, subtrees are constructed on subsets of data in parallel and merged at the end. In this paper, we present performance improvements for index creation and spatial joins using this approach. Spatial index-based joins on real GIS datasets are faster by a factor of 6 times in comparison to a nested-loop join. Index creation improves by a factor of 2.6 on 4 processors. In summary, this work complements prior work on spatial query optimization and provides useful insight into implementation of domain-specific indexes in commercial databases. The rest of the paper is organized as follows. Section 2 gives a brief overview of table functions in Oracle9i. Section 3 describes Oracle Spatial functionality. Section 4 describes implementation of R-tree Spatial Joins using table functions. We compare the performance of spatial joins using nested-loop and index-based scan methods and discuss some related issues. Section 5 describes parallel index creation using parallel table functions. We present some results from creating spatial indexes on real datasets. The final section summarizes the results. 851

2 2 Parallel and Pipelined Table Functions in Oracle Most applications such as data warehousing require a transient table or collection that can be operated as regular database tables. To support such processing, most commercial database vendors including IBM and Oracle have implemented table functions. Table functions return a collection-type instance that can be cast to a table of appropriate columns and queried using regular SQL queries. In Oracle9i [17], table functions allow for iteratively fetching result rows and for parallel processing of the computation and row fetching. Several applications such as Oracle Spatial, OLAP, and Oracle Data Mining have implemented their functionality using this support for table functions. Table functions are functions that can produce a set of rows as output. In other words, table functions return a collection type instance (nested table and VARRAY datatypes). Users can use a table function in place of a regular table in the FROM clause of a SQL statement as in the following example: select * from TABLE(spatial_join(tab1, col1, tab2, col2, intersect )); The spatial join function is a table function that could return the rowids of the tables tab1, tab2 whenever the geometries in columns col1, col2 satisfy a specified relationship such as intersection. The function could be implemented either in C/Java (or PL/SQL 1 ) using a start-fetchclose methodology to perform the function (or part of it) in the start routine, iteratively return the result rows in the fetch routine and release memory resources in the close routine. Note that such iterative fetching of result rows (referred to as pipelining here) is essentialto supporttablefunctionsthat return a large set of rows that cannot fit in memory. In addition to pipelining of result rows in table functions, parallel execution of a function is supported by allowing functions to directly accept a set of rows (a cursor) corresponding to a sub-query operand providing a mechanism that allows a set of input rows to be partitioned across multiple instances of a parallel function Given this model for supporting pipelining of results and parallelizing an operation, in the next section we describe its application in supporting R-tree spatial joins and parallel index creation. 1 A different approach is used in PL/SQL 3 Oracle Spatial Oracle Spatial models 2-4 dimensional spatial data using an sdo geometry data type. For the 2-dimensional case, this data type models all the spatial data types defined by the Open GIS Consortium (OGC) and caters to most data occurring in GIS, CAD/CAM applications. Supported spatial data includes simple primitive elements such as points, lines, curves, polygons (with and without holes), and complex elements that are made up of a combination of primitive elements. The sdo geometry data type is implemented as an Oracle object datatype. This approach extends all the benefits of Oracle s object-relational database technology including replication to spatial data. Quadtree and R-tree indexes on spatial data are implemented using the extensible indexing framework of Oracle [5, 20]. This framework allows for the creation of new domain-specific indexes and associated query operators and provides for the integration of user-specified query, update and index creation routines inside Oracle server. Oracle Spatial supports a spatial index indextype for indexing spatial data. Quadtree and R-tree indexes are supported as part of this spatial index indextype. Since these indexes are implemented as part of the extensible indexing framework, spatial indexes can be easily created on sdo geometry columns of database tables using an extended SQL syntax. As part of such index creation, the corresponding spatial index creation routines are executed and the constructed spatial index is stored in the database as a spatial index table. The index table stores index information such as R-tree nodes in the case of R-trees and Quadtree tiles in the case of Quadtrees. The metadata for the entire index is stored as a row in a separate metadata table. This metadata includes the name of the index table storing the index, dimensionality, root pointer fanout parameters for an R-tree and the tiling level parameter for a Quadtree index. In addition to SQL-level index creation, inserts and updates to database tables that have a spatial index also automatically trigger an update of the corresponding spatial indexes. In addition to these advantages, extensible indexing also ensures statement or session-level concurrency and table-level recovery. To query the constructed spatial indexes, new predicates, referred to as operators, are defined. These operators can be included in the where clause of a SQL statement to select data that satisfy a specified query criterion with respect to a specified query window. Such operators are executed using index-associated procedures for query processing and allow for incremental processing of queries (see [20, 5, 22] for more details). Queries have been optimized in prior work [21]. In the next sections, we examine how to improve the performance of R-tree joins and index creation using pipelined and parallel table functions. 852

3 4 R-tree Spatial Joins Spatial joins select pairs of rows from two tables based on their spatial interaction. For example, a query could identify the number of pairs of geometries from the cities and rivers tables that intersect each other as follows: select count(*) from city_table a, river_table b where sdo_relate( a.city_geom, b.river_geom, intersect )= TRUE ; There are two ways to compute such joins: First approach is to iterate on the first table (cities) performing a spatial query on the second table (rivers) using each geometry in the first table. This is the nested-loop approach. The second approach is to traverse the associated spatial indexes on both the tables together [10, 26] and identify interacting geometries This is referred to as index-based spatial join approach. As with all other B-tree joins, index-based spatial join approach is faster than nested-loop join approach. However, until Oracle9i, there is no efficient mechanism to return pairs of rowids (rowids of first and second table) in Oracle. Currently, spatial joins can be rewritten and evaluated using table functions of Oracle9i as follows. The table names and geometry column names along with the interaction-type can be passed in to a spatial join function that returns the pair of rowids of the interacting geometries from the indexes of the two tables. select count(*) from city_table a, river_table b where (a.rowid, b.rowid) in (select rid1, rid2 from TABLE(spatial_join( city_table, city_geom, river_table, river_geom, intersect ))); 4.1 Parallelizing Spatial Join The drawback of the above approach is that it only has a single input stream and does not use the parallelism available through Oracle table function technology. For instance, if the two indexes are rooted at R1 and S1 as shown in Figure 1, the above approach will invoke one join operation of the trees rooted at R1 and S1. To better avail of the tablefunction-level parallelism, we modify our approach to perform a spatial-join of subtrees of the R-tree indexes. To this end, we descend each index by a certain level and identify the roots of the subtrees at that level and join the subtrees. For instance, if we descend by one level in Figure 1, this will result in 4 joins of the following subtree pairs:,, and. In general, we descend both trees as far below as to get appropriate number of subtree-joins. The spatial-join function is modified to include a cursor returning a set of R-tree subtree roots as follows: select count(*) from city a, river b where (a.rowid, b.rowid) in (select rid1, rid2 from TABLE(spatial_join( CURSOR(select * from table(subtree_root( city_table_index, level)), table(subtree_root( river_table_index, level))), city_table, city_geom, river_table, river_geom, intersect ))); 4.2 Evaluation using Pipelined Table Functions Spatial join is evaluated using the start-fetch-close interface of Oracle pipelined table functions. In the start method, the metadata of the two R-tree indexes that need to be joined is loaded and the subtree roots of the R-tree indexes (that are passed in as parameters to the spatial join function) are pushed onto a stack. In each fetch call, the spatial join processing is resumed using the contents of the stack and as many result join rowids are determined as specified in the fetch call by joining the two R-tree indexes. Once there are no more result rowids to be returned, the fetch call returns an empty collection and the memory resources are cleaned up in the subsequent close call. Next we describe the join processing in each fetch call in more detail. Since the data are arbitrarily complex geometry data, the join has to be evaluated in a 2-stage fashion. First the indexbased MBRs are compared for intersection with each other. An array of candidate pairs of geometries are computed using the two indexes. The size of this array is determined by existing memory resources. Once the candidate array is processed, the array is filled by resuming the index-based join of the two R-trees. Each candidate pair of geometries in the array are processed by first fetching the exact geometries from the two tables and then comparing them using a secondary (geometry-geometry) filter. Shekhar et al. [26] note that the right order of fetching the geometries is important for performance and the problem is NP-complete. Instead of a random order of fetching the geometries, sorting the candidate pair based on the first rowid is much better and expected to be within 20% of the best approximate solutions. We adopt this approach in Oracle Spatial. 853

4 R1 S1 R11 R12 S11 S12 Index of First Table (cities) Index of Second Table(rivers) Join Pairs of Subtrees for Parallelism (R11, S11), (R11, S12), (R12, S11), (R12, S12) Figure 1. Joining Two Spatial Indexes. 4.3 Experiments We examine the performance of index-based spatial-join on two real datasets: Counties and Star-clusters. We describe these datasets and compare the performance of the nested-loop and index-based spatial join for each of these datasets in turn. These experiments are conducted using alpha version of Oracle10i on a Sun 400MHz 4-CPU machine with 1GB memory. The first dataset contains the geometries for the 3230 counties in the Unites States. This data is joined with itself by specifying either intersection (distance of 0) or by specifying a distance. Table 1 reports the results. Distance Result Nested Spatial Index Size Loop Join s 144.7s s 221.9s s 271.8s s 331.4s Table 1. Comparison of join times using Nested-loop Join, Spatial-index Join for Counties data. Spatial-index Join is 33-55% faster. Next we examined the join performance for different sizes of the dataset and using parallel processing using the second dataset. The second dataset is 250K data about star locations/clusters in a cross-section of the sky (customer data publicly available). We varied the dataset size from 25 to 250K by choosing subsets of the original 250K data. We performed a self-join of each subset and examined the performance of spatial using (1) nested-loop based evaluation, (2) index-based join on 1 processor (I1), and (3) index-based join on 2 processors (I2). Table 2 shows the join query response time for each dataset size. For small dataset sizes of 25 polygons, the nested-loop method performs the same as the index-based join on one processor. This is because of the relatively small size of the dataset and the result sets. However, as we increase the dataset size, the result set size increases and we observe that the nested-loop method is nearly 6 times slower in most cases compared to the index-based join(i1). The gains from parallel processing are nearly 50% for most dataset sizes. Data Result Nested Index Index size size loop Join(1) Join(2) s 6.2s 3.47s s 3.5s 2.23s s 10.3s 7.2s s 83s 70s s 864s 676s Table 2. Comparison of join times using Nested-loop-based Join, Index-based Join on 1 and 2 processors for different dataset sizes. Index-based join using table functions is nearly 6 times faster. 854

5 Geometry Table Table fn Partitioning Tesselate Tesselate Index Table cessors. We used the US Block-groups data consisting of about 230K arbitrarily-shaped complex polygon geometries. Table 3 illustrates these results from 1 to 4 processors for both Quadtree and R-tree indexes. Since the geometries are large and complex, the Quadtree creation time is high compared to R-trees. We observe that index creation speeds up by a factor of 2.6 on 4 processors for Quadtree. Compared to that, R-tree creation does not involve expensive tessellation and is faster even in the sequential case and speeds up by a factor of 1.8. Tesselate Figure 2. Parallelizing Quadtree Index Creation. 5 Parallel Index Creation In this section, we describe how Quadtree and R-tree index creation can be parallelized using table functions. Quadtree index creation consists of the following steps: 1. For each data geometry, tessellate the geometry into tiles and store these tiles in an index table. 2. Construct B-tree indexes on the codes for the tiles. In order to parallelize the index creation operation, we need to parallelize the tessellation of geometries and create parallel B-tree indexes. The latter part is performed by specifying the parallel clause of a B-tree index statement in Oracle. In order to parallelize the tessellation which happens to be a substantial portion of the index creation time for large complex polygon geometries, we use a table function that takes as input a cursor for fetching the geometries and tessellates these geometries (and inserts the tiles in a specified table). This process is illustrated in Figure 2. Since parallel table functions partition the input cursor based on the specified operation-level parallelism, the tessellation process is performed in parallel on subsets of the input geometries from the table. Analogous to Quadtree construction, R-tree creation is also parallelized by using parallel table functions (1) to load the geometry data and compute minimum bounding rectangles, and (2) to cluster subtrees in parallel. 5.1 Experiments In this section, we describe some experimental results to compare the index creation performance on multiple pro- Number of Quadtree Creation R-tree Creation Processors time time s 454s s 296s s 258s Table 3. Parallel Quadtree and R-tree creation times using table functions: Speedup of up to 2.6 on 4 processors. 6 Conclusions In this paper, we examined how to improve two expensive operations in Oracle Spatial: R-tree spatial joins and parallel index creation. We described how parallel and pipelined table function technology can be used to perform spatial joins efficiently using the two associated R-tree indexes. We also examined the effect of using table functions to parallelize index creation. Both operations improved in performance by several factors compared to prior versions that did not support table functions. This demonstrates the effectiveness of parallel and pipelined table functions as a building technology tool to efficiently support complex domain-specific operations such as spatial joins and index creation. References [1] N. Beckmann, H. Kriegel, R. Schneider, and B. Seeger. The R* tree: An efficient and robust access method for points and rectangles. In Proc. ACM SIGMOD Int. Conf. on Management of Data, pages , [2] S. Berchtold, D. A. Keim, and H. P. Kreigel. The X-tree: An index structure for high dimensional data. Procȯf the Int. Conf. on Very Large Data Bases, [3] S. Berchtold, D. A. Keim, H.-P.Kriegel, and T. Seidl. A new technique for nearest neighbor search in high-dimensional space. IEEE Trans. on Knowledge and Data Engineering, 12(1):45 57,

6 [4] T. Brinkhoff, H. Horn, H. P. Kriegel, and R. Schneider. A storage and access architecture for efficient query processing in spatial database systems. In Symposium on Large Spatial Databases (SSD 93), LNCS 692, [5] S. Defazio, A. Daoud, L. A. Smith, and J. Srinivasan. Integrating ir and rdbms using cooperative indexing. In Proc. of ACM SIGIR Conf. on Information Retrieval, pages 84 92, [6] H. Ferhatosmanoglu, E. Tuncel, D. Agrawal, and A. E. Abbadi. Approximate nearest neighbor searching in multimedia databases. In Proc. Int. Conf. on Data Engineering, pages , [7] Y. J. Garcia, S. T. Leutenegger, and M. A. Lopez. A greedy algorithm for bulk loading R-trees. In Proc. of ACM GIS, [8] A. Guttman. R-trees: A dynamic index structure for spatial searching. Proc. ACM SIGMOD Int. Conf. on Management of Data, pages 47 57, [9] G. Hjaltson and H. Samet. Ranking in spatial databases. In Symposium on Spatial Databases (SSD), [10] Y.-W. Huang, N. Jing, and E. A. Rundensteiner. Spatial joins using r-trees: Breadth-first traversal with global optimizations. In Procȯf the Int. Conf. on Very Large Data Bases, pages , [11] N. Katayama and S. Satoh. The SR-tree: An index structure for high-dimensional nearest-neighbor queries. Proc. ACM SIGMOD Int. Conf. on Management of Data, pages , May [12] M. Kornacker, C. Mohan, and J. Hellerstein. Concurrency and recovery in GiST. In Proc. ACM SIGMOD Int. Conf. on Management of Data, pages 62 72, Tucson, Arizon, June [13] S. T. Leutenegger, M. A. Lopez, and J. M. Edgington. STR: A simple and efficient algorithm for R-tree packing. In Proc. Int. Conf. on Data Engineering, [14] K.-I. Lin, H. V. Jagdish, and C. Faloutsos. The TV-tree: An index structure for high-dimensional data. VLDB Journal, 3: , [15] D. B. Lomet and B. Salzberg. The hb-tree: A multiattribute indexing method with good guaranteed performance. Proc. ACM Symp. on Transactions of Database Systems, 15(4): , December [16] B. C. Ooi, C. Yu, K. L. Tan, and H. V. Jagadish. Indexing the distance: an efficient method to knn processing. In Procȯf the Int. Conf. on Very Large Data Bases, [17] Oracle Press. Parallel and Pipelined Table Functions. In Oracle9i SQL Reference Documentation, [18] D. Papadis, T. Sellis, Y. Theodoridis, and M. Egenhofer. Topological relations in the world of minimum bounding rectangles: a study with r-trees. In Proc. ACM SIGMOD Int. Conf. on Management of Data, pages , [19] K. V. Ravi Kanth, D. Agrawal, Amr El Abbadi, and Ambuj K. Singh. Dimensionality reduction for similarity searching in dynamic databases. In Proc. ACM SIGMOD Int. Conf. on Management of Data, [20] K. V. Ravi Kanth, Siva Ravada, J. Sharma, and J. Banerjee. Indexing medium-dimensionality data in oracle. In Proc. ACM SIGMOD Int. Conf. on Management of Data, [21] Ravi Kanth V Kothuri and Siva Ravada. Efficient processing of large spatial queries using interior approximations. In Symposium on Spatial and Temporal Databases (SSTD), [22] Ravi Kanth V Kothuri, Siva Ravada, and Daniel Abugov. Quadtree and r-trees in oracle spatial: A comparison using gis data. In Proc. ACM SIGMOD Int. Conf. on Management of Data, [23] H. Samet. Recent developments in linear quadtree-based geographic information systems. Image and Vision Computing, 5(3): , Aug [24] H. Samet. The design and analysis of spatial data structures. Addison-Wesley Publishing Co., [25] T. Sellis, N. Roussopoulos, and C. Faloutsos. The r -tree: A dynamic index for multi-dimensional objects. Procȯf the Int. Conf. on Very Large Data Bases, 13: , [26] S. Shekhar, C. Lu, S. Chawla, and S. Ravada. Efficient join index based join processing; a clustering approach. IEEE Trans. on Knowledge and Data Engineering. [27] Y. Theodoridis and T. K. Sellis. Optimization issues in r-tree construction. In Geographic Information Systems (IGIS), pages , [28] Y. Theodoridis and T. K. Sellis. A model for the prediction of r-tree performance. In Proc. ACM Symp. on Principles of Database Systems, [29] F. Wang. Relational-linear quadtree approach for twodimensional spatial representation and manipulation. IEEE Trans. on Knowledge and Data Engineering, 3(1): , Mar [30] D. White and R. Jain. Algorithms and strategies for similarity retrieval. Proc. of the SPIE Conference, [31] D. White and R. Jain. Similarity indexing with the SS-tree. Proc. Int. Conf. on Data Engineering, pages ,

Efficient Processing of Large Spatial Queries Using Interior Approximations

Efficient Processing of Large Spatial Queries Using Interior Approximations Ravi K. Kothuri and Siva Ravada Spatial Technologies, NEDC Oracle Corporation, Nashua NH 03062 {Ravi.Kothuri,Siva.Ravada}@oracle.com