Situational Awareness over Large Spatio-Temporal Databases

Situational Awareness over Large Spatio-Temporal Databases Sharad Mehrotra +, Iosif Lazaridis +, Kriengkrai Porkaew * + University of California, Irvine, CA, USA * King Mongkut s University of Technology, Thonburi, Thailand Abstract. Providing database support for interactive 3D Visualization and Situational Awareness (SA) is one of the keys to making such systems scale to the complexity of real-world scenarios. It is desirable that SA systems be highly realistic and intuitive to use. From a database perspective these requirements translate into (i) support for large amount of heterogeneous spatio-temporal data (to provide a convincing virtual environment, approximating the complexity of reality), (ii) support for highly dynamic situations in which the data may change frequently, (iii) support for mechanisms to deal with inherent imprecision in the underlying information, and, (iv) powerful query interface that allows users to pose varied and interesting queries that are answered with minimum time delay. In this paper, we outline emerging research results in the field, pointing to their direct applicability to the Situational Awareness task. We describe techniques for the representation of dynamic and mobile objects and their integration into a database system by reviewing appropriate indexing and query processing techniques. We also deal with spatial aggregate queries and present new techniques to compute their answers in a time-critical environment. We conclude by outlining research directions in the context of the QUASAR project which emerge from the anticipated proliferation of spatially-embedded cheap electronic sensors and devices. Such devices generate information at high rates and create an opportunity for a richer interaction with the physical world for a variety of new applications (e.g., transportation management). 1. Introduction Modern 3D Visualization systems, especially for the task of Situational Awareness (SA), pose novel challenges to their data management component. As data incorporated in such systems increases both in volume and complexity, data management emerges as one of the most important enabling factors for their continued success. Typical Database Management Systems (DBMS) have been designed for business-oriented applications (e.g., banking), thus they are not immediately suited for the quite different task of Situational Awareness. The main challenges for database systems for SA arise from the nature of data typically stored in such systems. These potentially include all the types of objects in the physical world like terrain, buildings, transportation power and communication networks, moving vehicles, weather patterns, in short, everything having a physical presence. A broad classification of these data could be as static data, which remains unchanged with the passage of time (e.g., terrain) and dynamic data, which conversely have a changing state that can be given as a function of time.

We might alternatively define static data as data that changes infrequently or very slowly and dynamic data as data that changes frequently or even continuously. We may further distinguish dynamic data as stationary and mobile. In the first case the location of the objects in question does not change with the passing of time, but other properties do. An example would be a temperature sensor deployed in the field; its position remains constant but the temperature reading that it transmits to the system changes from moment to moment. Mobile objects have the property that their spatial location changes with time. In a transportation management application, these would include users vehicles following surface roads and freeways as well as flying units, etc. Most data used in Situational Awareness applications differ from traditional data stored in databases in that they frequently possess an inherent notion of relative quality or precision. In a banking application that keeps track of account balances this issue does not arise; there is a single number that represents the precise balance of an account. In the SA domain things are not that clear. There exists a natural trade-off between the quality of the presentation (virtual environment) and the computational cost that has to be paid to deliver this level of quality. For instance, a geometric model of a mountain can be kept at various levels of resolution that increasingly approximate the real mountain. The visualization system can present a coarse approximation of the mountain at a minimal computational cost but can provide a better approximation for additional cost. A different example is the average speed of vehicles along a transportation route; this may be reported at various levels of accuracy. The notion of quality arises both in static data (like mountains) and in dynamic data (e.g., aircraft). In static data it corresponds to the multiple resolution model outlined in the previous paragraph. For dynamic data traditional database systems follow the explicit update model. Under this model, whenever an object in the world being modeled in the database changes the value of one of its attributes, it issues a direct update of its representation in the database system. E.g., every time a customer makes a withdrawal from his banking account, the new balance is update in the database system. For highly dynamic systems (including SA systems), this policy is not feasible. As an example, an aircraft moves continuously through the air at every instance of time. To precisely capture its motion we would have to issue updates at a very high rate. This is undesirable, because the cost of supporting the high update rate will hinder performance and most of these updates are never used (e.g., there is no reason to update a temperature reading every second, if the user asks for the temperature every day). Much work still needs to be done to support dynamic data without an explicit update model. The work done so far has focused on storing an attribute value periodically and estimating it using a prediction model for query times in-between updates. We have outlined just a few of the challenges that database management for Situational Awareness has to face. The entire data management component of SA systems has to be designed with these considerations in mind. In this paper we focus on a variety of issues that arise in this domain. In particular we summarize the work done in the SATURN project on the representation, indexing and querying of mobile object databases and also the processing of spatial aggregate queries for visualization. We also present future directions of research in the context of the QUASAR project at UC Irvine. This paper, prepared for a committee of the Computer Science and Telecommunications Board, should not be cited

2. Mobile Object Representation & Indexing Mobile objects are objects in the virtual environment whose position (i) changes continuously with time and (ii) is not reflected in the database by an explicit update. Condition (ii) is important as we have mentioned, because of the prohibitive cost of sustaining multiple frequent updates of objects. A mobile object traces a trajectory through space-time. If we think of space-time as having 3 spatial and 1 temporal dimensions then each 4D point (x, y, z, t) can be used to represent the position of an object at some particular time. Since the object is mobile, this representation will change continuously with time. Thus, if we look at the object in space-time, we will observe an arbitrary, continuous line each point of which represents the object s location at the particular instance of time. Figure 1. Mobile Object Update Management Techniques It is natural to index mobile objects using a multi-dimension index structure, e.g., R-Tree [5]. Such a data structure clusters objects that are close together in the indexed space (in our case 4D space-time). An R-Tree works by grouping objects with bounding rectangles in a hierarchical manner. Each node of the R-Tree contains all the bounding rectangles of its children. Queries usually need to look only at a small region of the indexed space. Using an R-Tree they can avoid looking at regions that are disjoint from the query. Thus, an R-Tree provides an efficient way to access multi-dimensional data. A significant problem in dynamic environments is that of update management, i.e., how often and under which conditions the object s representation in the database should be changed to reflect its changing real-world parameters. In [13] the Adaptive Dead Reckoning technique (Figure 1.a) is proposed. Between updates, the object s location is estimated based on its previous behavior. An update is issued when the discrepancy between the object s actual location and its estimated location exceeds a threshold value of uncertainty. This threshold value should be set so as to minimize the update cost (computational/communication cost) and the uncertainty cost (imprecision in our query answering). Another technique, Disconnection Detecting Dead Reckoning (Figure 1.b) decreases the uncertainty value as time passes. This addresses the problem of objects not sending in their updates because they have been disconnected from the network. As the uncertainty value decreases with time, the probability that an update should be sent increases.

Figure 2. Parametric Vs Native Space Indexing To effectively use an R-Tree to index mobile objects we need to define the bounding rectangle that will be used to represent the object. This is defined as [x min, x max ] [y min, y max ] [z min, z max ] [t start, t end ] where [t start, t end ] is the interval of time in which the object s motion is valid and the other intervals bound the object s motion along the 3 spatial dimensions throughout the temporal interval. An interesting problem that arises is that of segmenting the motion of objects into more than one bounding rectangles. This increases the number of objects stored in the database but also decreases the total volume occupied by these objects. The analysis presented in [11] indicates that this strategy is advantageous if the dimensionality of the indexed space increases or the average object size increases. A different approach for indexing of mobile objects has been proposed in the literature [7], [12], [11]. This parametric space indexing (PSI), as opposed to the previously described native space indexing (NSI) uses a different set of parameters to represent an object. These could be time, location and velocity. One of the problems of PSI is its sensitivity to the parameters chosen. Query processing becomes troublesome if the indexed motion contains acceleration. This is due to the fact that while in NSI both the query and the objects are rectangles in the native space, in PSI a transformation between native and parametric space has to be performed. Figure 2 illustrates this problem by showing how spatio-temporal range queries are mapped onto a parametric space. A significant contribution of the work in [11] is its classification of possible queries that might be meaningful in a spatio-temporal application context. Additionally, specific techniques for the efficient processing of such queries both for NSI and PSI have been proposed and an empirical study establishing the efficiency of the proposed methods was performed. Three main classes of queries were studied: (i) Spatio-Temporal Range Queries. The user specifies a range (interval) along each spatial and along the single temporal dimension. The query returns all objects that exist in the prescribed 3D spatial rectangle, during the specified time. E.g., all vehicles in the Nevada desert from 10a.m. to 2p.m. (ii) Spatial k-nearest Neighbor (knn) Queries. The user specifies a query point P in 3D space and a time interval t. The k objects in the database that are closest to P during t are returned. E.g., five closest submarines to me in the next 10 min.

(iii) Temporal knn Queries. The user specifies a time instance tq and a spatial rectangle R. The k objects that are in R closest temporally to tq are returned. E.g., first four ambulances to the scene of the accident after it occurs. 3. Dynamic Query for High Performance Visualization One of the novel concepts in terms of a Situational Awareness system is that of a Dynamic Query. In traditional database systems the query is issued explicitly, the database evaluates it and returns the results. In a SA in which the user himself moves through the virtual environment, it is often desirable to associate a query with the user and to continuously evaluate it. An example of this would be in a flight simulation application. One of the requirements of such an application is to gather up and render all visible elements for each frame of the simulation presentation. The information that needs to be retrieved from the database depends on the user s position in the virtual world and needs to be refreshed as the position changes. This is the essence of a dynamic query: it is associated with a mobile object and is continuously evaluated as the motion parameters of that object change. As introduced in [11] a dynamic query is a series of snapshot queries Q 1, Q 2,, Q n posed at successive times t 1, t 2,, t n. The snapshot query could be any typical spatio-temporal query, e.g., a range query (retrieve all objects that exist in a given rectangle of space for a particular interval of time) or a k-nearest Neighbor (knn) query (retrieve the k objects in the database that are closest to a query object). Techniques to evaluate snapshot range and knn queries are well studied in the literature. Thus, a dynamic query can be evaluated by issuing a sequence of snapshot queries and using standard techniques to retrieve their answers. However this approach is unnecessary wasteful. The fundamental observation is that in a typical dynamic query that arises in a visualization scenario, the observer s motion is continuous and thus for each two queries Q i, Q i+1 in the dynamic query sequence that he issues the overlap of the results is likely to be high. Thus, some part of the effort expended toward answering Q i can be re-used to answer Q i+1 as well. The quality of the presentation can thus be preserved, since lower delays in query answering correspond directly to a minimization of the time required to build each individual frame to be displayed. Subsequently, the efficient processing at the database level of the application enables high frame-rate visualization with multiple visual elements. Figure 3. I/O Performance of PDQ (left) and NPDQ (right)

We further differentiate between two types of dynamic queries: Predictive Dynamic Queries (PDQ) and Non-Predictive Dynamic Queries (NPDQ). The first type corresponds to the case that the sequence Q i is known beforehand. In such a case all the objects that need to be retrieved are know a priori and thus could in principle be pre-fetched into main memory. However this would require a large memory buffer to be kept. Another problem would be that as the indexed objects are mobile (hence imprecise), they might send an update that makes them irrelevant to the query; thus they would have been wastefully pre-fetched. In our approach, the index structure (R-Tree) is traversed using a priority-based scheme. Nodes are sorted based on the time in which they will first cross the query trajectory. Since the query is known in advance, we can also maintain the time in which they will cross out of this trajectory. The application iteratively reads objects from the priority queue based on their time of appearance, making sure that all objects that should appear now are actually fetched. A further cache mechanism in main memory sorts them based on their disappearance time; thus the application clears the memory buffer from objects as soon as they have crossed out of the query trajectory. In the case of NPDQ the sequence Q i is not known exactly. However there is still at least knowledge of the query posed at any given instance of time, Q as well as the previous query in the sequence, P. These probably overlap to a great degree unless the observer moves very fast or in a jerky, non-continuous manner. Thus, the results of P can be utilized in answering Q. We will end up to doing significantly less disk accesses to answer Q if we take care not to traverse nodes of the index structure that are classified as discardable. If R is the bounding rectangle of a node, it is easy to reason that the node is discardable iff (Q R) P. This means that if the part of a node that is contained in the current query Q was also contained in P then that node has already been fetched and thus we should avoid reading it again into main memory. In Figure 3 we see the I/O performance of both PDQ and NPDQ as the percentage of overlap between successive snapshots increases. The experiments indicate a great increase in efficiency for PDQ even for very small percentage of overlap. For NPDQ there is also a significant improvement that becomes more pronounced as the overlap percentage increases. 4. Concurrency Control Mechanisms In dynamic environments, the spatio-temporal information stored in the databases might be frequently updated. Earlier, we had discussed the issue of update management to reduce the overhead of updates. Another significant issue that arises in dynamic environments is that of concurrent operations over the database. Even though concurrency control in databases is a well studied problem, in the application domains considered, data is accessed through specialized access methods (e.g., multidimensional data structures such as R-tree). Concurrent access through such data structures has been surprisingly little studied in previous literature and has been noted as one of the difficult open research problems in [4]. Concurrent access to data via a index structure introduces two independent concurrency control (CC) problems: Preserving consistency and integrity of the data structure in presence of concurrent insertions, deletions and updates. Protecting search regions from phantoms. Below, we describe these problems and their solutions in more details in the context of Generalized Search Trees (GiST) [6]. GiST is an index structure that is extensible ``both'' in the data types it can index and in the queries it can support. It is like a ``template'' the application developer can implement her own access method using GiST by simply registering a few

extension methods with the DBMS. In particular, the GiST template and be instantiated by an R- tree and many other multidimensional access methods. Developing CC techniques for GiST is particularly beneficial since it would need writing the CC code only once and would allow concurrent access to the database via any index structure implemented in the DBMS using GiST, thus avoiding the need to write the code for each index structure separately. We first discuss the consistency problem. Consider a GiST (configured as, say, an R-tree) with a root node R and its two children nodes A and B. Concurrent operations over the R-tree could result in the tree losing its integrity. To see this consider two operations: an insertion of a new key k1 into B and a deletion of a key k2 from B. Suppose the deletion operation examines R and discovers that k2, if present, must be in B. Before it can examine B, the insertion operation causes B to split into B and B' as a result of which k2 moves to B' (and subsequently updates R). The delete operation now examines B and incorrectly concludes that k2 does not exist. Many such similar situations that compromise data structure integrity could arise when operations execute concurrently. An approach to avoid above execution and preserve the consistency of the multidimensional data structures is studied in [8]. We now move on to the problem of phantom protection. Consider a transaction T1 reading a set of data items from a GiST that satisfy some search predicate Q. Transaction T2 then inserts a data item that satisfies Q and commits. If T1 now repeats its scan with the same search predicate Q, it gets a set of data items (known as ``phantoms'') different from the first read. Phantoms must be prevented to guarantee serializable execution. Note that object level locking does not prevent phantoms since even if all objects currently in the database that satisfy the search predicate are locked, concurrent insertions 1 into the search range cannot be prevented. There are two general strategies to solve the phantom problem, namely predicate locking and its engineering approximation, granular locking. In predicate locking, transactions acquire locks on predicates rather than individual objects. Although predicate locking is a complete solution to the phantom problem, it is usually too costly. In contrast, in granular locking, the predicate space is divided into a set of lockable resource granules. Transactions acquire locks on granules instead of on predicates. The locking protocol guarantees that if two transactions request conflicting mode locks on predicates p and p' such that p p' is satisfied, then the two transactions will request conflicting locks on at least one granule in common. Granular locks can be set and released as efficiently as object locks. An example of the granular locking approach is the multi-granularity locking protocol. Application of MGL to the key space associated with a B-tree is referred to as key range locking (KRL). Unfortunately, KRL cannot be applied for phantom protection in multidimensional data structures since it relies on a total order of key values of objects which does not exist for multidimensional data. Imposing an artificial total order (say a Z-order) over multidimensional data to adapt KRL is not a viable technique either. Instead, in [2] we define the concept of a lockable resource granules over the multidimensional key space. In order to define the coverage of the granules, a granule predicate is associated with every index node of a GiST. Let N be an index node and P be the parent of N. Let BP(N) denote the bounding predicate of N. The granule predicate of N, denoted by GP(N), is defined as BP(N) if N is the root and BP(N) GP(P) otherwise. There is a lockable resource granule TG(N) (TG stands for tree granule) associated with each index node N. The lock coverage of TG(N) is defined by GP(N). The granules 1 These insertions may be a result of insertion of new objects, updates to existing objects or rolling-back deletions made by other concurrent transactions.

associated with the leaf nodes are called leaf granules while those associated with the non-leaf nodes are called non-leaf granules. Since the granules are based on the predicate space partitioning generated by the GiST, they continuously adapt to the key distribution (by updating current granules, creating new granules and destroying old ones) which is a key to the effectiveness of this technique. Based on the above concept of lockable granules, lock protocols for the various operations on GiSTs are developed. The protocol developed is the first such granular locking solution to phantoms for multidimensional access methods and represents a scalable and effective approach to supporting concurrent operations over such data sets. 5. Fast Computation of Spatial Aggregate Queries An important class of queries in database systems is that of aggregate queries. Such queries look at a great amount of data and return an aggregate value on them. In spatial applications like SA, aggregate queries are especially important since it is natural to want to discover some summary information about data objects that are embedded in a particular region of space. Examples of this type of spatial aggregate query are: what is the number of cars within 20 miles of my position, what is the average traffic speed 50 miles ahead on the freeway and what is the average concentration of vehicles per square mile in juncture X. Let us define a data space as S R n. The data items stored in the database are in the form (loc, value) where loc S, and val D R is the attribute value for the object in location loc. For our purposes the data space is usually the physical 3D space, hence S=R 3, and D is the domain of the attribute that we are measuring, e.g., D=[0, velocity max ] is the domain of possible speeds of a vehicle. A spatial aggregate query can now be formally defined as a pair (R, agg_type) where R S is the query region and agg_type is the type of attribute we are interested in. Usually but not necessarily agg_type is one of the Structured Query Language (SQL) aggregates (MIN, MAX, COUNT, SUM, AVG). The answer of the query is derived by gathering up all objects o such that o.loc R and aggregating over their values. An example of such a query would be (California, MAX) that would return the vehicle with the top current speed among all vehicles located in California. Figure 4. Example of an MRA-Quadtree

It is apparent that since aggregate queries look at a large number of data items, it is difficult to compute them under the time-restrictive demands imposed by an interactive SA application. To address this problem, we have developed a technique [9] that provides approximate answers at increasing quality levels. The application can thus provide a desirable quality level together with the query and/or a deadline. Our technique either gives the best possible answer before the specified deadline or stops when it is guaranteed that the quality requirement has been reached. We define quality as follows: the algorithm we propose returns and estimate of the query answer â and a range of values I=[l, h]. This interval of uncertainty is such that the true answer to the query a I. Quality is quantified by the length of I or I which is being reduced during the run of the algorithm. Perfect quality is reached when I=[â, â]. The mechanism we have introduced, called a Multi-Resolution Aggregate Tree (MRA-Tree) is hierarchical decomposition of data space S. Each node of the MRA-Tree covers a region of S. The root covers the entire S. We store on each node, along with the spatial decomposition information, aggregate information (specifically MIN, MAX, SUM, COUNT values) about each of its children nodes. An example of an MRA-quadtree can be seen in Figure 4. A query can always be mapped to two sets of nodes in the MRA-tree, a set of nodes that are completely contained in the query region (set NC) and a set of nodes that are partially contained in the query region or are a superset of the region (set NP). The contribution of nodes from set NC to the query is certain and we do not need to visit their children. However, the contribution of nodes from NP to the query is uncertain. Our algorithm specifies: (a) a technique to estimate the value of the aggregate based on sets NC, NP, (b) a technique to provide minimum 100% intervals of confidence I as we have described above and (c) a method for choosing nodes from NP to explore by visiting their children. We will briefly outline the above for COUNT queries, a description for all SQL-type aggregates can be found in [9]. It is obvious that given a query region R and sets of nodes NC, NP the count of data items in the query must be in the interval: count = N N + N I count, count count N NC N NC N NP To estimate the answer of the query, given NC, NP we assume that data items are uniformly distributed in partially overlapping nodes. If p N is the fraction of node N in the query region then our estimate is: R E ( count ) = N NC count An optimal traversal policy would aim to maximally reduce the uncertainty interval for a given number of nodes. Such a policy is impossible to be derived since we cannot know a priori how much a particular node will decrease the uncertainty. A heuristic that can be used is based on each node s contribution to the uncertainty interval. It is easy to see that for the COUNT example, the contribution of some node N is equal to its count, i.e., count N. Thus a simple but effective traversal policy would keep partially overlapping nodes in a priority queue keyed on count and traverse them in order of increasing count. N + p N NP N count N

Figure 5. Error Decrease as MRA-tree nodes are visited (left) and MRA-tree comparison with plain quadtree index scan (right) The experimental results both in synthetic and real-life spatial data indicate that this technique can be used very effectively to provide approximate aggregate answers. Even for exact answering it has been shown that the results are quite good; this is because in the course of the algorithm we don t need to visit the subtree rooted in nodes of set NC, saving a lot of computational cost. In Figure 5 we see the method s performance on an MRA-quadtree using synthetic data. Both in terms of nodes visited for an exact answer (as compared to a normal quadtree index scan) and in the decrease of error of estimation, the technique is shown to be highly effective. 6. Future Directions: the QUASAR Project Our work in the SATURN project as outlined in this paper addresses some of the issues inherent in incorporating a notion of quality in large spatio-temporal databases and other related issues. Our main finding from our experience has been that future database systems over this type of data must: (i) allow the user to express complex and interesting queries, and (ii) evaluate queries by taking into consideration both time and resource consumption. We have proposed new ways to represent spatio-temporal information in the database system, new query types that can be supported at the database level itself and efficient processing techniques to produce answers at a guaranteed level of quality. Looking ahead, we have initiated the QUASAR project 2 at UC Irvine which is an effort towards what we call quality aware data management. The project s focus is on integrating dynamic data seamlessly in database engines, providing functionality to applications over such types of data. Related projects include the TRAPP project [10] at Stanford U. and the COUGAR Project [1] at Cornell U. A variety of issues are explored in the context of our project: how to specify quality and resource consumption constraints at the language level by (possibly) extensions to SQL; how to manage the dissemination of information in a distributed architecture composed of sensors, servers and mobile user information devices (e.g., PDAs); how to execute complex relational queries involving all operators over imprecise representations of dynamic data; finally, how to register and monitor triggers over multiple spatially-embedded sensors 2 Quality Aware Sensor Architecture

efficiently. Even though we anticipate that a system incorporating dynamic data, providing awareness of the changing data landscape, will have multiple, potentially conflicting modalities, we nonetheless believe that an adaptable data architecture for this sort of data is both necessary and feasible. References [1] Philippe Bonnet, Johannes Gehrke, and Praveen Seshadri, Towards Sensor Database Systems, Mobile Data Management 2001 Conference. [2] K. Chakrabarti and S. Mehrotra, Dynamic Granular Locking Approach to Phantom Protection in R-Trees, February 1998, IEEE ICDE Conference. [3] K. Chakrabarti and S. Mehrotra, Efficient Concurrency Control in Multidimensional Access Methods, June 1999, ACM SIGMOD Conference. [4] J. Gray and A. Reuter, Transaction Processing: Concepts and Techniques, 1993, ISBN 1-55860-190-2 [5] A. Guttman, R-Trees: A Dynamic Index Structure for Spatial Searching, June 1984, ACM SIGMOD Conference. [6] J. Hellerstein, J. Naughton and A. Pfeffer, Generalized Search Trees for Database Systems, September 1995, VLDB Conference. [7] G. Kollios, D. Gunopoulos and V. Tsotras, On Indexing Mobile Objects, June 1999, ACM PODS Symposium. [8] M. Kornacker, C. Mohan and J. Hellerstein, Concurrency and Recovery in Generalized Search Trees, June 1997, ACM SIGMOD Conference. [9] I. Lazaridis and S. Mehrotra, Progressive Approximate Aggregate Queries with a Multi- Resolution Tree Structure, May 2001, ACM SIGMOD 2001 Conference. [10] Chris Olston and Jennifer Widom, Offering a Precision-Performance Tradeoff for Aggregation Queries over Replicated Data, September 2000, VLDB Conference [11] K. Porkaew, Database Support for Similarity Retrieval and Querying Mobile Objects, Ph.D. thesis, Department of Computer Science, University of Illinois at Urbana- Champaign, 2000. [12] S. Saltenis, C. Jensen, S. Leutenegger, M. Lopez, Indexing the Positions of Continuously Moving Objects, May 2000, ACM SIGMOD Conference. [13] O. Wolfson, B. Xu, S. Chamberlain, L. Jiang, Moving Objects Databases: Issues and Solutions, July 1998, IEEE SSDBM Conference.