Data Storage in Sensor Networks for Multi-dimensional Range Queries

Data Storage in Sensor Networks for Multi-dimensional Range Queries Ji Yeon Lee 1, Yong Hun Lim 2, Yon Dohn Chung 3,*, and Myoung Ho Kim 4 1 E-Government Team, National Computerization Agency, Seoul, Korea jylee@nca.or.kr 2 Home Platform Group, Samsung Electronics, Seoul, Korea yonghun.lim@samsung.com 3 Department of Computer Engineering, Dongguk University, Seoul, Korea ydchung@dgu.edu 4 Department of Computer Science, KAIST, Daejon, Korea mhkim@dbserver.kaist.ac.kr Abstract. In data-centric sensor networks, various data items, such as temperature, humidity, pressure and so on, are sensed and stored in sensor nodes. As these attributes are mostly scalar values and inter-related, multi-dimensional range queries are very useful. However, the previous work on range query processing in sensor networks did not consider overall network lifetime. To prolong network lifetime and support multi-dimensional range queries, we propose a dynamic data placement method for multi-dimensional data, where data space is divided into equal-sized regions and placed over sensor nodes in a dynamic way. Through experiments, we show the efficiency of the proposed method compared with the previous work.. Keywords: sensor network, data-centric storage, multi-dimensional range queries, data placement and distribution. 1 Introduction A sensor network consists of widely distributed sensors, where each sensor node is a small device with some limited computing, storage and wireless communication capacity [1, 2, 4, 5, 8]. The applications of sensor networks have been widely expanded into areas of military, environment, health, and so on. For example, in an environmental monitoring application, sensor nodes which are widely and randomly deployed over deserts or volcanic areas periodically sense environmental parameters such as the temperature, humidity, and air pressure. The sensor data can be stored locally into sensor nodes or delivered to other sensor nodes/outer gateways. The sensor network where measured data are stored within sensor nodes is called data-centric, which is the target environment of this paper. The stored data are analyzed or processed through various queries including point queries, range queries, aggregation queries, and so on. * Corresponding author. L.T. Yang et al. (Eds.): ICESS 2005, LNCS 3820, pp. 420 429, 2005. Springer-Verlag Berlin Heidelberg 2005

Data Storage in Sensor Networks for Multi-dimensional Range Queries 421 Sensor network has some unique properties compared with the conventional wireless network [2, 4, 5, 7, 8]. First, the capacity or resource of a sensor such as computing power, storage, and energy is very restricted. Especially, since sensor nodes use batteries for their energy source and the batteries cannot be re-charged or replaced, efficient control of energy consumption in sensor nodes is very important. Second, sensor nodes are randomly (i.e., not-controlled) deployed over a target area, and hence they are not informed of the overall network configuration. (That is, a sensor node is aware of only its neighbor nodes within its radio scope.) Third, the sensor node is not gathered and reused, so it is important to fully use the deployed sensors by prolonging their lifetimes. Since the lifetime of a sensor network is determined by that of the shortest-life sensor node, we have to elaborate on the energy consumption of sensor network not to be concentrated on some hot-spot nodes. In data-centric sensor networks [5, 7, 8], data are stored in sensor nodes based on their values, not stored in the collector node or delivered to external storage (or a predefined processor). In this approach, each sensor node has a predefined region of data domain it will store. This prevents the cases that the sensor nodes which are located near to external storage or collect more data than the others become hot-spot (i.e., consuming much more energy than the others). In this paper, in order to balance the energy-consumption of sensor nodes, we propose a dynamic data placement method, where we initially assign each sensor node a range of data domain, and dynamically adjust the ranges of sensor nodes based on their workload. For region assignment, we linearize the multi-dimensional data space by using Hilbert space-filling curves, and make a linear address space by zigzag traversing of sensor nodes. The rest of the paper is organized as follows. In Chapter 2, we describe some related work on data-centric sensor networks. In Chapter 3, we propose a dynamic data placement method for multi-dimensional queries on sensor networks. In Chapter 4, we show the performance of the proposed method through experimental results. In Chapter 5, we conclude the paper with some remarks on future work. 2 Related Work In data-centric sensor networks, addressing scheme which determines sensor node to store data is needed. The address (also called the index ) denotes the logical position of data storage, which is used for routing data or queries to target sensors. A popular addressing scheme in the conventional data-centric sensor networks is GHT (Geographic Hash Table) [8]. In the GHT method, data are stored sensor nodes which are determined based on their geographic locations. Although this method is so effective for exact-match queries and prefix queries, it is not efficient for range queries. It is because data objects of similar values are stored into geographically dispersed sensor nodes, and hence partial queries should be transmitted over many sensor nodes in the network for range-query processing. In the DIM (Distributed Index for Multi-dimensional data) approach [5], the geographic area of sensor networks are iteratively divided by X-dimension and Y- dimension in an alternative way until there remains one sensor node for a region. Then, the data space is partitioned and assigned into the geographic regions of sensor nodes. Differently from GHT, DIM assigns data objects of similar values into geo-

422 J.Y. Lee et al. graphically near sensor nodes. This improves energy-efficiency for range-query processing because the number of communications between sensor nodes is reduced compared with GHT. However, since the sensor nodes are deployed in a not-controlled manner, the data region assignments of sensor nodes cannot be guaranteed to be balanced. (In Figure 1(b), you can observe that the data allocation for each sensor node is not equal.) Because the energy-consumption of sensor nodes for query processing is in proportion to the amount of data it stores, non-uniform data assignment to sensor nodes will cause non-uniform energy-consumption between sensor nodes, which results in shortening the lifetime of the sensor networks. zœ š Gkˆ ˆ h ŒššG z ˆŠŒ zœ š G uœ ž š OˆPGno{ O PGkpt Fig. 1. Data Distribution in GHT and DIM Methods 3 The Proposed Method Our solution for dynamic distribution of sensor data consists of the following steps: (1) We construct a single dimensional address space of sensor nodes (i.e., linearization of sensor nodes) through a zigzag traversing such that geographically near nodes are located near in the linear address space. (2) We transform the multi-dimensional data space into a single dimensional data space (i.e., linearization of data space) using Hilbert space-filling curves. (3) Initially, the data regions on the one dimensional data space are uniformly assigned into one dimensional address space of sensor nodes. Then, during the lifetime of the sensor networks, parts of data regions initially allocated to sensor nodes are dynamically migrated to near sensor nodes based on the workload of sensor nodes. 3.1 Construction of One Dimensional Address Space of Sensor Nodes If we assign data regions into sensor nodes based on geographic positions as in DIM, the amount of data allocated to sensor nodes may be unbalanced. This causes the

Data Storage in Sensor Networks for Multi-dimensional Range Queries 423 energy-consumption of some nodes that covers relatively large data regions to be more than the others, which leads to shortening the lifetime of entire sensor networks. For addressing sensor nodes without geographic position information, we construct a linear address space of sensor nodes through zigzag traversing. The zigzag traversing allows sensor nodes which are deployed geographically near to be assigned near addresses in the linear address space. The procedure for zigzag traversing is shown in Figure 2. All nodes are initially assigned level numbers through constructing a spanning tree via flooding (See Figure 3(a)). A level number denotes the number of hops needed for reaching the node from an outer point. After complete level numbering, we traverse the sensor nodes in a zigzag way. The sequence of traverse becomes the one dimensional address space of sensor nodes, where the start node is numbered as 1. Figure 3(b) shows an example of zigzag traversing. In the figure, sensor node a has three same-level neighbors (node b, c and d ). According to Step 1.A of Figure 2, node b is selected. 1. Choose a sensor node in the lowest level among ne ighboring sensor nodes. A. When there are multiple candidates (i.e., the same level), choose a sensor the number of neig hbors of which is minimal. (This is a heuristic for selecting outlier nodes first.) i. When there are still multiple candidates (i.e., the same level and the same number of neighbors), choose the nearest one 2. When no more neighbor node exists, we backtrack o n the traversed path and checked if there are still not-chosen neighboring sensor nodes. If any, we proc eed to the above Step1 from that node. 3. When we backtrack to the start node (i.e., number 1 node), the traversal is terminated. Fig. 2. The Algorithm of Zigzag Traversing sœœ GX sœœ GY sœœ GZ sœœ G[ X Y _ [ Z ^ ` \ ] XW XY XX OˆPGsŒŒ Guœ Œ Ž O PGŽ ˆŽG{ ˆŒ š Ž Fig. 3. Generation of Linear Address Space by Zigzag Traversing of Sensor Networks

424 J.Y. Lee et al. 3.2 Transforming Multi-dimensional Data Space into One Dimensional Data Space In this paper, we consider multi-attribute sensor data (e.g., temperature, humidity, air pressure, lightness and so on) on which multi-dimensional range queries are processed. In order to store multi-dimensional data into sensor nodes, we transform the multi-dimensional data space into one dimensional one. For this purpose, we use a popular space-filling curve, called the Hilbert curve. It is known that the Hilbert curve has the best locality-preserving characteristic among many space-filling curves such as Z-ordering, Peano curve, etc [3, 6]. Since the Hilbert curve method assumes a normalized data space, we have to normalize the sensor data. In this paper we assume that sensor nodes are aware of all domains of sensor data space, and compute the normalized values as a N = (a a MIN ) / (a MAX a MIN ), where a is the measured data, a N is normalized data value of a, a MIN is the minimum value of the attribute, and a MAX is the maximum value. 3.3 Data Allocation and Dynamic Adjustment After generating linear address space of sensor nodes via zigzag traversing and linear data space via Hilbert curve, we map the data regions evenly into sensor nodes as in Figure 4. Since both the zigzag traversing and Hilbert space-filling curve tend to preserve the locality property, this data placement on sensor nodes also has good clustering effects on range queries. For example, in Figure 4, data regions 31~34 which are adjacent with each other in the original multi-dimensional data space are actually allocated in neighboring sensor nodes 4 and 5. This entails low cost when a range query includes data regions 31~34 is processed, since partial queries need not be delivered to other far-away nodes. 4 1 5 3 2 6 Sensor Networks 1 2 3 4 5 6 7 8 1~8 9~16 17~24 25~32 33~40 41~48 49~56 57~64 Addresses Humidity 37 36 29 28 38 35 30 27 63 62 64 61 1 0.5 Temperature 3 2 4 1 Fig. 4. Allocation of Data Regions to Sensor Nodes 0 1 Light Multi-dimensional Data Space

Data Storage in Sensor Networks for Multi-dimensional Range Queries 425 The above allocation of data is fair when the workload on every data regions is uniform, since the amount of data space allocated to each sensor node is equal. However, according to specific characteristics of attributes, some data regions will be highly accessed than other regions. Query patterns are also dynamically changed during the lifetime of sensor networks. If the workload of sensor nodes is not uniform, the energy-consumption could be skewed. For the purpose of balancing the workload of sensor nodes, which is the goal of our proposed method, we dynamically adjust the data regions allocated to sensor nodes based on the current workload of sensor nodes. Region Adjustment for Overloaded Sensor Nodes For balancing the workload, we have to measure the amount of load of a sensor node in a quantitative way. Based on the assumption that the amount of energy consumption of a sensor node primarily depends on the amount of data it has stored and the frequency of queries it has processed, we define the load of a sensor node as follows: Definition 1. The load(l i ) of sensor node i is defined as L i = j e j q j, where j are the data regions allocated to sensor node i, e j is the amount of data region j, and q j is the frequency of queries for j. In the paper, we use the following two terms, neighboring sensor and adjacent sensor. The neighboring sensor denotes the sensor which is connected directly i.e., located in a single hop communication range. The adjacent node denotes the sensor node whose data region is adjacent. Usually, the adjacent node of a sensor node is chosen among its neighboring nodes. When a node is overloaded, its two adjacent nodes will take parts of data of the overloaded node for load balancing. The state of overloaded means the sensor has been consuming relatively more energy than the other sensors. In the proposed method, dynamic adjustment of data regions between sensor nodes is activated by detection of any overloaded sensor nodes. In this paper, we define the criteria of being overloaded as follows: Compared with the initial amount of energy in the battery and the amount of storage space, when the amount of currently remained energy or the amount of currently available free storage space are below the half of initial ones, we call those sensor nodes are overloaded. (The criteria can be modified according to target environments and applications.) When adjusting data regions of sensor nodes, parts of data regions of overloaded sensor nodes are distributed into their adjacent nodes. Here, the amount of data for migration is determined according to the relative loads of overloaded node and its adjacent nodes. Definition 2. The amount of data space ( pq ) to be transferred from sensor node p to sensor node q is as follows: (Here, p max and p min are the max/min addresses (on the 1- dimensional address space) for sensor node p, L p and L q are the amount of load of sensor nodes p and q, respectively. pq = p p 2 L L L max min p q p

426 J.Y. Lee et al. When data transfer is performed on two (i.e., left and right) adjacent nodes, these nodes might be overloaded due to the transferred data. Then, those nodes can also transfer parts of their data to their adjacent nodes progressively. For example, in Figure 5, some data are transferred from n 3 (initially overloaded sensor node) to node n 2 and n 4, then n 2 and n 4 can send parts of their data to n 1 and n 5, respectively. h ŒššG z ˆŠŒ W 21 32 34 X Y Z [ vœ ˆ Œ Fig. 5. Data Transfer for Overloaded Sensor Nodes 4 Performance Experiments We have conducted simulation experiments for evaluating the performance of our approach compared with the previous one. We set the size of area for deploying 300 sensor nodes as 800m x 800m, where a sensor node has 14 neighbor nodes (in average) within its radio coverage 100m. A data record in a sensor consists of 5 attributes, and a sensor node contains 100 data records at the beginning. We have conducted the experiment until a failure of sensor node is occurred. We have generated multidimensional range queries which cover 5%, 10% and 20% of data space in a normalized way and in randomly chosen sensor nodes. The amount of energy consumption is determined by the number of message hops multiplied by the number of bytes for each message. In this experiment, the energy consumption for data storage is ignored for convenience. Energy Consumption In DIM, when two sensor nodes are located very closely, one sensor is assigned a very big region of data space while the other is assigned a small one. Observed through experiments, the size of maximum region assignment is 5 times bigger than that of the average one. This unbalanced region assignment leads to the increase of differences of energy consumption between sensor nodes, which results in shortening the lifetime of overall networks. Figure 6 shows the comparison result of energy consumption of our method and the DIM, where the data records are uniformly generated over the entire data space, and range queries access the data space uniformly. The results indicate energy consumption ratios of sensor nodes at the time when a sensor node has failed due to ex-

Data Storage in Sensor Networks for Multi-dimensional Range Queries 427 haustion of its energy. The DIM method has many highly consumed nodes compared with our method. We have tested a non-uniform setting, where data generation follows a normal distribution with the mean value of 0.5 and the standard deviation of 0.1, and also the query generation is based on the normal distribution of 0 and 0.1. Figure 7 shows the result of experiment, where we have measured the energy consumption ratios of sensor nodes at the time when 10% of the sensor nodes in our method are failed. (Here, we sort the id s of sensor nodes for readers convenience.) In the result we can see that more than half of the sensor nodes in DIM have consumed most of their energy at the time of 10% node failure whereas our sensor nodes consume relatively little energy. Fig. 6. Energy Consumption Ratios under Uniform Data and Queries Node Energy Consumption(%) 100 90 80 70 60 50 40 30 20 10 0 Ours DIM 0 30 60 90 120 150 180 210 240 270 sensor id Fig. 7. Energy Consumption Ratios under Non-uniform Data and Queries Network Lifetime Figure 8 shows the network lifetime (in terms of unit time) comparison result of our method and DIM. As you can see in the figure, the lifetime until one node failure of DIM is very short compared with ours. This indicates that DIM is not appropriate for mission-critical applications where a single node failure would not be admitted. By the time of 15% node failure, our method survives longer (about 150%) than DIM.

428 J.Y. Lee et al. Fig. 8. Network Lifetime according to Node Failure Ratios 5 Conclusion In data-centric sensor networks, the lifetime significantly depends on data placement (or distribution). The use of hash functions is effective for load balancing since it distributes (i.e., de-cluster) data over the entire network. However, it is inefficient for range queries since many sensor nodes must be involved for the query processing. In other aspects, the previous approach DIM for range query processing did not consider load balancing on sensor nodes, which results in differences of energy consumption between sensor nodes and thus short network lifetime. In this paper we have proposed a new data storage method which balances workloads of sensor nodes, and thus improves overall network life time. We have constructed the address space of sensor nodes by using zigzag traversing, and assigned the linearly transformed (via Hilbert curves) data space on it. Since this approach assigns adjacent addresses onto neighboring sensor nodes, (multi-dimensional) range queries can be efficiently processed. In addition, the dynamic and progressive update of address assignments effectively copes with the changes of workloads and balances energy consumption ratios of sensor nodes. Through simulation experiments, we have shown that the proposed approach efficiently balances the energy-consumption of sensor nodes and improve the lifetime of sensor networks. We in the paper considered that the data are stored on sensor nodes in a nonreplicated way. For future work, we will investigate on data replication in sensor networks and query processing over replicated data. If some replication of data between sensor nodes is allowable, the performance of query processing and the availability of sensor database will be significantly improved. Acknowledgement This work was done as a part of Information & Communication Fundamental Technology Research Program, supported by Ministry of Information & Communication in Republic of Korea.

Data Storage in Sensor Networks for Multi-dimensional Range Queries 429 References [1] Bhardwaj, M.. and Chandrakasan, A. P. Bounding the Lifetime of Sensor Networks Via Optimal Role Assignments, IEEE INFOCOM 2002. [2] Greenstein, B. et. al. DIFS: A Distributed Index for Features in Sensor Networks, Elsevier Journal of Ad Hoc Networks, 2003. [3] Jagadish, H. V., Linear clustering of objects with multiple attributes, International Conference on Management of Data, Proceedings of the ACM SIGMOD 1990. [4] Karp, B. and Kung, H. Greedy Perimeter Stateless Routing In Proceedings of the Sixth Annual ACM/IEEE International Conference on Mobile Computing, pp. 243~254, 2000. [5] Li, X. et. al. Multi-dimensional Range Queries in Sensor Networks, Proceedings of the 1st international conference on Embedded networked sensor systems, pp. 63~75, 2003. [6] Moon, B. et. al. Analysis of the clustering properties of Hilbert space-filling curve. IEEE Trans. on Knowledge and Data Engineering, pp 124~141, 1996. [7] Newsome, J. and Song, D. GEM: Graph EMbedding for Routing and Data-Centric Storage in Sensor Networks without Geographic Information. SenSys 2003. [8] Ratnasamy, S. et. al. Data-Centric Storage in Sensornets with GHT, a Geographic Hash Table, Mobile Networks and Applications, 8, pp. 427-442, 2003.