A Distributed Indexing Scheme for Multi-dimensional Range Queries in Sensor Networks Tingjian Ge Outline Introduction and Overview Concepts and Technology Inserting Events into the Index Querying the Index Analysis and Comparison with Other Schemes 2 Motivation Events are data Tuple of attribute values <A 1, A 2,, A k > Each A i is a sensor reading Multi-dimensional range queries <x 1 y 1, x 2 y 2,, x k y k > List all events that have temperatures between 50 F and 60 F, and light levels between 10 and 20 Point query (equality) is a special case Essentially any query Correlate events and triggering actions Eg., queries indicate an event, triggering cameras Traditional Database Systems Not in this paper B-tree index Hash index Patricia tree index Bitmap index All above are centralized indices Data dependency (insertion order vs. index structure) B-tree is data-dependent, Patricia tree is not Our distributed index is not Sensitivity of index structure on insert/update/delete Rebalance of index structure on skewed data 3 4
How about Queries in Sensor Networks Flooding Store events where they are generated and queries are flooded through the network External storage All events stored centrally in a node outside the sensor network Geographic hash table (GHT) Events are hashed on some attribute and range queries are sub-divided and hashed to the appropriate location Distributed Index for Multi-dimensional data (DIMs) 5 Overview Events (data) are mapped onto zones of the network (multi-dimensional space to 2-d space) Data Locality: Events with close attribute values stored in same location in network Locality-preserving geographic hash Events are routed to and stored at that node Queries are routed to and resolved by appropriate nodes 6 Using GPSR - Greedy Perimeter Stateless Routing Zones Algorithm to route events to appropriate nodes at specified location Greedy-mode forwarding Node receives packet with destination X, node forwards packet to neighbor closest to X Perimeter-mode If no neighbor that takes the packet closer to its destination exists (i.e. void) Rectangle R on x-y plane (entire network) Subrectangle Z is a zone if Z is obtained by dividing R k times satisfying the following property: After the i-th division, 1 i k, R is partitioned into 2 i equal rectangles. If i is odd (even), the division is parallel to the y-axis (xaxis). k is the level of the zone, level(z) = k Right-hand rule to circumnavigate voids 7 8
Zones Example Zone Identification code(z) Bit string of length level(z) Starting from left of code string, if zone Z resides on the left half of R, bit equals 0, else 1. For the next bit, if zone Z resides on the bottom half of R, bit is 0, else 1. addr(z) Location of the centroid of zone rectangle 9 10 Zone Terminology Sibling subtree of a zone Left/right subtree rooted at the same parent zone Backup zone If the sibling subtree of a node is on the left (right), its backup zone is the rightmost (leftmost) zone in its sibling subtree If code(z)=p1, code(backup(z))=p01* If code(z)=p0, code(backup(z))=p10* Associating Zones with Nodes Sensor field divided into zones, which can be of different sizes (not a complete binary tree different levels) Zone ownership A owns Z A Z A is the largest zone that contains only node A Some zones may not have node owner backup(z) is the owner 11 12
Algorithm for Zone Ownership Each node maintains its four boundaries Initialize to network boundary Send messages to learn locations of neighbors If neighbor responds, node will adjust its boundaries accordingly Else boundary is undecided Undecided boundaries resolved during querying or event insertion Discussions (optionally): Efficiency Alternative Reality and conclusion (offload insert/query, one-time) Pseudocode for Zone Ownership Build-Zone(a) while Contain(ZA, a) do if length(code(za)) mod 2 == 0 then new_bound = (bound[0] + bound[1])/2 if A.x < new_bound then bound[1] = new_bound else bound[0] = new_bound else new_bound = (bound[2] + bound[3])/2 if A.y < new_bound then bound[3] = new_bound else bound[2] = new_bound 13 14 Hashing an event to a zone Routing an event to its owner Resolving undecided zone boundaries during insertion Hashing an Event to a Zone Have m attributes A 1, A 2,, A m and attribute values have been normalized To hash a k-bit zone code to an event: For i in [1, m], if A i < 0.5, the i th bit of the zone code is 0, else 1. For i in [m+1, 2m], if A i < 0.25 or 0.5 A i < 0.75, then the (i-m) th bit is 0, else 1. Etc. until all k bits are assigned 15 16
Hashing an Event to a Zone Example: Hash event <0.3, 0.8> to a 5-bit zone code Zone code = 01110 Discussions (optional): Precision What if k<m? Add dummy levels? Ordering of attributes Normalization, value bound tracking, dynamic updates Can actual code of an event be determined from only the max level of the network? Routing an Event to its Owner Node generating the event calculates code(e) up to its own length GPSR delivers message to some intermediate node A Message contains: event E, code(e), target location, owner, location of owner, A encodes the event to code new (E) (actually only if needed) Updates message if code new (E) is longer than code in message A checks if code(a) has longer match with code(e) than previous owner If yes, update message by setting itself as the owner If code(a) and code(e) identical and A s boundaries are known, A is the owner of E and stores it Else A will route E to its owner by invoking GPSR 17 18 Resolving Undecided Boundaries Suppose node C receives event E If code(c) = code(e) and all of C s boundaries are known, C will store the event If C has undecided boundaries, there may be zone overlap with another node C sets itself as owner and forwards message using GPSR perimeter mode If message is not changed, it will come back to C C assumes it is the owner and stores it 19 Resolving Undecided Boundaries An intermediate node X marks itself as the owner but code(e) is unchanged X recognizes zone overlap with C and adjusts its boundaries and send messages to C to update its boundaries An intermediate node D refines code(e) D will try to deliver the message to the new zone Another node X may overlap with C X will shrink its zone and send C messages to do the same C will update its undecided boundary 20
Example Nodes A and B have claimed the same zone 0 Node A generates event E = <0.4, 0.8, 0.9>, code(e) = 0 Perimeter mode forwarding of event to B B and A engage in message exchange to shrink zones Mistake in the paper: B shrinks its zone from 0 to 01 according to A s location (not needed, it knows) Queries Routing Routing for point queries is the same as event insertion Range queries query initially routed to zone corresponding to the entire range Comment: Effectively, this means the initial destination of the query is the lowest-level node containing the query ranges progressively split into smaller sub-queries so each sub-query can be resolved by a single node 21 22 Splitting Queries Node A splits a query if its zone overlaps, but does not entirely contain query range If the range of Q s first attribute contains value 0.5, A divides Q into two sub-queries, one with range 0 to 0.5, and the other 0.5 to 1 If a sub-query does not overlap with zone A any more, A stops splitting it Else A continues splitting the query using successive attribute ranges and recomputing the overlap until it is small enough to fit entirely in zone(a) Splitting Queries Example Suppose there is a node A with code(a) = 0110 Split a query Q = <0.3 0.8, 0.6 0.9> 23 24
Query Resolution Once a sub-query falls into a zone, the node owner resolves the query and sends the reply to the querier The other sub-queries are forwarded to other nodes Robustness Maintaining zones: zone expansion (due to nodes turned off) Dealing with Node Failures: Local replication (sibling zone) Mirror replication (one s complement of zone code) Dealing with packet loss ACK, negative ACK, timeout, selectively re-issue 25 26 Analysis on DIMs Average Insertion Cost Metrics Average insertion cost average number of messages required to insert an event into the network Discussion: why is the result in next slide? Average query delivery cost average number of messages required to route a query message to all the relevant nodes Compared against alternatives GHT and flooding Discussion: what happens with GHT? Discussion: why the difference between bounded uniform & exponential query distribution? 27 28
Average Query Cost Bounded Uniform Distribution Average Query Cost Exponential Query Distribution 29 30 Conclusion Under reasonable assumptions about query distributions, DIMs scale quite well with network size (both insertion and query costs scale O( N)) Work that still needs to be done Skewed data distribution Existential queries Node heterogeneity Final Thoughts & Discussions Comparison with B-tree index B-tree index can only do prefix match; DIMs can match any attribute, distributed & concurrent processing B-tree can rebalance DIMs is essentially a binary tree, but GPSR routing makes it more than logn, and N is total network nodes, not data size Locality The more the levels, the more divided the values, the worse the locality Even if few events (data), likely to be very distributed Possible solution: change normalization, but doesn t scale well Selectivity Depends on network node structure Insertion cost vs. query cost Sensor networks insertion cost is a big deal Improvements? Distributed caching of query result? 31 32