Approximate Aggregation Techniques for Sensor Databases

Similar documents
Approximately Uniform Random Sampling in Sensor Networks

Information Brokerage

TAG: A TINY AGGREGATION SERVICE FOR AD-HOC SENSOR NETWORKS

CE693: Adv. Computer Networking

Approximate Aggregation Techniques for Sensor Databases

References. Introduction. Publish/Subscribe paradigm. In a wireless sensor network, a node is often interested in some information, but

Summarizing and mining inverse distributions on data streams via dynamic inverse sampling

Fractional Cascading in Wireless. Jie Gao Computer Science Department Stony Brook University

Sensor Network Protocol Design and Implementation. Philip Levis UC Berkeley

Querying the Sensor Network. TinyDB/TAG

Introduction to Data Mining

Big Data Analytics CSCI 4030

Robust Aggregation in Sensor Network: An Efficient Frequent itemset and Number of occurrence counting

Routing protocols in WSN

Important issues. Query the Sensor Network. Challenges. Challenges. In-network network data aggregation. Distributed In-network network Storage

Secure Routing in Wireless Sensor Networks: Attacks and Countermeasures

One-Pass Streaming Algorithms

Tracking Frequent Items Dynamically: What s Hot and What s Not To appear in PODS 2003

ROUTING ALGORITHMS Part 1: Data centric and hierarchical protocols

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CACHE-OBLIVIOUS MAPS. Edward Kmett McGraw Hill Financial. Saturday, October 26, 13

Sensors. Outline. Sensor networks. Tributaries and Deltas: Efficient and Robust Aggregation in Sensor Network Streams

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Announcement. Reading Material. Overview of Query Evaluation. Overview of Query Evaluation. Overview of Query Evaluation 9/26/17

CPSC 536N: Randomized Algorithms Term 2. Lecture 5

Data-Centric Query in Sensor Networks

Lossless Compression Algorithms

The Flooding Time Synchronization Protocol

Advanced Database Systems

SMITE: A Stochastic Compressive Data Collection. Sensor Networks

CSE 417 Branch & Bound (pt 4) Branch & Bound

Database Applications (15-415)

Cushion: Autonomically Adaptive Data Fusion in Wireless Sensor Networks. Technical Report August 2005

Networking Sensors, II

Codes, Bloom Filters, and Overlay Networks. Michael Mitzenmacher

Introduction. hashing performs basic operations, such as insertion, better than other ADTs we ve seen so far

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Query Processing & Optimization

Chapter 12: Query Processing. Chapter 12: Query Processing

Genetic-Algorithm-Based Construction of Load-Balanced CDSs in Wireless Sensor Networks

Synopsis Diffusion for Robust Aggregation in Sensor Networks

Lecture 9. Quality of Service in ad hoc wireless networks

Ad hoc and Sensor Networks Chapter 3: Network architecture

CS 4604: Introduction to Database Management Systems. B. Aditya Prakash Lecture #10: Query Processing

ROUTING ALGORITHMS Part 2: Data centric and hierarchical protocols

Ad hoc and Sensor Networks Chapter 13a: Protocols for dependable data transport

Study on Wireless Sensor Networks Challenges and Routing Protocols

Location-aware In-Network Monitoring in Wireless Sensor Networks

Overview of Sensor Network Routing Protocols. WeeSan Lee 11/1/04

EECS 121: Coding for Digital Communication & Beyond Fall Lecture 22 December 3. Introduction

B561 Advanced Database Concepts Streaming Model. Qin Zhang 1-1

Towards a Wireless Lexicon. Philip Levis Computer Systems Lab Stanford University 20.viii.2007

Chapter 13: Query Processing

CMSC424: Database Design. Instructor: Amol Deshpande

Treaps. 1 Binary Search Trees (BSTs) CSE341T/CSE549T 11/05/2014. Lecture 19

Tributaries and Deltas: Efficient and Robust Aggregation in Sensor Network Streams

13 Sensor networks Gathering in an adversarial environment

Nearest neighbor classification DSE 220

Inference. Inference: calculating some useful quantity from a joint probability distribution Examples: Posterior probability: Most likely explanation:

Chapter 12: Query Processing

Faloutsos 1. Carnegie Mellon Univ. Dept. of Computer Science Database Applications. Outline

Chapter 8 LOCATION SERVICES

CHAPTER 5 PROPAGATION DELAY

Tag a Tiny Aggregation Service for Ad-Hoc Sensor Networks. Samuel Madden, Michael Franklin, Joseph Hellerstein,Wei Hong UC Berkeley Usinex OSDI 02

CHAPTER 2 WIRELESS SENSOR NETWORKS AND NEED OF TOPOLOGY CONTROL

Introduction. Introduction. Router Architectures. Introduction. Recent advances in routing architecture including

Image Transmission in Sensor Networks

Wireless Sensor Architecture GENERAL PRINCIPLES AND ARCHITECTURES FOR PUTTING SENSOR NODES TOGETHER TO

! A relational algebra expression may have many equivalent. ! Cost is generally measured as total elapsed time for

Chapter 13: Query Processing Basic Steps in Query Processing

CS4700/CS5700 Fundamentals of Computer Networks

Packet-Level Diversity From Theory to Practice: An based Experimental Investigation

Multidimensional Indexes [14]

Throughput-Optimal Broadcast in Wireless Networks with Point-to-Multipoint Transmissions

Routing. Chapter 12. Ad Hoc and Sensor Networks Roger Wattenhofer

EECS 647: Introduction to Database Systems

1 Overview, Models of Computation, Brent s Theorem

P 5 : A Protocol for Scalable Anonymous Communications

Homework 1 Solutions:

Implementation of Lexical Analysis. Lecture 4

Treewidth and graph minors

COMP4128 Programming Challenges

9/29/2016. Chapter 4 Trees. Introduction. Terminology. Terminology. Terminology. Terminology

CS 525: Advanced Database Organization 04: Indexing

Query Processing. Debapriyo Majumdar Indian Sta4s4cal Ins4tute Kolkata DBMS PGDBA 2016

Graph Algorithms. Many problems in networks can be modeled as graph problems.

Energy-Efficient Communication Protocol for Wireless Micro-sensor Networks

CS 245 Midterm Exam Solution Winter 2015

Integrated Routing and Query Processing in Wireless Sensor Networks

Chapter 5 Hashing. Introduction. Hashing. Hashing Functions. hashing performs basic operations, such as insertion,

CS 268: Route Lookup and Packet Classification

Searching for Shared Resources: DHT in General

Research on Transmission Based on Collaboration Coding in WSNs

Distributed Operating Systems

Lectures 8/9. 1 Overview. 2 Prelude:Routing on the Grid. 3 A couple of networks.

CS244 Advanced Topics in Computer Networks Midterm Exam Monday, May 2, 2016 OPEN BOOK, OPEN NOTES, INTERNET OFF

Algorithms and Applications for the Estimation of Stream Statistics in Networks

Fast Approximations for Analyzing Ten Trillion Cells. Filip Buruiana Reimar Hofmann

Searching for Shared Resources: DHT in General

Ad Hoc Networks: Issues and Routing

Transcription:

Approximate Aggregation Techniques for Sensor Databases John Byers Computer Science Department Boston University Joint work with Jeffrey Considine, George Kollios and Feifei Li

Sensor Network Model Large set of sensors distributed in a sensor field. Communication via a wireless ad-hoc network. Node and links are failure-prone. Sensors are resource-constrained constrained Limited memory, battery-powered, messaging is costly.

Sensor Databases Useful abstraction: Treat sensor field as a distributed database But: data is gathered, not stored nor saved. Express query in SQL-like language COUNT, SUM, AVG, MIN, GROUP-BY Query processor distributes query and aggregates responses Exemplified by systems like TAG (Berkeley/MIT) and Cougar (Cornell)

A Motivating Example Each sensor has a single sensed value. Sink initiates one-shot queries such as: What is the maximum value? mean value? Continuous queries are a natural extension. 2 6 A H 9 C G 7 E D 3 J H 4 1 B 6 10 I F 12

MAX Aggregation (no losses) Build spanning tree Aggregate in-network network Each node sends one summary packet Summary has MAX of entire sub-tree One loss could lose MAX of many nodes Neighbors of sink are particularly vulnerable 2 6 A H 9 2 C G 7 E 1 D 3 6 7 6 10 9 9 J 12 H 4 6 B 6 12 10 I F 12 MAX=12

MAX Aggregation (with loss) Nodes send summaries over multiple paths Free local broadcast Always send MAX value observed MAX is infectious Harder to lose Just need one viable path to the sink Relies on duplicate- insensitivity of MAX 2 6 A H 9 2 C G 7 E 1 D 3 6 7 6 12 9 9 J 12 H 4 6 B 6 12 10 12 I F 12 MAX=12

AVG Aggregation (no losses) Build spanning tree Aggregate in-network network Each node sends one summary packet Summary has SUM and COUNT of sub-tree Same reliability problem as before 2 6 A H 9 2,1 C G 7 D 3 9,2 15,3 J H 4 B 6 6,1 6,1 10,1 9,1 10,2 26,4 12,1 10 I F 12 E 1 AVG=70/10=7

AVG Aggregation (naive) What if redundant copies of data are sent? AVG is duplicate- sensitive Duplicating data changes aggregate Increases weight of duplicated data 2 6 A H 9 2,1 C G 7 D 3 9,2 15,3 J H 4 B 6 6,1 6,1 22,2 9,1 10,2 26,4 12,1 10 I 12,1 F 12 E 1 AVG=82/11 7

AVG Aggregation (TAG++) Can compensate for increased weight [MFHH 02] Send halved SUM and COUNT instead Does not change expectation! Only reduces variance 2 6 A H 9 2,1 C G 7 D 3 9,2 15,3 J H 4 B 6 6,1 6,1 16,0.5 9,1 10,2 20,3.5 6,0.5 10 I 6,0.5 F 12 E 1 AVG=70/10=7

AVG Aggregation (LIST) Can handle duplicates exactly with a list of <id, value> pairs Transmitting this list is expensive! Lower bound: linear space is necessary if we demand exact results. 2 6 A H 9 A,2 H,6 C G 7 C,9 A,2; G,7; H,6 E 1 D B,6; D,3 J H 3 4 B B,6 6 F,1; I,10 C,9; E,1; F,1; H,4 F,1 C,9; E,1 10 I F,1 F 12 AVG=70/10=7

Classification of Aggregates TAG classifies aggregates according to Size of partial state Monotonicity Exemplary vs.. summary Duplicate-sensitivity MIN/MAX (cheap and easy) Small state, monotone, exemplary, duplicate-insensitive insensitive COUNT/SUM/AVG (considerably harder) Small state and monotone, BUT duplicate-sensitive Cheap if aggregating over tree without losses Expensive with multiple paths and losses

Design Objectives for Robust Aggregation Admit in-network network aggregation of partial values. Let representation of aggregates be both order- insensitive and duplicate-insensitive insensitive. Be agnostic to routing protocol Trust routing protocol to be best-effort. effort. Routing and aggregation can be logically decoupled [NG 03]. Some routing algorithms better than others (multipath( multipath). Exact answers incur extremely high cost. We argue that it is reasonable to use aggregation methods that are themselves approximate.

Outline Introduction Sketch Theory and Practice COUNT sketches (old) SUM sketches (new) Practical realizations for sensor nets Experiments Conclusions

COUNT Sketches Problem: Estimate the number of distinct item IDs in a data set with only one pass. Constraints: Small space relative to stream size. Small per item processing overhead. Union operator on sketch results. Exact COUNT is impossible without linear space. First approximate COUNT sketch in [FM 85]. O(log N) space, O(1) processing time per item.

Counting Paintballs Imagine the following scenario: A bag of n paintballs is emptied at the top of a long stair-case. At each step, each paintball either bursts and marks the step, or bounces to the next step. 50/50 chance either way. Looking only at the pattern of marked steps, what was n?

Counting Paintballs (cont) What does the distribution of paintball bursts look like? The number of bursts at each step follows a binomial distribution. The expected number of bursts drops geometrically. Few bursts after log 2 n steps 1 st B(n,1/2) B(n,1/4) 2 nd S th B(n,1/2 S ) B(n,1/2 S )

Counting Paintballs (cont) Many different estimator ideas [FM'85,AMS'96,GGR'03,DF'03,...] Example: Let pos denote the position of the highest unmarked stair, E(pos) log 2 (0.775351 n) σ 2 (pos) 1.12127 Standard variance reduction methods apply Either O(log n) or O(log log n) space

Back to COUNT Sketches The COUNT sketches of [FM'85] are equivalent to the paintball process. Start with a bit-vector of all zeros. For each item, Use its ID and a hash function for coin flips. Pick a bit-vector entry. Set that bit to one. These sketches are duplicate-insensitive insensitive: {x} {y} {x,y} 1 0 0 0 0 0 0 1 0 0 1 0 1 0 0 A,B (Sketch(A) Sketch(B)) = Sketch(A B)

Application to Sensornets Each sensor computes k independent sketches of itself using its unique sensor ID. Coming next: sensor computes sketches of its value. Use a robust routing algorithm to route sketches up to the sink. Aggregate the k sketches via in-network network XOR. Union via XOR is duplicate-insensitive insensitive. The sink then estimates the count. Similar to gossip and epidemic protocols.

SUM Sketches Problem: Estimate the sum of values of distinct < <key, value> > pairs in a data stream with repetitions. ( (value 0, integral). Obvious start: Emulate value insertions into a COUNT sketch and use the same estimators. For <k,v>, imagine inserting <k, v, 1>, <k, v, 2>,, <k, v, v>

SUM Sketches (cont) But what if the value is 1,000,000? Main Idea (details on next slide): Recall that all of the low-order order bits will be set to 1 w.h.p. inserting such a value. Just set these bits to one immediately. Then set the high-order bits carefully.

Simulating a set of insertions Set all the low-order order bits in the safe region. First S = log v 2 log log v bits are set to 1 w.h.p. Statistically estimate number of trials going beyond safe region Probability of a trial doing so is simply 2 -S Number of trials ~ B (v,, 2 -S ). [Mean = O(log 2 v)] For trials and bits outside safe region, set those bits manually. Running time is O(1) for each outlying trial. Expected running time: O(log v) ) + time to draw from B (v,, 2 -S ) + O(log 2 v)

Sampling for Sensor Networks We need to generate samples from B (n, p). p With a slow CPU, very little RAM, no floating point hardware General problem: sampling from a discrete pdf. Assume can draw uniformly at random from [0,1]. With an event space of size N: O(log N) ) lookups are immediate. Represent the cdf in an array of size N. Draw from [0, 1] and do binary search. Cleverer methods for O(log log N), O(log* N) ) time Amazingly, this can be done in constant time!

Walker s Alias Method Theorem [Walker 77]: For any discrete pdf D over a sample space of size n, a table of size O(n) can be constructed in O(n) time that enables random variables to be drawn from D using at most two table lookups. 1/n 1/n 1/n B 0.10 D 0.05 B 0.25 E 0.15 A 0.10 B 0.15 C 0.05 E 0.35 0.25 D 0.20 E 0.20 n n n

Binomial Sampling for Sensors Recall we want to sample from B(v,2 - S ) for various values of v and S. First, reduce to sampling from G(1 2 - S ). Truncate distribution to make range finite (recursion to handle large values). Construct tables of size 2 S for each S of interest. Can sample B(v,2 - S ) in O(v 2 - S ) expected time.

The Bottom Line SUM inserts in O(log 2 (v)) time with O(v / log 2 (v)) space O(log(v)) time with O(v / log(v)) space O(v) time with naïve method Using O(log 2 (v)) method, 16 bit values (S 8) and 64 bit probabilities Resulting lookup tables are ~ 4.5KB Recursive nature of G(1 2 - S ) lets us tune size further Can achieve O(log v) time at the cost of bigger tables

Outline Introduction Sketch Theory and Practice Experiments Conclusions

Experimental Results Used TAG simulator Grid topology with sink in the middle Grid size [default: 30 by 30] Transmission radius [default: 8 neighbors on the grid] Node, packet, or link loss [default: 5% link loss rate] Number of bit vectors [default: 20 bit-vectors of 16 bits (compressed)].

Experimental Results We consider four main methods. TAG1: transmit aggregates up a single spanning tree TAG2: Send a 1/k fraction of the aggregated values to each of k parents. SKETCH: broadcast an aggregated sketch to all neighbors at level i 1 LIST: explicitly enumerate all <key, value> pairs and broadcast to all neighbors at level i 1. LIST vs.. SKETCH measures the penalty associated with approximate values.

COUNT vs Link Loss (grid)

COUNT vs Link Loss (grid)

SUM vs Link Loss (grid)

Message Cost Comparison Strategy Total Data Bytes Messages Sent Messages Received TAG1 1800 900 900 TAG2 1800 900 2468 SKETCH 10843 900 2468 LIST 170424 900 2468

Outline Introduction Sketch Theory and Practice Experiments Conclusions

Our Work in Context In parallel with our efforts, Nath and Gibbons (Intel/CMU) What are the essential properties of duplicate insensitivity? What other aggregates can be sketched? Bawa et al (Stanford) What routing methods are necessary to guarantee the validity and semantics of aggregates?

Conclusions Duplicate-insensitive insensitive sketches fundamentally change how aggregation works Routing becomes logically decoupled Arbitrarily complex aggregation scenarios are allowable cyclic topologies, multiple sinks, etc. Extended list of non-trivial aggregates We added SUM, MEAN, VARIANCE, STDEV, Resulting system performs better Moderate cost (tunable( tunable) ) for large reliability boost

Ongoing Work What else can we sketch? Clear need to extend expressiveness of sketches Also: what are the limits of duplicate-insensitive insensitive ones? Distributed streaming model Monitor and sketch streams of data Collect sketches and estimate global properties Traffic monitoring Identifying large flows, flows with large changes Both already done with counting Bloom filters [KSGC 03,CM 04] We can make those duplicate-insensitive! insensitive! Aggregation via random sampling

Future Directions (cont)

Message Cost Comparison Strategy Total Data Bytes Messages Sent Messages Received TAG1 1800 900 900 TAG2 1800 900 2468 SKETCH 10843 900 2468 LIST 170424 900 2468

Thank you! More questions?

Multipath Routing Braided Paths: Two paths from the source to the sink that differ in at least two nodes

Design Objectives (cont) Final aggregate is exact if at least one representative from each leaf survives to reach the sink. This won t happen in practice in sensornets without extremely high cost. It is reasonable to hope for approximate results. We argue that it is reasonable to use aggregation methods that are themselves approximate.

Goal of This Work So far, we ve seen ideas of In-network network aggregation (low traffic per link) Multi-path routing (reliability of individual items) These usually don t combine well Only works for duplicate-insensitive insensitive aggregates such as MIN/MAX, AND/OR What about all the other aggregates? We want them cheap, reliable, and correct

Contributions of This Work Propose duplicate-insensitive insensitive sketches to approximately aggregate data Difficulty was noted [MFHH 02] Approximation is necessary With duplicate-insensitive insensitive sketches, any best-effort effort routing method can be employed Design new duplicate-insensitive insensitive sketches SUM => MEAN, VARIANCE, STDEV,

Routing Methodologies Considerable work on reliable delivery via multipath routing Directed diffusion [IGE 00] Braided diffusion [GGSE 01] GRAdient Broadcast [YZLZ 02] Broadcast intermediate results along gradient back to source Can dynamically control width of broadcast Trade off fault tolerance and transmission costs Our approach similar to GRAB: Broadcast. Grab if upstream, ignore if downstream Common goal: : try to get at least one copy to sink

SUM Sketches (cont) Remaining questions: What should S be when inserting < <k, v>? v When using analysis of [FM 85] S log 2 (v) 2 log 2 log(v) Expected time = O(log 2 (v)) + sample time Can go farther keeping high probability S log 2 (v) log 2 log(v) Expected time = O(log(v)) + sample time How do we sample the binomial distribution? Space requirements may affect choice of S

SUM Sketches (cont) Reduction to COUNT sketches: Pick a prefix length S The first S bits should be set with high probability. Set the first S bits to one. Sample from B(v, 2 - S ) to figure out how many items would pick bits after the first S bits. Simulate the insertion of those items. Expected time = O(S) + sample time + O(v 2 - S )

Sampling Constraints Sensor motes have very limited resources Slow CPU Very little RAM No floating point hardware Sampling from B(n, p) isn t easy normally Obviously O(log n) time and O(n) space O(np np) expected time (and good FP hardware) with standard reduction to geometric distribution How hard is this sampling problem anyway?

COUNT vs Diameter (grid)

COUNT vs Link Loss (random)