Rectangle-Efficient Aggregation in Spatial Data Streams

Similar documents
An Efficient Transformation for Klee s Measure Problem in the Streaming Model


Rectangle-Efficient Aggregation in Spatial Data Streams

Course : Data mining

Sketching Asynchronous Streams Over a Sliding Window

CMSC 754 Computational Geometry 1

RANGE-EFFICIENT COUNTING OF DISTINCT ELEMENTS IN A MASSIVE DATA STREAM

Computational Geometry

Edge and local feature detection - 2. Importance of edge detection in computer vision

COSC160: Data Structures Hashing Structures. Jeremy Bolton, PhD Assistant Teaching Professor

Compact Data Representations and their Applications. Moses Charikar Princeton University

We have set up our axioms to deal with the geometry of space but have not yet developed these ideas much. Let s redress that imbalance.

4.2 and 4.6 filled in notes.notebook. December 08, Integration. Copyright Cengage Learning. All rights reserved.

The Touring Polygons Problem (TPP)

Parameterization of triangular meshes

Notes on Binary Dumbbell Trees

3 Nonlinear Regression

Convex hulls of spheres and convex hulls of convex polytopes lying on parallel hyperplanes

Part IV. 2D Clipping

PACKING DIGRAPHS WITH DIRECTED CLOSED TRAILS

This strand involves properties of the physical world that can be measured, the units used to measure them and the process of measurement.

Relational Databases

Sketching Probabilistic Data Streams

Dynamic Range Majority Data Structures

Thomas H. Cormen Charles E. Leiserson Ronald L. Rivest. Introduction to Algorithms

Scanning Real World Objects without Worries 3D Reconstruction

Topology notes. Basic Definitions and Properties.

1 Probability Review. CS 124 Section #8 Hashing, Skip Lists 3/20/17. Expectation (weighted average): the expectation of a random quantity X is:

On Computing the Centroid of the Vertices of an Arrangement and Related Problems

Math 170- Graph Theory Notes

Principal Component Analysis for Distributed Data

Machine Learning. B. Unsupervised Learning B.1 Cluster Analysis. Lars Schmidt-Thieme, Nicolas Schilling

Software Documentation of the Potential Support Vector Machine

One-Pass Streaming Algorithms

CS166 Handout 14 Spring 2018 May 22, 2018 Midterm Practice Problems

Annotations For Sparse Data Streams

Sutured Manifold Hierarchies and Finite-Depth Foliations

Data Mining. Part 2. Data Understanding and Preparation. 2.4 Data Transformation. Spring Instructor: Dr. Masoud Yaghini. Data Transformation

Range Reporting. Range Reporting. Range Reporting Problem. Applications

Lattice Polygon s and Pick s Theorem From Dana Paquin and Tom Davis 1 Warm-Up to Ponder

Buffered Count-Min Sketch on SSD: Theory and Experiments

CPSC / Sonny Chan - University of Calgary. Collision Detection II

Set4. 8 February 2019 OSU CSE 1

Figure 1: Two polygonal loops

CHAPTER 6 IDENTIFICATION OF CLUSTERS USING VISUAL VALIDATION VAT ALGORITHM

Privacy Breaches in Privacy-Preserving Data Mining

Euler s Theorem. Brett Chenoweth. February 26, 2013

COSC160: Detection and Classification. Jeremy Bolton, PhD Assistant Teaching Professor

Detecting Burnscar from Hyperspectral Imagery via Sparse Representation with Low-Rank Interference

Multidimensional Divide and Conquer 2 Spatial Joins

COMP 352 FALL Tutorial 10

3 Nonlinear Regression

EXERCISES SHORTEST PATHS: APPLICATIONS, OPTIMIZATION, VARIATIONS, AND SOLVING THE CONSTRAINED SHORTEST PATH PROBLEM. 1 Applications and Modelling

arxiv: v1 [cs.ds] 1 Sep 2015

1. Meshes. D7013E Lecture 14

Joshua Brody and Amit Chakrabarti

Parallel Algorithm Design. Parallel Algorithm Design p. 1

Data Mining. 1.3 Input. Fall Instructor: Dr. Masoud Yaghini. Chapter 3: Input

Lecture notes on the simplex method September We will present an algorithm to solve linear programs of the form. maximize.

Data Streaming Algorithms for Geometric Problems

2 Geometry Solutions

Approximate Integration of Streaming data

Discrete Mathematics

COMP3121/3821/9101/ s1 Assignment 1

Computer Graphics: Two Dimensional Viewing

Semi-Streaming Algorithms for Annotated Graph Streams. Justin Thaler, Yahoo Labs

Here are some of the more basic curves that we ll need to know how to do as well as limits on the parameter if they are required.

Algorithms for Nearest Neighbors

Chapter 4. Relations & Graphs. 4.1 Relations. Exercises For each of the relations specified below:

Chapter Introduction

Interactive Math Glossary Terms and Definitions

Section 1.1 The Distance and Midpoint Formulas; Graphing Utilities; Introduction to Graphing Equations

3 Competitive Dynamic BSTs (January 31 and February 2)

Hash-Based Indexing 1

Efficient Data Structures for the Factor Periodicity Problem

Grade 6 Mathematics Item Specifications Florida Standards Assessments

Solutions to Problem Set 1

Math 462: Review questions

Applied Statistics for the Behavioral Sciences

Image Analysis & Retrieval. CS/EE 5590 Special Topics (Class Ids: 44873, 44874) Fall 2016, M/W Lec 18.

Boundary descriptors. Representation REPRESENTATION & DESCRIPTION. Descriptors. Moore boundary tracking

Using Statistics for Computing Joins with MapReduce

Methods for Intelligent Systems

Parameterization. Michael S. Floater. November 10, 2011

Column and row space of a matrix

Johnson-Lindenstrauss Lemma, Random Projection, and Applications

Relational Database: The Relational Data Model; Operations on Database Relations

AMS /672: Graph Theory Homework Problems - Week V. Problems to be handed in on Wednesday, March 2: 6, 8, 9, 11, 12.

1.1 - Introduction to Sets

A Data Dependent Triangulation for Vector Fields

Range-Aggregate Queries Involving Geometric Aggregation Operations

Diffusion Wavelets for Natural Image Analysis

Models of distributed computing: port numbering and local algorithms

On Op%mality of Clustering by Space Filling Curves

2D transformations: An introduction to the maths behind computer graphics

Lecture 19. Lecturer: Aleksander Mądry Scribes: Chidambaram Annamalai and Carsten Moldenhauer

Anany Levitin 3RD EDITION. Arup Kumar Bhattacharjee. mmmmm Analysis of Algorithms. Soumen Mukherjee. Introduction to TllG DCSISFI &

Lecture 5: Data Streaming Algorithms

Leveraging Set Relations in Exact Set Similarity Join

6.842 Randomness and Computation September 25-27, Lecture 6 & 7. Definition 1 Interactive Proof Systems (IPS) [Goldwasser, Micali, Rackoff]

Transcription:

Rectangle-Efficient Aggregation in Spatial Data Streams Srikanta Tirthapura Iowa State David Woodruff IBM Almaden

The Data Stream Model Stream S of additive updates (i, Δ) to an underlying vector v: v i <- v i + Δ Dimension of v is n v initialized to 0 n Number of updates is m Δ is an integer in {-M, -M+1,, M} Assume M, m poly(n)

Applications Coordinates of v associated with items v i is the number of times (frequency) item i occurs in S Number of non-zero entries of v = number of distinct items in S denoted v 0 v 1 = Σ i v i is sum of frequencies in S v 2 2 = Σ i v i 2 is self-join size v p = (Σ i v i p ) 1/p is p-norm of v

Lots of Known Results (ε, δ)-approximation output estimator E to v p Pr[ v p E (1+ε) v p ] 1-δ Let O~(1) denote poly(1/ε, log n/δ) Optimal estimators for v p : 0 p 2, use O~(1) memory p > 2 use O~(n 1-2/p ) memory Both results have O~(1) update time

Range-Updates Sometimes more efficient to represent additive updates in the form ([i,j], Δ): 8k 2 {i, i+1, i+2,, j}: v k <- v k + Δ Useful for representing updates to time intervals Many reductions: Triangle counting Distinct Summation Max Dominance Norm

A Trivial Solution? Given update ([i, j], Δ), treat it as j-i+1 updates (i, Δ), (i+1, Δ), (i+2, Δ),, (i+j, Δ) Run memory-optimal algorithm on the resulting j-i+1 updates Main problem: update time could be as large as n

Scattered Known Results Algorithm is range-efficient if update time is O~(1) and memory optimal up to O~(1) Range-efficient algorithms exist for positive-only updates to v 0 v 2 For v p, p>2, can get O~(n 1-1/p ) time and memory Many questions open (we will resolve some)

Talk Outline General Framework: Rectangle Updates Problems Frequent points ( heavy hitters ) Norm estimation Results Techniques Conclusion

From Ranges to Rectangles Vector v has coordinates indexed by pairs (i,j) 2 {1, 2,, n} x {1, 2,, n} Rectangular updates ([i 1, j 1 ] x [i 2, j 2 ], Δ) 8(i,j) 2 [i 1, j 1 ] x [i 2, j 2 ]: v i,j <- v i,j + Δ

Spatial Datasets A natural extension of 1-dimensional ranges is to axis-aligned rectangles Can approximate polygons by rectangles Spatial databases such as OpenGIS No previous work on streams of rectangles What statistics do we care about?

Klee s Measure Problem 4 3 2 1 0 0 1 2 3 4 What is the volume of the union of three rectangles? [0, 2] x [0,4], [1, 3] x [1,3], [2, 4] x [0,2]

Other Important Statistics max i,j v i,j is the depth can approximate the depth by approximating v p for large p > 0 (sometimes easier than computing depth) also of interest: find heavy hitters, namely, those (i,j) with v i,j ε v 1 or v i,j 2 ε v 2 2

Notation Algorithm is rectangle-efficient if update time is O~(1) and memory optimal up to O~(1) For (ε,δ)-approximating v p we want for 0 p 2, O~(1) time and memory for p > 2, O~(1) time and O~(n 2 ) 1-2/p memory

Our Results For v p for 0 p 2, we obtain the first rectangle-efficient algorithms First solution for Klee s Measure Problem on streams For finding heavy hitters, we obtain the first rectangle-efficient algorithms For p > 2 we achieve: O~(n 2 ) 1-2/p memory and time OR O~(n 2 ) 1-1/p memory and O~(1) time

Our Results For any number d of dimensions: For v p for 0 p 2 and heavy hitters: O~(1) memory and O~(d) time For v p for p > 2: O~(n d ) 1-2/p memory and time OR O~(n d ) 1-1/p memory and O~(d) time Only a mild dependence on d Improves previous results even for d = 1

Our Techniques Main idea: Leverage a technique in streaming algorithms for estimating v p for any p 0 Replace random hash functions in technique with hash functions with very special properties

Indyk/W Methodology To estimate v p of a vector v of dimension n: Choose O(log n) random subsets of coordinates of v, denoted S 0, S 1,, S log n S i is of size n/2 i 8 S i : find those coordinates j 2 S i for which v j 2 γ v S i 2 2 Use CountSketch: Assign each coordinate j a random sign ¾(j) 2 {-1,1} Randomly hash the coordinates into 1/γ buckets, maintain Σ j s.t. h(j) = k and j 2 S i ¾(j) v j in k-th bucket Σ j s.t. h(j) = 2 and j 2 S i ¾(j) v j Our observation: can choose the S i and hash functions in CountSketch to be pairwise-independent

Special Hash Functions Let A be a random k x r binary matrix, and b a random binary vector of length k Let x in GF(2 r ) Ax + b is a pairwise-independent function: For x x 2 GF(2 r ) and y y 2 GF(2 k ): Pr[Ax+b = y and Ax +b = y ] = 1/2 2k

Special Hash Functions Con d Given a stream update ([i 1, j 1 ] x [i 2, j 2 ], Δ): can decompose [i 1, j 1 ] and [i 2, j 2 ] into O(log n) disjoint dyadic intervals: [i 1, j 1 ] = [a 1, b 1 ] [ [ [a s, b s ] [i 2, j 2 ] = [c 1, d 1 ] [ [ [c s, d s ] Dyadic inteval: [u2 q, (u+1)2 q ) for integers u,q Then [i 1, j 1 ] x [i 2, j 2 ] = [ r, r [a r, b r ] x [c r, d r ]

Special Hash Functions Con d A property of function Ax+b: can quickly compute the number of x in the interval [a r, b r ] x [c r, d r ] for which Ax+b = e By the structure of dyadic intervals, this corresponds to fixing a subset of bits of x, and letting the remaining variables be free: # of x 2 [a r,b r ] x [c r,d r ] for which Ax+b = e is # of z for which A z = e, z is unconstrained Can now use Gaussian elimination to count # of z

Techniques WrapUp Step 1: Modify Indyk/W analysis to use only pairwiseindependence Step 2: Given update ([i 1, j 1 ] x [i 2, j 2 ], Δ), decompose into disjoint products of dyadic intervals [ r, r [a r, b r ] x [c r, d r ] Step 3: For each [a r, b r ] x [c r, d r ], find the number of items in each S i and each bucket of CountSketch and with each sign by solving a system of linear equations Step 4: Update all buckets

Conclusion Contributions: Initiated study of rectangle-efficiency Gave rectangle-efficient algorithms for estimating v p, 0 p 2 and heavy hitters Tradeoffs for estimating v p for p > 2 Improve previous work even for d = 1 Open Questions: Get O~(n d ) 1-2/p memory and O~(1) time for p > 2 Other applications of the model and technique