arxiv: v2 [cs.db] 18 Sep 2017

Size: px

Start display at page:

Download "arxiv: v2 [cs.db] 18 Sep 2017"

Wesley Lucas
6 years ago
Views:

1 Effcent Approxmate Query Answerng over Sensor Data wth Determnstc Error Guarantees arxv: v2 [cs.db] 18 Sep 2017 ABSTRACT Jaquelne Brto UC San Dego Yanns Katss UC San Dego Wth the recent prolferaton of sensor data, there s an ncreasng need for the effcent evaluaton of analytcal queres over multple sensor datasets. The magntude of such datasets makes exact query answerng nfeasble, leadng researchers nto the development of approxmate query answerng approaches. However, exstng approxmate query answerng algorthms are not suted for the effcent processng of queres over sensor data, as they exhbt at least one of the followng shortcomngs: (a) They do not provde determnstc error guarantees, resortng to weaker probablstc error guarantees that are n many cases not acceptable, (b) they allow queres only over a sngle dataset, thus not supportng the multtude of queres over multple datasets that appear n practce, such as correlaton or cross-correlaton and (c) they support relatonal data n general and thus mss speedup opportuntes created by the specal nature of sensor data, whch are not random but follow a typcally smooth underlyng phenomenon. To address these problems, we propose PlatoDB; a system that explots the nature of sensor data to compress them and provde effcent processng of queres over multple sensor datasets, whle provdng determnstc error guarantees. PlatoDB acheves the above through a novel archtecture that (a) at data mport tme preprocesses each dataset, creatng for t an ntermedate herarchcal data structure that provdes a herarchy of summarzatons of the dataset together wth approprate error measures and (b) at query processng tme leverages the pre-computed data structures to compute an approxmate answer and determnstc error guarantees for ad hoc queres even when these combne multple datasets. As a result of ts novel archtecture, PlatoDB exhbts speedups of 1-3 orders of magntude compared to systems that use the entre sensor datasets to compute exact query answers durng experments performed on real sensor datasets. 1. INTRODUCTION The ncreasng affordablty of sensors and storage has recently led to the prolferaton of sensor data n a varety of domans, ncludng transportaton, envronmental protecton, healthcare, ftness, etc. These data are typcally of hgh granularty and as a result have substantal storage requrements, rangng from a few GB to many TB. For nstance, a Formula 1 produces 20GB of data durng two 90-mnute practce sessons 1, whle a commercal arcraft may generate 2.5TB of data per day 2. Supported by NSF BIGDATA Korhan Demrkaya UC San Dego kdemrka@cs.ucsd.edu Chunbn Ln UC San Dego chunbnln@cs.ucsd.edu Bourser Etenne UC San Dego ebourser@cs.ucsd.edu Yanns Papakonstantnou UC San Dego yanns@cs.ucsd.edu The magntude of sensor datasets creates a sgnfcant challenge when t comes to query evaluaton. Runnng analytcal queres over the data (such as fndng correlatons between sgnals), whch typcally nvolve aggregates, can be very expensve, as the queres have to access sgnfcant amounts of data. Ths problem becomes worse when queres combne n ad hoc ways multple sensor datasets. For nstance, consder a data analytcs scenaro, where a user wants to combne (a) a locaton dataset provdng the locaton of users for dfferent ponts n tme (as recorded by ther smartphone s GPS) and (b) an ar polluton dataset recordng the ar qualty at dfferent ponts n tme and space (as recorded by ar qualty sensors) to compute the average qualty of ar nhaled by each user over a certan tme perod 3. Answerng ths query requres accessng all locaton and ar polluton measurements n the tme perod of nterest, whch can be substantal for long perods. To solve ths problem, researchers have proposed approxmate query processng algorthms [17, 1, 37, 2, 26, 26, 31, 24] that approxmate the query result by lookng at a subset of the data. However, exstng approaches have the followng shortcomngs when t comes to the query processng of multple sensor data sets: Lack of determnstc error guarantees. Most query approxmaton algorthms provde probablstc error guarantees. Whle ths s suffcent for some use cases, t does not cover scenaros where the user needs determnstc guarantees ensurng that the returned answer s wthn the specfed error bounds. Lack of support of queres over multple datasets. Many technques, such as wavelets, provde error guarantees only for queres over a sngle dataset. The errors can be arbtrarly large for queres rangng over multple datasets, as they are unaware of how multple datasets nteract wth each other. Data agnostcsm. The majorty of exstng technques works for relatonal data n general and does not leverage compresson opportuntes that come from the fact that sensor data are not random n nature but follow typcally smooth contnuous phenomena. To overcome the lmtatons, we desgn the PlatoDB system, whch leverages the nature of sensor data to compress them and provde effcent processng of analytcal queres over multple sensor datasets, whle provdng determnstc error guarantees. In a nutshell, PlatoDB operates as follows: When ntated, t preproscence-arbus-puts sensors-n-every-sngle 3 Ths s a real example encountered durng the DELPHI project conducted at UC San Dego, whch studed how health-related data about ndvduals, ncludng large amounts of sensor data, can be leveraged to dscover the determnants of health condtons [18].

2 cesses each tme seres dataset and bulds for t a bnary tree structure, whch provdes a herarchy of summarzatons of segments of the orgnal tme seres. A node n the tree structure summarzes a segment of tme seres through two components: () a compresson functon estmatng the data ponts n the segment, and () error measures ndcatng the dstance between the compressed segment and the orgnal one. The lower level nodes refers to fner-graned segments and smaller errors. Durng runtme, PlatoDB takes as nput an aggregate query over potentally multple sensor datasets together wth an error or tme budget and utlzes the tree structure for each of the datasets nvolved n the query to obtan an approxmate answer together wth a determnstc error guarantee that satsfes the tme/error budget. Contrbutons. In ths work, we make the followng contrbutons: We defne a query language over sensor data, whch s powerful enough to express most common statstcs over both sngle and multple tme seres, such as varance, correlaton, and cross-correlaton (Secton 3). We propose a novel tree structure (structurally smlar to herarchcal hstograms) and a correspondng tree generaton algorthm that provdes a herarchcal summarzaton of each tme seres ndependently of the other tme seres. The summarzaton s based on the combnaton of arbtrary compresson functons that can be reused from the lterature together wth three novel error measures that can be used to provde determnstc error guarantees, regardless of the employed compresson functon (Secton 4). We desgn an effcent query processng algorthm operatng on the pre-computed tree structures, whch can provde determnstc error guarantees for queres rangng over multple tme seres, even though each tree refers to one tme seres n solaton. The algorthm s based on a combnaton of error estmaton formulas that leverage the error measures of ndvdual tme seres segments to compute an error for an entre query (Secton 5) together wth a tree navgaton algorthm that effcently traverses the tme seres tree to quckly compute an approxmate answer that satsfes the error guarantees (Secton 6). We conduct experments on two real-lfe datasets to evaluate our algorthms. The results show that our algorthm outperforms the baselne by 1-3 orders of magntude (Secton 7). 2. SYSTEM ARCHITECTURE Fgure 1 depcts PlatoDB s archtecture. PlatoDB operates n two steps, performed at two dfferent ponts n tme. At data mport tme, PlatoDB pre-processes the ncomng tme seres data, creatng a segment tree structure for each tme seres. At query executon tme, t leverages these segment trees to provde an approxmate query answer together wth determnstc error guarantees. We next descrbe these two steps n detal. Off-lne Pre-Processng. At data mport tme, PlatoDB takes as nput a set of tme seres. The tme seres are created from the raw sensor data by the typcal Extract-Transform-Load (ETL) scrpts potentally combned wth de-nosng algorthms, whch s outsde the focus of ths paper. For each such tme seres, PlatoDB s Segment Tree Generator creates a herarchy of summarzatons of the data n the form of a segment tree; a tree, whose nodes summarze the data for segments of the orgnal tme seres. Intutvely, the structure of the segment tree corresponds to a way of splttng the tme seres recursvely nto smaller segments: The root S 1 of the tree corresponds to the entre tme seres, whch can be splt nto two subsegments (generally of dfferent length), represented by the root s chldren S 1.1 and S 1.2. The segment correspondng to S 1.1 can be n turn splt further nto two smaller segments, represented by the chldren S and S of S 1.1 and so on. Snce each node provdes a bref summarzaton of the correspondng segment, lower levels of the tree provde a more precse representaton of the tme seres than upper levels. As we wll see later, ths herarchcal structure of segments s crucal for the query processor s ablty to adapt to a wde varety of error/tme budgets provded by the user. When the user s wllng to accept a large error, the query processor wll mostly use the top levels of the trees, provdng a quck response. On the other hand, f the user demands a lower error, the algorthm wll be able to satsfy the request by vstng lower levels of the segment trees (whch exact nodes wll be vsted also depends on the query and the nterplay of the tme seres n t). Leveragng the trees, PlatoDB can even provde users wth contnuously mprovng approxmate answers and error guarantees, allowng them to stop the computaton at any tme, smlar to works n onlne aggregaton [15, 7, 26]. Each node of the tree summarzes the correspondng segment through two data tems: (a) a compresson functon, whch represents the data ponts n a segment n a compact way (e.g., through a constant [21] or a lne [19]), and (b) a set of error measures, whch are metrcs of the dstance between the data pont values estmated by the compresson functon and the actual values of the data ponts. As we wll see, the query processor uses the compresson functon and error measures of the segment tree nodes to produce an approxmate answer of the query and the error guarantees, respectvely. Interestngly, PlatoDB s nternals are agnostc of the compresson functon used. As we wll dscuss n Secton 4, PlatoDB s query processor works ndependently of the employed compresson functons, allowng the system to be combned wth all popular compresson technques. For nstance, n our example above we utlzed the Pecewse Aggregate Approxmaton (PAA) [21], whch returns the average of a set of values. However, we could have used other compresson technques, such as the Adaptve Pecewse Constant Approxmaton (APCA) [20], the Pecewse Lnear Representaton (PLR) [19], or others. Remark. It s mportant to note that the segment tree s not necessarly a balanced tree. PlatoDB decdes whether a segment need to be splt based on how close the values derved from the compresson functon are to the actual values of the segment. PlatoDB splts the segment when the dfference s large. Intutvely, ths means that the segment tree contans more nodes for parts of the doman where the tme seres s rregular and/or rapdly changng, and fewer nodes for the smooth parts. PlatoDB treats the problem of fndng the splttng postons as an optmzaton problem, splttng at postons that can brng the largest error reducton. We wll present the segment tree generator algorthms n Secton 4. EXAMPLE 1. Fgure 1(a) shows the segment tree for a tme seres T. The root node S 1 of the tree (correspondng to the segment coverng the entre tme seres) summarzes ths segment through two tems: a set of parameters descrbng a compresson functon f 1 (n ths case the functon returns the average v of the values of the tme seres and can therefore be descrbed by the sngle value v) and a set of error measures M 1 (the detals of error measures wll be presented n Secton 4). Ths entre segment s splt nto two

3 Q Query max ε/ max t Error/ Tme Budget Pre-Processng Offlne Query Processng Onlne ETL + Nose Removal Segment Tree Generator Query Processor Sensor Data Segment Trees Approxmate Error Answer Guarantee (a) Generatng the segment tree of a tme seres T S 1 error measures (b) Evaluatng a query nvolvng T 1 and T 2 Segment S 1 compresson functon S 1.1 S 1.2 Segment S 1.1 Segment S 1.2 S S S 1.1,1 S S 1.2 Tme seres T Segment Tree for T Segment Tree for T 1 Segment Tree for T2 Fgure 1: PlatoDB s archtecture, ncludng detals on the segment tree generaton and query processng. subsegments S 1.1 and S 1.2, gvng rse to the dentcally-named tree nodes. Note that the tree s not balanced. Segment S 1.2 s not splt further as ts functon f 1.2 correctly predcts the values wthn the correspondng segment. In contrast, the segment S 1.1 dsplays great varablty n the tme seres values and s thus splt further nto segments S and S On-lne Query Processng. At query evaluaton tme, PlatoDB s Query Processor receves a query and a tme or error budget and leverages the pre-processed segment trees to produce an approxmate query answer and a correspondng error guarantee satsfyng the provded budget. To compute the answer and error guarantee, PlatoDB traverses n parallel n a top-down fashon the segment trees of all tme seres nvolved n the query. At any step of ths process, t uses the compresson functon and error measures n the current accessed nodes to calculate an approxmate query answer and the correspondng error. If t has not reached yet the tme/error budget (.e., f there s stll tme left or f the current error s stll greater than the error budget), PlatoDB greedly chooses among all the currently accessed nodes the one, whose chldren nodes would yeld the greatest error reducton and uses them to replace ther parent n the answer and error estmaton. Otherwse, PlatoDB stops accessng further nodes of the segment trees and outputs the currently computed approxmate answer and error. Query processng s descrbed n detal n Sectons 5 and 6. Remark. It s mportant to note that, n contrast to exstng approxmate query answerng systems, PlatoDB can answer queres that span across dfferent tme seres, even though the segment trees were pre-processed for each tme seres ndvdually. As we wll see, the fact that the segment trees were generated for each tme seres ndvdually, leads to nterestng problems at query processng tme, such as algnng the segments of dfferent tme seres and reasonng about how these segments nteract to produce the query answer and error guarantees. Fnally, t s also mportant to note that PlatoDB adapts to the provded error budget by accessng dfferent number of nodes. Larger error budgets lead to fewer node accesses, whle smaller error budgets requre more node accesses. EXAMPLE 2. Consder a query Q nvolvng two tme seres T 1 and T 2 and an error budget ε max = 10. Fgure 1(b) shows how the query processng algorthm uses the pre-computed segment trees of the two tme seres. PlatoDB frst accesses the root nodes of both segment trees n parallel and computes the current approxmate query answer ˆR and error ˆε, usng the compresson functon and error measures n the root nodes. Let s assume that ˆε = 20. Snce ˆε > ε max, PlatoDB keeps traversng the trees by greedly choosng a node and replacng t by ts chldren, so that the error reducton at each step s maxmzed. Ths process contnues untl the error budget s satsfed. For nstance, assume that usng the yellow shaded nodes n Fgure 1(b) PlatoDB obtans an error ˆε = 6 < ε max. Then PlatoDB stops traversng the trees and outputs the approxmate answer and the error ˆε = 6. Note that none of the descendants of the shaded nodes s touched, resultng n bg performance savngs. As a result of ths archtecture, PlatoDB acheves speedups of 1-3 orders of magntude n query processng of sensor data compared to approaches that use the entre dataset to compute exact query answers (more detals are ncluded n PlatoDB s expermental evaluaton n Secton 7). 3. DATA AND QUERIES Before descrbng the PlatoDB system, we frst present ts data model and query language.

4 Query Expresson (Q) Q Ar Arthmetc Expresson (Ar) Ar number Agg Ar Ar where {+,,, } Aggregaton Expresson (Agg) Agg Sum(T, l s, l e) Tme Seres Expresson (T ) T base tme seres l e =l s d SeresGen(υ, n) (υ, υ,..., υ) }{{} n Plus(T, T ) (d (1) 1 + d (2) 1,..., d(1) n + d (2) n ) n ) n ) Mnus(T, T ) (d (1) 1 d (2) 1,..., d(1) n d (2) Tmes(T, T ) (d (1) 1 d (2) 1,..., d(1) n d (2) Fgure 2: Grammar of query expressons. Data Model. For the purpose of ths work, a tme seres T =[(t 1, d 1), (t 2, d 2),..., (t n, d n)] s a sequence of (tme, data pont) pars (t, d ), such that the data pont d was observed at tme t. We follow exstng work [13] to normalze and standardze the tme seres so that all tme seres are n the same doman and have the same resoluton. Snce all tme seres are algned, for ease of exposton we omt the exact tme ponts and use nstead the ndex of the data ponts whenever we need to defne a tme nterval. For nstance, we wll denote the above tme seres smply as T =(d 1, d 2,..., d n), and use [, j] to refer to the tme nterval [t, t j]. A subsequence of a tme seres s called a tme seres segment. For example S = (5.01, 5.06) s a segment of the tme seres T = (5.05, 5.01, 5.06, 5.06, 5.08). Query Language. PlatoDB supports queres whose man buldng blocks are aggregaton queres over tme seres. Fgure 2 shows the formal defnton of the query language and Table 1 lsts several common statstcs that can be expressed n ths language. A query expresson Q s an arthmetc expresson of the form Arr 1 Arr 2... Arr n, where are the standard arthmetc operators (+,, ) and Arr s ether an arthmetc lteral or an aggregaton expresson over a tme seres. An aggregaton expresson Sum(T, l s, l e) over a tme seres T computes the sum of all data ponts of T n the tme nterval [l s, l e]. Note that the tme seres that s aggregated could ether be a base tme seres or a derved tme seres that was computed from a set of base tme seres through a set of tme seres operators. PlatoDB allows a seres of tme seres operators, ncludng Plus(T 1, T 2), Mnus(T 1, T 2), and Tmes(T 1, T 2) (whch return a tme seres that has data ponts computed by addng, subtractng, and multplyng the respectve data ponts of the orgnal tme seres, respectvely), as well as SeresGen(v, n), whch takes as nput a value v and a counter n and creates a new tme seres that contans n data ponts wth the value v. Note that the query language can be used to express many common statstcs over tme seres encountered n practce and all the queres we encountered durng the DELPHI project conducted at UC San Dego, whch explored how health-related data about ndvduals, ncludng large amounts of sensor data, can be leveraged to dscover the determnants of health condtons and whch served as the motvaton for ths work [18]. These nclude the mean and varance of a sngle tme seres, as well as the covarance, correlaton, and cross-correlaton between two tme seres. Table 1 shows how common statstcs can be expressed n PlatoDB s query language. 4. SEGMENT TREE As explaned n Secton 2, at data mport tme, PlatoDB creates for each tme seres a herarchy of summarzatons of the seres n the form of the segment tree. In ths Secton we frst explan the structure of the tree and then descrbe the segment tree generaton algorthm. 4.1 Segment Tree Structure Let T = (d 1,..., d n) be a tme seres. The segment tree of T s a bnary tree whose nodes summarze segments of the tme seres wth nodes hgher up the tree summarzng large segments and nodes lower down the tree summarzng progressvely smaller segments. In partcular, the root node summarzes the entre tme seres T. Moreover, for each node n of the tree summarzng a segment S = (d,..., d j) of T, ts left and rght chldren nodes n l and n r summarze two subsegments S l = (d,..., d k ) and S r = (d k+1,..., d j), respectvely, whch form a parttonng of the orgnal segment S. As we wll see n Secton 6, ths herarchcal structure allows PlatoDB to adapt to varyng error/tme budgets by only accessng the parts of the tree requred to acheve the gven error/tme budget. At each node n correspondng to segment S = (d,..., d j), PlatoDB summarzes the segment S by keepng two types of measures: (a) a descrpton of a compresson functon that s used to approxmately represent the tme seres values n the segment and (b) a set of error measures descrbng how far the above approxmate values are from the real values. As we wll see n Sectons 5 and 6, PlatoDB uses at query processng tme the compresson functon and error measures stored n each node to compute an approxmate answer of the query and determnstc error guarantees, respectvely. We next descrbe the compresson functons and error measures stored wthn each segment tree node n detal. Segment Compresson Functon. Let S = (d 1,..., d n) be a segment. PlatoDB summarzes ts contents through a compresson functon f used by the user. PlatoDB supports the use of any of the compresson functons suggested n the lterature [21, 20, 19, 11, 5, 4]. Examples nclude but are not lmted to the Pecewse Aggregate Approxmaton (PAA) [21], the Adaptve Pecewse Constant Approxmaton (APCA) [20], the Pecewse Lnear Representaton (PLR) [19], the Dscrete Fourer Transformaton (DFT) [11], the Dscrete Wavelet Transformaton (DWT) [5], and the Chebyshev polynomals (CHEB) [4]. To descrbe the functon, PlatoDB stores n the segment node parameters descrbng the functon. These parameters depend on the type of the functon. For nstance, f f s a Pecewse Aggregate Approxmaton (PAA), estmatng all values wthn a segment by a sngle value b, then the parameter s just a sngle value b. On the other hand, f f s a Pecewse Lnear Approxmaton (PLR), estmatng the values n the segment through a lne ax + b, then the functon parameters are the coeffcents a and b of the polynomal used to descrbe the lne. In the rest of the document, we wll refer drectly to the compresson functon f (nstead of the parameters that are used to descrbe t). Gven a segment (d 1,..., d n), we wll use f() to denote the

5 Statstc Symbol Defnton Query Expresson n Sum(T,1,n) Mean E(T ) d n n Varance V ar(t ) (d E(T )) 2 Sum(T mes(t, T ), 1, n) Sum(T,1,n) Sum(T,1,n) Covarance Cov(T 1, T 2) Correlaton Corr(T 1, T 2) Cross-correlaton Coss(T 1, T 2, l) n ((d (1) E(T 1 ))(d (2) E(T 2 ))) n 1 n ((d (1) E(T 1 ))(d (2) E(T 2 )) n (d (1) E(T 1 )) 2 n (d (2) E(T 2 )) 2 n ((d (1) E(T 1 ))(d (2) +l E(T 2)) n (d (1) E(T 1 )) 2 n (d (2) +l E(T 2)) 2 Sum(T mes(t 1,T 2 ),1,n) n 1 Table 1: Query expressons for common statstcs. Sum(T 1,1,n) Sum(T 2,1,n) n(n 1) Sum(T mes(t 1,T 2 )) 1 n Sum(T 1,1,n) Sum(T 2,1,n) V ar(t1 )V ar(t 2 ) Sum(T mes(t 1,T 2 )) 1 n Sum(T 1,1,n) Sum(T 2,1+l,n+l) V ar(t1 )V ar(t 2 ) n value for element d of the segment, as derved by f. Segment Error Measures. In addton to the compresson functon, PlatoDB also stores a set of error measures for each tme seres segment S = (d 1,..., d n). PlatoDB stores the followng three error measures: L : The sum of the absolute dstances between the orgnal and the compressed tme seres (also known as the Manhattan or L 1 dstance),.e., L = n d f(). d : The maxmum absolute value of the orgnal tme seres,.e., d = max{ d 1 n}. f : The maxmum absolute value of the compressed tme seres,.e., f = max{ f() 1 n}. EXAMPLE 3. For nstance, consder a segment S = (5.12, 5.09, 5.07, 5.04) summarzed through the PAA compresson functon f = 5.08 (.e., f(1) = f(2) = f(3) = f(4) = 5.08). Then L = = 0.1, d = max{5.12, 5.09, 5.07, 5.04} = 5.12 and f = max{5.08, 5.08, 5.08, 5.08} = As we wll see n Secton 5, the above three error measures are suffcent to compute determnstc error guarantees for any query supported by the system, regardless of the employed compresson functon f. Ths allows admnstrators to select the compresson functon best suted to each tme seres, wthout worryng about computng the error guarantees, whch s automatcally handled by PlatoDB. 4.2 Segment Tree Generaton We next descrbe the algorthm generatng the segment tree. To buld the tree, the algorthm has to decde how to buld the chldren nodes from a parent node;.e., how to partton a segment nto two non-overlappng subsegments. Each possble splttng pont wll lead to dfferent chldren segments and as a result to dfferent errors when PlatoDB uses the chldren segments to answer a query at query processng tme. Ideally, the splttng pont should be the one that mnmzes the error among all possble splttng ponts. However, snce PlatoDB supports ad hoc queres and snce each query may beneft from a dfferent splttng pont, there s no way for PlatoDB to choose a splttng pont that s optmal for all queres. Segment Tree Generaton Algorthm. Based on ths observaton, PlatoDB chooses the splttng pont that mnmzes the error for the basc query that smply computes the sum of all data ponts of the orgnal segment. In partcular, the segment tree generaton algorthm starts from the root and proceedng n a topdown fashon gven a segment S = (d 1,..., d n), selects a splttng pont d k that leads nto two subsegments S l = (d 1,..., d k ) and S r = (d k+1,..., d n) so that the sum of the Manhattan dstances of the new subsegments L Sl + L Sr s mnmzed. The algorthm stops further splttng down a segment S, when one of the followng two condtons hold: () When the Manhattan dstance L S of the segment s smaller than a threshold τ or () when he sze of the segment s below a threshold κ. The choce between condtons () and () and the values of the correspondng thresholds τ and κ s specfed by the system admnstrator. Snce the algorthm needs tme proportonal to the sze of a segment to compute the splttng pont of a sngle segment and t repeats ths process for every non-leaf tree node, t exhbts a worsttme complexty of O(mn), where n s the sze of the orgnal tme seres (.e., the number of ts data ponts) and m number of nodes n the resultng segment tree. Dscusson. Note that by decdng ndependently how to splt each ndvdual segment nto two subsegments, the segment tree generaton algorthm s a greedy algorthm, whch even though makes optmal local decsons for the basc aggregaton query, may not lead to optmal global decsons. For nstance, there s no guarantee that the k nodes that exst at a partcular level of the segment tree correspond to the k nodes that mnmze the error of the basc aggregaton query. The lterature contans a multtude of algorthms that can provde such a guarantee for a gven k;.e., algorthms that can, gven a tme seres T and a number k, produce k segments of T that mnmze some error metrc. Examples nclude the optmal algorthm of [3], as well as approxmaton algorthms wth formal guarantees presented n [34]. However, all these algorthms have very hgh worst-tme complexty that makes them prohbtve for the large number of data ponts typcally found n sensor datasets and are therefore not consdered n ths work. Though several heurstc segmentaton algorthms exst, such as the Sldng Wndows [33], the Top-down [22] and the Bottom-Up [23] algorthm, smlar do our greedy algorthm, they do not provde any formal guarantees. Fnally, note that the tree generated by the above algorthm wll n general be unbalanced. Intutvely, the algorthm wll create more nodes and correspondng tree levels to cover segments that

6 contan data ponts that are more rregular and/or rapdly changng, utlzng fewer nodes for smooth segments. 5. COMPUTING APPROXIMATE QUERY AN- SWERS AND ERROR GUARANTEES Gven pre-computed segment trees for tme seres T 1,..., T n, PlatoDB answers ad hoc queres over the tme seres by accessng ther segment trees. In partcular, to answer a gven query Q under an error/tme budget, PlatoDB navgates the segment trees of the tme seres nvolved n Q, selects segment nodes (or smply segments) that satsfy the budget, and computes an approxmate answer for Q together wth determnstc error guarantees. We wll next present the query processng algorthm. For ease of exposton, we wll start by descrbng how PlatoDB computes an approxmate query answer and the assocated error guarantees assumng that the segment nodes have been already chosen, and wll explan n Secton 6 how PlatoDB traverses the tree to choose the segment nodes. Approxmate query answerng problem under gven segments. Formally, let T 1,..., T k be tme seres, such that tme seres T s parttoned nto segments S 1,... S n. Gven (a) these segments and the assocated measures as descrbed above and (b) a query Q over the tme seres T 1,..., T k, we wll show how PlatoDB computes an approxmate query answer ˆR and an estmated error ˆε, such that the approxmate query answer ˆR s guaranteed to be wth ±ˆε of the accurate query answer R 4,.e., R ˆR ˆε. For ease of exposton, we next frst descrbe the smple case where each tme seres T contans a sngle segment perfectly algned wth the sngle segment of the other seres, before descrbng the general case, where each tme seres T contans multple segments, whch may also not be perfectly algned wth the segments of the other tme seres. 5.1 Sngle Tme Seres Segment Let T 1,..., T k be k tme seres wth sngle algned segments,.e., T s approxmated by a sngle segment S. Also let f be the compresson functon and (L, d, f ) the error measures of segment S, respectvely. To compute the approxmate answer and error guarantees of a query Q over T 1,..., T k usng the sngle segments S 1,..., S k, PlatoDB employs an algebrac approach computng n a bottom-up fashon for each algebrac operator op of Q the approxmate answer and error guarantees for the subquery correspondng to the subtree rooted at op. Ths algebrac approach s based on formulas that for each algebrac query operator, gven an approxmate query answer and error for the nputs of the operator, provde the correspondng query answer and error for the output of the operator. Fgure 3 shows the formulas employed by PlatoDB for each algebrac query operator supported by the system. Note that the output sgnatures dffer between operators. Ths s due to the dfferent types of operators supported by PlatoDB, as explaned next. Recall from Secton 3 that PlatoDB s query language conssts of three types of operators: () tme seres operators, () aggregaton operator, and () arthmetc operators. Whle tme seres operators output a tme seres, aggregaton and arthmetc operators output a sngle number. As a result, the formulas used for answer and error estmaton, treat these two classes of operators dfferently: For tme seres operators, the formulas return, smlarly to the nput tme seres, the compresson 4 Accurate answer means runnng queres over raw data. But note that, n ths work, we can gven estmate errors whout computng the accurate answers. Tme Seres Operators Operator Compr. Output Func. Error Measures f L d f SeresGen(υ, n) υ 0 υ υ Plus(T 1, T 2 ) f 1 + f 2 L 1 + L 2 d 1 + d 2 f 1 + f 2 Mnus(T 1, T 2 ) f 1 f 2 L 1 + L 2 d 1 + d 2 f 1 + f 2 Tmes(T 1, T 2 ) f 1 f 2 mn{ d 1 d 2 f 1 f 2 d 2 L 1 + f1 L 2, f2 L 1 + d 1 L 2} Aggregaton Operator Operator Approxmate Estmated Output Error Sum(T,l s, l le e) =ls f() L Arthmetc Operators Operator Approxmate Estmated Output Error Agg + Number Agg ˆ + Number ˆε Agg Number Agg ˆ Number ˆε Agg Number Agg ˆ Number ˆε number Agg Number Agg ˆ Number ˆε number Agg a + Agg b Agg ˆ a + Agg ˆ b ˆε a + ˆε b Agg a Agg b Agg ˆ a Agg ˆ b ˆε a + ˆε b Agg a Agg b Agg ˆ a Agg ˆ b Agg ˆ a ˆε b + Agg ˆ b ˆε a + ˆε a ˆε b Agg a Agg b Agg ˆ a Agg ˆ Agg ˆ a +ˆε a Agg b Agg ˆ ˆ a b ˆε b Agg ˆ b Fgure 3: Formulas for estmatng answer and error for each algebrac operator (sngle segment). functon and error measures of the output tme seres. For aggregaton and arthmetc operators on the other hand, whch return a sngle number and not an entre tme seres, the formulas return smply a sngle approxmate answer and estmated error. Fgure 3 shows the resultng formulas. 5 Wthout gong nto detal nto each of them, we next explan how they can be used to compute the answer and correspondng error guarantees for an entre query through an example. T Mnus SeresGen μ Sum Tmes T Mnus SeresGen Fgure 4: Approxmate query answer and assocated error for query Q = Sum(Tmes (Mnus(T, SeresGen(µ, n)), Mnus(T, SeresGen(µ, n)), 1, n). Compresson functons and error measures are shown n blue and red, respectvely. EXAMPLE 4. Ths example shows how to use the formulas n Fgure 3 to compute the approxmate answer and assocated er- 5 Out of the formulas, the most nvolved are the output measure estmaton formulas of the Tmes operator. More detals on how they were derved can be found n Appendx A.1. μ

7 ror for a query computng the varance of a tme seres T consstng of sngle segment S. For smplcty of the query expresson we assume that the mean µ of T s known n advance (note that even f µ was not known, the query would stll be expressble n PlatoDB s query language, albet through a longer expresson). Let f be the compresson functon and (L, d, f ) the error measures of S. The query can be expressed as Q = Sum(Tmes (Mnus(T, SeresGen(µ, n)), Mnus(T, SeresGen(µ, n)), 1, n). Fgure 4 shows how PlatoDB evaluates ths query n a bottom-up fashon. It frst uses the formula of the SeresGen operator to compute the compresson functon (f = µ) and error measures (L = 0, d = µ, f = µ) for the output of the SeresGen operator. It then computes the compresson functon (f µ) and error measures (L, (d + µ), (f + µ)) for the output of the Mnus operator. The computaton contnues n a bottom-up fashon, untl PlatoDB computes the output of the Sum operator n the form of an approxmate answer ˆR = n(f µ) 2 where n s the number of data ponts n T, and an estmated error ˆε = (d + f )L. Importantly, the formulas shown n Fgure 3 are guaranteed to produce the best error estmaton out of any formula that uses the three error measures employed by PlatoDB as explaned by the followng theorem: THEOREM 1. The estmated errors produced through the use of the formulas shown n Fgure 3 are the lowest among all possble error estmatons produced by usng the error measures descrbed n Secton 4. The proof can be found n Appendx A Multple Segment Tme Seres Let us now consder the general case, where each tme seres T contans multple segments of varyng dfferent szes. As a result of the varyng szes of the segments, segments of dfferent tme seres may not fully algn. EXAMPLE 5. For nstance consder the top two tme seres T 1 = (S 1,1, S 1,2) and T 2 = (S 2,1, S 2,2) of Fgure 5 (gnore the thrd tme seres for now). Segment S 1,1 overlaps wth both S 2,1 and S 2,2. Smlarly, segment S 2,2 overlaps wth both S 1,1 and S 1,2. One may thnk that ths can be easly solved by creatng subsegments that are perfectly algned and then usng for each of them the answer and error estmaton formulas of Secton 5.1. EXAMPLE 6. Contnung our example, the two tme seres T 1 and T 2 can be splt nto the three algned subsegments shown as the output tme seres T 3. Then for each of these output segments, we can compute the error based on the formulas of Secton 5.1. However, the problem wth ths approach s that the resultng error wll be severely overestmated as the error of a sngle segment of the orgnal tme seres may be counted multple tmes, as t overlaps wth multple output segments. EXAMPLE 7. For nstance, for a query over the tme seres T 1 and T 2 of Fgure 5,the error of S 2,2 wll be double-counted, as t wll be counted towards the error of the two output segments S 3,2 and S 3,3. To avod ths ptfall, PlatoDB does not estmate the error for ts segment ndvdually but nstead computes the error holstcally for the entre tme seres. Fgures 6 and 7 show the resultng answer Fgure 5: Example of algned tme seres segments. The new generated tme seres T 3 s shown n red color. and error estmaton formulas for tme seres operators and the aggregaton operator, respectvely. The formulas of the arthmetc operators are omtted as they reman the same as n the sngle segment case, as the arthmetc operators take as nput sngle numbers nstead of tme seres and are thus not affected by multple segments. 6. NAVIGATING THE SEGMENT TREE So far we have seen how PlatoDB computes the approxmate answer to a query and ts assocated error, assumng that the segments that are used for query processng have already been selected. In ths Secton, we explan how ths selecton s performed. In partcular, we show how PlatoDB navgates the segment trees of the tme seres nvolved n the query to contnuously compute better estmatons of the query answer under the gven error or tme budget s satsfed. Query Processng Algorthm. Let T 1,..., T m be a set of tme seres and I 1,..., I m the respectve segment trees. Let also Q be a query over T 1,..., T m and ε max/t max an error/tme budget, respectvely. To answer Q under the gven budget, PlatoDB frst starts from the roots of I 1,..., I m and uses them to compute the approxmate query answer ˆR and correspondng error ˆε usng the formulas presented n Secton 5. If the estmated error s greater than the error budget (.e., f ˆε ε max) or f the elapsed tme s smaller than the allowed tme budget, PlatoDB chooses one of the tree nodes used above, replaces t wth ts chldren and repeats the above procedure usng the newly selected nodes untl the gven error/tme budget s reached. What s mportant s the crteron that s used to choose the node that s replaced at each step by ts chldren. In general, PlatoDB wll have to select between several nodes, as t wll be explorng n whch segment tree and moreover n whch part of the selected segment tree t pays off to navgate further down. Snce PlatoDB ams to reduce the estmated error as much as possble, at each step t greedly chooses the node whose replacement by ts chldren leads to the bggest reducton n the estmated error. The resultng procedure s shown as Algorthm 1 6. Algorthm Optmalty. Gven ts greedy nature, one may wonder whether the query processng algorthm s optmal. To answer ths queston, we have to frst defne optmalty. Snce the am of the query processng algorthm s to produce the lowest possble error n the fastest possble tme (whch can be approxmated by the number of nodes that are accessed), we say that an algorthm s optmal f for every possble query, set of segment trees, and error budget ε max t answers the query under the gven budget accessng the 6 Note that the algorthm s shown for both error and tme budget case. In contrast to the case when a tme budget s provded, n whch the algorthm has to always keep a computed estmated answer ˆR to return t when the tme budget runs out, n the case of the error budget ths s not requred. Thus, n the latter case, t suffces to compute ˆR only at the very last step of the algorthm, thus avodng ts teratve computaton durng the whle loop.

8 Tme Seres Operators Operator Comp. func. Output Error Measures f L d f SeresGen(υ, n) υ 0 υ υ Plus(T a, T b ) {(f c,1,..., f c,k ) f c, = f a,u + f b,v [1, k]} p L a, + q j=1 L b,j max{d c, d c, = d a,u + d b,v [1, k]} max{f c, f c, = f a,u + f b,v [1, k]} Mnus(T a, T b ) {(f c,1,..., f c,k ) f c, = f a,u f b,v [1, k]} p L a, + q j=1 L b,j max{d c, d c, = d a,u + d b,v [1, k]} max{f c, f c, = f a,u + f b,v [1, k]} Tmes(T a, T b ) {(f c,1,..., f c,k ) f c, = f a,u f b,v [1, k]} L Tc max{d c, d c, = d a,u d b,v [1, k]} max{f c, f c, = f a,u f b,v [1, k]} Fgure 6: Formulas for estmatng answer and error for tme seres operators (multple segments). For each output tme seres segment S c,, let S a,u and S b,v be the nput segments that overlap wth S c,. Aggregaton Operator Operator Approxmate Estmated Output Error Sum(T,l s, l v S e) =u j=1 f v (j) =u L Fgure 7: Formulas for estmatng answer and error for the aggregaton operator (multple segments). the chldren of the shaded node only after t has accessed all the other nodes n the tree. However, ths s suboptmal, as there s a way to access the chldren of the shaded node wth fewer node accesses (.e., by followng the path from the root to the shaded node). Therefore, no algorthm n A s optmal. Algorthm 1: PlatoDB Query Processng Input: Segment Trees I 1,..., I m, query Q, error budget ε max or tme budget t max Output: Approxmate answer ˆR and error ˆε 1 Access the roots of I 1,..., I m; 2 Compute ˆR and ˆε by usng the compresson functons and error measures of the currently accessed nodes (see Secton 5 for detals); 3 whle ˆε > ε max or elapsed tme < t max do 4 Choose a node maxmzng the error reducton; 5 Update the current answer ˆR and error ˆε usng the compresson functons and error measures of the currently accessed nodes; 6 Return ( ˆR, ˆε); lowest number of nodes than any other possble algorthm. Snce a comparson of any possble algorthm s hard, we also restrct our attenton to determnstc algorthms that access the segment trees n a top-down fashon (.e, to access a node N all ts ancestor nodes should also be accessed). We denote ths class of algorthms as A. It turns out that no algorthm n A can be optmal as the followng theorem states: THEOREM 2. There s no optmal algorthm n A. PROOF. Consder the followng segment trees of two tme seres T 1 and T 2. The segment tree of T 1 s shown n Fgure 8 and the segment tree of T 2 s a tree contanng a sngle node. Now consder a query Q over these two tme seres and an error budget ε = h 1 where h > 1 s the heght of the T 1 s tree. Assume that the query error usng the tree roots s ε root = 2h. Also assume that whenever the query processng algorthm replaces a node by ts chldren, the error for the query s reduced by 1 2 h wth the excepton of the shaded node, whch, when replaced by ts chldren, leads to an error reducton of h + 1. Ths means that the query processng algorthm can only termnate after accessng the chldren of the shaded node, as the query error n that case wll be at most 2h (h + 1) = h 1. Otherwse, the error estmated by the algorthm wll be at least 2h 2 h ( 1 2 h ) = 2h 1 > h 1, whch exceeds the error budget and thus does not allow the algorthm to termnate. Snce the shaded node can be placed at an arbtrary poston n the tree, for every gven determnstc algorthm, we can place the shaded node n the tree, so that the algorthm accesses Fgure 8: Segment Tree for Theorem 2. As a result of the above theorem, PlatoDB s query processng algorthm cannot be optmal n general. However, we can show that t s optmal for segment trees that exhbt the followng property: For every par of nodes N and N of the segment tree, such that N s a descendant of N, the error reducton ε (N) acheved by replacng N wth ts chldren s greater or equal to the error reducton ε (N ) acheved by replacng N wth ts chldren. Such a tree s called fne-error-reducton tree and ntutvely t guarantees that any node leads to a greater or equal error reducton than any of ts descendants. If all trees satsfy the above property, PlatoDB s query processng algorthm s optmal: THEOREM 3. In the presence of segment trees that are fneerror-reducton trees, PlatoDB s query processng algorthm s optmal. Operator Incremental Error Update Plus(T a, T b ) ˆε = ˆε (L a (L a.1 + L a.2 )) Mnus(T a, T b ) ˆε = ˆε (L a (L a.1 + L a.2 )) ˆε = ˆε (max(p b,1,..., p b,k )L a Tmes(T a, T b ) max(p b,1,..., p b, )L a.1 + max(p b,,..., p b,k )L a.2 ) Table 2: Incremental update of estmated errors for tme seres operators. p b, {d b,, f b,}. Incremental Error Update. Havng proven the optmalty of the algorthm for fne-error-reducton trees, we wll next dscuss an optmzaton that can be employed to speedup the algorthm. By

9 studyng the algorthm, t s easy to observe that as the algorthm moves from a set N = {N 1,..., N n} of nodes to a set N = {N 1,..., N a 1, N a.1, N a.2, N a+1,..., N n} of nodes (by replacng node N a by ts chldren N a.1 and N a.2), t recomputes the error usng all nodes n N, although only the two nodes N a.1 and N a.2 have changed from the prevous node set N. Ths observaton led to the ncremental error update optmzaton of PlatoDB s query processng algorthm descrbed next. Instead of recomputng from scratch the error of N usng all nodes, PlatoDB ncrementally updates the error of N by usng only the error measures of the newly replaced node N a and the newly nserted nodes N a.1 and N a.2. Let (L a, d a, f a ), (L a.1, d a.1, f a.1), and (L a.2, d a.2, f a.2) be the error measures of nodes N a, N a.1, and N a.2, respectvely. Assume that the segments S b,1,..., S b,k overlap wth the segment of node N a, the segments S b,1,..., S b, ( k) overlap wth the segment of node N a.1, and the segments S b,,..., S b,k overlap wth the segment of node N a.2. Then the estmated error ˆε usng nodes N a.1 and N a.2 can be ncrementally computed from the error ˆε usng node N a through the ncremental error update formulas shown n Table 2 7. Probablstc Extenson. Whle PlatoDB provdes determnstc error guarantees, whch as we dscussed above are n many cases requred, t s nterestng to note that t can be easly extended to provde probablstc error guarantees f needed. Most mportantly ths can be done smply by changng the error measures computed for each segment from (L, d, f ) to (σ ε, ε, f ), where σ ε s the varance of d f(), and ε s the maxmal absolute value of d f(). Then we can employ the Central Lmt Theorem (CLT) [10] to bound the accurate error ε by P r(ε ˆε) 1 α, where α can be adjusted by the users to get dfferent confdence levels. It s nterestng that the rest of the system, ncludng the herarchcal structure of the segment tree and the tree navgaton algorthm employed at query processng tme do not need to be modfed. In our future work we plan to further explore ths probablstc extenson and compare t to exstng approxmate query answerng technques wth probablstc guarantees. 7. EXPERIMENTAL EVALUATION To evaluate PlatoDB s performance and verfy our hypothess that PlatoDB s able to provde sgnfcant savngs n the query processng of sensor data, we are conductng experments on real sensor data. We present here early data ponts that we have dscovered. Datasets. For our prelmnary experments, we used two real sensor datasets: 1. Intel Lab Data (ILD) 8. Smart home data (humdty and temperature) collected at 31-second ntervals from 54 sensors deployed at the Intel Berkeley Research Lab between February 28th and Aprl 5th, The dataset contans about 2.3 mllon tuples (.e., 4.6 mllon sensor readngs n total). 2. EPA Ar Qualty Data (AIR) 9. Ar qualty data collected at hourly ntervals from about 1000 sensors from January 1st 2000 to Aprl 1st The dataset contans about 133 mllon tuples (.e., 266 mllon sensor readngs n total). 7 The SeresGen operator s omtted, snce ts nput s not a tme seres and as a result there s no segment tree assocated wth ts nput From each dataset we extracted multple tme seres, each correspondng to a sngle attrbute of the dataset; Humdty and Temperature for ILD and Ozone and SO 2 for AIR. We then used PlatoDB to create the correspondng segment tree for each tme seres and to answer queres over them. Expermental platform. All experments were performed on a computer wth a 4th generaton Intel processor (4 32 KB L1 data cache, KB L2 cache, 8 MB shared L3 cache, 4 physcal cores, 3.6 GHz) and 16 GB RAM, runnng Ubuntu All the algorthms were mplemented n C++ and compled wth g , usng -O3 optmzaton. All data was stored n man memory. 7.1 Expermental Results In our prelmnary evaluaton, we measured two quanttes: Frst, the sze of the segment tree created by PlatoDB, snce ths segment tree s stored n man memory, and second, the query processng performance of PlatoDB compared to a system that answers queres usng the entrety of the raw sensor data. In our future work, we wll be conductng a more thorough evaluaton of the system. We next present our prelmnary results: Dataset # Tuples Raw Data Segment Tree (0-degree) (1-degree) ILD 2,313, MB 0.14 MB 0.67 MB AIR 133,075, GB 4.37 MB 8.11 MB Table 3: Raw data and segment tree szes. Segment tree sze. Table 3 shows the sze of the raw data and the combned sze of the segment trees bult for all the tme seres extracted from the ILD and AIR datasets. 10 We expermented wth two dfferent compresson functons, resultng n dfferent segment tree szes; a 0-degree polynomal (correspondng to the Pecewse Aggregate Approxmaton [21], where each value wthn a segment s approxmated through the average of the values n the segment) and a 1-degree polynomal (correspondng to the Pecewse Lnear Approxmaton [19], where each segment s approxmated through a lne). As shown, the segment trees are sgnfcantly smaller than the raw sensor data (about 0.40% 1.90% and 0.22% 0.40% smaller for the ILD and AIR datasets, respectvely). As a result, the segment trees of the tme seres can be easly kept n man memory, even when the system stores a large number of tme seres. Tme Cost(ms) Exact ApproPlato-0 ApproPlato Error (%) (a) ILD Tme Cost(ms) Exact ApproPlato-0 ApproPlato Error (%) (b) AIR Fgure 9: Query processng performance for correlaton query (tme shown n ms). Query processng performance. We next compared the query 10 To make a far comparson, the raw data sze refers only to the combned sze of the attrbutes used n the tme seres and does not nclude other attrbutes that exst n the orgnal dataset (such as locaton codes etc).

10 processng performance of PlatoDB aganst a baselne, whch s a custom n-memory algorthm that computes the exact answer of the queres usng the raw data. To compare the systems, we measured the tme requred to process a correlaton query between two tme seres (.e., correlaton(humdty, Temperature) n ILD and correlaton(ozone and SO 2) n AIR)) wth a varyng error budget (rangng from 5% to 25%). Fgure 9 shows the resultng tmes for each of the two datasets. Each graph depcts the performance of three systems; Exact, whch s the baselne method of answerng queres over the raw data, and PlatoDB-0, PlatoDB-1, whch are nstances of PlatoDB usng the 0-degree and 1-degree polynomal compresson functons, as explaned above. By studyng Fgure 9, we can make the followng observatons. Both nstances of PlatoDB outperform Exact by one to three orders of magntude, dependng on the provded error budget. In contrast to Exact whch always uses the entre raw dataset to compute exact query answers, PlatoDB allows the user to select the approprate tradeoff between tme spent n query processng and resultng error by specfyng the desred error budget. The system adapts to the budget by provdng faster responses as the allowed error budget ncreases; Notably, PlatoDB remans sgnfcantly faster than Exact even for small error budgets. In partcular, PlatoDB s over 9 and 37 faster than Exact when the error s 5% n ILD and AIR respectvely. In summary, our prelmnary results show that PlatoDB shows sgnfcant potental for speedng up query processng of ad hoc queres over large amounts of sensor data, as t outperforms exact query processng algorthms n many cases by several orders of magntude. Moreover, t can provde such speedups, whle provdng determnstc error guarantees, n contrast to exstng samplngbased approxmate query answerng approaches that provde only probablstc guarantees, whch may not hold n practce. Despte the dfference n guarantees, n our future work we wll be conductng a more thorough evaluaton of the system comparng t also aganst samplng-based systems. 8. RELATED WORK Approxmate query answerng has been the focus on an extensve body of work, whch we wll summarze next. However, to the best of our knowledge, ths s the frst work that provdes determnstc guarantees for aggregaton queres over multple tme seres. Approxmate query answerng wth probablstc error guarantees. Most of the exstng work on approxmate query processng has focused on usng samplng to compute approxmate query answers by approprately evaluatng the queres on small samples of the data [17, 1, 37, 2, 26, 26]. Such approaches typcally leverage statstcal nequaltes and the central lmt theorem to compute the confdence nterval or varance of the computed approxmate answer. As a result, ther error guarantees are probablstc. Whle probablstc guarantees are often suffcent, there are not sutable for scenaros where one wants to be certan that the answer wll fall wthn a certan nterval Note that as dscussed n Secton 6, PlatoDB can also be extended to provde probablstc guarantees when determnstc guarantees are not requred, smply by modfyng the error measures computed for each segment. A specal form of samplng-based methods are onlne aggregaton approaches, whch provde a contnuously mprovng query answer, allowng users to stop the query evaluaton when they are satsfed wth the resultng error [15, 7, 26]. Wth ts herarchcal segment tree, PlatoDB can support the onlne aggregaton paradgm, whle provdng determnstc error guarantees. Approxmate query answerng wth determnstc error guarantees. Approxmately answerng queres whle provdng determnstc error guarantees has so far receved only very lmted attenton [31, 24, 30]. Exstng work n the area has focused on smple aggregaton queres that nvolve a sngle relatonal table. In contrast, PlatoDB provdes determnstc error guarantees on queres that may nvolve multple tme seres (each of whch can be though of as a sngle relatonal table), enablng the evaluaton of many common statstcs that span tables, such as correlaton, crosscorrelaton and others. Approxmate query answerng over sensor data. Moreover, PlatoDB s one of the frst approxmate query answerng systems that leverage the fact that sensor data are not random but follow a usually smooth underlyng phenomenon. The majorty of exstng works on approxmate query answerng looked at general relatonal data. Moreover, the ones that studed approxmate query processng for sensor data, focused on the networkng aspect of the problem, studyng how aggregate queres can be effcently evaluated n a dstrbuted sensor network [25, 8, 9]. Whle these works focused on the networkng aspect of sensor data, our work focuses on the contnuous nature of the sensor data, whch t leverages to accelerate query processng even n a sngle machne scenaro, where hstorcal sensor data already accumulated on the machne have to be analyzed. Data summarzatons. Last but not least, there has been extensve work on creatng summarzatons of sensor data. Work n ths area has come mostly from two dfferent communtes; from the database communty [16, 30, 27, 35] and the sgnal processng communty [21, 20, 19, 5, 11, 11]. The database communty has mostly focused on creatng summarzatons (also referred to as synopses or sketches) that can be used to answer specfc queres. These nclude among others hstograms [16, 30, 12, 29] (e.g., EquWdth and EquDepth hstograms [28], V-Optmal hstograms [16], Herarchcal Model Fttng (HMF) hstograms [36], and Compact Herarchcal Hstograms (CHH) [32]), as well as samplng methods [14, 6], used among other for cardnalty estmaton [16] and selectvty estmaton [30]. In contrast to such specal-purpose approaches, PlatoDB supports a large class of queres over arbtrary sensor data. The sgnal processng communty on the other hand, produced a varety of methods that can be used to compress tme seres data. These nclude among others the Pecewse Aggregate Approxmaton (PAA) [21], the Adaptve Pecewse Constant Approxmaton (APCA) [20], the Pecewse Lnear Representaton (PLR) [19], the Dscrete Wavelet Transform (DWT) [5], and the Dscrete Fourer Transform (DFT) [11]. However, t has not been concerned on how such compresson technques can be used to answer general queres. PlatoDB s modular archtecture allows the easy ncorporaton of such technques as compresson functons, that are then automatcally leveraged by the system to enable approxmate answerng of a large number of queres wth determnstc error guarantees. 9. CONCLUSION

6.854 Advanced Algorithms Petar Maymounkov Problem Set 11 (November 23, 2005) With: Benjamin Rossman, Oren Weimann, and Pouya Kheradpour

6.854 Advanced Algorithms Petar Maymounkov Problem Set 11 (November 23, 2005) With: Benjamin Rossman, Oren Weimann, and Pouya Kheradpour 6.854 Advanced Algorthms Petar Maymounkov Problem Set 11 (November 23, 2005) Wth: Benjamn Rossman, Oren Wemann, and Pouya Kheradpour Problem 1. We reduce vertex cover to MAX-SAT wth weghts, such that the