SAO: A Stream Index for Answering Linear Optimization Queries

SAO: A Stream Index for Answerng near Optmzaton Queres Gang uo Kun-ung Wu Phlp S. Yu IBM T.J. Watson Research Center {luog, klwu, psyu}@us.bm.com Abstract near optmzaton queres retreve the top-k tuples n a sldng wndow of a data stream that maxmze/mnmze the lnearly weghted sums of certan attrbute values. To effcently answer such queres aganst a large relaton, an onon ndex was prevously proposed to properly organze all the tuples n the relaton. However, such an onon ndex does not work n a streamng envronment due to fast tuple arrval rate and lmted memory. In ths paper, we propose a SAO ndex to approxmately answer arbtrary lnear optmzaton queres aganst a data stream. It uses a small amount of memory to effcently keep track of the most mportant tuples n a sldng wndow of a data stream. The ndex mantenance cost s small because the great majorty of the ncomng tuples do not cause any changes to the ndex and are quckly dscarded. At any tme, for any lnear optmzaton query, we can retreve from the SAO ndex the approxmate top-k tuples n the sldng wndow almost nstantly. The larger the amount of avalable memory, the better the qualty of the answers s. More mportantly, for a gven amount of memory, the qualty of the answers can be further mproved by dynamcally allocatng a larger porton of the memory to the outer layers of the SAO ndex. We evaluate the effectveness of ths SAO ndex through a prototype mplementaton.. Introducton Data stream applcatons are becomng popular. Many of such applcatons use varous lnear optmzaton queres [2, 3, 4] to retreve the (approxmate) top-k tuples that maxmze or mnmze the lnearly weghted sums of certan attrbute values. For example, n envronmental epdemologcal applcatons, varous lnear models that ncorporate remotely sensed mages, weather nformaton, and demographc nformaton are used to predct the outbreak of certan envronmental epdemc dseases, lke Hantavrus Pulmonary Syndrome [3]. In ol/gas exploraton applcatons, lnear models that ncorporate drll sensor measurements and sesmc nformaton are used to gude the drllng drecton [4]. In fnancal applcatons, lnear models that ncorporate personal credt hstory, ncome level, and employment hstory are used to evaluate credt rsks for loan approvals [3]. In all the above applcatons, data contnuously stream n (say, from satelltes and sensors) at a rapd rate. Users frequently pose lnear optmzaton queres and want answers back as soon as possble. Moreover, dfferent ndvduals may pose queres that have dvergent weghts and K s. Ths s because the optmal weghts may vary from one locaton to another (n ol/gas exploraton), the weghts may be adjusted as the model s contnually traned wth hstorcal data collected more recently (n envronmental epdemology and fnance), and dfferent users may have dfferng preferences. In a read-mostly envronment, Chang et al. [2] frst proposed an onon ndex to speed up the evaluaton of lnear optmzaton queres aganst a large database relaton. An onon ndex organzes all the tuples n the database relaton nto one or more convex layers, where each convex layer s a convex hull. For each, the ( + )th convex layer s contaned wthn the th convex layer. For any lnear optmzaton query, to fnd the top-k tuples, we need to search no more than all the vertces of the frst K outer convex layers n the onon ndex. However, due to the extremely hgh cost of computng precse convex hulls, both the creaton and the mantenance of the onon ndex are rather expensve. Moreover, an onon ndex requres lots of storage because t keeps track of all the tuples n a relaton. In a streamng envronment, tuples keep arrvng rapdly whle avalable memory s lmted. Hence, t s mpossble to mantan a precse onon ndex for a data stream, let alone usng t to provde exact answers to lnear optmzaton queres. To address these problems, we propose a SAO (Stream Approxmate Onon-lke structure) ndex for a data stream. The ndex provdes hgh-qualty, approxmate answers to arbtrary lnear optmzaton queres almost nstantly. Our key observaton s that the precse onon ndex typcally contans a large number of convex layers, but most nner layers are not needed for answerng lnear optmzaton queres. Hence, the SAO ndex mantans only the frst few outer convex layers. Moreover, each layer n the SAO ndex only keeps some of the most mportant vertces rather than all the vertces. As a result, the amortzed mantenance cost of a SAO ndex s rather small because the great majorty of the ncomng tuples, more than 95% n most cases, do not cause any changes to the ndex and are quckly dscarded, even though ndvdual nserts or deletes mght have non-trval costs. A key challenge n desgnng a SAO ndex s: For a gven amount of memory, how do we properly allocate t among the layers so that the qualty of the answers can be maxmzed? To do that, a dynamc, error-mnmzng storage allocaton strategy s used so that a larger porton of the avalable memory tends to be allocated to the outer layers than to the nner layers. In ths way, both storage and mantenance overheads of the SAO ndex are greatly reduced. More mportantly, the errors ntroduced nto the approxmate answers are also mnmzed.

Wth lmted memory and contnually arrvng tuples, there are ntrnsc errors n any stream applcaton. It s dffcult to provde an upper bound on these errors for lnear optmzaton queres because the amount of naccuraces depends on the specfc sequence of tuples n a stream. However, n practce, the exact errors can be measured based on stream traces. As shown n the experments conducted n ths paper, the actual errors are relatvely mnor (often less than %) even f the SAO ndex holds only a tny fracton (less than 0.%) of the tuples n the sldng wndow. Ths s because, statstcally, only few tuples cause errors. Moreover, the mpact of any error, no matter how large t may be, dsappears mmedately once the tuple causng the error has moved out of the sldng wndow. For some stream applcatons, the lnear optmzaton queres are known n advance and the entre hstory, not just a sldng wndow, of the stream s consdered. In ths case, for each query, an n-memory materalzed vew can be mantaned to contnuously keep track of the top-k tuples. However, f there are many such queres, t may not be feasble to keep all these materalzed vews n memory and/or to mantan them n real tme. As a consequence, the SAO ndex method s stll needed under such crcumstances. We mplemented the SAO ndex by modfyng a wdelyused Qhull package []. Our expermental results show that the SAO ndex can handle hgh tuple arrval rates, be mantaned effcently n real tme, and provde hghqualty answers to lnear optmzaton queres almost nstantly. 2. Revew of the Tradtonal Onon Index We brefly revew the earler onon ndex [2] for lnear optmzaton queres aganst a large database relaton. Suppose each tuple contans n numercal feature attrbutes and m 0 other non-feature attrbutes. A top-k lnear optmzaton query asks for the top-k tuples that maxmze the followng lnear equaton: n j max{ w } a, top K = where j j j ( a, a2,..., an ) s the feature attrbute vector of the jth tuple and ( w, w2,..., wn ) s the weghtng vector of the query. Some w s may be zero. Here, n j v j = w = a s called the lnear combnaton value of the jth tuple. A set of tuples S can be mapped to a set of ponts n an n-dmensonal space accordng to ther feature attrbute vectors. For a top-k lnear optmzaton query, the top-k tuples are those K tuples wth the largest projecton values along the query drecton. The onon ndex n [2] organzes all the tuples nto one or more convex layers. The frst convex layer s the convex hull of all the tuples n S. The vertces of form a set S S. For each >, the th convex layer s the convex hull of all the tuples n form a set S U S S j = j S U S j = j. The vertces of. It s easy to see that for each, + s contaned wthn. Accordng to lnear programmng theory, we have the followng property [2]: Property : For any lnear optmzaton query, suppose all the tuples are sorted n descendng order of ther lnear combnaton values (v j ). The tuple that s ranked kth n the sorted lst s called the kth largest tuple. Then the largest tuple s on. The second largest tuple s on ether or 2. In general, for any, the th largest tuple s on one of the frst outer convex layers. Gven a top-k lnear optmzaton query, the search procedure of the onon ndex starts from and searches the convex layers one by one. On each convex layer, all ts vertces are checked. Based on Property, the search procedure can fnd the top-k tuples by searchng no more than the frst K outer convex layers. 3. SAO Index The orgnal onon ndex [2] keeps track of all the tuples, requrng lots of storage. Mantanng the orgnal onon ndex s also computatonally costly, makng t dffcult to meet the real-tme requrement of data streams. To address these problems, we propose a SAO ndex for lnear optmzaton queres aganst a data stream. Our key dea s to reduce both the ndex storage and mantenance overheads by keepng only a subset of the tuples n a data stream n the SAO ndex. We focus on the count-based sldng wndow model for data streams, wth W denotng the sldng wndow sze. That s, the tuples under consderaton are the last W tuples that we have seen. Our technques can be easly extended to the case of tmebased sldng wndows or the case that the entre hstory of the stream s consdered. Suppose the avalable memory can hold M + tuples. In the steady state, no more than M tuples are kept n the SAO ndex. That s, the storage budget s M tuples. In a transton perod, M + tuples can be kept n the SAO ndex temporarly. Our technques can be extended to the case where memory s measured n bytes. In general, a tuple contans both feature and non-feature attrbutes. We are nterested n fndng all the attrbutes of the top-k tuples. Hence, all the attrbutes of those tuples n the SAO ndex are kept n memory. Even f the convex hull for feature attrbutes occupy only a small amount of space, the non-feature attrbutes may stll domnate the storage requrement. For example, n the earler-mentoned, envronmental epdemology applcaton, each tuple has a large non-feature mage attrbute, whch s also kept n memory. Note that the mage cannot be stored on dsk, even f we lke to do so, because the tuple arrval rate can

be too hgh for even the fastest dsk to keep up wth the rapdly arrvng tuples. Our desgn prncple s as follows. To provde hghqualty answers to lnear optmzaton queres, the SAO ndex carefully controls the number of tuples on each layer. It dynamcally allocates proper amount of storage to ndvdual layers so that a larger porton of the avalable memory tends to be allocated to the outer layers. As such, the qualty of the answers can be maxmzed wthout ncreasng the storage requrement. In case of overflow, the SAO ndex keeps the most mportant tuples and throws away the less mportant ones. Moreover, to mnmze the computaton overhead, the creaton and mantenance algorthms of the SAO ndex are optmzed. 3. Index Organzaton The SAO ndex s based on a key observaton: An onon ndex typcally contans a large number of convex layers, but most nner layers are not needed for answerng the majorty of lnear optmzaton queres. The SAO ndex keeps only the frst outer convex layers, where s specfed by the user creatng the ndex. Snce M s lmted, the SAO ndex cannot always keep the precse frst outer convex layers. Therefore, for each of the frst outer convex layers, the SAO ndex may only keep some of the most mportant tuples rather than all the tuples belongng to that layer. In other words, each layer n the SAO ndex s an approxmate convex layer (AC) n the sense that t s an approxmaton to the correspondng precse convex layer n the onon ndex. For each ( ), s used to denote the th AC. The SAO ndex mantans the followng propertes. Each AC s the convex hull of all the tuples on that layer. For each ( ), + s contaned wthn. Also, the total number of tuples on all ACs s no more than M. All the tuples n the SAO ndex are kept as a sorted, doubly-lnked lst dl. The sortng crteron s a tuple s remanng lfetme. The frst tuple n dl s gong to expre the soonest. In ths way, we can quckly check whether any tuple n the SAO ndex expres, whch s needed at Step 2 of Secton 3.5 below. Also, we can easly delete tuples that are n the mddle of dl, whch s necessary when the avalable memory s exhausted and a tuple needs to be deleted from the SAO ndex (see Secton 3.3 below). For each AC, a standard convex hull data structure s mantaned. The vertces of the convex hull pont to tuples n dl. Also, each tuple t n dl has a label ndcatng the AC to whch tuple t belongs. Ths label s used when a tuple expres and needs to be removed from the correspondng AC (see Secton 3.5 below). 3.2 Allocatng Proper Memory to Each ayer A key challenge n desgnng a SAO ndex s: For a gven amount of memory, how do we properly allocate the memory to each layer so that the qualty of the answers can be maxmzed? A Smple, Unform Storage Allocaton Strategy A smple storage allocaton strategy s to dvde the storage budget M evenly among all ACs. Each AC cannot keep more than M / tuples. However, ths smple, unform method s far from beng optmal. In order to provde hgh-qualty answers to lnear optmzaton queres, the SAO ndex should allocate more tuples to the outer ACs than to the nner ACs, as outer ACs contan the largest tuples. Statc, Error-Mnmzng Storage Allocaton Now we descrbe a statc, error-mnmzng storage allocaton strategy when resource s lmted. We determne the optmal numbers of tuples the SAO ndex should allocate to the ACs. By resource beng lmted, we mean that each AC needs more tuples than can be actually allocated to t. For each ( ), let N denote the optmal number of tuples that should be allocated to. Then N M =. () = Consder a top- lnear optmzaton query. For each ( ), let t represent the exact th largest tuple, and t ' represent the th largest tuple that s found n the SAO ndex. Here, v s the lnear combnaton value of t, and v ' s the lnear combnaton value of t '. The relatve error of t ' s defned as e = ( v v ) / v. (2) For the top- tuples (t ') that are returned by the SAO ndex, a weghted mean of ther relatve errors s used as the performance metrc e: e = u = e / u, (3) = where u s the weght of e. Intutvely, the hgher the rank of a tuple t, the more mportant t s relatve error. Hence, u should be a non-ncreasng functon of. We would lke to mnmze the mean of e for all top- lnear optmzaton queres, whch s how N s are derved. Our dea s to represent the mean of e as a functon of N s and fnd ts mnmal value under condton (). Accordng to our dervaton whose detals are n [5], we can show that N C, where C = j 2 = /. (4) j 3.3 Dynamc, Error-Mnmzng Storage Allocaton The real world s not statc. At any tme, some ACs may need more than N tuples whle other ACs may need fewer than N tuples. As tuples keep enterng and leavng the sldng wndow, the storage requrements of dfferent ACs change contnuously. To ensure the best qualty of the answers, the SAO ndex needs to fully utlze the storage budget M as much as possble by dong dynamc storage allocaton. Our desgn prncple s: Whenever possble, the storage budget M s used up. At the same tme, the SAO ndex tres ts best to mantan the condton that the number of tuples on s proportonal to C.

The concrete method s as follows. For each ( ), let M denote the number of tuples on. The SAO ndex contnuously montors these M s. At any tme, there are two possble cases. In the frst case, M M. Ths s = a normal case and nothng needs to be done, as the storage budget M has not been used up. In the second case, = M = + M. (Accordng to Secton 3.5, = M can never be larger than M +.) Ths s an overflow case, as the storage budget M s exceeded by one. We need to pck a canddate AC and delete one tuple from t. Choosng a Canddate Approxmate Convex ayer We frst dscuss how to choose the canddate AC. For each ( ), let r = M / N. We pck j such that rj = max{ r r >, }. Ths j must exst. Otherwse ( ), r. Ths leads to M N M =, = = whch conflcts wth the condton that = M = M +. j s chosen as the canddate AC. Choosng a Canddate Tuple Now one canddate tuple needs to be deleted from the canddate AC j. Intutvely, ths canddate tuple t should have a close neghbor so that deletng t wll have lttle mpact on the shape of j. Two tuples on an AC are neghbors f they are connected by an edge. For any tuple t on j, let R t denote the Eucldean dstance between tuple t and ts nearest neghbor on j. The canddate tuple s chosen to be the tuple that has the smallest R t (usually there are a par of such tuples and the older one,.e., the sooner-to-expre one, s pcked). Deletng Canddate Tuple Fnally, after choosng the canddate tuple t, we use the method that s descrbed n Step 2 of Secton 3.5 below to delete t from j and then adjust the affected ACs. 3.4 Index Creaton At the begnnng, the SAO ndex s empty. We keep recevng new tuples untl there are M tuples. Then a standard convex hull constructon algorthm s used to create the ACs n batch. From now on, each tme a new tuple arrves, we use the method n Secton 3.5 to ncrementally mantan the SAO ndex. 3.5 Index Mantenance In a typcal data streamng envronment, we expect that W >> M,.e., only a small fracton of all W tuples n the sldng wndow are stored n the SAO ndex. Intutvely, ths means that tuples on the ACs can be regarded as anomales. The smaller the ( ) s, the more anomalous the tuples on are. As a result, we have the followng heurstc (not exact) property: Property 2: Most new tuples are normal tuples and thus nsde. Moreover, for a new tuple t, t s most lkely to be nsde. ess lkely s tuple t between - and, and even less lkely s tuple t between -2 and -, etc. Accordng to our storage allocaton strategy descrbed n Sectons 3.2 and 3.3, the nner ACs tend to have fewer tuples than the outer ACs. From computatonal geometry lterature, t s known that gven a pont p, the complexty of checkng whether p s nsde a convex polytope P ncreases wth the number of vertces of P. Therefore, we have the followng property: Property 3: For a tuple t, t s typcally faster to check whether t s nsde an nner AC than to check whether t s nsde an outer AC. Upon the arrval of a new tuple t, Propertes 2 and 3 are used to reduce the SAO ndex mantenance overhead. We proceed n three steps. Step : Tuple Inserton All ACs are checked one by one, startng from. From Propertes 2 and 3 together wth the procedure descrbed below, t can be seen that ths checkng drecton s the most effcent one. There are two possble cases. In the frst case, tuple t s nsde. Accordng to Property 2, ths s the mostly lkely case. Also, accordng to Property 3, t can be dscovered quckly whether tuple t s nsde. In ths frst case, tuple t wll not change any of the ACs and thus can be thrown away mmedately. Snce no new tuple s ntroduced nto the SAO ndex, there wll be no memory overflow. Hence, Step 3 can be skpped, although Step 2 stll needs to be performed. Note: If s empty, we thnk that tuple t s outsde of. In the second case, a number k ( k ) can be located such that tuple t s nsde k- but outsde of k. (If k =, tuple t s outsde of all ACs.) In ths case, tuple t needs to be nserted nto the SAO ndex n a way smlar to that the onon ndex s mantaned [2]. Step 2: Tuple Expraton The arrval of tuple t wll cause at most one tuple n the SAO ndex to expre from the sldng wndow. et t' denote the frst tuple n the doubly-lnked lst dl, whch s the only tuple n the SAO ndex that may expre from the sldng wndow. There are two possble cases. In the frst case, tuple t' has not expred. We proceed to Step 3 drectly. In the second case, tuple t' has expred and thus needs to be deleted from the SAO ndex n a way smlar to that the onon ndex s mantaned [2]. Step 3: Handlng Memory Overflow In the above two steps, at most one new tuple s ntroduced nto the SAO ndex whle one or more tuples may be deleted (e.g., tuples may get expelled from n

= Step ). Now we check whether condton M M stll holds. If not, = M = + must be true. In ths M case, we use the procedure that s descrbed n Secton 3.3 to delete one tuple from the SAO ndex. 3.6 Query Evaluaton To provde approxmate answer to a top-k lnear optmzaton query (K can be larger than ), we use the onon ndex search procedure that s descrbed n [2]. 4. Performance Evaluaton We mplemented our technques by modfyng the wdely used Qhull (verson 2003.) software package [], whch mplements effcent constructs for the creaton of convex hulls. Our measurements were performed on a computer wth one.6ghz processor, GB man memory, one 75GB dsk, and runnng the Mcrosoft Wndows XP. Our evaluaton used the real data set n the 2005 UC data mnng competton [6]. Among all the attrbutes, we used the three (n=3) attrbutes that carred the most nformaton (.e., had the largest number of dstnct values) as the feature attrbutes. In each experment, after the system has run for enough tme and reaches a steady state, we posed 00 top-k lnear optmzaton queres, whose weghts were unformly dstrbuted between - and. As n Secton 3.2, for each top-k lnear optmzaton query, the weghted relatve K K error e s defned as e = u = e / u, where = u = / ( K ). The followng three performance metrcs were used: () Max error: The maxmum e observed for the 00 top-k lnear optmzaton queres. (2) Avg error: The average e observed for the 00 top-k lnear optmzaton queres. (3) Throughput: In the steady state, the average number of tuples that can be processed per second. In the rest of ths secton, by default the sldng wndow sze s W=,000,000. The SAO ndex contans =4 ACs. The number of top tuples s K=0. The storage budget M s 500 tuples. The dmensonalty n (.e., the number of feature attrbutes) of the data set s 3. We vared the storage budget M from 200 tuples to 500 tuples. We used the dynamc storage allocaton strategy descrbed n Secton 3.3 and compared the followng two methods of computng N s: () Error-mnmzng method: Ths s the method descrbed n Secton 3.2. N s are computed accordng to (4). (2) Unform method: N = M /. Fgure shows the mpact of M/W rato on the weghted relatve error. For any M/W rato, both the max error and the avg error are much smaller for the error-mnmzng method. Namely, the error-mnmzng method works better than the unform one. weghted relatve error throughput (tuples/second) 25% 20% 5% 0% 5% avg error, error-mnmzng max error, error-mnmzng avg error, unform max error, unform 0% 0.02% 0.03% 0.04% 0.05% M /W Fgure. Weghted relatve error vs. storage budget. 000000 00000 0000 000 0.02% 0.03% 0.04% 0.05% M /W Fgure 2. Throughput vs. storage budget. Both the max error and the avg error decrease as M/W rato ncreases. When M/W=0.05%, the avg error s 0.5% whle the max error s 3%, both farly small. In other words, even wth a storage budget that s only a very small fracton of the sldng wndow sze, the SAO ndex can provde farly accurate answers. Fgure 2 shows the mpact of M/W rato on the throughput. The y-axs uses logarthmc scale. When M/W=0.05%, the throughput s over 00,000 tuples/second. [5] s a full verson of ths paper that ncludes addtonal results and algorthm detals. In summary, n all our experments, usng the SAO ndex to answer a top-k lnear optmzaton query always takes less than 0.00002 second. Usng a farly small amount of memory space, the SAO ndex provdes good approxmate answers to top-k lnear optmzaton queres, and the qualty of approxmate answers mproves wth the amount of avalable memory. Even under a rapd tuple arrval rate, the SAO ndex can stll be mantaned effcently n real tme. Hence, the SAO ndex can provde good support for answerng lnear optmzaton queres aganst a data stream. References [] C.B. Barber, D.P. Dobkn, and H. Huhdanpaa. The Quckhull Algorthm for Convex Hulls. ACM Trans. Math. Softw. 22(4): 469-483, 996. [2] Y.C. Chang,.D. Bergman, and V. Castell et al. The Onon Technque: Indexng for near Optmzaton Queres. SIGMOD Conf. 2000: 39-402. [3] C.S., Y.C. Chang, and.d. Bergman et al. Model-Based Mult-modal Informaton Retreval from arge Archves. ICDCS Internatonal Workshop of Knowledge Dscovery and Data Mnng n the World-Wde Web, 2000. [4] C.S., Y.C. Chang, and J.R. Smth et al. SPIRE/EPI-SPIRE Model-Based Mult-modal Informaton Retreval from arge Archves. MMCBIR 200. [5] G. uo, K. Wu, and P.S. Yu. SAO: A Stream Index for Answerng near Optmzaton Queres, full verson. Avalable at http://www.cs.wsc.edu/~gangluo/onon_full.pdf, 2006. [6] 2005 UC Data Mnng Competton Homepage. http://mll.ucsd.edu, 2005.