Run-Time Operator State Spilling for Memory Intensive Long-Running Queries

Run-Tme Operator State Spllng for Memory Intensve Long-Runnng Queres Bn Lu, Yal Zhu, and lke A. Rundenstener epartment of Computer Scence, Worcester Polytechnc Insttute Worcester, Massachusetts, USA {bnlu, yalz, rundenst}@cs.wp.edu ABSTRACT Man memory s a crtcal resource when processng longrunnng queres over data streams wth state ntensve operators. In ths work, we nvestgate state spll strateges that handle run-tme memory shortage when processng such complex queres by selectvely pushng operator states nto dsks. Unlke prevous solutons whch all focus on one sngle operator only, we nstead target queres wth multple state ntensve operators. We observe an nterdependency among multple operators n the query plan when spllng operator states. We llustrate that exstng strateges, whch do not take account of ths nterdependency, become largely neffectve n ths query context. Clearly, a consoldated plan level spll strategy must be devsed to address ths problem. Several data spll strateges are proposed n ths paper to maxmze the run-tme query throughput n memory constraned envronments. The bottom-up state spll strategy s an operator-level strategy that treats all data n one operator state equally. More sophstcated partton-level data spll strateges are then proposed to take dfferent characterstcs of the nput data nto account, ncludng the local output, the global output and the global output wth penalty strateges. All proposed state spll strateges have been mplemented n the -CAP query system. The expermental results confrm the effectveness of our proposed strateges. In partcular, the global output strategy and the global output wth penalty strategy have shown favorable results as compared to the other two more localzed strateges.. INTROUCTION Processng long-runnng queres over real-tme data has ganed great attenton n recent years [, 3, 6, 4]. Unlke statc queres n a tradtonal database system, such query evaluates streamng data that s contnuously arrvng and produces query results n a real tme fashon. The strngent requrement of generatng real tme results demands Ths work was partly supported by the Natonal Scence Foundaton under grants IIS-044567. Permsson to make dgtal or hard copes of all or part of ths work for personal or classroom use s granted wthout fee provded that copes are not made or dstrbuted for proft or commercal advantage and that copes bear ths notce and the full ctaton on the frst page. To copy otherwse, to republsh, to post on servers or to redstrbute to lsts, requres pror specfc permsson and/or a fee. SIGMO 006, June 7 9, 006, Chcago, Illnos, USA. Copyrght 006 ACM -59593-56-9/06/0006 $5.00. effcent man memory based query processng. Therefore long-runnng queres, especally complex queres wth multple potentally very large operator states such as multjons [], can be extremely memory ntensve durng ther executon. Memory ntensve queres wth multple stateful operators are for nstance common n data ntegraton or n data warehousng envronments. For example, a real-tme data ntegraton system helps fnancal analysts n makng tmely decsons. At run tme, stock prces, volumes and external revews are contnuously sent to the ntegraton server. The ntegraton server must jon these nput streams as fast as possble to produce early output results to the decson support system. Ths ensures that fnancal analysts can analyze and make nstantaneous decsons based on the most up to date nformaton. When a query system does not have enough resources to keep up wth the query workload at runtme, technques such as load sheddng [9] can be appled to dscard some workload from the system. However, n many cases, longrunnng queres may need to produce complete result sets, even though the query system may not have suffcent resources for the query workload at runtme. As an example, decson support applcatons rely on complete results to eventually apply complex and long-rangng hstorc data analyss,.e., quanttve analyss. Thus, technques such as load sheddng [9] are not applcable for such applcatons. One vable soluton to address the problem of run-tme man memory shortage whle satsfyng the needs of complete query results s to push memory resdent states temporarly nto dsks when memory overflow occurs. Such solutons have been dscussed n XJon [0], Hash-Merge Jon [5] and MJon []. These solutons am to ensure a hgh runtme output rate as well as the completeness of query results for a query that contans a sngle operator. The processng of the dsk resdent states, referred to as state cleanup, s delayed untl a later tme when more resources become avalable. We refer to ths pushng and cleanng process as state spll adaptaton. However, the state spll strateges n the current lterature are all desgned for queres wth one sngle stateful operator only [5, 0, ]. We now pont out that for a query wth multple state ntensve operators, data spllng from one operator can affect other operators n the same ppelne. Such nterdependency among operators n the same dataflow ppelne must be consdered f the goal of the runtme data spllng s to ensure hgh output rate of the whole query plan. Ths poses new challenges on the state spll technques, 347

whch the exstng strateges, such as XJon [0] and Hash- Merge Jon [5], cannot cope wth. As an example of the problem consdered, Fgure shows two stateful operators OP and OP j wth the output of OP drectly feedng nto OP j. If we apply the exstng state spll strateges on both operators separately, the nterdependency between the two operators can cause problems not solved by these strateges. Frst, the data spll strateges would am to maxmze the output rate of OP when spllng states from OP. However, ths could n fact backfre snce t would n turn ncrease the man memory consumpton of OP j. Secondly, the states splled n OP may have the potental to have made a hgh contrbuton to the output of OP j.however, snce they are splled n OP, ths may produce the opposte of the ntended effect, that s, t may reduce nstead of ncrease the output rate of OP j. Ths contradcts the goal of the data spll strateges appled on OP j. Maxmze the output of OP? OP OP j Fgure : A Chan of Stateful Operators In ths work, we propose effectve runtme data spll strateges for queres wth multple nter-dependent state ntensve operators. The man research queston addressed n ths work s how to choose whch part of the operator states of a query to spll at run-tme to avod memory overflow whle maxmzng the overall query throughput. Another mportant queston addressed s how to effcently clean-up dsk-resdent data to guarantee completeness of query results. We focus on applcatons that need accurate query results. Thus, all nput tuples have to be processed ether n real tme durng the executon stage or later durng the state clean-up phase. Several data spll strateges are proposed n ths paper. We frst dscuss the bottom-up state spll strategy, whch s a operator-level strategy that treats all data n one operator state equally. We then propose more sophstcated parttonlevel data spll strateges that take dfferent characterstcs of the nput data nto account, ncludng a localzed strategy called local output, and two global throughput-orented state spllng strateges, named global output and global output wth penalty. All proposed data spll strateges am to select approprate portons of the operator states to spll n order to maxmze the run-tme query throughput. We also propose effcent clean-up algorthms to generate the complete query results from the dsk-resdent data. Furthermore, we show how to extend the proposed data spll strateges to apply them n a parallel processng envronment. For long-runnng queres wth hgh stream nput rates and thus a monotonc ncrease of operator states, the state cleanup process may be performed only after the run-tme executon phase fnshes. In ths paper we focus on ths case. For queres wth wndow constrants and bursty nput streams, the n-memory executon and the dsk clean-up may need to be nterleaved at runtme. New ssues n ths scenaro nclude tmng of spll, tmng of clean-up, and selecton of data to clean-up. We plan to address these ssues n our future work. The proposed state spll strateges and clean-up algorthms have all been mplemented n the -CAP contnuous query system [3]. The expermental results confrm the effectveness of our proposed strateges. In partcular, the global output strategy and the global output wth penalty strategy have shown more favorable results as compared to the other two more localzed strateges. The remander of the paper s organzed as follows. Secton dscusses basc concepts that are necessary for later sectons. Secton 3 defnes the problem of throughput-orented data spllng we are addressng n ths paper. The global throughput-orented state spllng strateges are presented and analyzed n Secton 4. Secton 5 dscusses the clean-up algorthms. In Secton 6, we show how to apply the data spllng strateges n a parallel processng envonment. Performance evaluatons are presented n Secton 7. Secton 8 dscusses related work and we conclude n Secton 9.. PRLIMINARIS. State Parttons and Partton Groups Operators n a contnuous long-runnng queres are requred to be non-blockng. Thus many operators need states. For example, a jon operator needs states to store tuples that have been processed so far so to jon them wth future ncomng tuples from the other streams. In case of hgh stream arrval rates and long-runnng tme, the states n an operator can become huge. Spllng one of these large states n ts entrety to dsk at tmes of memory overflow can be rather neffcent, and possbly even not necessary. In many cases, we need the flexblty to choose to spll part of a state or choose to spll data from several states to dsk to temporarly reduce the query workload n terms of memory. To facltate ths flexblty n run tme adaptaton, we can dvde each nput stream nto a large number of parttons. Ths enables us to effectvely spll some parttons n a state wthout affectng other parttons n the same state or parttons n other operator states. Ths method has frst been found to be effectve n the early data skew handlng lterature, such as [9], as well as n recent work on parttoned contnuous query processng, such as Flux [8]. By usng the above stream parttonng method, we can organze operator states based on the nput parttons. ach nput partton s dentfed by a unque partton I. Thus each tuple wthn an operator state belongs to exactly one of these nput parttons and would be assocated wth that partcular partton I. For smplcty, we also use the term partton to refer the correspondng operator state partton. The nput streams should be parttoned such that each query result can be generated from tuples wthn the same partton,.e., wth the same partton I.Inthsway,we can smply choose approprate parttons to spll at run tme, whle avodng reparttonng durng ths adaptaton process. Fgure depcts the stream parttonng for a jon query (A B C).ThejonsdefnedasA.A =B.B =C.C where A, B, and C denote nput streams (jon relatons) and A, B,andC are the correspondng jon columns. Here, the Splt A operator parttons the stream A based on the For m-way jons (m > ) [] wth jon condtons defned on dfferent columns, more data structures are requred to support ths parttoned m-way jon processng. The dscusson of ths s out of the scope of the paper snce we focus on the aspect of run-tme state adaptaton n ths work. 348

value of column A, whle the Splt B operator parttons the stream B based on B, and so on. As we can see, n order to generate a fnal query result, tuples from stream A wth partton I only need to jon wth tuples wth the same partton I from streams B and C. P A I A Splt A A I P B B B.. Splt B I C P C Splt C Fgure : xample of Parttoned Inputs When spllng operator states, we could choose parttons from each nput separately, as shown n Fgure 3(a). Usng ths strategy requres us to keep track of the tmestamps of when each of these parttons was splled to dsk, and the tmestamps of each tuple n order to avod duplcates or mssng results n the cleanup process. For example, partton A has been splled to the dsk at tme t. WeuseA to denote ths part of the partton A. All the tuples from B and C wth a tmestamp greater than t have to eventually jon wth the A n the cleanup process. Snce A, B,and C could be splled more than one tme, the cleanup needs to be carefully synchronzed wth the tmestamps of the nput tuples and the tmestamps of the parttons beng splled. An alternatve strategy s to use a partton group as the smallest unt of adaptaton. As llustrated n Fgure 3(b), a partton group contans parttons wth the same partton I from all nputs. urng our research, we found that usng the granularty of a partton group can smplfy the cleanup process (descrbed n Secton 4). Therefore, n our work we choose to use the noton of a partton group as the smallest unt to spll to dsk. From now on, we use the term partton to refer to a partton group f the context s clear. Snce a query plan can contan multple jons, the partton groups here are defned for each ndvdual operator n the plan. fferent operators may generate a tuple s partton I based on dfferent columns of that tuple. Ths arses when the jon predcates are non-transtve. Therefore a tuple may hold dfferent partton Is n dfferent operators. A B C (a) Select parttons from one ndvdual nput C.. A B C (b) Select parttons from all nputs wth the same I Fgure 3: Composng Partton Groups As an addtonal bonus, the approach of parttonng nput streams (operator states) naturally facltates effcent parttoned parallel query processng [0, 8]. That s, we can send non-overlappng parttons to multple machnes and have the query processed n parallel. The query processng can then proceed respectvely on each machne. Ths wll be further dscussed n Secton 5.. Calculatng State Sze Servng as the bass for the followng sectons, we now descrbe how to calculate the operator state sze and the state sze of the query tree. The operator state sze can be estmated based on the average sze of each tuple and the total number of tuples n the operator. The total state sze of the query tree s equal to the sum of all the operator state szes. For example, the state sze of Jon (see Fgure 4) can be estmated by S = u a s a+u b s b +u c s c. Here, s a, s b,and s c denote the number of tuples n Jon from nput stream A, B and C respectvely, and u a, u b,andu c represent the average szes of nput tuples from the correspondng nput streams. In Fgure 4, I and I denote the ntermedate results from Jon and Jon respectvely. Note that the average tuple sze of I can be represented by u a + u b + u c, whle the average tuple sze of I can be denoted by u a + u b + u c + u d f no projecton s appled n the query plan. Ths smple model can be naturally extended to stuatons when projectons do exst. The sze of operator states to be splled durng the spll process can be computed n a smlar manner. For example, assume d a tuples from A, d b tuples from B, and d c tuples from C are to be splled. Then, the splled state sze can be represented by = u a d a + u b d b + u c d c. (u a +u b +u c + u d ) splled state sze (u a +u b +u c ) =u a *d a +u b *d b +u c *d c Jon Jon S =u a *s a +u b *s b +u c *s c overall state sze ua u b u c I Jon 3 Fgure 4: Unt Sze of ach Stateful Operator Thus, the total percentage of states splled for the query tree can be computed by the sum of state szes beng splled dvded by the total state sze. For the query tree depcted n Fgure 4, t s denoted by ( + + 3)/(S + S + S 3). Here S represents the total state sze of operator Jon, whle denotes the operator states beng splled from Jon ( 3). 3. THROUGHPUT-ORINT STAT SPILL STRATGIS As dscussed n Secton, our goal s to keep the runtme throughput of the query plan as hgh as possble whle at the same tme preventng the system from memory overflow by applyng runtme data spllng when necessary. Gven multple stateful operators n a query tree, parttons from I u d u e 349

all operators can be consdered as potental canddates to be pushed when man memory overflows. We now dscuss varous strateges to choose partton groups to spll from multple stateful operators. State spll strateges have been nvestgated n the lterature [5, 0, ] to choose parttons from one sngle stateful operator to spll to dsk wth the least effect on the overall throughput. However, as dscussed n Secton, the exstng strateges are not suffcent to apply on a query tree wth multple stateful operators, because they do not consder the nterdependences among a chan of stateful operators n a dataflow ppelne. As we wll llustrate below, a drect extenson of the exstng strateges for one sngle operator does not perform well when appled to multple stateful operators. The decson of fndng parttons to spll can be done at the operator-level or at the partton-level. Selectng parttons at the operator-level means that we frst choose whch operators to spll parttons from and then start to spll parttons from ths operator untl the desred amount of data s pushed to dsk. If the sze of the chosen operator state s smaller than the desred spll amount, we would choose the next operator to spll parttons from. In other words, by usng the operator-level state spll, all parttons nsde one operator state are treated unformly and have equal chances of beng splled to dsk. The state spllng can also be done at the partton-level, whch treats each partton as an ndvdual unt and globally choose whch parttons to spll wthout consderng whch operators these parttons belong to. In ths secton, we present varous state spll strateges at both the operator-level and the partton-level. We frst nvestgate the mpact of pushng operator states to dsk n a chan of operators. Fgure 5 llustrates an example of an operator chan. ach operator n the chan represents a state ntensve operator n a query tree. Note that t does not have to be a sngle nput operator as depcted n the fgure. s represents the correspondng selectvtes of operator OP ( n). to fnal output s s s 3 s n OP OP OP 3 OP n I = n [( s j t)] () = j= More precsely, OP stores t tuples, OP stores t s tuples, OP 3 stores t s s tuples, and so on. Thus, f we spll t tuples at OP, then all the correspondng ntermedate results generated due to the exstence of these t tuples and would have been stored n OP, OP 3,, OP n now would not exst any more. Note that spllng any of these ntermedate results would have the same overall effect on the fnal output,.e., spllng the t s tuples at OP would decrease the same amount of the fnal output as spllng t tuples at operator OP, as estmated by the quaton. 3. Operator-Level State Spll 3.. Bottom-up Pushng Strategy Inspred by the above analyss, we now propose a nave strategy, referred to as bottom-up pushng, to spll operator states of a query tree wth multple stateful operators at the operator-level. Ths strategy always chooses operator states from the bottom operator(s) n the query tree untl enough space has been saved n the memory. For example, n Fgure 5, the bottom operator s OP. Partton groups from bottom operators are chosen randomly and have equal chances to be chosen. Intutvely, f partton groups from the bottom operator are chosen to be pushed to dsk, less ntermedate results would be stored n the query tree, compared to pushng states n the other operators. Thus, the bottom-up pushng strategy has the potental to lead to a smaller number of state spll processes, because less states (ntermedate results) are expected to be accumulated n the query tree. However, havng a smaller number of state spll processes does not naturally result n a hgh overall throughput. Ths s because () the states beng pushed n the bottom operator may contrbute to a hgh output rate n ts down stream operators, and () the cost of each state spll process may not be hgh, thus havng a large number of state spll processes may not ncur sgnfcant overhead on the query processng. Intermedate states Fgure 5: An Operator Chan t p t p t p t p p n t p n OP OP OP n For such an operator chan, quaton estmates the possble number of output tuples from OP n gven a set of nput tuples t to OP. n u = s t () = The total number of tuples that wll be stored somewhere wthn ths chan due to these t nput tuples, whch also corresponds to the ncrease n the operator state sze, can be computed as follows : We assume that all nput tuples to stateful jon operators have to be stored n operator states. In prncple, other stateful operators can be addressed n a smlar manner. Fgure 6: A Chan of Parttoned Operators Moreover, the output of a partcular partton of the bottom operator s lkely to be sent nto multple dfferent parttons of the down stream operator(s). For example, as llustrated n Fgure 6, assume the t nput tuples to OP are parttoned nto partton group P. Here the superscrpt represents the operator I, whle the subscrpt denotes the partton I. After the processng n OP, t result tuples are outputted and parttoned nto P of OP, whle t tuples are parttoned nto P of OP. The parttons P and P of OP may have very dfferent selectvtes. For example, the output t may be much larger than t whle the 350

sze of these two parttons may be smlar. Thus, t may be worthwhle to keep P n OP even though certan states (n P of OP ) wll be accumulated at the same tme. 3.. scussons On Operator-Level State Spll As we can see, the relatonshp between parttons among adjacent operators s a many-to-many relatonshp. Pushng partton groups at any operator other than the root operators may affect multple partton groups at ts down stream operators. However, an operator-level strategy, such as the presented bottom-up strategy, does not have a clear connecton between the partton pushng and ts effects on the overall throughput. Another general drawback of the operator-level spllng s that t treats all parttons n the same state as havng the same characterstcs and the same effects on query performance when consder data spllng. However, dfferent parttons may have dfferent effects on the memory consumpton and the query throughput after the data spllng. For example, some tuples have data values that appear more often n the stream, so they may have hgher chances to jons wth other tuples and produce more results. Thus we may need to make decsons on where to spll data on a fner granularty. 3. Partton-Level State Spll To desgn a better state spllng strategy, we propose to globally select partton groups n the query tree as canddates to push. Fgure 7 llustrates the basc dea of ths approach. Instead of pushng parttons from partcular operator(s) only, we conceptually vew parttons from dfferent operators at the same level. That s, we choose parttons globally at the query level based on certan cost statstcs collected about each partton. The basc statstcs we collect for each partton group are P output and P sze. P output ndcates the total number of tuples that have been output from the partton group, and P sze refers to the operator state sze of the partton group. These two values together can be utlzed to dentfy the productvty of the partton group. We now descrbe three dfferent strateges for how to collect P output and P sze values of each partton group, and how partton groups can be chosen based on these values wth the most postve mpact on the run tme throughput. sk State Spll Jon Jon Jon 3 Fgure 7: Globally Choose Partton Groups 3.. Local Output Strategy The frst proposed partton-level state spll strategy, referred to as local output, updatesp output and P sze values of each partton group locally at each operator. The P sze of each partton group s updated whenever the nput tuples are nserted nto the partton group. Whle P output value s updated whenever output tuples are generated from the operator. Fgure 8 llustrates ths localzed approach. When t tuples nput to Jon,weupdateP sze of the correspondng partton groups n Jon. When t tuples are generated from Jon,thenP output value of the correspondng partton groups n Jon and the P sze value of related partton groups n Jon are updated. Smlarly, f we get t from Jon,thenP output of correspondng partton groups n Jon and P sze n Jon 3 are updated. Poutput, Psze Poutput, Psze t Jon t Jon t 3 t Jon 3 Fgure 8: A Localzed Statstcs Approach fferent from the prevous operator-level state spll, when selectng parttons to spll, ths strategy chooses from the set of all parttons across all operators n the query plan basedontherproductvty values (P output/p sze). Hence ths s a partton-level state spll strategy. We push the partton group wth the smallest productvty value among all partton groups n the query plan. However, ths approach does not provde a global productvty vew of the partton groups. For example, f we keep partton groups of Jon wth hgh productvty values n man memory, ths n turn would contrbute to generatng more output tuples to be nput to Jon. All these tuples wll be stored n Jon and hence wll ncrease the man memory consumpton of Jon. Ths may cause the man memory to be flled up quckly. However, these ntermedate results may not necessarly help the overall throughput snce these results may be dropped by ts down-stream operators. 3.. Global Output Strategy In order to maxmze the run-tme throughput after pushng states nto dsks, we need to have a global vew of partton groups that reflects how each partton group contrbutes to the fnal output. That s, the productvty value of each partton group needs to be defned n terms of the whole query tree. Ths requres the P output value of each partton group to represent the number of fnal output tuples generated from the query. The productvty value, P output/p sze, now ndcates how good the partton group s n terms of contrbutng to the fnal output of the query. Thus, f we keep the partton groups wth hgh global productvty value n man 35

memory, the overall throughput of the query tree s lkely to be hgh compared wth the prevously descrbed pushng strateges. Note that the key dfference of ths global output approach from the local output approach s ts new way of computng the P output value. We have desgned a tracng algorthm that computes the P output value of each partton group. The basc dea s that whenever output tuples are generated from the query tree, we fgure out the lneage of each output tuple. That s, we trace back to the correspondng partton groups from dfferent operators that have contrbuted to ths output. The tracng of the partton groups that contrbute to an output tuple can be computed by applyng the correspondng splt operators. Ths s feasble snce we can apply the splt functons on the output tuple along the query tree to dentfy all the partton groups that the output tuple belongs to. Such tracng requres that the output tuple contans at least all jon columns of the jon operators n the query tree. The man dea of the tracng algorthm s depcted n Fgure 9. When k tuples are generated from Jon 3,wedrectly update the P output values of partton groups n Jon 3 that produce these outputs. To fnd out the partton groups n the Jon that contrbute to the outputs, we apply the partton functon of Splt on each output tuple. Snce multple partton groups n the Jon may contrbute to one partton group n Jon 3, we need to trace for each partton group that s found n Jon. Smlarly, we apply the partton functon of Splt to fnd the correspondng partton groups n operator Jon. Note that we do not have to trace and update P output for each output tuple. We only update the value wth a random sample of the output tuples. The pseudocode for the tracng algorthm for a chan of operators s gven n Algorthm. Here, we assume that each stateful operator n the query tree keeps reference to ts mmedate upstream stateful operator and reference to ts mmedate upstream splt operator. Upstream operator of an operator op here s defned as the operators that feed ther output tuples as nputs to the operator op. Note that for a query tree, multple mmedate upstream stateful operators may exst for one operator. We can then smlarly update the tracng algorthm to use a breadth-frst or depth-frst traversal algorthms of the query plan tree to update the P output values of the correspondng parttons. Algorthm updatestatstcs(tpset) /*Tracng and updatng the P output values for a gven set of output tuples tpset.*/ : op root operator of the query tree; : prv op ref op.getupstreamoperatorreference(); 3: prv splt ref op.getupstreamspltreferences(); 4: whle ((prv op ref null) &&(prv splt ref null)) do 5: for each tuple tp tpset do 6: cp I Compute parttoni of tp n prv op ref ; 7: Update P output of partton group wth I cp I; 8: end for 9: prv op ref prv op ref.getupstreamoperatorreference(); 0: prv splt ref prv splt ref.getupstreamspltreference(); : end whle Gven the above tracng, the P output value of each partton group ndcates the total number of outputs that have been generated that have ths partton group nvolved n. TheupdateofP sze value s the same as we have dscussed n the local output approach. Thus, P output/p sze ndcates the global productvty of the partton group. By pushng partton groups wth a lower global productvty, the overall run-tme throughput would be expected to be better than the localzed approach as well as the bottom-up approach. Jon Splt Jon Splt A Splt B Splt C Splt k Jon 3 Splt Splt Fgure 9: Tracng the Output Tuples 3..3 Global Output wth Penalty Strategy In the above approaches, the sze of the partton group P sze reflects the man memory usage of the current partton group. However, as prevously ponted out, the operators n a query tree are not ndependent. That s, output tuples of an up stream operator have to be stored n the down stream stateful operators. Ths ndrectly affects the P sze of the correspondng partton groups n the down stream operator. P : P sze = 0, P output =0 P : P sze = 0, P output =0 p p OP 0 p p j OP Fgure 0: Impact of the Intermedate Results For example, as shown n Fgure 0, both partton groups P and P of OP have the same P sze and P output values. Thus, these two parttons have the same productvty value n the global output approach. However, P produces tuples on average that are output to the OP gven one nput tuple, whle P generates 0 tuples on average gven one nput tuple. All ntermedate results have to be stored n the down stream stateful operators. Thus, pushng P nstead of P can help to reduce the memory that wll be needed to store possble ntermedate results n downstream operators. 35

To capture ths effect, we defne an ntermedate result factor n each partton group, denoted by P nter. Thsfactor ndcates the possble ntermedate results that wll be stored n ts down stream operators n the query tree. In ths strategy, the productvty value of each partton group s defned as P output/(p sze + P nter). Ths ntermedate result factor can be computed smlarly as the tracng of the fnal output. That s, whenever an ntermedate result s generated, we update the P nter values of the correspondng partton groups n all the upstream operators. Fgure llustrates an example of how tracng algorthm can be utlzed to update P nter. Inthsexample, one nput tuple to OP eventually generates output tuples from OP 4. The number n the square box represents the number of ntermedate results beng generated. partton groups wth I A B C A B C A r B r C r partton groups wth I A B C A s B s C s partton groups wth I n A m B m C m A t m B t m C t m merge merge merge A ~r B ~r C ~r A ~s B ~s C ~s A ~t m B ~t m C ~t m p p OP 4 3 3 4 p p j OP 3 4 p 3 p 3 j OP 3 4 p 4 p 4 j Fgure : Tracng and Updatng P nter Values 4. CLAN UP ISK RSINT PARTITIONS 4. Clean Up of One Stateful Operator When memory becomes avalable, dsk resdent states have to be brought back to man memory to produce mssng results. Ths state cleanup process can be performed at any tme when memory becomes avalable durng the executon. If no new resources are beng devoted to the computaton, then ths cleanup process may lkely occur at the end of the run-tme phase. In the cleanup, we must produce all mssng results due to spllng data to dsk whle preventng duplcates. Note that multple partton groups may exst n dsk for one partton I. Ths s because once a partton group has been pushed nto dsk, new tuples wth the same partton I may agan accumulate and thus a new partton group forms n man memory. Later, as needed, ths partton group could be pushed nto the dsk agan. The tasks that need to be performed n the cleanup can be descrbed as follows: () Organze the dsk resdent partton groups based on ther partton I. () Merge partton groups wth the same partton I and generate mssng results. (3) If a man memory resdent partton group wth the same I exsts, then merge ths memory resdent part wth the dsk resdent ones. Fgure llustrates an example of the partton groups before and after the cleanup process. Here, the example query s defned as A B C. We use a subscrpt to ndcate the partton I, whle we use a superscrpt to dstngush between the partton groups wth the same partton I that have been pushed at dfferent tmes. The collecton of superscrpts such as r represents the merge of partton groups that respectvely had been pushed at tmes,,,r. OP 4 Fgure : xample of Cleanup Process The merge of partton groups wth the same I can be descrbed as follows. For example, assume that a partton group wth partton I has been pushed k tmes to dsk, represented as (A,B,C ), (A,B,C ),,(A k,b k,c k ) respectvely. Here (A j,bj,cj ), j k denotes the j-th tme that the partton group wth I has been pushed nto the dsk. For ease of descrpton, we denote these partton groups by P,P,,P k respectvely. ue to our usage of the dea of spllng at the granularty of complete partton groups (see Secton.), the results generated between all the members of each partton group have already been produced durng the prevous run-tme executon phase. In other words, all the results such as A B C, A B C,, A k B k C k are guaranteed to have been prevously generated. For smplcty, we denote these results as V, V,, V k. These partton groups can thus be consdered to be self-contaned partton groups gven the fact that all the results have been generated from the operator states that are ncluded n the partton group. Mergng two partton groups wth the same partton I results n a combned partton group that then contans unon of the operator states from both partton groups. For example, the merge of P and P results n a new partton group P, now contanng the operator states A A,B B,C C. Note that the output V, from partton group P, should be (A A ) (B B ) (C C ). Clearly, a subset of these output tuples have already been generated, namely, V and V. Thus now we must generate the mssng part n the mergng process for these two partton groups n order to make the resultng partton group P, self-contaned. Ths mssng part s ΔV, = V, V V. Here, we observe that the problem of mergng partton groups and producng mssng results s smlar to the problem of the ncremental batch vew mantenance [, 6]. We thus now descrbe the algorthm for ncremental batch vew mantenance and then show how to map our problem to the vew mantenance problem so to apply exstng solutons from the lterature to our problem [, 6]. Assume a materalzed vew V s defned as an n-way jon upon n dstrbuted data sources. It s denoted by R R R n. There are n source deltas (ΔR, n) 353

that need to be mantaned. As was mentoned earler, each ΔR denotes the changes (the collecton of nsert and delete tuples) on R at a logcal level. An actual mantenance query wll be ssued separately, that s, one for nsert tuples and one for delete tuples. Gven the above notatons, the batch vew mantenance process s depcted n quaton 3. ΔV = ΔR R R 3 R n + R ΔR R 3 R n (3) + + R R R 3 ΔR n Here R refers to the orgnal data source state wthout any changes from ΔR ncorporated n t yet, whle R represents the state after the ΔR has been ncorporated,.e., t reflects R +ΔR ( + denotes the unon operaton). The dscusson of the correctness of ths batch vew mantenance tself can be found n [, 6]. Intutvely, we can treat one partton group as the base state, whle the other as the ncremental changes. Thus, the mantenance equaton descrbed n quaton 3 can be naturally appled to merge parttons and recompute mssng results. Lemma 4.. A combned partton group P r,s generated by mergng partton groups P r and P s usng the ncremental batch vew mantenance algorthm as lsted n quaton 3 s self-contaned f P r and P s were both self-contaned before the merge. Proof. Wthout loss of generalty, we treat partton group P r as the base state, whle P s as the ncremental change to P r. Incremental batch vew mantenance equaton as descrbed n quaton 3 produces the followng two results: () the partton group P r,s havng both states of P r and P s, and () the ncremental changes to the base result V r by ΔV r,s = V r,s - V r. Snce two partton groups P r and P s already have results V r and V s generated, the mssng result of combnng P r and P s can be generated by ΔV r,s - V s. As can be seen, P r,s s self-contaned snce t has generated exactly the output results V r,s V r + V s ). =(ΔV r,s - V s )+( As an example, let us assume A, B and C are the base states, whle A, B and C are the ncremental changes. Then, by evaluatng the vew mantenance equaton n quaton 4, we get the combned partton group P, and the delta change ΔV, = V, V. By further removng V from ΔV,, we generate exactly the mssng results by combnng P and P. The frst combnaton merges two partton groups, whle the remanng m- partton groups are combned one at a tme. Thus the combnaton ends after m- steps. Gven each combnaton results n a self-contaned partton group based on Lemma 4., the fnal partton group s self-contaned. BasedonLemmas4.and4.,wecanseethatthecleanup process (mergng partton groups wth the same partton I) successfully produces exactly all mssng results and no duplcates. Note that memory resdent partton groups can be combned wth the dsk resdent parts n exactly the same manner as dscussed above. As can be seen, the cleanup process does not rely on any tmestamps. We thus do not have to keep track of any tmestamps durng the state spll process. 4. Clean Up of Multple Stateful Operators Gven a query tree wth multple stateful operators, when operator states from any of the stateful operators have been pushed nto the dsk durng run-tme, the fnal cleanup stage to completely remove all persstent data should not be performed n a random order. Ths s because the operator has to ncorporate the mssng results generated from the cleanup process of any of ts up stream operators. That s, the cleanup process of jon operators has to conform to the partal order as defned n the query tree. Fgure 3 llustrates a 5-jon query tree ((A B C) ) wth three jon operators Jon, Jon,andJon 3. Assume we have operator states pushed nto the dsk from all three operators. The correspondng jon results from these dsk resdent states are denoted by ΔI,ΔI,andΔI 3.From Fgure 3, we can see that the cleanup results of Jon (ΔI ) have to be joned wth the complete operator states related to stream to produce the complete cleanup results for Jon. Here, the complete stream state ncludes states from the dsk resdent part ΔI and the correspondng man memory operator state. The cleanup result of Jon,(ΔI +ΔI ), has to jon wth the complete stream state n Jon 3 to produce the mssng results. Clean up ΔI ΔI Splled States ΔI 3 Jon Jon Jon 3 V, V = A B C (A A ) B C (A A ) (B B ) C (4) Fgure 3: Clean Up the Operator Tree Lemma 4.. Gven a collecton of self-contaned partton groups {P, P,, P m }, a self-contaned partton group P m can be constructed usng the above gven ncremental vew mantenance algorthm repeatedly n m steps. Proof. A straghtforward teratve process can be appled to combne such a collecton of m partton groups. Gven ths constrant, we desgn a synchronzed cleanup process to combne dsk resdent states and produce all mssng results. We start the cleanup from the bottom operators(s) whch are the furthest from the root operator,.e., from all the leaves. The cleanup process for operators wth the same dstance from the root can be processed concurrently. Once an up stream operator completes ts cleanup 354

process, t notfes ts down stream operator usng a control message nterleaved n the data stream to sgnal that no more ntermedate tuples wll be sent to ts down stream operators hereafter. Ths message then trggers the cleanup process of the down stream operator. Once the cleanup process of an operator s completed, the operator wll no longer be scheduled by the query engne untl the full cleanup s accomplshed. Ths synchronzed cleanup process s llustrated n Fgure 3. The cleanup process starts from Jon. The generated mssng results ΔI are sent to the down stream operators. Jon then generates a specal control tuple ndof-cleanup to ndcate the end of ts cleanup. The down stream stateful operator Jon starts ts cleanup after recevng the control tuple. All the other non-stateful operators, such as splt operators, smply pass the nd-of-cleanup tuple through to ther down stream operator(s). Ths process contnues untl all cleanup processes have been processed. Note that n prncple t s possble to start the cleanup process of all stateful operators at the same tme. However, ths may requre a large amount of man memory space snce each cleanup process wll brng dsk resdent states nto the memory. On the other hand, the operator states of the down stream operators cannot be released n any case untl ts up stream operators fnsh ther cleanup and compute the mssng results. Whle for the synchronzed method, we nstead brng these dsk resdent states nto memory sequentally one operator at a tme. Furthermore, we can safely dscard them once the cleanup process of ths operator completes. 5. APPLYING TO PARTITION PARALLL QURY PROCSSING A query system that processes long-runnng queres over data streams can easly run out of resources when processng large volume of nput stream data. Parallel query processng over a shared nothng archtecture,.e., a cluster of machnes, has been recognzed as a scalable method to solve ths problem [, 8, 8]. Parallel query processng can be especally useful for queres wth multple state ntensve operators that are resource demandng n nature. Ths s exactly the type of queres we are focusng on n ths work. However, the overall resources of even a dstrbuted system may stll be lmtng. A parallel processng system may stll need to temporarly spll state parttons to dsk to react to overall resource shortage mmedately. In ths secton, we llustrate that our proposed state spll strateges natually can be extended to also work for such parttoned parallel query processng envronment. Ths observaton broadens the applcablty of our proposed spll technques. The approach of parttonng nput streams (operator states) dscussed n Secton. s stll applcable n the context of parallel query processng. In fact, t helps to acheve a parttoned parallel query processng [7,, 7]. We can smply spread the stream parttons across dfferent machnes wth each machne only processng a porton of all nputs. Fgure 4 depcts an example of processng a query plan wth two jons n a parallel processng envronment. Frst, stateful operators must be dstrbuted across avalable machnes. In ths work, we choose to allocate all stateful operators n the query tree to all the machnes n the cluster, as shown n Fgure 4(b). Thus, each machne wll have exactly the same number of stateful operators defned n the query tree actvated. ach machne processes a porton of all nput streams of the stateful operators. The parttoned stateful operators can be connected by splt operators as shown n Fgure 4(c). One splt operator s nserted after each nstance of the stateful operator. The output of the operator nstance s drectly parttoned by the splt operator and then shpped to the next approprate down stream operators. Note that other approaches exst for both allocatng stateful operators across multple machnes and connectng such parttoned query plans. However, the man focus of the work here s to adapt operator states to address the problem of run tme man memory shortage. The exploraton of other parttoned parallel processng approaches as well as ther performance are beyond the scope of ths paper. Jon Jon (a) Orgnal Query cluster m m m 3 m 4 (b) Allocatng Multple Stateful Operators Splt Jon Jon Jon Splt Splt Jon Splt A Splt B Splt C (c) Composng Parttoned Query Plan Fgure 4: Parttoned Parallel Processng The throughput-orented state spll strateges dscussed n Secton 3 naturally apply to the parttoned parallel processng envronments. Ths s because the statstcs we collect are based on man memory usage and operator states only. However, gven parttoned parallel processng, when applyng the global output or the global output wth penalty state spll strategy, the P output value must be traced and then correctly updated across multple machnes. For example, as shown n Fgure 5, the query plan s deployed n two machnes. If k tuples are generated by Jon 3,we drectly update the P output values of partton groups n Jon 3 that have produced these outputs. To fnd out the partton groups n Jon that contrbute to the outputs, we then apply the partton functon of Splt on each output tuple. Note that gven parttoned parallel processng, partton groups from dfferent machnes may contrbute to the same partton group of the down stream operator. Thus, the tracng and updatng of P output values may nvolve multple machnes. In ths work, we desgn an UpdatePartton- Statstcs message to notfy other machnes of the update of P nter and P output values. Snce each splt operator knows exactly the mappngs between the partton groups and the machnes, t s feasble to only send the message to the machnes that have the partton groups to be updated. The revsed updatestatstcs algorthm s sketched n Algorthm. We classfy partton group Is by applyng the current splt functon nto localis and remoteis dependng on whether the I s mapped to the current machne. Then for the partton groups wth localis, weupdateether P nter or P output based on whether the current tpset s a set of ntermedate results. Whle for the remoteis, we compose UpdateParttonStatstcs messages wth approprate nformaton and then send the messages to the machne that holds the partton groups wth ther Is n remoteis. 355

Machne Jon Splt Jon Splt Splt A Splt B Splt C k Jon 3 Splt Splt Machne Jon Splt Jon Splt A Splt B Splt C Splt Jon 3 Splt Splt Fgure 5: Tracng the Number of Output Algorthm updatestatstcsrev(tpset,ntermedate) /*Tracng and updatng the P output/p nter values for a gven set of output tuples tpset. ntermedate s a boolean ndcatng whether tpset s the ntermedate results of the query tree*/ : op root operator of the query plan; : prv op ref op.getupstreamoperatorreference(); 3: prv splt ref ths.getupstreamspltreference(); 4: whle ((prv op ref null) &&(prv splt ref null)) do 5: for each tp tpset do 6: cp I Compute parttoni of tp n prv op ref ; 7: Classfy cp I nto localis/remoteis; 8: end for 9: f (ntermedate) then 0: Update P nter of localis; : else : Update P output of localis; 3: end f 4: Compose & send UpdateParttonStatstcs msg(s) for remoteis; 5: prv splt ref prv splt ref.getupstreamspltreference(); 6: prv op ref prv op ref.getupstreamoperatorreference(); 7: end whle ng about half of all nput parttons. ach machne has dual.4hz Xeon CPUs wth G man memory. All nput streams are parttoned nto 300 parttons. We set the memory threshold (θ m) for state spllng to be 60 MB for each machne. Ths means the system starts spllng states to dsk when the memory usage of the system s over 60 MB. We vary two factors, namely the tuple range and the range jon rato, when generatng nput streams. We specfy that a data value V appears R tmes for every K nput tuples. Here K s defned as the tuple range and R the range jon rato for V. fferent values (parttons) n each jon operator can have dfferent range jon ratos. The average of these ratos s defned as the average jon rato for that operator. 6. xpermental valuaton Fgure 6 compares the run-tme phase throughput of dfferent state spllng strateges. Here we set the average jon rato of Jon to 3, whle the average jon rato of Jon and Jon 3 s. In Fgure 6, the X-axs represents tme, whle the Y-axs denotes the overall run tme throughput. From Fgure 6, we can see that both the local output approach and the bottom-up approach perform much worse than the global output and the global output wth penalty approaches. Ths s as expected because the local output and the bottom-up approaches do not consder the productvty of partton groups at the global level. From Fgure 6, we also see that the global output wth penalty approach performs even better than the global output approach. Ths s because the global output wth penalty approach s able to effcently use the man memory resource by consderng both the partton group sze as well as the possble ntermedate results that have to be stored n the query tree. Throughput 30000 5000 0000 5000 0000 5000 Global Output wth Penalty Global Output Local Output Bottom-Up 0 3 5 7 9 35793579333353739443454749 Mnutes 6. PRFORMANC STUIS 6. xpermental Setup All state spllng strateges dscussed n ths paper have been mplemented n the -CAP system, a prototype contnuous query system [3]. We use a fve-jon query tree llustrated n Fgure 5 to report our expermental results. The query s defned on 5 nput streams denoted as A, B, C,, and wth each nput stream havng two columns. Here Jon s defned on the frst column of each nput stream A, B, and C. Jon s defned on the frst jon column of nput and the second jon column of nput C, whle Jon 3 s defned on the frst column of nput and the second column of nput. The average tuple nterarrval tme s set to be 50 ms for each nput stream. All jons utlze the symmetrc hash-jon algorthm []. We deploy the query on two machnes wth each process- Fgure 6: Comparng Run-tme Throughput wth Jon Rato 3,, Fgures 7 and 8 show the correspondng memory usage when applyng dfferent spllng strateges. Fgure 7 shows the memory usage of the global output approach and global output wth penalty approach. Note that each zg n the lnes ndcates one state spll process. From Fgure 7, we can see that the global output approach has a total of 3 state spll processes n the 50 mnutes runnng, whle the global output wth penalty approach only splls for 0 tmes. Agan, ths s expected snce the global output wth penalty approach consders both the sze of the partton group and the overall memory mpact on the query tree. As dscussed n Secton 3, havng a smaller number of state spll processes does not mply a hgh overall run tme throughput. In Fgure 8, the bottom-up approach only has 7 tmes of adaptatons. However, the run tme throughput of the bottom-up approach s much less than the global output 356