A Comparison of Top-k Temporal Keyword Querying over Versioned Text Collections

Size: px

Start display at page:

Download "A Comparison of Top-k Temporal Keyword Querying over Versioned Text Collections"

Angelica Simmons
5 years ago
Views:

1 A Comparson of Top-k Temporal Keyword Queryng over Versoned Text Collectons Wenyu Huo and Vassls J. Tsotras Department of Computer Scence and Engneerng Unversty of Calforna, Rversde Rversde, CA, USA Abstract. As the web evolves over tme, the amount of versoned text collectons ncreases rapdly. Most web search engnes wll answer a query by rankng all known documents at the (current) tme the query s posed. There are applcatons however (for example customer behavor analyss, crme nvestgaton, etc.) that would need to effcently query these sources as of some past tme, that s, retreve the results as f the user was posng the query n a past tme nstant, thus accessng data known as of that tme. Rankng and searchng over versoned documents consders not only keyword constrants but also the tme dmenson, most commonly, a tme pont or tme range of nterest. In ths paper, we deal wth top-k query evaluatons wth both keyword and temporal constrants over versoned textual documents. In addton to consderng prevous solutons, we propose novel data organzaton and ndexng solutons: the frst one parttons data along rankng postons, whle the other mantans the full rankng order through the use of a multverson ordered lst. We present an expermental comparson for both tme pont and tme nterval constrants. For tme-nterval constrants, dfferent queryng defntons, such as aggregaton functons and consstent top-k queres are evaluated. Expermental evaluatons on large real world datasets demonstrate the advantages of the newly proposed data organzaton and ndexng approaches. 1 Introducton Versoned text collectons are textual documents that retan multple versons as tme evolves. Numerous such collectons are avalable today and a well-known example s the collaboratve authorng envronment, such as Wkpeda [1], where textual content s explctly verson-controlled. Smlarly, web archvng applcatons such as the Internet Archve [2] and the European Archve [3] store regular crawls (over tme) of web pages on a large scale. Other tme-stamped textual nformaton such as, weblogs, mcro-blogs, even feeds and tags, as also create versoned text collectons. If a text collecton does not retan past documents, then a search query ranks only the documents as of the most current tme. If the collecton contans versoned documents, a search typcally consders each verson of a document as a separate document and the rankng s taken over all documents ndependently to the document s verson (creaton tme). There are applcatons however, where ths approach s not S.W. Lddle et al. (Eds.): DEXA 2012, Part II, LNCS 7447, pp , Sprnger-Verlag Berln Hedelberg 2012

2 A Comparson of Top-k Temporal Keyword Queryng over Versoned Text Collectons 361 adequate. Consder the followng example: n order for a company to analyze consumer comments on a specfc product before some event occurred (new product, advertsement campagn etc.), a temporal constrant may be very useful. For example, to vew opnons on phone4, a tme-wndow wthn 06/07/2010 (announce date) and 10/04/2011 (announce date of phone4s) could be a far choce. Many nvestgaton scenaros also requre combnng the keyword search wth a tme-wndow of nterest. For example, whle consderng a fnancal crme, an nvestgator may need to dentfy what nformaton was avalable to the accused as of a specfc tme nstant n the past. Provdng as-of queres s a challengng problem. Frst s the data volume. Document collectons lke Wkpeda and Internet Archve, are already huge even f only ther most recent snapshot s consdered. When searchng n ther evolutonary hstory, we are faced wth even larger data volumes. Moreover, how to quckly return the top-k temporally ranked canddates s another new challenge. Note that returnng all qualfed results wthout temporal constrants would not be effcent snce two extra steps are requred: () flterng out results later than the query specfed tme constrant, and, () rankng the remanng results so as to provde the top-k answers. In ths paper we present an expermental evaluaton of the top-k query over versoned text collectons, comparng prevously proposed as well novel approaches. In partcular the key contrbutons can be summarzed as: 1. Prevous methods related to versoned text keyword search are sutably extended for top-k temporal queres. 2. Novel approaches are proposed n order to accelerate top-k temporal queres. The frst approach parttons the temporal data based on ther rankng postons, whle the other mantans the full rank order usng a multverson ordered lst. 3. In addton to top-k tme-pont keyword based search, we also consder two tmenterval (or tme-range) varants, namely aggregaton rankng and consstent top-k queryng. 4. Expermental evaluatons wth large-scale real-world datasets are performed on both the prevous and newly proposed methods. The rest of the paper s organzed as follows. Prelmnares and related work are ntroduced n secton 2. Our novel approaches appear n secton 3. Dfferent query defntons of tme-nterval top-k queres are presented n secton 4. All technques are comprehensvely evaluated and compared n a seres of experments n secton 5 whle the conclusons appear n secton 6. 2 Prelmnares and Related Work 2.1 Defntons The data model for versoned document collectons was formally ntroduced n [10], and used by later works [9, 5, 6]. Let D be a set of n documents d 1,d 2,,d n where 1 2 m each document d s a sequence of m versons: d = { d, d,..., d }. Each verson has a sem-closed valdty tme-nterval (or lfespan) lfe( d j ) = [ t, t ). Moreover, t s s e

3 362 W. Huo and V.J. Tsotras docid d5 d4 d3 d2 d tme t0 t1 t2 t3 t4 t5 t6 t7 t8 score-order d4, 1, t0, t1 d2, 0.95, t1, t4 d2, 0.9, t4, t8 d2, 0.7, t0, t1 d4, 0.7, t1, t3 d4, 0.4, t3, t6 d4, 0.25, t6, t8 ts-order d2, 0.7, t0, t1 d4, 1, t0, t1 d2, 0.95, t1, t4 d4, 0.7, t1, t3 d4, 0.4, t3, t6 d2, 0.9, t4, t8 d4, 0.25, t6, t8 Fg. 1. Example of versoned documents wth scores for one term assumed that dfferent versons of the same document have dsjont lfe spans. An example of fve documents and ther versons appears n Fg. 1; each document corresponds to a colored lne, whle segments represent dfferent versons of a document. The nverted fle ndex s the standard technque of text ndexng for keyword queres, deployed n many search engnes. Assumng a vocabulary V, for each term v n V, the ndex contans an nverted lst L v consstng of postngs of the form (d, s) where d s a document-dentfer and s s the so-called payload score. There are numerous exstng relevance scorng functons, such as tf-df [7], language models [13] and Okap BM25 [14]. The actual scorng functon s not mportant for our purposes; for smplcty we assume that the payload score contans the term frequency of v n d. In order to support temporal queres, the nverted fle ndex must also contan temporal nformaton. Thus [10] proposed addng the temporal lfespan explctly n the ndex postngs. Each postng ncludes the valdty tme-nterval of the correspondng document verson: (d, s, t s, t e ) where the document d had payload score s durng the tme nterval [t s, t e ). If the document evoluton contans few changes over tme, the assocated score of most terms s unchanged between adjacent versons. In order to reduce the number of postngs n an ndex lst, [10] coalesces temporally adjacent postngs belongng to the same document that have dentcal (or approxmate dentcal) scores. A general keyword search query Q conssts of a set of x terms q = (v 1, v 2,,v x ) and a temporal nterval [lb, rb]. Wthout loss of generalty, we use the aggregated score of a document verson for keyword query q s the sum of the scores from each term v. The tme-nterval [lb, rb] restrcts the canddate document versons as a subset of the [ lb, rb ] j j orgnal collecton: D = { d D [ lb, rb] lfe( d ) }. When lb = rb holds, the query tme nterval collapses nto a sngle tme pont t. For smplcty we frst concentrate on tme-pont query and more complex tme-nterval queres are dscussed n secton 4 wth related varatons. The answer R to a Top-K Tme-Pont keyword query TKTP = (q, t, k) over collecton D s a set of k document versons satsfyng: { d j R ( v q: v d j ) ( d j D t ) ( ( t j d D R): s( d ) s( d ))} where D t = { d j D t lfe( d j )}. The frst condton presents the keyword constrant, the second condton the temporal constrant, whle

4 A Comparson of Top-k Temporal Keyword Queryng over Versoned Text Collectons 363 the thrd mples that the top-k scored document versons are returned. Now we present how to answer query TKTP usng prevous methods based on temporal nverted ndexes. 2.2 Prevous Methods The straghtforward way (referred to as basc) to solve query TKTP uses exactly one nverted lst for each vocabulary term v wth the postng (d, s, t s, t e ). To answer the top-k queres, correspondng nverted lsts are traversed and postngs are fetched. When a postng s scanned, t s also verfed for the tme pont specfed n TKTP. The sort-order of the ndex lsts s also mportant. One natural choce s to sort each lst n score order. Ths method (score-order) enables the classcal top-k algorthms [11] to stop early after havng dentfed the k hghest scores wth qualfed lfespan. Another sutable sortng choce s to order the lsts frst by the start tme t s and then by score (t s -order) whch s benefcal for checkng the temporal constrant. However, ths approach s not effcent for top-k queryng, especally when the query ncludes multple terms. Fg. 1 shows the score-order and ts-order lsts for a specfc term. Note that the effcency of processng a top-k temporal query s nfluenced adversely by the wasted I/O due to read but skpped postngs. We proceed wth varous materalzaton deas of the slce the whole lst of a term nto several sub-lsts or parttons thus mprovng processng costs. Interval Based Slcng splts each term lst along the tme-axs nto several sub-lsts, each of whch corresponds to a contguous sub-nterval of the tme spanned by the full lst. Each of these sub-lsts contans all coalesced postngs that overlap wth the correspondng tme nterval. Note that ndex entres whose valdty tme-nterval spans across the slcng boundares are replcated n each of the spanned sub-lsts. The selecton of the correspondng tme-ntervals where the slces are created s vtal as dscussed n [9, 5]. One obvous strategy s to eagerly slce sub-lsts for all possble tme nstants (and adjacent dentcal lsts can be merged). Ths wll create one sub-lst per tme nstant; ths wll provde deal query performance for a TKTP query snce only the postngs n the sub-lst for the query tme pont wll be accesses. We refer to ths method as elementary. Note that the basc and elementary methods are two extremes: the former requres mnmal space but requres more processng at query tme snce many entres rrelevant to the temporal constrant are accessed; the latter provdes the best possble performance (for tme-pont query) but s not space-effcent (due to copyng of entres among sub lsts). To explore the trade-off between space and performance, [5] employs a smple but practcal approach (referred to as Fx) n whch a partton boundary s placed after a fxed tme wndow. The wndow sze can be a week, a month, a year, or other flexble choces. Fg. 2 shows the Fx-2 and Fx-4 sub-lsts of our runnng example from Fg. 1, wth the partton tme wndow sze as 2 and 4 tme nstants respectvely. Nevertheless, all varatons of the nterval based slcng suffer from an ndex-sze blowup snce entres whose vald-tme nterval spans across the slcng boundares are replcated.

5 364 W. Huo and V.J. Tsotras Fx-2: Fx-4: [t0, t2): d4, 1, t0, t1 d2, 0.95, t1, t4 d2, 0.7, t0, t1 d4, 0.7, t1, t3 [t2, t4): d2, 0.95, t1, t4 d4, 0.7, t1, t3 d4, 0.4, t3, t6 [t4, t6): d2, 0.9, t4, t8 d4, 0.4, t3, t6 [t6, t8): d2, 0.9, t4, t8 d4, 0.25, t6, t8 [t0, t4): d4, 1, t0, t1 d2, 0.95, t1, t4 d2, 0.7, t0, t1 d4, 0.7, t1, t3 d4, 0.4, t3, t6 [t4, t8): d2, 0.9, t4, t8 d4, 0.4, t3, t6 d4, 0.25, t6, t8 Fg. 2. Tme Interval Based Slcng sub-lst examples Stencl Based Parttonng. Another ndex parttonng method along the tme-axs was proposed n [12]. It s dstngushed from the nterval based slcng by usng a mult-level herarchcal (vertcal) parttonng of the lfespan. The nverted lst of term v, at level L 0 contans the entre lfespan of ths lst, whle level L +1 s obtaned from L by parttonng each nterval n L nto b sub-ntervals. Such a parttonng s called a stencl; each ndex postng s placed nto the deepest nterval n the mult-level parttonng that fts ts range. A stencl-based partton of three levels wth b = 2 for the runnng example (from Fg. 1) s shown n Fg. 3. Comparng to the tme nterval based slcng, the stencl based parttonng has sgnfcant advantage n space because each postng falls nto a sngle lst, the deepest sub-nterval that t fts. Nevertheless, for a tme-pont query stencl based parttonng has to fetch multple sub-lsts, one from each level. d1, 0.6 d2, 0.95 d4, 0.7 d3, 0.5 d4, 0.4 d5, 0.75 d2, 0.9 L 0 L 1 d2, 0.7 d4, 1 d4, 0.25 t0 t1 t2 t3 t4 t5 t6 t7 t8 Fg. 3. Stencl-based parttonng wth 3 levels and b = 2 The sort-order of each sub-lst s agan mportant. Snce the temporal parttonng already shreds one full lst nto several sub-lsts along the tme-axs, a more approprate choce for top-k queres s score-orderng. Temporal Shardng. The approach proposed n [6] s to shard (or horzontally partton) each term lst along the document dentfers nstead of tme. Entres n a term lst are thus dstrbuted over dsjont sub-lsts called shards, and entres n a shard are ordered accordng to ther start tmes t s. So as to elmnate wasteful reads, wthn a shard g, entres satsfy a starcase property: pq, g, tsp ( ) tsq ( ) tep ( ) teq ( ). An optmal greedy algorthm for creatng ths parttonng s gven n [6]; an example of temporal shardng for the term lst from Fg. 1, s shown n Fg. 4. L 2 tme

6 A Comparson of Top-k Temporal Keyword Queryng over Versoned Text Collectons 365 docid d5 d4 d3 d2 d1 e5 e2 e1 e3 e7 e4 e1 e2 e6 e3 t0 t1 t2 t3 t4 t5 t6 t7 t8 tme Shard_1: e1: d2, 0.7, t0, t1 e2: d4, 1, t0, t1 e3: e4: e5: e6: d2, 0.9, t4, t8 e7: d4, 0.25, t6, t8 Shard_2: e1: d4, 0.7, t1, t3 e2: d2, 0.95, t1, t4 e3: d4, 0.4, t3, t6 Fg. 4. Temporal shardng example As wth the stencl based approach, the space usage for temporal shardng s optmal snce there are no replcatons of ndex entres. However, for query processng, all shards for each term need to be accessed, resultng n multple sub-lst readngs. Moreover, the entres n each shard can only be tme-ordered (based on start tme t s ). Thus the beneft of score-orderng for ranked queres cannot be acheved, because all temporal vald entres have to be fetched. 3 Novel Approaches A common characterstc of exstng works s that they only consder the versoned documents on the tme- and docid-axes, and try to partton the data along ether drecton. Instead, we vew the ndex entres from a new angle -- namely, ther score over tme, and create ndex organzatons to mprove the performance of top-k queryng. The score-tme vew of the example from Fg. 1 s shown n Fg. 5. docid d d1 d2 d3 d4 d5 score d4 d3 d2 d tme tme t0 t1 t2 t3 t4 t5 t6 t7 t8 t0 t1 t2 t3 t4 t5 t6 t7 t8 Fg. 5. The Score-Tme vew of the versoned documents Recall that to answer a TKTP query we should be able to quckly fnd the top-k scores of a term at a gven tme nstant. The man dea behnd the score-tme vew s to mantan an ndex that wll provde the top scores per term at each tme nstant. For example, at tme t 0, the term depcted n Fg.5 had scores 1 (from d 4 ), 0.7 (from d 2 ), 0.6 (from d 1 ) and 0.5 (from d 3 ). These orderngs change as tme proceeds; for example at tme t 2, the top score s 0.95 from d 2, etc. In rank-based parttonng (secton 3.1), we frst dscuss a smplstc approach (SPR) where an ndex s created for each rank poston of a term. For example, there s an ndex that mantans the top score over tme,

7 366 W. Huo and V.J. Tsotras then one for the second top score, etc. More practcal s the group rankng approach (GR) where an ndex s created to mantan the group of the top-g scores (g s a constant), then the next top-g etc. We also consder temporal ndexng methods (secton 3.2). One soluton s to use the Multverson B-tree and mantan the whole ranked lst n order over tme. We realze however that these ranked lsts are always accessed n order, so a better soluton s provded (multverson lst) that lnks approprately the data pages of the temporal ndex, wthout overhead of the ndex nodes. 3.1 Rank Based Parttonng The Sngle Poston Rankng (SPR) approach creates a separate temporal ndex for each rankng poston of a term. Thus, for the -th rankng poston ( = 1,2, ), a sublst s mantaned that contans all the entres that ever exsted on poston over tme. Together wth each entry we mantan the tme nterval durng whch ths entry occuped that poston. All sub-lst entres are ordered based on ther recorded startng tme; a B+tree bult on the start tmes can easly locate the approprate entry at a gven tme. The SPR of our runnng example (from Fg. 1) s shown n Fg. 6(a). Space can be saved by usng only the start tme of each entry but for smplcty we show the end tmes as well (the end tme s needed only f there s no entry n a partcular poston, but ths s true only at the last poston). Usng the SPR approach, to process a TKTP query about tme t, the frst k sub-lsts have to be accessed for each relevant term; from each sublst the B+tree wll provde the approprate score (and document d) of ths term at tme t. If each sub-lst has m tems on average, the estmated tme complexty s O(k log B m) (here B corresponds to the page sze n records). Many sub-lst accesses can degrade queryng performance; moreover, n ths smple SPR method the same postng can be duplcated n multple rankng poston sub-lsts. Ths unavodable replcaton may result n storage overhead. 1 SPR d4, 1, t0, t1 d2, 0.95, t1, t4 d2, 0.9, t4, t8 (a) 2 d2, 0.7, t0, t1 d4, 0.7, t1, t3 3 4 d4, 0.25, t6, t8 5 d4, 0.4, t3, t6 GR (g = 2) gr1 d4, 1, t0, t1 d2, 0.7, t0, t1 d2, 0.95, t1, t4 d4, 0.7, t1, t3 d2, 0.9, t4, t8 gr2 d4, 0.25, t6, t8 (b) gr3 d4, 0.4, t3, t6 Fg. 6. Rankng poston based parttonng Group Rankng (GR). In order to save space and mprove queryng performance, GR mantans an ndex not for a sngle rankng poston, but for a group of postons. Let the group sze be g. For example, the frst g ranked elements are n group gr 1, the next g ranked elements are n group gr 2, etc. Thus, compared to the n sub-lsts mantaned n SPR for n rankng postons, GR uses nstead n/g sub-lsts. Wth respect to the I/O of top-k queryng, we only need k/g random accesses (each of them stll logarthmc).

8 A Comparson of Top-k Temporal Keyword Queryng over Versoned Text Collectons 367 As wth the SPR each member wthn a group also records the tme nterval that the member was n the group. For example, assume that group gr mantans rankng postons (-1)g+1 through g. If at tme t s the score of a partcular term falls wthn these postons, ths score s added to the group, wth an nterval startng at t s. As long as ths score falls wthn the rankng postons of ths group, t s consdered part of the group; f at tme t e t falls out of the group, the end tme of ts nterval s updated to t e. To save on update tme, wthn each group we do not mantan the rank order. That s, each group s treated as an unordered set of scores that evolves over tme. To answer a TKTP query that nvolves a partcular group gr at tme t, we need to dentfy what members group gr had at tme t. Snce the sze of the group s fxed, we can easly sort these member scores and provde them to the TKTP result n rank order. However, t s guaranteed that gven tme t, the members n gr have no lower scores than those n group gr j where 1 < j ( n/ g). The GR approach for the above example (from Fg. 1) wth g = 2 s shown n Fg. 6(b). An nterestng queston s what ndex to employ for mantanng each group over tme. Dfferent than SPR, each group at a gven tme may contan multple entres; thus a B+ndex on the temporal start tmes s not enough. Instead, temporal ndex structures that mantan and reconstruct effcently an evolvng set over tme, lke the snapshot ndex [15] can be used to accelerate temporal queryng. Note that when mplementng GR n practce, each group may have a dfferent sze g. It s preferable to use smaller g for the top groups and larger g for the lower groups (snce the focus s on top-k, the few top groups wll be accessed more frequently and thus we prefer to gve faster access). For smplcty however, we use the same g for all groups. 3.2 Usng a Multverson Lst Consder the ordered lst of scores that a term has over all documents at tme t; as tme evolves, ths lst changes (new scores are added, scores are promoted, demoted or even removed, etc). Temporal ndexng methods have addressed a more general problem: how to mantan an evolvng set of keys over tme. Ths set s allowed to change by addng, deletng or updatng keys; the man temporal query supported s the so called: temporal-range query: gven t, provde the keys that were n the set at tme t, and are wthn key range r. The Multverson B-tree (MVBT) proposed n [8], s an asymptotcally optmal (n terms of I/O accesses under lnear space) soluton to the temporal range query. Assumng that there were a total of n changes that occurred n the set evoluton, then the MVBT uses lnear space (O(n/B)). Consder a range temporal query that specfes range r and tme t, and let a t denote the number of keys that were wthn range r at tme t (.e., the number of keys that satsfy the query); the MVBT answers the above query usng O(log B n + a t /B) page I/Os, whch s optmal n lnear space [8]. In order for the MVBT to mantan order among the keys, t uses a B+tree to ndex the set. As the set evolves, so does the B+-tree. Conceptually the MVBT contans all B+-trees over tme; for a gven query tme t the MVBT provdes access to the root of the approprate B+-tree, etc. Of course, the MVBT does not copy all B+-trees (as ths would result n quadratc space). Instead t uses clever page update polces.

9 368 W. Huo and V.J. Tsotras In partcular, when a key k s added to the evolvng set at tme t a record s nserted n the (leaf) data page whose range contans k; ths record stores key k and a tme nterval of the form: [t, *). The * denotes that key k has not been updated yet. If later at tme t ths key s removed from the set, ts record s not physcally deleted. Instead ths change s represented by changng the * to t n ths record s nterval. A record s called alve for all tme nstants n ts nterval. Gven a query about tme t, the MVBT tree dentfes all data pages that contan alve records for that tme t. In contrast to a regular B+-tree that deals wth pages that get underutlzed due to record deletons, the MVBT pages cannot get underutlzed because no record s ever deleted. Lke the B+-tree pages can get full of records and need to be splt (page overflow). However, the MVBT needs to also guarantee that the number of alve records n a page do not fall below a lower threshold l (weak verson underflow) and also do not go over an upper threshold u (strong verson overflow)- note that l and u are O(B). If a page overflows, a tme-splt occurs, that copes the alve records of the overflown page (at the tme of the overflow) to a new page. If there are too few alve records, the page s merged wth a sblng page that s also frst tme-splt. If there are too many alve records, a key splt s frst appled (among the alve records) [8]. Usng the MVBT for our purposes means that the scores play the role of keys. That s, the MVBT wll mantan the order of scores over tme. Snce however term records are accessed by the docid they belong to, a hashng ndex s also needed that, for a gven docid, t provdes the leaf page that holds the record wth ths term s current score. Ths hashng scheme need only mantan the most current scores (.e., t does not need to mantan past postons). Nevertheless, the above MVBT approach has a sgnfcant overhead. In partcular, t s bult to answer queres about any range of scores. Ths s acheved by startng from an approprate root of the MVBT and follow ndex nodes untl the leaf data pages n the query range are accessed. For top-k processng however, we only access scores n decreasng order, startng wth the largest score at a partcular tme nstant. Thus, what we actually need, s a way to access the leaf page that has the hghest scores at a partcular tme, and then follow to ts sblng leaf page (wth the next lower scores) at that tme, etc. We stll mantan the splt polces among the leaf pages, but we do not use the MVBT s ndex nodes. Effectvely we mantan a multverson lst (MLst),.e., of the leaf data pages over tme. To access the leaf data page that has the hghest scores at a gven tme, we mantan an array A wth records of the form (t, p) where t s a tme nstant and p s a ponter to the leaf page wth the hghest scores at tme t. If later at tme t another page p becomes the leaf page wth the hghest scores, array A s updated wth a record (t, p ). If ths array becomes too large for man memory, t can easly be ndexed by a B+-tree on the (ordered) tme attrbute. For the above lst of leaf pages dea to work, each leaf page needs to remember the next sblng leaf page (wth lower scores) at each tme. (Note: the MVBT does not requre the sblng ponters, snce access to sblngs s done through the parent ndex nodes). One could stll use the array approach (one array responsble to keep access to the second leaf page, one for the thrd etc.) but ths would requre many array lookups at query tme (each such lookup takng O(log B n) page I/Os. Instead, we propose

10 A Comparson of Top-k Temporal Keyword Queryng over Versoned Text Collectons 369 to embed these arrays wthn the page structure. That s, wthn each leaf page, we allocate a space of c records (where c s a constant) for the sblng page ponter records (also of the form (t,p)). As a result, each leaf page has now space for B-c score records. Snce however, the sblng page can change over tme, t s possble that for a leaf page p the sblng wll change more than c tmes. If ths happens at tme t, page p s tme splt, that s, a new leaf page p s created contanng only the currently alve records of page p and wth an empty array for sblng ponters. Moreover, p replaces p n the lst. If before t, the lst of leaf pages contaned pages (n that order) m p v, a new record (t, p ) s added n the array of page m, and the array of page p s ntalzed wth a record (t,v). If p was the frst page, the record (t,p ) s added to array A. The advantage of the Mlst approach s apparent at query processng tme. A search s frst performed wthn array A for tme t (n O(log B n) page I/Os). Ths wll provde access to the page wth the hghest scores at tme t. Fnd the next sblng page at tme t however wll be provded by lookng among the c records of ths page, etc. That s, the top-k scores at tme t wll be accessed n O(log B n + k/b) page I/Os. The justfcaton s that after the access to array A, each leaf page (except possbly the last one) wll provde O(B) of the top-k scores (snce we are usng the MVBT splttng polces wthn the B-c space of each leaf page and c s a constant, each page s guaranteed to provde at least l=o(b) scores that were vald at the query tme t. 4 Top-k Tme Interval Queres Untl now we focused on the top-k tme pont (TKTP) queryng, and analyzed dfferent ndex structures for solvng t. We proceed wth the tme nterval top-k query. The man dfference s that n the TKTP, each document has at most one vald verson at the gven tme pont t; whle for an nterval queryng, each document may have multple versons vald durng the gven tme nterval [lb, rb]. As a result, there are dfferent varatons, dependng on how the top-k s defned (whch of the vald scores per document partcpate n the top-k computaton). Here, we summarze the dfferent defntons of top-k tme-nterval queres and dscuss how to process them effcently wthn the proposed ndex structures. 4.1 Classc Top-k Tme-Interval Query Ths query defnton s a straght forward extenson from the top-k tme pont query. For a Top-K Tme Interval keyword query TKTI = (q, lb, rb, k) over collecton D, we requre the answer R be a set of k document versons satsfyng:{ j j d R ( v q : v d ) where [, ] [, ] ( d j D lb rb ) ( d ( D lb rb R): s( d j ) s( d ))} D = { d D [ lb, rb] lfe( d ) } [, ] j j. Ths defnton only changes the tme constrants from a tme pont t to a tme range [lb, rb]. The returned top-k answers are dfferent versons, whch may be from the same document, that s, we consder each document verson as an ndependent object. Processng a TKTI query s smlar to processng a TKTP query. For some of the descrbed ndex methods, multple sub-lsts have to be accessed nstead of one.

11 370 W. Huo and V.J. Tsotras For example n tme nterval based slcng and stencl based parttonng, all the sublsts (or stencls) overlappng wth the query tme-nterval should be checked n order to fnd the correct top-k results. The multple parallel sub-lsts can be accessed n a round-robn fashon whch s compatble wth top-k algorthms. 4.2 Document Aggregated Top-k Tme-Interval Query Another possblty s to treat each document as one object, that s, a document appears at most once n the result. There are varous approaches n aggregatng relevance scores of the document versons that exsted at any pont n the temporal constrant [lb, rb] to obtan a document relevance score drs(d, lb, rb). Three aggregaton relevance models are mentoned n [10]: MIN. Ths model judges the relevance of a document based on the mnmum score. It s formally defned as: (,, ) mn{ ( j j drs d lb rb = s d ) [ lb, rb] lfe( d ) }. The MIN scores of our fve-document example for nterval [t 0, t 8 ) are d 2 =0.7, d 3 =0.5, d 4 =0.25, d 1 =0, d 5 =0. MAX. In contrast, ths model takes the maxmum score as an ndcator. It s formally j j defned as: drs( d, lb, rb) = max{ s( d ) [ lb, rb] lfe( d ) }. MAX scores of our fve-document example for nterval [t 0, t 8 ) are d 4 =1, d 2 =0.95, d 5 =0.75, d 1 =0.6, d 3 =0.5. TAVG. Fnally, the TAVG model assgns the score to each document usng a temporal average among all ts vald versons. Snce score sd ( j ) s pecewse-constant n tme, drs(d, lb, rb) can be effcently computed as a weghted summaton of these segments. TAVG scores of our fve-document example for nterval [t 0, t 8 ) are d 2 =0.89, d 4 =0.51, d 3 =0.5, d 5 =0.47, d 1 =0.45. After the aggregaton mechansm has been defned, one can consder the Aggregated Top-K Tme-Interval keyword query TKTI A = (q, lb, rb, k) over collecton D, that fnds the top k documents wth aggregated scores over all ther vald document versons. To process the aggregated top-k tme-nterval query, we need to extend the tradtonal top-k algorthms (such as TA and NRA) by recordng the bookkeepng nformaton and computng the scores and thresholds wth canddates at documentlevel. The relevance score of a document n the query temporal-context depends on the scores of ts verson that are vald durng ths perod. 4.3 Consstent Top-k Tme Range Query The consstent top-k search fnds a set of documents that are consstently n the top-k results of a query throughout a gven tme nterval. The result of ths query has sze 0 to k; queres can have empty results f k s small or the rankngs change drastcally. A relaxng consstent top-k query utlzes a relax factor r, 0 < r <= 1, and seeks for documents that are n the top-k for at least r h(rb lb) tme n the [lb, rb] nterval. For a Consstent Top-K Tme Interval keyword query TKTI C = (q, lb, rb, k) over collecton D, the documents n the answer R are n the top-k for at least r h(rb lb) tme n the [lb, rb] nterval. The consstent top-3 query of our fve-doc example for tme-nterval [t 0, t 8 ) has only one result as d 2 f r = 1, and has three results as d 1, d 2 and d 5 f r = 0.6.

12 A Comparson of Top-k Temporal Keyword Queryng over Versoned Text Collectons 371 In [16] several algorthms were ntroduced to answer the consstent top-k query; the most effcent ones are based on the assumpton that there s a lst contanng all versons satsfyng the keyword and tme nterval constrants and the lst s ordered by score. Ths assumpton concdes wth the purpose of our proposed ndex structures, thus we can access the qualfed entres and execute the consstent top-k tme nterval query usng the proposed approaches n [16]. 5 Expermental Evaluatons Dataset Descrpton and Methods Implemented: We used news-lke artcles as our prmary versoned document collecton. We collected US and world-wde Englsh newspaper webstes and treated each URL as a sngle document. Then ther hstorcal homepage versons were retreved by crawlng the Internet Archve [2] from untl We created two dfferent datasets wth daly unt tme granularty. The US based news had many frequent updates. The sze of raw data s about 0.2 TB, wth 12,649 documents and 1,542,893 versons; thus on average there are 122 versons per document n the US dataset. For the world-wde news webstes, the sze of raw data s about 50 GB, wth 5,046 documents and 275,981 versons, so on average there are 55 versons per document. Prevous related works create query workloads by extractng frequent queres from the AOL query logs. In addton to ths tradtonal query workload, we use popular keywords (such as twtter, phone, lady gaga etc.) from the Google Zetgest [4] annual reports from 2001 untl Overall, we formed 200 queres wth 265 terms for both classc and popular keywords. We organze the data nto term nverted lst(s) usng the prevous and novel approaches. In the basc method wth score-orderng (referred to as Basc-s) we create one nverted lst per term. The second method s elementary tme-nterval slcng wth a mergng of adjacent dentcal sub-lsts (Ele). For the Fx approach we used a tmewndow length of 30 days (Fx-30). The stencl based parttonng was mplemented wth 3 levels and b = 4 (Stencl). Temporal shardng s referred as Shard, whle the sngle poston rankng model appears as SPR. Two group rankng methods were mplemented wth group szes of 25 and 50 (GR-25 and GR-50). For comparson purposes we also ncluded the MVBT ndex (wth the approprate hashng secondary ndex).the multverson lst approach (MLst) uses a factor a = c / B to present the rato of the number of ponter records to the number of all records n a page. Comparson Results: Frst, the space usage for all mplemented methods on both the US-news and World-news datasets s presented n Table 1. The page sze s 4 Kbytes whle B = 100 records. The table presents the space consumed (n GB) to mplement the ndex methods for the 256 terms used n our experments. Clearly, the elementary tme-nterval slcng has a huge space overhead whle the Stencl and Shard methods present substantal space savngs. As expected, the Basc-s approach has the mnmal space requrements. Fx-30 uses more space snce a record may appear n more parttons whle n Stencl and Shard, each record appears once. The addtonal space that Stencl and Shard use wrt Basc-s s due to the addtonal structures they utlze. Among the rank-based parttonng methods, SPR uses more space than the GR approaches; ths s because the SPR approach has to mantan one ndex per ranked poston. GR-25 uses more space than GR-50 snce t uses more ndex structures

13 372 W. Huo and V.J. Tsotras (one per group). For the MLst method, we show the results of a = 7% and a = 10% (referred to MLst-7 and MLst-10). The MLst approaches also use lnear space (but due to the copyng of records at page splts, the space s more than the Stencl and Shard approaches). MLst uses slghtly more space than the MVBT because of the use of sblng ponters and the splts they create. Table 1. The space usage (n GB) for the 256 terms used n the queres Methods Basc-s Ele Fx-30 Stencl Shard SPR GR-25 GR-50 MVBT MLst-7 MLst-10 US World The top-k temporal queres nclude both tme-pont (n our dataset ths corresponds to one day) and tme-nterval queres. For each temporal keyword query, we randomly choose 50 tme constrants from the 15-year lfespan from 1997 to 2011, and record the average performance. For TKTI A, we use TAVG scorng; for TKTI C, we use r = 1. The page I/O costs for top-20 queres usng the US-news dataset are shown n Table 2 (the best performance for each query s shown n bold). For tme nterval queres, the tme-nterval lengths used were 15 days, 30 days, and 60 days. We also present the I/O costs for top-100 queres on both US-news and World-news datasets n Table 3 for both tme-pont query and 30-day tme-nterval queres. Table 2. The page I/O cost of top-20 temporal keyword queres for US news Methods Basc-s Ele Fx-30 Stencl Shard SPR GR-25 GR-50 MVBT MLst-7 MLst-10 TKTP TKTI TKTI TKTI TKTIA TKTIA TKTIA TKTIC TKTIC TKTIC Table 3. The page I/O cost of top-100 temporal keyword queres for US and World news US Basc-s Ele Fx-30 Stencl Shard SPR GR-25 GR-50 MVBT MLst-7 MLst-10 TKTP TKTI TKTIA TKTIC World Basc-s Ele Fx-30 Stencl Shard SPR GR-25 GR-50 MVBT MLst-7 MLst-10 TKTP TKTI TKTIA TKTIC The elementary tme-nterval slcng has the best snapshot queryng performance for both top-20 and top-100 queres. Ths s to be expected snce the answer s bascally prepared for each tme nstant (at the cost of huge storage requrements). Among the other methods, the newly proposed approaches (GR, MLst) outperform

14 A Comparson of Top-k Temporal Keyword Queryng over Versoned Text Collectons 373 the prevous methods (Stencl and Shard). The best performance s provded by the MLst-10 method. It has better performance than the MVBT gven t accesses the answer faster (by avodng the MVBT ndex traversal). Consderng ts low space requrements, ths approach provdes the overall best performance for TKTP queres. For tme-nterval queres, the Ele method s performance degrades drastcally, especally for longer tme-nterval. The group rankng method s performance s related to ts group sze g as t relates to k. For top-20 queryng, a group sze of 25 works better than a group sze of 50 (the answer can be found by accessng the frst group only); whle for top-100 queryng, GR-50 s a better choce (only two groups need to be accessed nstead of four for GR-25, thus less ndex accesses). For top-20 nterval queres, the MLst-10 had consstently the best performance for each query US/TKTP US/TKTI Page I/O Page I/O % 7% 10% 13% 16% a 0 4% 7% 10% 13% 16% a World/TKTP World/TKTI Page I/O Page I/O % 7% 10% 13% 16% a 0 4% 7% 10% 13% 16% a Fg. 7. The Multverson lst method for dfferent rato a usng the US and World news datasets Interestngly, for the top-100 nterval queres the MLst-7 shows better performance for the World-news dataset. The reason for that s that ths dataset has fewer updates. As a result, there wll be fewer ponter changes n the ordered lst, thus a smaller a wll provde enough space to hold the ponter structure. Ths can also be seen n the space requrements for ths dataset: the fewer ponter splts mean that MLst-7 uses less space than MLst-10 (and thus the lsts are shorter and the query performance better). The above observaton mples that the performance of the multverson lst method s related to the value of a. There are two opposng factors affectng the query performance wth respect to a. For a gven page sze, a small a mples that the area allocated to sblng ponters s small; thus few sblng page changes can cause the page to splt. More splts use more space and the query tme ncreases. On the other hand, a large a mples that the space allocated for the regular records n a page s small, thus the page can splt faster due to the record updates. Ths also ncreases space and query tme. The optmzed value of a depends on the dataset characterstcs. Fgure 7 depcts the page I/O for the top-100 results returned by pont

15 374 W. Huo and V.J. Tsotras (TKTP) and nterval (TKTI-30) queres for the US and World-news datasets. For the US-news dataset, a = 10% has the best average performance for both tme-pont queryng (TKTP) and 30-day tme-nterval queryng (TKTI-30) whle for the World-news dataset (whch has less update frequency), the performance s optmzed for a = 7%. 6 Concluson We presented an expermental comparson of ndexng methods over versoned text collectons for top-k temporal keyword queres. In addton to prevous methods, we proposed novel solutons that partton the data along the score-tme axes. Among all methods, the multverson lst provded the most robust performance consderng space usage and query tme effcency for both tme-pont and tme-nterval queres. We examned varatons of the tme-nterval queres, ncludng the document-level aggregated top-k queres and consstent top-k queres. The performance of the multverson lst s affected by the value of a, the percentage of a data page allocated to hold sblng ponters. As future work, we plan to devse a model that can optmze the value of a based on the frequency of updates, the sze of the page and other factors. Acknowledgements. Ths work s partally supported by NSF grant IIS References [1] Wkpeda, [2] Internet Archve, [3] European Archve, [4] Google Zetgest, [5] Anand, A., Bedathur, S., Berberch, K., Schenkel, R.: Effcent Temporal Keyword Queres over Versoned Text. In: CIKM (2010) [6] Anand, A., Bedathur, S., Berberch, K., Schenkel, R.: Temporal Index Shardng for Space-Tme Effcency n Archve Search. In: SIGIR (2011) [7] Baeza-Yates, R., Rbero-Neto, B.: Modern Informaton Retreval. Addson-Wesley (1999) [8] Becker, B., Gschwnd, S., Ohler, T., Seeger, B., Wdmayer, P.: An asymptotcally optmal multverson B-tree. VLDB Journal (1996) [9] Berberch, K., Bedathur, S., Neumann, T., Wekum, G.: A Tme Machne for Text Search. In: SIGIR (2007) [10] Berberch, K., Bedathur, S., Wekum, G.: Effcent Tme-Travel on Versoned Text Collectons. In: BTW (2007) [11] Fagn, R., Lotem, A., Naor, M.: Optmal Aggregaton Algorthms for Mddleware. J. Comput. Syst. Sc. 66(4), (2003) [12] He, J., Suel, T.: Faster Temporal Range Queres over Versoned Text. In: SIGIR (2011) [13] Ponte, J.M., Croft, W.B.: A Language Modelng Approach to Informaton Retreval. In: SIGIR (1998) [14] Robertson, S.E., Walker, S.: Okap/keenbow at TREC-8. In: TREC (1999) [15] Tsotras, V.J., Kangelars, N.: The Snapshot Index: an I/O Optmal Access Method for Snapshot Queres. Informaton System 20(3), (1995) [16] U, L.H., Mamouls, N., Berberch, K., Bedathur, S.: Durable Top-k Search n Document Archves. In: SIGMOD (2010)

Parallelism for Nested Loops with Non-uniform and Flow Dependences

Parallelism for Nested Loops with Non-uniform and Flow Dependences Parallelsm for Nested Loops wth Non-unform and Flow Dependences Sam-Jn Jeong Dept. of Informaton & Communcaton Engneerng, Cheonan Unversty, 5, Anseo-dong, Cheonan, Chungnam, 330-80, Korea. seong@cheonan.ac.kr