Partition-based document identifier assignment (PBDIA) algorithm. (long queries)

Size: px

Start display at page:

Download "Partition-based document identifier assignment (PBDIA) algorithm. (long queries)"

Juliana Rose
5 years ago
Views:

2 ( ) Pariion-based documen idenifier assignmen (PBDIA) algorihm PBDIA (long queries) (parallel IR) :,,,, d-gap Compressing an invered file can grealy improve query performance of an informaion rerieval sysem (IRS) by reducing disk I/Os. We observe ha a good documen idenifier assignmen (DIA) can make he documen idenifiers in he posing liss more clusered, and resul in beer compression as well as shorer query processing ime. In his paper, we ackle he NP-complee problem of finding an opimal DIA o minimize he average query processing ime in an IRS when he probabiliy disribuion of query erms is given. We indicae ha he greedy neares neighbor (Greedy-NN) algorihm can provide excellen performance for his problem. However, he Greedy-NN algorihm is inappropriae if used in large-scale IRSes, due o is high complexiy O(N 2 n), where N denoes he number of documens and n denoes he number of disinc erms. In real-world IRSes, he disribuion of query erms is skewed. Based on his fac, we propose a fas O(N n) heurisic, called pariion-based documen idenifier assignmen (PBDIA) algorihm, which can efficienly assign consecuive documen idenifiers o hose documens conaining frequenly used query erms, and improve compression efficiency of he posing liss for hose erms. This can resul in reduced query processing ime. The experimenal resuls show ha he PBDIA algorihm can yield a compeiive performance versus he Greedy-NN for he DIA problem, and ha his opimizaion problem has significan advanages for boh long queries and parallel informaion rerieval (IR). Keywords: invered index, invered file compression, query evaluaion, documen idenifier assignmen, d-gap echnique

3 (Acceped by Informaion Processing & Managemen). Inroducion Informaion rerieval sysems (IRSes) ha are wildly used in many applicaions, such as search engines, digial libraries, genomic sequence analyses, ec. (Kobayashi & Takeda 2000; Williams & Zobel 2002), are overwhelmed by he explosion of daa. To efficienly search vas amouns of daa, an invered file is used o evaluae queries for modern large-scale IRSes due o is quick response ime, high compression efficiency, scalabiliy, and suppor for various search echniques (Wien e al. 999; Zobel e al. 998). An invered file conains, for each disinc erm in he collecion, a lis (called a posing lis or synonymously an invered lis) of he idenifiers of he documens conaining ha erm. A query consiss of keyword erms. To rerieve informaion, he query evaluaion engine reads and decompresses he posing liss for he erms involved in he query, and hen merges (inersecion, union, or difference) corresponding posing liss o obain a candidae se of relevan documens. Compressing an invered file can grealy increase query hroughpu (Zobel & Moffa 995; Williams & Zobel 999). This is because he oal ime of ransferring a compressed posing lis and subsequenly decompressing i is poenially much less han ha of ransferring an uncompressed posing lis. The documen idenifiers in a posing lis are usually sored in ascending order. By using he popular d-gap compression approach (Wien e al. 999; Moffa & Zobel 992), efficien compression of an invered file can be achieved. In addiion, we observe ha he d-gap compression approach can resul in good compression if he documen idenifiers in he posing liss are clusered. The query processing ime in a large-scale IRS is dominaed by he ime needed o read and decompress he posing liss for he erms involved in he query (Moffa & Zobel 996), and we observe ha he query processing ime grows wih he oal encoded size of he corresponding posing liss. This is because he disk ransfer rae is near consan, and he decoding processes of mos encoding mehods used in he d-gap compression approach are on a bi-by-bi basis. If we can reduce he oal encoded size of he corresponding posing liss wihou increasing decompression imes, a shorer query processing ime can be obained. A documen idenifier assignmen (DIA) can make he documen idenifiers in he posing liss evenly disribued, or clusered. Clusered documen idenifiers generally can improve he compression efficiency of he d-gap compression approach wihou increasing he complexiy of decoding process, hence reduce he query processing ime. In his paper, we consider he problem of finding an opimal DIA o minimize he average query processing ime in an IRS when he probabiliy disribuion of query erms is given. The DIA problem, ha is known o be NP-complee via a reducion o he recilinear raveling salesman problem (TSP), is a generalizaion of he problems solved by Olken & Roem (986), Shieh e al. (2003), and Gelbukh e al. (2003). Their research resuls showed ha his kind of opimizaion problem can be

4 effecively solved by he well-known TSP heurisic algorihms. The greedy neares neighbor (Greedy-NN) algorihm performs he bes on average, bu is high complexiy discourages is use in modern large-scale IRSes. In his paper, we propose a fas heurisic, called pariion-based documen idenifier assignmen (PBDIA) algorihm, o find a good DIA ha can make he documen idenifiers in he posing liss for frequenly used query erms more clusered. This can grealy improve he compression efficiency of he posing liss for frequenly used query erms. Where he probabiliy disribuion of query erms is skewed, as is he ypical case in a real-world IRS, he experimenal resuls show ha he PBDIA algorihm can yield a compeiive performance versus he Greedy-NN for he DIA problem. The experimenal resuls also show ha he DIA problem has significan advanages for boh long queries and parallel informaion rerieval (IR). The remainder of his paper is organized as follows. Secion 2 describes he invered index and explains why a DIA can affec he sorage space required and change query performance. Secion 3 derives a cos model for he DIA problem, and presens how o use he well-known TSP heurisic algorihms o solve his opimizaion problem. In Secion 4, we propose a fas PBDIA algorihm. We show he experimenal resuls in Secion 5. Finally, Secion 6 presens our conclusion. 2. General Framework An invered index consiss of an index file and an invered file. An index file is a se of records, each conaining a keyword erm and a poiner o he posing lis for erm. An invered file conains, for each disinc erm in he collecion, a posing lis of he form IL =<id, id 2,, id f >, where id i is he idenifier of he documen ha conains, and frequency f is he number of documens in which appears. The documen idenifiers are wihin he range...n, where N is he number of documens in he indexed collecion. In a large documen collecion, posing liss are usually compressed, and decompression of posing liss is hence required during query processing. Zipf (949) observed ha he se of frequenly used erms is small. According o Zipf s law, 95% of words in all documens fall in a vocabulary wih no more han 8000 disinc erms. This suggess ha i is advisable o sore he index records of frequenly used erms in RAM o grealy reduce index search ime. Hence, he significan porion of query processing ime is o read and decompress he compressed posing lis for each query erm. This paper resrics aenion o invered file side only and invesigaes he DIA problem o improve he efficiency of an invered file and he overall IR performance. The d-gap compression approach (Wien e al. 999; Moffa & Zobel 992), he mos popular approach for invered file compression, consiss of wo seps. I firs sors he documen idenifiers of each posing lis in increasing order, and hen replaces each documen idenifier (excep he firs one) wih he disance beween iself and is predecessor. For example, he posing lis <3, 8, 2, 5, 32> can be represened in d-gaps as <3, 5, 4, 3, 7>. And he second sep is o

5 encode (compress) hese d-gaps using an appropriae coding mehod. Many coding mehods, such as γ coding (Elias 975), Golomb coding (Golomb 966; Wien e al. 999), skewed Golomb coding (Teuhola 978), and bached LLRUN coding (Fraenkel & Klein 985), have been proposed o compress posing liss hrough he esimaes of d-gap probabiliy disribuions. The more accuraely he esimae, he greaer he compression can be achieved. erm erm 2 erm 2 erm 2 erm 4 erm erm 2 erm 3 erm 4 erm erm 4 erm erm 2 erm 3 doc. d doc. d 2 doc. d 3 doc. d 4 doc. d 5 doc.d 6 (a) Example documens DIA I: { d, d 2 2, d 3 3, d 4 4, d 5 5, d 6 6} Posing lis of erm : <, 4, 5, 6> d-gap lis of erm : <, 3,, > Posing lis of erm 2: <, 2, 3, 4, 6> d-gap lis of erm 2: <,,,, 2> Posing lis of erm 3: <4, 6> d-gap lis of erm 3: <4, 2> Posing lis of erm 4: <3, 4, 5> d-gap lis of erm 4: <3,, > Toal bis required o encode d-gaps wih γ code = 26 bis (b) DIA I resul DIA II: { d 3, d 2 5, d 3 4, d 4, d 5 6, d 6 2} Posing lis of erm : <, 2, 3, 6> d-gap lis of erm : <,,, 3> Posing lis of erm 2: <, 2, 3, 4, 5> d-gap lis of erm 2: <,,,, > Posing lis of erm 3: <, 2> d-gap lis of erm 3: <, > Posing lis of erm 4: <, 4, 6> d-gap lis of erm 4: <, 3, 2> Toal bis required o encode d-gaps wih γ code = 20 bis (c) DIA II resul Figure. An example o show differen DIAs resul in differen compression resuls d-gap value γ code x Table. Some example codes for γ coding

6 One common characerisic of coding mehods used in he d-gap compression approach is ha small d-gap values can be coded more economically han large ones. If we can shrink he d-gap values, he compression raio and query performance can be improved. Consider a documen collecion of 6 documens shown in Figure (a). Each documen conains one or more erms. For example, he documen d conains erm and erm 2, documen d 2 conains erm 2, ec. In Figures(b) and (c), he noaion d i j in DIAs I and II denoes ha he documen idenifier j is assigned o he documen d i. According o he documens in Figure (a) and he DIAs I and II, he obained posing liss and d-gap liss are shown in Figures (b) and (c). For DIA I, he d-gap values have nine s, wo 2s, wo 3s and one 4; whereas for DIA II, he d-gap values have eleven s, one 2 and wo 3s. Wih γ coding in Table, he compressed invered file requires 26 bis for DIA I, whereas i requires 20 bis for DIA II. If every erm is queried wih equal probabiliy, he query processing coss for DIA II will be much lower han ha of DIA I. This is because DIA II can resul in beer compression for he given coding mehod wihou increasing he complexiy of decoding process, hence improve query hroughpu by reducing boh he rerieval and decompression imes of posing liss. This example shows ha differen DIAs can resul in differen compression resuls and differen query hroughpus for a given coding mehod. In nex secion, we will inroduce a query cos funcion for he DIA problem, and hen derive a mehod o find a good DIA o shoren average query processing ime when he probabiliy disribuion of query erms is given. 3. Documen idenifier assignmen problem and is algorihm The DIA problem is he problem of assigning documen idenifiers o a se of documens in an invered file-based IRS in order o minimize he average query processing ime when he probabiliy disribuion of query erms is given. In his secion, we firs formalize he problem, and hen show how o use he well-known greedy neares neighbor (Greedy-NN) algorihm o solve his problem. 3. Problem mahemaical formulaion Le D={d, d 2,,d N } be a collecion of N documens o be indexed, and π :{ d, d 2,, d N }{, 2,, N} be a DIA ha assigns a unique idenifier wihin he range N o each documen in D. Le f be he oal number of documens in which erm appears and d (), d (2),, d (f ) be documens conaining erm, hen he posing lis of he erm can be represened as IL =<π(d () ), π(d (2) ),, π(d (f ))>. Wihou loss of generaliy, we assume ha π(d () )<π(d (2) )< < π(d (f )). Assume a coding mehod C which requires C(x) bis o encode a d-gap x. The size of a posing lis IL for erm can hen be expressed as f i= C π ( d ) π ( d )) () ( ( i) ( i ) where we le d (0) =0 and π(d (0) )=0 o simplify he expression of Eq.(). Assume ha he probabiliy of a erm appearing in a query is p. Le X be a random Boolean variable represening wheher erm appears in a query: X = if erm appears in a query and X =0 oherwise. The query processing ime Time QP of posing lis processing includes () rerieval ime Time R of posing lis

7 IL for each query erm, (2) decompression ime Time D of posing lis IL for each query erm, and (3) documen idenifier comparison ime Time Comp. Since he documen idenifier comparison ime is relaively small (abou 0% of query processing ime) and does no change wih differen DIAs, he query processing ime in his paper is defined only as Time = X Time ( IL ) + Time ( IL )) (2) QP ( R D The average query processing ime AvgTime QP is he expeced value of Time QP. Tha is, AvgTime QP = p ( TimeR ( IL ) + TimeD ( IL )) (3) Since he disk ransfer rae is near consan and he decoding processes of mos coding mehods used in d-gap compression approach are on a bi-by-bi basis, he rerieval and decompression imes of a posing lis IL for he erm appearing in a query grows wih he size of he posing lis IL. So R D f Time ( IL ) + Time ( IL ) = consan C( π ( d ( )) π ( d )) (4) Subsiuing Eq.(4) ino Eq.(3), we obain AvgTime QP = consan p f i= i= i ( i ) ( i ) C( π ( d ( ) ) π ( d )) (5) i We hus define he objecive funcion Cos(π) o reflec he average query processing ime AvgTime QP : Cos( π ) = p C( π ( d π (6) f i= ( i) ) ( d( i ) )) The objecive of his research is o find a DIA π : D{,2,3,N} such ha Cos (π ) is minimal. This opimizaion problem is called he DIA problem, and i is reduced o he simple DIA (SDIA) problem if he value of p for each erm is se o. The SDIA problem is he problem of finding a DIA o minimize he size of invered file, and i is known o be NP-complee via a reducion o he recilinear raveling salesman problem (Olken & Roem 986). Since he DIA problem is a generalizaion of he SDIA problem, he DIA problem is also a NP-complee problem. 3.2 Solving DIA problem via he well-known Greedy-NN algorihm Shieh e al. (2003) showed ha he SDIA problem can be solved by using TSP heurisic algorihms. Given a collecion of N documens, a documen similariy graph (DSG) can be consruced. In a DSG, each verex represens a documen, and he weigh on an edge beween wo verices represens he similariy of hese wo corresponding documens. The similariy Sim(d i, d j ) beween wo documens d i and d j is defined as: Sim ( d i, d j ) = (7) ( T ( d ) T ( )) i d j where T(d i ) and T(d j ) denoe he se of erms appearing in d i and d j, respecively, and denoes he inersecion operaor. Hence, he similariy beween wo documens is he number of common

8 erms appearing in boh documens. The DSG for he example documens in Figure (a) is shown in Figure 2. A TSP heurisic algorihm can hen be used o find a pah of he DSG visiing each verex exacly once wih maximal sum of similariies. If we follow he visiing order of verices on he pah o assign documen idenifiers, he sum of d-gap values for an invered file can be decreased, and he size of invered file compressed via he d-gap compression approach can be reduced. Shieh e al. (2003) showed ha he Greedy-NN algorihm (Figure 3) can provide excellen performance for he SDIA problem. d d d 3 2 d 4 2 d 5 d 6 3 Figure 2. DSG for he example documens in Figure (a). Algorihm Greedy_neares_neighbor Inpu: D={d, d 2,, d N }: a collecion of N documens o be indexed. Oupu: A TSP pah: he visiing order of verices is { v v,..., }, 2 Mehod:. Consruc he DSG(V, E), where V is a se of verices (in which each verex represens a documen) and E is a se of edges (in which each edge has a similariy value associaed wih i); 2. Pick a verex v V as v such ha he sum of similariy values associaed wih he adjacen edges of v is maximal; 3. V : = V { v }; i : = ; 4. Find v in V such ha he similariy value of he edge (v,v i ) is maximal: if more han one such verex exis, selec one randomly; 5. i : = i + ; vi : = v; V : = V { vi}; 6. If i<n hen goo 3; 7. Oupu a TSP pah wih is visiing order of verices being { v, v2,..., v n } Figure 3. The Greedy-NN algorihm for he SDIA problem v n Using he same approach, The DIA problem can be solved using he Greedy-NN algorihm described in Figure 3, if he similariy Sim(d i, d j ) beween wo documens d i and d j in a DSG is redefined as: Sim ( di, d j ) = p (8) ( T ( d ) T ( )) i d j where he probabiliy of a erm appearing in a query is known o be p. Alhough he Greedy-NN algorihm is very simple o implemen, i is no very applicable o

9 large-scale IRSes due o is high complexiy. Given a collecion of N documens and n disinc erms, he number of comparisons for calculaing Sim(d i,d j ) given fixed i and j is O(n), hence he oal number of comparisons o consruc a DSG for he Greedy-NN algorihm is O(N 2 n). An algorihm wih lower complexiy ye sill generaes saisfacory resuls should be developed. 4. Pariion-based documen idenifier assignmen algorihm Since he DIA problem is an NP-complee problem, he effor in search for an effecive low-complexiy mehod is needed. Alhough he Greedy-NN algorihm can be used o solve he DIA problem, is complexiy is oo high. In his secion, we firs presen an opimal DIA algorihm for a single query erm, and hen propose an efficien pariion-based documen idenifier assignmen (PBDIA) algorihm for he DIA problem. 4. Generaing an opimal DIA for a single query erm Consider a posing lis IL for erm wih f documen idenifiers in a collecion of N documens. Using he d-gap echnique, we can obain f d-gap values: d-gap, d-gap 2,, d-gap f. Assume a coding mehod C which requires C(x) bis o encode a d-gap x. We wan o know which d-gap probabiliy disribuion can minimize he size of posing lis IL afer compression using mehod C. Tha is, we wan o know which d-gap probabiliy disribuion can minimize f i= subjec o f C( d-gap i ) (9) f d-gap i= i k and (0) d-gap i k for all i, i k () where k is he larges documen idenifier in he posing lis IL. I is known ha C(x) is approximaely proporional o log 2 (x) for many popular coding mehods, such as γ coding, skewed Golomb coding, and bached LLRUN coding. For hese coding mehods, we can use dynamic programming echnique (Bellman and Dreyfus 962) and find ha minimizing Eq.() should mee wo requiremens: () maximize he number of d-gap values of ; and (2) minimize he larges documen idenifier, i.e., k, in he posing lis IL. If a DIA for erm can saisfy he above wo requiremens, he bes compression and he fases query speed for he posing lis IL can be achieved. According o he above observaion, we propose he simple pariion-based documen idenifier assignmen (SPBDIA) algorihm o generae opimal DIAs for a given query erm. The SPBDIA algorihm consiss of a pariioning procedure, an ordering procedure, and a documen idenifier assignmen procedure. The pariioning procedure divides he given documens ino wo pariions in erms of query erm : one pariion P() consiss of documens conaining query erm ; he oher pariion P(') is made up of he documens wihou. Then, he ordering procedure ses he order of pariions as P() followed by P('). Finally, he documen idenifier assignmen procedure generaes an appropriae DIA for he ordered pariions according o query erm : he

10 documens in pariion P() are assigned smaller consecuive documen idenifiers, while he documens in pariion P(') assigned larger consecuive documen idenifiers. The SPBDIA algorihm is illusraed in he following Example. Example. There is a collecion of 500 documens, among which 300 documens conain query erm. Afer pariioning, P() has 300 documens and P(') has 200 documens. Then, he ordering procedure ses he order of pariions P() followed by P('). Finally, he documen idenifier assignmen procedure assigns he documen idenifiers ~300 o he 300 documens in pariion P() and assigns he documen idenifiers 30~500 o he 200 documens in pariion P('). Documens in a pariion can be arbirarily assigned idenifiers wihin he given range, hence he number of possible DIAs for he above Example is 300! 200!. Each of he 300! 200! DIAs saisfies he wo requiremens for minimizing Eq.(9), and hence gives boh he bes posing lis compression and fases query speed for query erm. The SPBDIA algorihm is simple, and is complexiy is O(N). 4.2 Efficien pariion-based documen idenifier assignmen algorihm for DIA problem In a real-world IRS, a few frequenly used query erms consiue a large porion of all erm occurrences in queries (Jansen e al. 998). Based on his fac, we assess ha a DIA algorihm ha allows hose frequenly used query erms o have beer posing lis compression can resul in reduced average query processing ime. Based on he SPBDIA algorihm, an efficien pariion-based documen idenifier assignmen (PBDIA) algorihm for he DIA problem can be developed. Like he SPBDIA algorihm, he PBDIA algorihm also pariions he documen se, orders hese pariions, and hen assigns documen idenifiers. The pariioning and ordering procedures of he PBDIA algorihm ierae n imes given ha here are n query erms. Then, he documen idenifier assignmen procedure is performed as he las sep of he PBDIA algorihm. Terms ha are queried more frequenly should ake higher prioriy in documen pariioning and pariion ordering. The PBDIA algorihm is given in Figure 4. A doubly linked lis is used o sore he pariions, and he wo links of a pariion mainain he ordering among hese pariions. Given a collecion of N documens and n disinc query erms, he number of comparisons for assigning documens o pariions in each ieraion is O(N). Since he PBDIA algorihm ieraes for n imes, he oal number of comparisons for he PBDIA algorihm is O(N n). Compared wih he Greedy-NN algorihm, his complexiy of PBDIA algorihm is disincively low. This advanage brings he PBDIA algorihm a dark side, of course. Alhough he PBDIA algorihm arges on improving he compression efficiency for he frequenly used query erms, i unavoidably decreases ha for he oher query erms. In realiy, i is ofen he case ha he populariies of he assored query erms are very unbalanced. And his imbalance naure makes he PBDIA algorihm achieve very good query performance. In Secion 5, we compare he search performance of he Greedy-NN and PBDIA algorihms for real-life documen collecions.

11 Algorihm Pariion_based_documen_idenifier_assignmen Inpu: D={d, d 2,, d N }: a collecion of N documens o be indexed. T={, 2,, n }: a se of n disinc erms appearing in D. Prob={p, p 2,, p n }: p i denoes he probabiliy of he erm i T appearing in a query. Oupu: A documen idenifier assignmen π :{ d, d 2,, d N }{, 2,, N} for he DIA. Mehod:. Creae an empy doubly linked lis ParLis; // o sore pariion 2. Creae an empy doubly linked lis TempLis; //o sore pariion pairs 3. Assign all documens in D o a new pariion P, and add P o he ParLis; 4. Sor he erms in T in descending order according o heir probabiliies. Le rank, rank2,, rankn represen he sored lis. 5. for i:= o n do 5. while ParLis is no empy do /*pariioning procedure*/ 5.. Ge a pariion P from he head of ParLis, and hen remove P from ParLis; 5..2 // A leas one of he pariions P( ranki ) and P( ranki ') should be nonempy Le P( ranki ) be he pariion conaining he documens ha are included in P and do conain he erm ranki ; le P( ranki ') be he pariion conaining he documens ha are included in P and do no conain he erm ranki ; 5..3 Add he pariion pair {P( ranki ),P( ranki ')} o he ail of TempLis; 5.2 while TempLis is no empy do /*ordering procedure*/ 5.2. Ge a pariion pair {P( ranki ),P( ranki ')} from he ail of TempLis, and hen remove {P( ranki ),P( ranki ')} from TempLis; if P( ranki ) is empy hen add P( ranki ') o he fron of ParLis and go o sep 5.2; if P( ranki ') is empy hen add P( ranki ) o he fron of ParLis and go o sep 5.2; if ParLis is empy hen Add P( ranki ') o he ParLis; add P( ranki ) o he fron of ParLis; else //ParLis is no empy Ge a pariion P from he head of ParLis, and ge a documen d P ; if he documen d conain he erm ranki hen Add P( ranki ) o he fron of ParLis; add P( ranki ') o he fron of ParLis; else // he documen d does no conain he erm ranki Add P( ranki ') o he fron of ParLis; add P( ranki ) o he fron of ParLis; 6. i:=; 7. while ParLis is no empy do /*documen idenifier assignmen procedure*/ 7. Ge a pariion P from he head of ParLis, and hen remove P from ParLis; 7.2 while P is no empy do 7.2. Ge a documen d P, and remove d from P; Assign documen idenifier i o he documen d, and hen i:=i+; 5. Experimens Figure 4. The PBDIA algorihm for he DIA problem This secion describes our experimens for evaluaing he differen DIA algorihms. Experimens were conduced on real-life documen collecions, and he average query processing

12 ime and he sorage requiremen for each DIA algorihm were measured. We also invesigaed he DIA problem in parallel IR. 5. Documen collecions and queries Three documen collecions were used in he experimens. Their saisics are lised in Table 2. In his able, N denoes he number of documens; n is he number of disinc erms; F is he oal number of erms in he collecion; and f indicaes he number of documen idenifiers ha appear in an invered file. The collecions FBIS (Foreign Broadcas Informaion Service) and LAT (LA Times) are disk 5 of he TREC-6 collecion ha is used inernaionally as a es bed for research in IR echniques (Voorhees and Harman 997). The collecion TREC includes he FBIS and LAT. Table 2. Saisics of documen collecions Collecion FBIS LAT TREC # of documens N 30,47 3, ,367 # of erms F 72,922,893 72,087,460 45,00,353 # of disinc erms n 24,30 68,25 37,393 # of documen idenifier coun f 28,628,698 32,483,656 6,2,354 Toal size (Mbyes) We followed he mehod (Moffa & Zobel 996) o evaluae performance wih random queries. For each documen collecion, 300 documens were randomly seleced o generae a query se. A query was generaed by selecing words from he word lis of a specific documen. To form he word lis of a documen, words in he documen were folded o lower case, and sop words such as he and his were eliminaed. The number of erms per query ranged from o 65. For each query, here exised a leas one documen in he documen collecion ha is relevan o he query. We also made he generaed query se for each documen collecion have he following characerisics: () Query repeiion frequencies followed a Zipf disribuion; (2) The erms per query disribuion followed he shifed negaive binomial disribuion. This made he disribuion of generaed queries closely resemble he disribuion of real queries (Xie & O Hallaron 2002; Wolfram 992). 5.2 Experimenal resuls In Secion 5.2., we firs presen he acual imes aken by he Greedy-NN and he PBDIA algorihms. In Secion 5.2.2, we hen presen he query performance of differen DIA algorihms. In Secion 5.2.3, we presen he compression performance of differen DIA algorihms. Finally, we sudy he DIA problem in parallel IR in Secion The invered files of he hree es collecions were consruced according o he DIAs generaed by differen DIA algorihms. We esed four differen DIA algorihms: Random,

13 Defaul, Greedy-NN, and PBDIA. The Random algorihm means ha he documen in a collecion is randomly assigned documen idenifier. The Defaul algorihm means ha he documen in a collecion is assigned documen idenifier in chronological order. The Greedy-NN and PBDIA algorihms were described in Secion 3.2 and Secion 4.2, respecively. For each DIA algorihm, we also esed five coding mehods: γ coding (Elias 975), Golomb coding (Golomb 966; Wien e al. 999), skewed Golomb coding (Teuhola 978), bached LLRUN coding (Fraenkel & Klein 985), and unique-order inerpolaive coding mehod (Cheng e al. 2004). For he following experimens, he parameer b for each posing lis in Golomb coding was calculaed using Wien s approximaion (Wien e al. 999), and he parameer g for unique-order inerpolaive coding was se o 4 (Cheng e al. 2004). All experimens were run on an Inel P4 2.4GHz PC wih 52MB DDR memory running Linux operaing sysem The hard disk was 40GB, and he daa ransfer rae was 25MB/sec. Inervening processes and disk aciviies were minimized during experimenaion Time aken by Greedy-NN and PBDIA algorihms In Table 3, he performance in erms of compleion ime is shown. The imes repored are he acual imes aken by he algorihms o generae a DIA for he given documen collecion ha has been invered. Please noe ha he imes presened in Table 3 consider neiher he ime spen in preliminary inversion of he documen collecion, nor he ime needed o rebuild an invered file wih a new DIA. Table 3 shows ha he PBDIA algorihm is much faser han he Greedy-NN algorihm. This fac makes he PBDIA algorihm viable for use in large-scale IRSes. Such a fas DIA algorihm can be very useful for siuaions such as:. Dynamically changing probabiliy disribuion of query erms, and 2. Dynamically changing documen collecion. Table 3. Time consumed by he Greedy-NN and he PBDIA algorihms DIA algorihm Collecion FBIS LAT TREC Greedy-NN 23 hrs 59 mins 24 hrs 37 mins 98 hrs 2 mins PBDIA 9 secs 0 secs 8 secs Query performance of differen DIA algorihms In Table 4, he average query processing ime (AvgTime QP ) and he speedup relaive o he Defaul algorihm (SP) were measured according o Eq.(3). In Table 5, he average number of bis required o rerieve and decode an idenifier during query processing (AvgBPI QP ) and he improvemen over he Defaul algorihm (Imp) were measured according o Eq.(6). For each documen collecion, he generaed query se was divided ino hree subses: he shor query se, he medium-lengh query se, and he long query se. The number of erms per query for he shor,

14 medium-lengh, and long query ses range from o 8, 9 o 20, and 2 o 65, respecively. All decoding mechanisms were opimized, including:. Replaced subrouines wih macros. 2. Replaced calls o he log funcion wih fas bi shifs. 3. Careful choice for compiler opimizaion flags. 4. Implemenaion used 32-bi inegers, as ha is he inernal regiser size of he Inel P4 CPU. Furhermore, he Huffman code of bached LLRUN coding was implemened wih canonical prefix codes ha can be decoded via a fas able look-up (Turpin 998). Wih hese opimizaions, decoding of a documen idenifier only required ens of ns. The experimenal resuls are shown in Tables 4 and 5. Key findings are:. Table 4 shows ha he query performance of he Defaul algorihm can be 0% faser han he Random algorihm. This indicaes ha he Defaul algorihm already capures some clusering naure, hus can serve as a rigid baseline in comparison wih oher fine-uned algorihms. 2. Comparing Tables 4 and 5, one should observe ha AvgTime QP is proporional o AvgBPI QP. This verifies Eq. (4) in Secion 3., and explains why a good DIA can resul in beer compression and reduced query processing ime. 3. From Table 5, one should observe ha boh he Greedy-NN and PBDIA algorihms can resul in beer compression of posing liss for all esed coding mehods excep Golomb coding. This indicaes ha he Greedy-NN and PBDIA algorihms can improve he cache efficiency if a posing lis cache is implemened. 4. Table 4 shows ha boh he Greedy-NN and PBDIA algorihms can reduce average query processing ime for all esed coding mehods excep Golomb coding. And he query speedup differences beween he Greedy-NN and PBDIA algorihms were only 3% on average. Considering he algorihm complexiy, he PBDIA algorihm is a good choice for he DIA problem. 5. From Table 4, one should observe ha Golomb coding canno benefi much from he Greedy-NN and PBDIA algorihms in erms of query performance. This is because Golomb coding assumes ha he d-gap values in a posing lis following a Bernoulli model (Wien e al. 999), hence boh he compression resul and he query processing ime of Golomb coding are independen of d-gap disribuion. 6. From Table 4, one should observe ha he query speedup obained by he PBDIA algorihm becomes higher as he query lengh increases. This is because ha, as he number of query erms increases, more frequenly used query erms are likely o be included, resuling in more advanage due o he PBDIA algorihm. 7. Table 4 shows ha boh γ coding and unique-order inerpolaive coding are recommended for real-world IRSes due o heir fas query hroughpus. In addiion, compared wih he oher esed coding mehods, hese wo coding mehods benefi more from he PBDIA algorihm. We conclude ha he PBDIA algorihm is viable for use in real-world IRSes.

15 Table 4. Query performance of differen DIA algorihms (AvgTime QP is he average query processing ime, and SP is he speedup relaive o he Defaul algorihm) (a) shor queries Coding Mehods Skewed Bached Unique-order Collecion γ coding Golomb coding Golomb coding LLRUN coding Inerpolaive coding DIA AvgTime QP SP AvgTime QP SP AvgTime QP SP AvgTime QP SP AvgTime QP SP algorihm Random FBIS Defaul Greedy-NN PBDIA Random LAT Defaul Greedy-NN PBDIA Random TREC Defaul Greedy-NN PBDIA (b) medium-lengh queries Coding Mehods DIA Skewed Bached Unique-order Collecion γ coding Golomb coding Golomb coding LLRUN coding Inerpolaive coding algorihm AvgTime QP SP AvgTime QP SP AvgTime QP SP AvgTime QP SP AvgTime QP SP Random FBIS Defaul Greedy-NN PBDIA Random LAT Defaul Greedy-NN PBDIA Random TREC Defaul Greedy-NN PBDIA (c) long queries Coding Mehods Skewed Bached Unique-order Collecion γ coding Golomb coding Golomb coding LLRUN coding Inerpolaive coding DIA AvgTime QP SP AvgTime QP SP AvgTime QP SP AvgTime QP SP AvgTime QP SP algorihm Random FBIS Defaul Greedy-NN PBDIA Random LAT Defaul Greedy-NN PBDIA Random TREC Defaul Greedy-NN PBDIA

16 Table 5. AvgBPI QP of differen DIA algorihms (AvgBPI QP is he average number of bis required o rerieve and decode an idenifier during query processing, and Imp is he improvemen over he Defaul algorihm) (a) shor queries Coding Mehods Collecion DIA algorihm γ coding Golomb coding Skewed Golomb coding Bached LLRUN coding Unique-order Inerpolaive coding Random FBIS Defaul Greedy-NN PBDIA Random LAT Defaul Greedy-NN PBDIA Random TREC Defaul Greedy-NN PBDIA (b) medium-lengh queries Coding Mehods Collecion DIA Skewed Bached Unique-order algorihm γ coding Golomb coding Golomb coding LLRUN coding Inerpolaive coding Random FBIS Defaul Greedy-NN PBDIA Random LAT Defaul Greedy-NN PBDIA Random TREC Defaul Greedy-NN PBDIA (c) long queries Coding Mehods Collecion DIA algorihm γ coding Golomb coding Skewed Golomb coding Bached LLRUN coding Unique-order Inerpolaive coding Random FBIS Defaul Greedy-NN PBDIA Random LAT Defaul Greedy-NN PBDIA Random TREC Defaul Greedy-NN PBDIA

17 5.2.3 Compression performance of differen DIA algorihms The compression resuls are shown in Table 6, and he meric used is he average number of bis per idenifier BPI, defined as follows: The size of he compressed invered file BPI =. number of documen idenfiers f To reduce average query processing ime, boh he Greedy-NN and PBDIA algorihms arge on improving he compression efficiency for he frequenly used query erms. However, his is a he cos of sacrificing he compression efficiency for he less frequenly used query erms. We need o know how much space overhead is needed o rade for his speed advanage. Resuls in Table 6 show ha he Greedy-NN and PBDIA algorihms can speed up query processing wih very lile or no sorage overhead. Table 6. Compression performance of differen DIA algorihms (BPI is he average bis per idenifier of he invered file for he es collecion, and Imp is he improvemen over he Defaul algorihm) Coding Mehods Collecion FBIS LAT TREC DIA algorihm γ coding BPI Imp Golomb coding BPI Imp Skewed Golomb coding BPI Imp Bached LLRUN coding BPI Imp Unique-order Inerpolaive coding BPI Imp Random Defaul Greedy-NN PBDIA Random Defaul Greedy-NN PBDIA Random Defaul Greedy-NN PBDIA DIA in parallel IR This subsecion invesigaes he DIA problem in an IRS ha runs on a cluser of worksaions. Assuming k worksaions, he invered file is generally pariioned ino k disjoin sub-files, each for one worksaion. When processing a query, all worksaions have o consul only heir own sub-files in parallel, and he query processing ime is shorened. Ma e al. (2002)

18 indicaed ha near-ideal speedup on query processing can be obained if an invered file is pariioned using he inerleaving pariioning scheme. For such a pariioning, DIA plays a crucial role in load balancing. The PDBIA algorihm can be applied o he invered file o enhance he clusering propery of posing liss for frequenly used query erms, and can aid he inerleaving pariioning scheme o yield beer load balancing. Table 7 shows he performance of parallel query processing using inerleaving pariioning scheme wih eiher he Defaul algorihm or he PBDIA algorihm. The meric is he speedup relaive o sequenial query processing wih Defaul algorihm. Experimens were conduced on he TREC collecion. The sub-file on each worksaion was compressed using he unique-order inerpolaive coding mehod. The parallel query processing ime was defined as max[t,t 2,,T k ], where T i ( i k) was he ime needed o rerieve and decompress he (parial) posing liss for he query erms on he i h worksaion. Noe ha T i did no include he documen idenifier comparison ime (he reason being he same as described in Secion 3.). The experimenal resuls show ha he inerleaving pariioning scheme can yield near-ideal speedups, as repored in Ma e al. (2002). In addiion, using he PBDIA algorihm o enhance he clusering propery of posing liss for frequenly used query erms, he inerleaving pariioning scheme yields super-linear speedups. Hence he DIA problem should deserve much aenion in parallel IR. Table 7. Speedup of parallel query processing Mehod The number of worksaions * Defaul algorihm + Inerleaving pariioning PBDIA algorihm + Inerleaving pariioning *: Wihou inerleaving pariioning 6. Conclusion In his paper, we sudy he DIA-based query opimizaion echniques for an IRS in which he invered file is used o evaluae queries. We firs define a cos model for query evaluaion. Based on his model, we propose an efficien heurisic, called pariion-based documen idenifier assignmen (PBDIA) algorihm, for generaing a good DIA o reduce average query processing ime. The PBDIA algorihm can efficienly assign consecuive documen idenifiers o he documens conaining frequenly used query erms. This makes he d-gaps of posing liss for frequenly used query erms very small, and resuls in beer compression for popular coding mehods wihou increasing he complexiy of decoding processes. This can resul in reduced query processing ime. Experimenal resuls show ha he PBDIA algorihm can reduce he average query processing ime by up o 20%. We also poin ou ha he DIA problem has vial effecs on he performance of long queries and parallel IR. Compared wih he well-known Greedy-NN algorihm, he PBDIA algorihm is much faser and yields very compeiive performance for he DIA problem. This fac should make he PBDIA algorihm viable for use in

19 modern large-scale invered file-based IRSes. Acknowledgemens This work was suppored by Naional Science Council, ROC: NSC E Reference Bellman, R.E. & Dreyfus, S.E. (962). Applied Dynamic Programming. Princeon, NJ: Princeon Universiy Press. Cheng, C.S., Shann, J.J.J., and Chung, C.P. (2004). A unique-order inerpolaive code for fas querying and space-efficien indexing in informaion rerieval sysems. In P. K. Srimani e al. (Eds.), Proceedings of ITCC 2004 Inernaional Conference on Informaion Technology: Coding and Communicaions Volume 2, (pp ), Las Vegas, Nevada, Apr. Los Alamios, CA: IEEE Compuer Sociey Press. Elias, P. (975). Universal codeword ses and represenaions of he inegers. IEEE Transacions on Informaion Theory, IT-2(2), Fraenkel, A.S. & Klein, S.T. (985). Novel compression of sparse bi-sringpreliminary repor. In A. Aposolico & Z. Galil (Eds.), Combinaorial Algorihms on Words: Vol. 2, NATO ASI Serials F. (pp ). Berlin: Springer-Verlag. Gelbukh, A., Han, S.Y., and Sidorov, G. (2003). Compression of boolean invered files by documen ordering. In Proceedings of 2003 IEEE Inernaional Conference on Naural Language Processing and Knowledge Engineering (IEEE NLPKE-2003), (pp ), Beijing, China, Oc. Los Alamios, CA: IEEE Compuer Sociey Press. Golomb, S.W. (966). Run Lengh Encoding. IEEE Transacions on Informaion Theory, IT-2(3), Janson, B.J., Spink, A., Baeman, J., and Saracevic, T. (998). Real life informaion rerieval: a sudy of user queries on he Web. SIGIR Forum, 32(), 5-7. Kobayashi, M. & Takeda, K. (2000). Informaion rerieval on he web. ACM Compuing Surveys, 32(2), Ma, Y.C., Chen, T.F., and Chung, C.P. (2002). Posing file pariioning and parallel informaion rerieval. Journal of Sysems and Sofware, 63(2), Moffa, A. & Zobel, J. (992). Parameerised compression for sparse bimaps. In N. Belkin, P. Ingwersen, and A.M. Pejersen (Eds.), Proceedings of 5h annual inernaional ACM-SIGIR Conference on Research and Developmen in Informaion Rerieval, (pp ), Copenhagen, Jun. New York: ACM Press. Moffa, A., Zobel, J., and Klein, S.T. (995). Improved invered file processing for large ex daabases. In R. Sacks-Davis and J. Zobel (Eds.), Proceedings of 6h Ausralasian Daabase Conference, (pp. 62-7), Adelaide, Ausralia, Jan. Moffa, A. & Zobel, J. (996). Self-indexing invered files for fas ex rerieval. ACM

20 Transacions on Informaion Sysems, 4(4), Olken, F. & Roem, D. (986). Rearranging daa o maximize he efficiency of compression. In Proceedings of he fifh ACM SIGACT-SIGMOD symposium on Principles of daabase sysems, (pp ), Cambridge, Massachuses, Unied Saes, Mar. New York: ACM Press. Shieh, W.Y., Chen, T.F., Shann, J.J., and Chung, C.P. (2003). Invered file compression hrough documen idenifier reassignmen. Informaion Processing and Managemen, 39(), 7-3. Teuhola, J. (978). A Compression mehod for clusered bi-vecors. Informaion Processing Leers, 7(6), Turpin, A. (998.) Efficien prefix coding. (Ph.D. hesis). Melbourne: Universiy of Melbourne. Voorhees, E. & Harman, D. (997). Overview of he sixh ex rerieval conference (TREC-6). In E.M. Voorhees and D.K. Harman (Eds.), Proceedings of he Sixh Tex RErieval Conference (TREC-6), (pp. -24). Gaihersburg, MD: NIST. Williams, H.E. & Zobel, J. (999). Compressing inegers for fas file access. The Compuer Journal, 42(3), Williams, H.E. & Zobel, J. (2002). Indexing and rerieval for genomic daabases. IEEE Transacions on Knowledge and Daa Engineering, 4(), Wien, I.H., Moffa, A., and Bell, T.C. (999). Managing Gigabyes: Compressing and Indexing on Documens and Images, Second Ediion. San Francisco, CA: Morgan Kaufmann Publishers. Wolfram, D. (992). Applying informeric characerisics of daabases o ir sysem file design, par i: informeric models. Informaion Processing and Managemen, 28(), Xie, Y. & O Hallaron, D. (2002). Localiy in search engine queries and is implicaions for caching. In P. Kermani, F. Bauer, and P. Morreale (Eds.), Proceedings of he 2h Annual Join Conference of he IEEE Compuer and Communicaions Socieies (INFOCOM'02), (pp ), New York, Jun. Zipf G. (949). Human Behavior and he Principle of Leas Effor. New York: Addison-Wesley. Zobel, J. & Moffa, A. (995). Adding compression o a full-ex rerieval sysem. Sofware Pracice and Experience, 25(8), Zobel, J., Moffa, A., and Ramamohanarao, K. (998). Invered files versus signaure files for ex indexing. ACM Transacions on Daabase Sysems, 23(4),

21 (long queries) (parallel IR) Informaion Processing & Managemen

Implementing Ray Casting in Tetrahedral Meshes with Programmable Graphics Hardware (Technical Report)

Implementing Ray Casting in Tetrahedral Meshes with Programmable Graphics Hardware (Technical Report) Implemening Ray Casing in Terahedral Meshes wih Programmable Graphics Hardware (Technical Repor) Marin Kraus, Thomas Erl March 28, 2002 1 Inroducion Alhough cell-projecion, e.g., [3, 2], and resampling,