Querying and Ranking XML Documents Based on Data Synopses

Size: px

Start display at page:

Download "Querying and Ranking XML Documents Based on Data Synopses"

Sibyl Russell
5 years ago
Views:

1 Queryng and Rankng XML Documents Based on Data Synopses Wemn He 1, Teng Lv 2 1 Department of Computng and New Meda Technologes Unversty of Wsconsn-Stevens Pont Stevens Pont, Wsconsn 54481, USA whe@uwsp.edu 2 Teachng and Research Secton of Computer Artllery Academy, Hefe Anhu, Chna lt0410@163.com Journal of Dgtal Informaton Management ABSTRACT: There s an ncreasng nterest n recent years for queryng and rankng XML documents. In ths paper, we present a new framework for queryng and rankng schema-less XML documents based on concse summares of ther structural and textual content. We ntroduce a novel data synopss structure to summarze the textual content of an XML document for effcent ndexng. More mportantly, we extend the tradtonal vector space model to effectvely rank XML documents over the proposed data synopses. We conduct extensve experments over XML benchmark data to demonstrate the advantages of the ndexng scheme and the effectveness of our rankng scheme. We also compare our framework wth Lucene to demonstrate our extended TF*IDF scorng functon s effectve. Categores and Subject Descrptors D.3.2 [Language Classfcatons]: Data-flow languages; H.3.1 [Content Analyss and Indexng]: I.7 [Document and Text Processng]: Markup languages; H.3.3 [Informaton Search and Retreval]: Query Formulaton General Terms: XML, Informaton Retreval, Data Processng Keywords: XML, Query processng, Document rankng, Query synopses, Document rankng Receved: 11 June 2011, Revsed 12 August 2011, Accepted 19 August Introducton As XML has become the de facto form for representng and exchangng data on the web, there s an ncreasng nterest n queryng and rankng text-centrc XML documents. Although approxmate matchng wth relevance rankng for plan documents has long been the focus of Informaton Retreval (IR), the herarchcal nature of XML data has brought new challenges to both the IR and database communtes. Extensve work has been motvated on developng effcent ndexng and query evaluaton algorthms, and proposng effectve rankng schemes over XML data [1], [2], [3], [4], [15], [16], [17]. In ths paper, we present a new framework for effcent queryng and rankng schema-less XML documents based on condensed summares extracted from the structural and textual content of the documents. Instead of ndexng each sngle element or term n a document, we extract a structural summary and a small number of data synopses from the document, whch are ndexed n a way sutable for query evaluaton. The result of the query evaluaton s a ranked lst of XML documents that best match the query. In order to rank XML documents, we extend the tradtonal vector space model and develop an effectve rankng scheme based on TF*IDF scorng. We beleve that our framework can serve as an nfrastructure for a wde range of web applcatons, such as onlne nteractve exploraton of massve XML data stores and XML search engnes n peer-topeer envronment. In summary, we address the followng ssues n ths paper: We present a novel framework for ndexng, queryng, and rankng XML documents based on content and structure synopses. We propose an effcent XML meta-data ndexng scheme that s sutable for full-text XML query evaluaton. We extend the tradtonal vector space model to effectvely rank XML documents based on meta-data. We expermentally valdate the advantages of our ndexng scheme and the effectveness of our rankng scheme. 2. Meta-Data Indexng As our query language, we extend the XPath syntax wth a full-text search predcate e ~ S, where e s an arbtrary XPath expresson. Ths predcate returns true f at least one element from the sequence returned by e matches the search specfcaton, S. A search specfcaton s a smple IR-style boolean keyword search that takes the form term S 1 and S 2 S 1 or S 2 ( S ) where S, S 1, and S 2 are search specfcatons. A term s an ndexed term that must be present n the text of an element returned by the expresson e. As a runnng example used throughout the paper, the followng query, Q: //aucton//tem[locaton "Dallas"] [descrpton "mountan" and "bcycle"]/prce searches for the prces of all aucton tems located n Dallas that contan the words mountan and bcycle n ther descrpton. An example XML document s shown n Fgure 1. The document represents the aucton nformaton from an aucton web ste. When an XML document s ndexed, only a structural summary and a small number of concse data synopses are extracted and ndexed. Journal of Dgtal Informaton Management Volume 9 Number 5 October

2 <aucton> <sponsor> <name> ebay </name> <address> 1040 W. Abram Dr. San Jose, CA </address> </sponsor> <tem d="01"> <name> bcycle </name> <descrpton> a mountan bcycle used for 2 years </descrpton> <payment> Credt Card, Cash, Money Order </payment> <locaton> Dallas, TX </locaton> <prce> 30 </prce> </tem> <tem d="02"> <name> bcycle </name> <descrpton> a brand new 24-nch bcycle for grls </descrpton> <payment> Credt Card, Cash, Check </payment> <locaton> Stevens Pont, WI </locaton> <prce> 60 </prce> </tem> <tem d="03"> <name> car </name> <descrpton> 1999 Toyota Camery LE, mleage 80K </descrpton> <payment> Credt Card, Check </payment> <locaton> Arlngton, TX </locaton> <prce> 5000 </prce> </tem> </aucton> Fgure 1. Example XML Document 2.1 Structural Summary To match the structural components n the query durng the query evaluaton, we construct a type of data synopses, called Structural Summary(SS), whch s a structural markup that captures all the unque paths n the document. The structural summary of the example XML document s shown n Fgure 2. Each node n a structural summary has a tagname and a unque d. Fgure 2. Structural Summary of Example XML Document 2.2 Content Synopses To capture the textual content of a document, for each text node k n the structural summary S of the document D (an SS node that corresponds to at least one document element that contans text), we construct a content synopss (CS) H D to p summarze the textual data assocated wth k, where the path p s the unque smple path from the root of S to the node k n S. H D s a bt matrx of sze L W, where W s the number of p term buckets and L s the number of postonal ranges n the document. Here, W s proportonal to the number of elements n D reachable by the text label path p. L s proportonal to the number of begn/end tags n the document. The postonal nformaton s represented by the document order of the begn/ end tags of the elements. More specfcally, for each term t drectly nsde an element assocated wth the node k whose begn/end poston s b/e, we set all matrx values H D [, hash(t) p Fgure 3. Data Synopses Example modw] to one, for all b L/ D e L// D, where hash s a strng hashng functon and D s the document sze. That s, the [0, D ] range of tag postons n the document s compressed nto the range [0, L]. H D p s mplemented as a B+ -tree ndex wth ndex key p, because, durng the query processng, we need to retreve the content synopses of all documents for a gven path p. For example, the content synopss for the SS node descrpton (k = 3) s llustrated on the rght of Fgure Postonal Flters For each non-text node n n the structural summary of a document, we construct another type of data synopss, called Postonal Flter (PF), denoted by F D p. As we dd for H D p, F D p s also mplemented as a B-tree wth ndex key p. F D p s a bt matrx of sze L M, where L s the document postonal ranges of the elements assocated wth node n that s reachable by the label path p, and M s the number of bt vectors n F D p. The postonal flter for SS node tem s demonstrated on the left n Fgure 3. The 7 one-bt ranges ndcate there are 7 tem elements n the document. 2.4 Contanment Flterng Usng the above data synopses, we can enforce the element contanment constrants n the query usng an operaton called Contanment Flterng. Let F be a postonal flter of sze L M and V be a bt vector extracted from a content synopss whose sze s L W. The Contanment Flterng CF(F, V ) returns a new postonal flter F'. The bt F' [, m] s on ff: k [0, L] : V [k] = 1 j [, k] : F [j, m] = 1 Bascally, the Contanment Flterng copes a contnuous range of one-bts from F to F' f there s at least one poston wthn ths range n whch the correspondng bt n V s one. Fgure 4 shows how the data synopses are used to determne whether a document s lkely to satsfy the query Q (here M = 2). Frst, we do a contanment flterng between the ntal postonal flter F 2 and the bt vector H 4 [ Dallas ]. In the resultng postonal flter A, only 5 one-bt ranges out of 7 n F2 are left. Countng from bottom to top, the 2nd and 4th one-bt ranges n F2 are dscarded n A because there s no any onebt range n H 4 [ Dallas ] that ntersects wth the 2nd or 4th range, whch means that nether the 2nd nor 4th tem element contans a locaton element that contans the term Dallas. Smlarly, we can do contanment flterng between A and the resultng bt vector derved from the 200 Journal of Dgtal Informaton Management Volume 9 Number 5 October 2011

3 calculatng the weght of a term, we consder a path-term par as the unt for content scorng. 1) TF*IDF Calculaton: We frst gve the defntons of TF and IDF scores of a path-term par, and then use an example to llustrate ther calculatons. Defnton 1: TF Score of a Path-Term Par. Let D be an XML document assocated wth the par (p, t), where p s a full text label path from ts structural summary and t s a term. Let paths(d) be the set of tuples (t x, p x, b x, e x, x ) for all terms t x n D, where b x /e x s the begn/end poston of the element that drectly contans t x, p x s a full label path that reaches t x, and x s the document poston of t x n D. The TF score of (p, t) relevant to D s defned as: TF D (p, t) = { x (t x, p x, b x, e x, x ) paths(d) p = p x t = t x } (1) Fgure 4. Testng query Q Usng Data Synopses btwse andng between H 3 [ mountan ] and H 3 [ bcycle ]. The 3 one-bt ranges n B ndcate that 3 tems out of 7 n F 2 satsfy all element contanment constrants n the query. Thus, the document s consdered to satsfy the query. 3. Query Processng Due to the the space lmtaton, we only brefly descrbe the query processng n our framework n ths secton. As a XPath query s evaluated, a query footprnt s derved from the query. A query footprnt(qf) captures the essental structural components and all the entry ponts assocated wth the search predcates. For example, the query footprnt of Q s: //aucton//tem:1[locaton:2] [descrpton:3]/prce The numbers 1, 2, and 3 are the numbers of entry ponts n QF that ndcate the places where data synopses are needed for query evaluaton. Based on the query footprnt, all the matchng structural summares and dstnct label path combnatons correspondng to the query search predcates are retreved. Based on the retreved label path combnatons, documents that have data synopses assocated wth those label paths are found and qualfed documents are fltered out usng contanment flterng. The scores of documents are also calculated durng the query evaluaton and the ranked document lst s returned to the clent. 4. Extendng Vector Space Model For Rankng Snce a query footprnt may match a large number of structural summares and there may be a large number of documents that match each structural summary, t s desrable to rank all the qualfed documents usng a scorng functon. 4.1 Extended Vector Space Model In the tradtonal vector space model used n IR systems, the TF*IDF rankng for keyword queres s defned over a document collecton [19]. Ths rankng mechansm consders two factors: (1) TF, the term frequency, that measures the relatve mportance of a query keyword n the canddate document, and (2) IDF, the nverse document frequency, that measures the global mportance of a query keyword n the entre collecton of documents. We extend the vector space model to measure the content smlarty of an XML document D to a query Q. Due to the nherent herarchcal structure of XML data, nstead of Bascally, TF D (p, t) counts the number of (p, t) pars n the document D. Notce that, f t occurs n tmes n the same element reachable by p n D, then (p, t) wll be counted n tmes. Defnton 2: IDF Score of a Path-Term Par. Let N be the total number of documents n the corpus. Let D j 1 j N, be an XML document assocated wth the par (p, t), where p s a full text label path derved from structural summary matchng and t s the correspondng term n the query. The IDF score of (p, t) s defned as: N ( ) = p IDF p, t log Np,t ( ) (2) where N p and N(p, t) are calculated as follows: N { ( x, x, x, x, x) paths( j) x} (3) j1 Np j t p b e D p p N Npt tx, pxtx, px, bx, ex, x) pathsdj j1 p px t tx Bascally, N p counts the total number of documents that contan path p and N(p, t) counts the total number of documents that contan the path-term par (p, t) n the corpus. We now walk through an example to llustrate TF and IDF score calculaton. Suppose we have three documents n the corpus only, D 1, D 2, and D 3. The path-term par (p, t) s (/aucton/tem/ locaton, Dallas ). Suppose D 1 contans 30 paths /aucton/tem/ locaton that can reach the term Dallas and D 1 contans 100 paths /aucton/tem/descrpton that can reach the term bcycle. Suppose that all the three documents contan the path /aucton/ tem/locaton, but only the paths /aucton/tem/locaton n D 1 and D 2 can reach the term Dallas. Then the TF and IDF scores can be calculated as follows: TF D 1(/aucton/tem/locaton, Dallas )=30 IDF(/aucton/tem/locaton, Dallas )=log(3/2)=log Enhanced Scorng wth Postonal Weght A path-term par (p,t) corresponds to a postonal bt vector V n the content synopss assocated wth p. A one-bt range n V represents an element that contans t and s reachable by p. For nstance, n Fgure 4, the bt vector H 3 [ mountan ] corresponds to the par (/aucton/tem/descrpton, mountan ) and t contans 5 bt-on ranges. The number of one-bt ranges n the vector reflects the TF score of the par (/aucton/ tem/ descrpton, mountan ). However, after the btwse andng (4) Journal of Dgtal Informaton Management Volume 9 Number 5 October

operaton, only 3 one-bt ranges are left n the resultng bt vector, whch ndcates that among those 5 descrpton elements, only 3 of them contan both mountan and bcycle.

Thus, t s desrable to take nto account ths percentage of qualfed elements as we calculate the weght of a path-term par.

4 operaton, only 3 one-bt ranges are left n the resultng bt vector, whch ndcates that among those 5 descrpton elements, only 3 of them contan both mountan and bcycle. Smlarly, after the contanment flterng between the postonal flter of tem(f 2 ) and H 4 [ Dallas ], only 5 tem elements are left out of 7 n the resultng postonal flter(a). Thus, t s desrable to take nto account ths percentage of qualfed elements as we calculate the weght of a path-term par. To make the weght calculaton of a path-term par more accurate, we ntroduce the postonal weght, whch s the percentage of qualfed pathterm pars found durng the contanment flterng or btwse andng operaton, and s used to calculate the weght of the path-term par. Defnton 3: Postonal Weght of a Path-Term Par. Let D be an XML document, PF D (p, t) be the postonal flter assocated wth (p, t), and PF D (p, t) be the result from contanment 0 flterng or btwse andng operaton. In addton, let N PF D (p, t) 0 be the number of one-bt ranges n PF D (p, t) and PF D (p, t) be 0 the number of one-bt ranges n PF D (p, t). The postonal weght of (p, t) n D s defned as: N D D PF ( p, t) PW ( p, t) (5) N PF D O ( p, t) To llustrate the calculaton of the postonal weght of a pathterm par, we refer to our runnng example. In Fgure 4, before the contanment flterng, there are 7 one-bt ranges n F 2. After the contanment flterng, 5 one-bt ranges reman n A. Smlarly, before the btwse andng operaton, there are 5 and 4 one-bt ranges n H [ mountan ] and H 3 [ bcycle ] respectvely. After ths operaton, 3 one-bt ranges are left. Thus, the postonal weght of each path-term par s calculated as follows: PW D (/aucton/tem/locaton, Dallas )=5/7=0.71 PW D (/aucton/tem/descrpton, mountan )=3/5=0.6 PW D (/aucton/tem/descrpton, bcycle )=3/4=0.75 Combnng the TF score, IDF score, and the postonal weght, the defnton of the weght of (p,t) n D s determned by the followng equaton: D D D W ptpw pttf pt IDFpt (6) Fnally, we gve the defnton of the enhanced content score of D relevant to Q. Defnton 4: Enhanced Content Score of a Document. Let Q be the query and D be an XML document. Let W Q ( p Q, t Q ) be the weght of the path-term par (p Q, t Q ) n Q and W D (p D, t D ) be the weght of the correspondng pathterm par (p D, t D ) n the document D. The content score of D relevant to Q s defned as n Q Q Q 2 D D D 2 W ( p, t ) W ( p, t ) ECS( D, Q) 1 n n Q Q Q 2 D D D 2 W ( p, t ) W ( p, t ) 1 1 where n s the number of path-term pars. In order to calculate the TF * IDF score of a pathterm par, we construct two addtonal synopses named TF Synopss and IDF Synopss. When a document D s ndexed, for each text label path p n the structural summary of D, a TF synopss s constructed to summarze the frequences of all the terms assocated wth p n D. More specfcally, a TF synopss s a hash table that maps a term to a par consstng of DstnctTerms that ndcates the number of dstnct terms n D reachable by p, and Frequences that counts the total number of term occurrences reachable by (7) p n D. In our mplementaton, a TF synopss s attached to the content synopss assocated wth p. To obtan the TF score of the path-term par (p, t) durng query evaluaton, path p s used as the key to retreve the correspondng content synopss. Then term t s hashed nto some bucket over the attached TF synopss, usng the value of Frequences/DstnctTerms as the term frequency of (p, t). To acheve IDF score of a pathterm par, for each text label path p, we construct an IDF synopss, whch contans two members named Documents and Synopss, where Documents counts the total number of documents contanng the path p, and Synopss s a hash table that maps the term t to the total number of documents contanng (p, t). 5. Experments We have buld a prototype system named XQR usng Java (J2SE 6.0) and Berkeley DB Java Edton [13] was employed as the storage manager. We conducted extensve experments to evaluate the effcency of our ndexng scheme and the effectveness of our rankng scheme. We used two XML benchmark datasets XMark [22] and XBench [21] as our datasets. The man characterstcs of our datasets and data synopses sze are summarzed n Table 1. For each dataset, our query workload s 10 full-text XPath queres exhbtng dfferent szes, query structures and search predcates. Data Set Data Sze Fles Avg. Fle Avg. SS Avg. CS Avg. PF (MB) Sze (KB) Sze (KB) Sze (KB) Sze (KB) XMark XMark XBench XBench Table 1. Data set characterstcs and data synopses sze 5.1 Evaluaton of Indexng Scheme To demonstrate the effcency of our Data Synopses Indexng( DSI) scheme, we mplemented the tradtonal Inverted Lst Indexng(ILI) scheme n Berkeley DB and compared t wth our ndexng scheme. We chose XMark2 and XBench2 as datasets for ths experment because the dataset sze s the key factor for the ndexng scheme comparson experment. Fgure 5 shows that for XMark dataset, ILI consumes over 2 tmes dsk space and query response tme than DSI. For XBench dataset, DSI s much more effcent than ILI because XBench data s text-centrc data generated from Text-CentrcMultple Document(TC/MD) class, whch s more sutable for the ndex comparson experments. In fact, the ndex sze of DSI s about 3% of that of ILI. The query response tme of DSI s over 40 tmes faster than ILI. (a) Index Sze (b) Query Response Tme Fgure 5. Comparson between ILI and DSI 202 Journal of Dgtal Informaton Management Volume 9 Number 5 October 2011

5.2 Evaluaton of Rankng Scheme To examne the effectveness of our rankng functon, we used two wdely accepted metrcs, precson and recall.

To construct the accurate relevant set for each query, we exploted Qzx/open [20] to evaluate the query over each dataset to obtan the strct relevant set.

Note that the meanng of bestk value s dfferent from that n the classcal topk algorthms, such as Fagn algorthm [18]. The bestk value here s just the number of returned documents for relevance rankng.

the relevant documents on the top of the ranked lst so that the precson does not drop too much when the number of returned documents ncreases.

Fgure 6(b) shows the recall over XMark1 s greater than that over XMark2 and the recall over XBench1 s larger than that over XBench2.

Varyng BestK Value 2) Impact of Heght Factor: In the second set of experments, we fxed the bestk value and the wdth of data synopses to see the mpact of dfferent heght factors on precson and recall.

5 5.2 Evaluaton of Rankng Scheme To examne the effectveness of our rankng functon, we used two wdely accepted metrcs, precson and recall. We evaluated all the XMark(XBench) queres over each XMark(XBench) dataset to measure the average precson and recall. To construct the accurate relevant set for each query, we exploted Qzx/open [20] to evaluate the query over each dataset to obtan the strct relevant set. 1) Varyng bestk value: We frst fxed the heght and wdth of data synopses and vared the bestk value,.e., the number of documents returned, to measure the average precson and recall of a query. Note that the meanng of bestk value s dfferent from that n the classcal topk algorthms, such as Fagn algorthm [18]. The bestk value here s just the number of returned documents for relevance rankng. The results n Fgure 6(a) show that as the bestk value vares from 10 to 100, the average precson of a query over each dataset drops smoothly, whch ndcates that our scorng functon can effectvely rank the relevant documents on the top of the ranked lst so that the precson does not drop too much when the number of returned documents ncreases. Snce the average sze of the relevant set for a query over each XBench dataset s relatvely small, the average precson of a query over each XBench dataset s a lttle lower. Fgure 6(b) shows the recall over XMark1 s greater than that over XMark2 and the recall over XBench1 s larger than that over XBench2. The reason s that when the number of documents ncreases, the sze of relevant set ncreases, whch leads to a lower recall. (a) Average Precson (b) Average Recall Fgure 6. Varyng BestK Value 2) Impact of Heght Factor: In the second set of experments, we fxed the bestk value and the wdth of data synopses to see the mpact of dfferent heght factors on precson and recall. As we can see from Fgure 7(a) and Fgure 7(b), as the heght factor vares from 0.1 to 0.5, the precson and recall almost reman at the same value for each dataset. Ths was expected because when the heght of data synopses s reduced, a query may get more false postves, but our rankng functon can effectvely rank the most relevant documents close to the top, whle movng false postves near the bottom of the answer set so that the precson and recall almost do not change. (a) Impact of Wdth Factor on (b) Impact of Wdth Factor Precson on Recall Fgure 8. Impact of Wdth Factor 3) Impact of Wdth Factor: Fnally, we fxed the sze of the bestk value and the heght of data synopses to see the mpact of dfferent wdth factors on precson and recall. The results are shown n Fgure 8(a) and Fgure 8(b). As we can see, the precson and recall change a lttle more than those n Fgure 7(a) and Fgure 7(b). Ths result mples that f we want to decrease the sze of data synopses to reduce the data storage overhead but stll keep hgh precson, the heght factor should be adjusted frst. 5.3 Expermental Comparson wth Lucene We also compared our framework wth Lucene [14], whch s a full-featured text search engne lbrary wrtten entrely n Java. Lucene s the closest to our framework because t returns ranked XML document hts as query answers and t supports boolean queres that may contan feld-term pars such as ttle: XML AND author: Sucu. We conducted the comparatve experments over XMark2 and XBench2 datasets. Snce Lucene does not drectly support XPath queres, we converted the XPath queres over two data sets to the correspondng queres that are supported n Lucene. We frst compared our rankng scheme wth Lucene over XMark data. We vared the top K value and measured the average precson and recall. The result s shown n Fgure 9. Fgure 9(a) shows that for the same top K value, the average precson of XQR s much hgher than that of Lucene. In addton, the average precson of Lucene drops rapdly when the top K value s ncreased because more false postves are returned n the answer set. On the contrary, the average precson of XQR does not decrease sharply when the top K value s ncreased because our rankng scheme can effectvely rank the relevant documents on the top of the ranked lst. From Fgure 9(b), we can see that the average recall of Lucene s not only much lower than that of XQR, but also s lower than 2%. The reason s that the sze of a relevant set s relatvely larger for XMark data and there are more false postves n the answer set of Lucene, whch leads to a very low average recall. (a) Impact of Heght Factor on Precson Fgure 7. Impact of Heght Factor (b) Impact of Heght Factor on Recall (a) Average Precson (b) Average Recall Fgure 9. Rankng Comparson wth Lucene over XMark Data Journal of Dgtal Informaton Management Volume 9 Number 5 October

(a) Average Precson (b) Average Recall Fgure 10. Rankng Comparson wth Lucene over XBench Data We then compared our rankng scheme wth Lucene over XBench data.

From Fgure 10, we can see that both average precson and recall of Lucene are much lower than those of XQR and are close to 0.

6 (a) Average Precson (b) Average Recall Fgure 10. Rankng Comparson wth Lucene over XBench Data We then compared our rankng scheme wth Lucene over XBench data. We vared the top K value and measured the average precson and recall. The result s shown n Fgure 10. From Fgure 10, we can see that both average precson and recall of Lucene are much lower than those of XQR and are close to 0. The s because the queres over XBench data are more selectve and Lucene returns very few relevant document hts n the top K answer set for most queres, thus leads to close-to-zero average precson and recall. 6. Related Work Wth the popularty of XML as the data format for a wde varety of web data repostores, extensve work has been motvated on desgnng powerful query languages, developng effcent ndexng and query evaluaton algorthms, and proposng effectve rankng schemes over XML data [1], [2], [3], [15], [17]. The related work can be classfed nto three categores. Approaches n the frst category are focused on extendng exstng complex XML query languages, such as XQuery, wth IR search predcates and ncorporatng smple IR scorng methods nto the query evaluaton. Khalfa et al [1] proposes bulk algebra called TIX, whch ntegrates smple IR scorng schemes nto a tradtonal ppelned query evaluator for an XML database. More specfcally, new operators and effcent access methods are proposed to generate and manpulate scores of XML fragments durng query evaluaton. TeXQuery [2] supports a powerful set of fully composable full-text search prmtves, whch can be seamlessly ntegrated nto the XQuery language. It mostly focuses on the TeXQuery language desgn and the underlyng formal model, but also provdes a costng model [5]. Strateges n the second category are dedcated to extendng tradtonal IR models for scorng smple XPath-lke queres wth full-text extenson. In [6], the authors present a framework that relaxes a full-text XPath query by droppng some predcates from ts closure and scorng the approxmate answers usng predcate penaltes. They further propose novel scorng methods by extendng TF*IDF rankng to account for both structure and content whle consderng query relaxatons [3]. Sgurbjornsson et al, [7] propose a framework that decomposes the query nto several pathterm pars, evaluates these pars separately and produces a fnal rankng that takes nto account the scores of dfferent sources of evdence. In the thrd category, efforts are made to address the problem of effcently producng ranked results for keyword search queres over XML documents. XRank [16] extends Google-lke keyword search to XML. The authors propose an algorthm for scorng XML elements that takes nto account both hyperlnk and contanment edges. New nverted lst ndex structures based on Dewey IDs and assocated query processng algorthms are also presented. XSEarch [15] s a semantc search engne that extends smple keyword search by ncorporatng keyword context nformaton nto the query,.e., each query term s a keyword-label par nstead of a sngle keyword. XSEarch extends the TF*IDF rankng scheme to rank the results. A recent work, XKSearch [8], ntroduces the concept of smallest lowest common ancestors (SLCAs) and proposes two effcent algorthms, Indexed Lookup Eager and Scan Eager, for keyword search n XML documents accordng to the SLCA semantcs. Note that all the above proposals consder queryng and rankng orgnal XML documents, and returnng XML fragments as query answers. None of them evaluates a full-text query over XML meta-data to return a ranked lst of document locatons to the clent. Snce our queryng and rankng schemes are based on XML meta-data, our work s also related to XML summarzaton technques n the lterature. Polyzots et al [9] propose an XSKETCH synopss model that explots localzed stablty and value-dstrbuton summares to accurately capture the complex correlaton patterns that exst between and across path structure and element values n the XML data graph. In [10], the authors further propose a novel class of XML synopses, termed XCLUSTERs, that provdes a unfed summarzaton framework to address the key problem of XML summarzaton n the context of heterogeneous value content. Smlarly, [11] presents a novel data structure, the bloom hstogram, to approxmate XML path frequency dstrbuton wthn a small space budget and to accurately estmate the path selectvty. A dynamc summary layer s used to keep exact or more detaled XML path nformaton to accommodate frequently changed data. However, all these proposed XML synopses and summares are manly used for selectvty estmaton, rather than for locatng and rankng XML documents, as s n our framework. A recent work by Cho et al [12] addresses the meta-data ndexng problem of effcently dentfyng XML elements along each locaton step n an XPath query, that satsfy range constrants on the ordered meta-data. the authors develop a meta-data ndex structure named full meta-data ndex(fmi) by enhancng the R-tree ndex proposed for XPath locaton steps. The FMI can quckly dentfy XML elements that are reachable from a gven element usng a specfed XPath axs and satsfy the meta-data range constrant.to reduce the metadata update overhead, another R-tree-based ndex structure named nhertance meta-data ndex(imi) s also proposed to enable effcent lookup n the ndex structure, whle keepng the update cost manageable. 7. Concluson We presented a framework for ndexng and rankng XML documents based on concse data synopses extracted from orgnal XML documents. We frst proposed a novel meta-data ndexng scheme sutable for full-text XPath query evaluaton. We then extended vector space model to rank the query results over the ndexed meta-data. The experments demonstrate that our ndexng scheme s much effcent than tradtonal nvertedlst based ndexng scheme. Our rankng scheme s effectve n terms of precson and recall and superor than the exstng XML search engne Lucene. 8. Acknowledgment Ths work s supported by the Faculty & Program Development Grant from the Unversty of Wsconsn-Stevens Pont, USA, Journal of Dgtal Informaton Management Volume 9 Number 5 October 2011

7 References [1] Al-Khalfa, S., Yu, C., Jagadsh, H. V. (2003). Queryng Structured Text n an XML Database, In: SIGMOD 03. [2] Amer-Yaha, S., Botev, C., Shanmugasundaram, S. (2004). TeXQuery: A Full-Text Search Extenson to XQuery, In: WWW 04. [3] Amer-Yaha, S., Koudas, N., Maran, A., Srvastava, D., Toman, D. (2005). Structure and Content Scorng for XML, In: VLDB 05. [4] Amer-Yaha, S., Curtmola, E., Deutsch, A. (2006). Flexble and Effcent XML Search wth Complex Full-Text Predcates, In: Proceedngs of SIGMOD 06. [5] Maran, A., Amer-Yaha, S., Koudas, N., Srvastava, D. (2005). Adaptve Processng of Top-K Queres n XML, In: Proceedngs of ICDE 05. [6] Amer-Yaha, S., Lakshmanan, L. V.S., Pandt, S. (2004). FleXPath: Flexble Structure and Full-Text Queryng for XML, In: Proceedngs of SIGMOD 04. [7] Sgurbjornsson, B., Kamps, J., Rjke, M. D. (2004). Processng Content- Orented XPath Queres, In: Proceedngs of CIKM 04. [8] Xu, Y., Papakonstantnou, Y. (2005). Effcent Keyword Search for Smallest LCAs n XML Databases. SIGMOD [9] Polyzots, N., Garofalaks, M. (2002). Structure and Value Synopses for XML Data Graphs, In: Proceedngs of VLDB 02. [10] Polyzots, N., Garofalaks, M. (2006). XCluster Synopses for Structured XML Content, In: Proceedngs of ICDE 06. [11] Wang, W., Jang, H., Lu, H., Yu, J.H. (2004). Bloom Hstogram: Path Selectvty Estmaton for XML Data wth Updates, In: Proceedngs of VLDB 04. [12]. Cho, S., Koudas, N., Srvastava, D. (2006). Metadata Indexng for XPath Locaton Steps, In: Proceedngs of SIGMOD 06. [13] Berkeley DB. [14] Lucene. [15] Cohen, S., Mamou, J., Kanza, Y., Sagv, Y. (2003). XSEarch: A Semantc Search Engne for XML, In: VLDB 03. [16] Guo, L., Shao, F., Botev, C., Shanmugasundaram, J. (2003). XRANK: Ranked Keyword Search over XML Documents, In: SIGMOD 03. [17] Chen, L., Papakonstantnoul Y. (2010). Supportng Top-K Keyword Search n XML Databases, In: Proceedngs of ICDE 10. [18] Fagn, R., Lotem, A., Naor, M. (2003). Optmal aggregaton algorthms for mddleware, In: Journal of Computer Systems and Scence, 66 (4) [19] Yates, R. B., Neto, B.R. (1999). Modern Informaton Retreval, ACM Press. [20] Qzx/open. [21] XBench. [22] XMark. [23] XML Path Language (XPath) xpath20/ Journal of Dgtal Informaton Management Volume 9 Number 5 October

Performance Evaluation of Information Retrieval Systems

Performance Evaluation of Information Retrieval Systems Why System Evaluaton? Performance Evaluaton of Informaton Retreval Systems Many sldes n ths secton are adapted from Prof. Joydeep Ghosh (UT ECE) who n turn adapted them from Prof. Dk Lee (Unv. of Scence