Federated Search of Text-Based Digital Libraries in Hierarchical Peer-to-Peer Networks

Size: px

Start display at page:

Download "Federated Search of Text-Based Digital Libraries in Hierarchical Peer-to-Peer Networks"

Ethelbert Newton
5 years ago
Views:

1 Federated Search of Text-Based Dgtal Lbrares n Herarchcal Peer-to-Peer Networks Je Lu School of Computer Scence Carnege Mellon Unversty Pttsburgh, PA jelu@cs.cmu.edu Jame Callan School of Computer Scence Carnege Mellon Unversty Pttsburgh, PA callan@cs.cmu.edu ABSTRACT Peer-to-peer archtectures are a potentally powerful model for developng large-scale networks of text-based dgtal lbrares, but peer-to-peer networks have so far provded very lmted support for text-based federated search of dgtal lbrares usng relevancebased rankng. Ths paper addresses the problems of resource representaton, resource rankng and selecton, and result mergng for federated search of text-based dgtal lbrares n herarchcal peer-to-peer networks. Exstng approaches to text-based federated search are adapted and two new methods are developed for resource representaton and resource selecton accordng to the unque characterstcs of herarchcal peer-to-peer networks. Expermental results demonstrate that the proposed approaches are both more accurate and more effcent than more common alternatves for text-based federated search n peer-to-peer networks. Categores and Subject Descrptors H.3.3 [Informaton Storage and Retreval]: Retreval models, Search process, Selecton process General Terms Algorthms, Desgn, Expermentaton, Performance Keywords Peer-to-peer, Herarchcal, Federated Search, Text-Based, Retreval, Dgtal Lbrary 1. INTRODUCTION Peer-to-peer (P2P) networks are an appealng approach to federated search over large networks of dgtal lbrares. The actvtes nvolved for search n peer-to-peer networks nclude ssung requests ( queres ), routng requests ( query routng ), and respondng to requests ( retreval ). The nodes n peer-topeer networks can partcpate as clents and/or servers. Clent nodes ssue queres to ntate search n peer-to-peer networks; server nodes provde nformaton contents, respond to queres wth documents that are lkely to satsfy the requests, and/or route queres to other servers. The frst peer-to-peer networks were based on sharng popular musc, vdeos, and software. These types of dgtal objects have relatvely obvous or well-known namng conventons and descrptons, makng t possble to represent them wth just a few words from a name, ttle, or manual annotaton. From a Lbrary Scence or Informaton Retreval perspectve, these systems were desgned for known-tem searches, n whch the goal s to fnd a sngle nstance of a known object (e.g., a partcular song by a partcular artst). In a known tem search, the user s famlar wth the object beng requested, and any copy s as good as any other. Known-tem search of popular musc, vdeo, and software flesharng systems s a task for whch smple solutons suffce. If P2P systems are to scale to more vared content and larger dgtal lbrares, they must adopt more sophstcated solutons. A very large number of text-based dgtal lbrares were developed durng the last decade. Nearly all of them use some form of relevance rankng, n whch term frequency nformaton s used to rank documents by how well they satsfy an unstructured text query. Many of them allow free search access to ther contents va the Internet, but do not provde complete copes of ther contents, or even complete ttle lsts for ther contents, upon request. Many do not allow ther contents to be crawled by Web search engnes. They do not cooperate by conformng to a sngle method of text representaton, query processng, or document retreval; they don t even provde nformaton about how these operatons are done. We would argue that most of the recent research on peer-to-peer networks offers lttle useful gudance for provdng federated search of current text-based dgtal lbrares. Ths paper addresses the problem of usng peer-to-peer networks as a federated search layer for text-based dgtal lbrares. We study federated search n two dfferent types of envronments: cooperatve envronments where each dgtal lbrary provdes accurate resource descrpton of ts content upon request, and uncooperatve envronments where resource descrptons must be obtaned ndrectly. We start by assumng the current state of the art; that s, we assume that each dgtal lbrary s a text database runnng a reasonably good conventonal search engne, that t provdes search access to ts holdngs, and that t provdes ndvdual documents n response to full text queres. We present n ths paper how resource descrptons of dgtal lbrares are obtaned and used for effcent query routng, and how results from dfferent dgtal lbrares are merged nto a sngle, ntegrated ranked lst n peer-to-peer networks. In the followng secton we gve an overvew of the pror research on federated search of text-based dgtal lbrares and peer-to-peer networks. Secton 3 descrbes our approaches to federated search of text-based dgtal lbrares n peer-to-peer networks. Sectons 4 and 5 dscuss our data resources and evaluaton methodologes. Expermental settngs and results are presented n Secton 6. Secton 7 concludes. 2. OVERVIEW Accurate and effcent federated search n peer-to-peer networks of text-based dgtal lbrares requres both the approprate peerto-peer archtecture and the effectve search methods developed for the chosen archtecture. In ths secton we present an

2 overvew of the pror research on federated search of text-based dgtal lbrares, peer-to-peer network archtectures, and textbased search n peer-to-peer networks n order to set the stage for the descrptons of our approaches to text-based federated search n peer-to-peer networks. 2.1 Federated Search of Text-Based Dgtal Lbrares Pror research on federated search of text-based dgtal lbrares (also called dstrbuted nformaton retreval n the research lterature) dentfes three problems that must be addressed: Resource representaton: Dscoverng the contents or content areas covered by each resource ( resource descrpton ); Resource rankng and selecton: Decdng whch resources are most approprate for an nformaton need based on ther resource descrptons; and Result-mergng: Mergng ranked retreval results from a set of selected resources. A drectory servce s responsble for acqurng resource descrptons of the dgtal lbrares t serves, selectng the approprate resources (dgtal lbrares) gven the query, and mergng the retreval results from selected resources nto a sngle, ntegrated ranked lst. Solutons to all these three problems for the case of a sngle drectory servce have been developed n dstrbuted nformaton retreval. We brefly revew them below Resource Representaton Dfferent technques for acqurng resource descrptons requre dfferent degrees of cooperaton from dgtal lbrares. STARTS s a cooperatve protocol that requres every dgtal lbrary to provde an accurate resource descrpton to the drectory servce upon request [6]. STARTS s a good soluton n envronments where cooperaton can be guaranteed. However, n some envronments where dgtal lbrares may not cooperate or may have an ncentve to cheat, STARTS cannot be used to acqure accurate resource descrptons. Query-based samplng s an alternatve approach to acqurng resource descrptons wthout requrng explct cooperaton from dgtal lbrares. The resource descrpton of a dgtal lbrary s constructed by samplng ts documents va the normal process of submttng queres and retrevng documents. Query-based samplng has been shown to acqure farly accurate resource descrptons usng a small number of queres and documents n dstrbuted nformaton retreval envronments [1]. The total number of documents of a dgtal lbrary s one of the most mportant corpus statstcs requred by many resource selecton algorthms. Capture-Recapture [12] and Sample- Resample [20] are two methods of estmatng the total number of documents of an uncooperatve dgtal lbrary. Expermental results show that n most scenaros, Sample-Resample s more accurate and has less communcaton costs than the Capture- Recapture method Resource Rankng and Selecton Resource selecton ams at selectng a small set of resources that contan a lot of documents relevant to the nformaton request. Resources are ranked by ther lkelhood to return relevant documents and top-ranked resources are selected to process the nformaton request. Resource selecton algorthms such as CORI [1], ggloss [7], and Kullback-Lebler (K-L) dvergence-based algorthms [24] use technques adapted from document retreval for resource rankng. The resource descrpton of a dgtal lbrary used by these algorthms ncludes a lst of terms wth correspondng collecton term frequences, and corpus statstcs such as the total number of terms and documents n the collecton. These algorthms have been shown to work well wth resource descrptons provded by cooperatve dgtal lbrares or acqured usng query-based samplng. Other resource selecton algorthms ncludng ReDDE [20] and DTF (the decson-theoretc framework for resource selecton) [16] rank resources by drectly estmatng the number of relevant documents from each resource for a gven query. ReDDE reles on sampled documents obtaned usng query-based samplng for such estmaton. DTF has three varants DTF-rp, DTF-sample and DTF-normal. DTF-rp estmates the number of relevant documents from a resource by assumng a lnearly decreasng recall-precson functon and calculatng the expected precson and recall from the resource. DTF-sample uses sampled documents to estmate how relevant documents are dstrbuted among the avalable resources. DTF-normal models the dstrbuton of document scores from a resource wth normal dstrbuton and map document scores to probablty of relevance usng a functon learned wth user relevance feedback. Decdng how many top-ranked resources to be selected ( thresholdng ) s a problem that s usually smplfed. Most resource selecton algorthms use heurstc values such as 10 and 20 for the number of selected resources Result Mergng Many result-mergng algorthms have been proposed n dstrbuted nformaton retreval. Varous approaches can be dvded nto two categores: approaches based on normalzng resource-specfc document scores nto resource-ndependent document scores, and approaches based on recalculatng document scores at the drectory servce. The CORI mergng algorthm uses a heurstc lnear combnaton of dgtal lbrary scores and document scores to normalze the scores of the documents from dfferent dgtal lbrares. The ntuton s to favor documents from dgtal lbrares wth hgh scores and also to enable hgh-scorng documents from lowscorng dgtal lbrares to be ranked hghly. It s effectve when used together wth the CORI resource selecton and INQUERY document retreval algorthms n federated search usng a sngle drectory servce [1]. There has been some work on usng logstc regresson to learn mergng models to normalze document scores but relevance judgments are requred for tranng [2]. The Sem-Supervsed Learnng result-mergng algorthm uses the documents obtaned by query-based samplng as tranng data to learn score normalzng functons on a query-by-query bass. It s shown to work well wth a varety of resource selecton and document retreval algorthms and s the current state-of-the-art for result mergng n dstrbuted nformaton retreval [19]. Document scores can be recalculated at the drectory servce by downloadng all the documents n the retreval results from

3 selected resources, ndexng them, and re-rankng them usng a document retreval algorthm. Downloadng documents s not necessary f all the statstcs requred for score recalculaton can be obtaned alternatvely. Krsch s algorthm [10] requres each resource to provde summary statstcs for each of the retreved documents. It allows very accurate normalzed document scores to be determned wthout the hgh communcaton cost of downloadng. The corpus statstcs requred for recalculatng document scores could also be substtuted by a reference statstcs database contanng all the relevant statstcs for some set of documents. Ths method s explored n [3] for federated search usng a sngle drectory servce and shown to be effectve compared wth usng the corpus statstcs provded by cooperatve dgtal lbrares. 2.2 P2P Network Archtectures As mentoned n Secton 1, the actvtes nvolved for search n peer-to-peer networks nclude ssung queres, query routng, and retreval. Query routng s essentally a problem of resource selecton and locaton. Resource locaton n frst generaton peerto-peer networks s characterzed by Napster, whch used a sngle logcal drectory servce, and Gnutella 0.4, whch used undrected message floodng and a search horzon. The former proved easy to attack, and the latter ddn t scale; both systems demonstrated the mportance of robust and relable methods of locatng nformaton n peer-to-peer networks. They also explored very dfferent solutons: Napster was centralzed and requred cooperaton (sharng of accurate nformaton); Gnutella 0.4 was decentralzed and requred lttle cooperaton. Recent research provdes a varety of solutons to the flaws of the Napster and Gnutella 0.4 archtectures, but perhaps the most nfluental are herarchcal and structured P2P archtectures. Structured P2P archtecture assocates each data tem wth a key and dstrbutes keys among drectory servces usng a Dstrbuted Hash Table (DHT) [17, 18, 21, 22, 28]. Herarchcal P2P archtecture [9, 11, 23] uses top-layer drectory servces to serve regons of bottom-layer dgtal lbrares and drectory servces work collectvely to cover the whole network. The common characterstc of both approaches s the constructon of an overlay network to organze the nodes that provde drectory servces (also called look up servces by DHT-based approaches) for effcent query routng. An mportant dstncton s that structured P2P networks requre the ablty to map (va a dstrbuted hash table) from an nformaton need to the dentty of the drectory servce that satsfes the need, whereas herarchcal P2P networks rely on message-passng to locate drectory servces. Structured P2P networks requre dgtal lbrares to cooperatvely share descrptons of data tems n order to generate keys and construct dstrbuted hash tables. In contrast, herarchcal P2P networks enable drectory servces to automatcally dscover the contents of (possbly uncooperatve) dgtal lbrares, whch s well-matched to networks that are dynamc, heterogeneous, or protectve of ntellectual property. 2.3 Text-Based Search n P2P Networks Most of the pror research on search n peer-to-peer networks only support smple keyword-based search. Matches between query terms and keywords of documents are used to determne how to route queres and whch documents to be retreved. There has been some recent work on developng systems that adopt more sophstcated retreval models to support text-based search (also called content-based retreval ) n peer-to-peer networks. Examples are PlanetP usng a completed decentralzed P2P archtecture [5], psearch usng a structured P2P archtecture [22], and content-based retreval n herarchcal P2P networks [13]. In PlanetP [5], a node uses a TF.IDF algorthm to decde whch nodes to contact for nformaton requests based on the compact summares t collects about all other nodes nverted ndexes. Because no specal resources are dedcated to support drectory servces n completely decentralzed P2P archtectures, t s somewhat neffcent for each node to collect and store nformaton about the contents of all other nodes, especally n dynamc P2P networks. psearch [22] uses the semantc vector (generated by Latent Semantc Indexng) of each document as the key to dstrbute document ndex n a structured P2P network so that documents close n dstance have smlar contents. The relevance of a document to a query s determned by the smlarty between ther semantc vectors. To compute semantc vectors for documents and queres, global statstcs such as the nverse document frequency and the bass of the semantc space need to be dssemnated to each node n the network. Because global statstcs can only be obtaned n completely cooperatve envronments where each dgtal lbrary shares ts document and corpus statstcs, ths approach cannot be easly extended to uncooperatve and heterogeneous envronments. There has been some pror research on content-based resource selecton and document retreval n herarchcal P2P networks of dgtal lbrares [13]. Vewng peer-to-peer networks as a partcular type of dstrbuted nformaton retreval envronment, content-based resource selecton s extended to the case of multple drectory servces n peer-to-peer envronments where dgtal lbrares cooperatvely provde resource descrptons to connectng drectory servces. Expermental results demonstrate that content-based resource selecton and document retreval can provde more accurate and more effcent solutons to federated search n peer-to-peer networks of text-based dgtal lbrares compared wth the floodng and keyword-based approaches. The problem of result mergng n herarchcal P2P networks of uncooperatve and barely-cooperatve text-based dgtal lbrares has also been studed n [15]. The Sem-Supervsed Learnng (SSL) result-mergng algorthm s modfed and an algorthm Score Estmaton wth Sample Statstcs (SESS) whch extends Krsch s approach to result mergng s proposed. Expermental results show that modfed SSL has satsfactory precson for topranked merged documents, and SESS s able to provde near optmal performance wth a small amount of cooperaton from dgtal lbrares. 3. TEXT-BASED FEDERATED SEARCH IN HIERARCHICAL P2P NETWORKS The research descrbed n ths paper adopts a herarchcal P2P archtecture because t provdes a more flexble framework to ncorporate varous solutons to resource selecton and result mergng n both cooperatve and uncooperatve envronments. Followng the termnology of pror research, we refer to textbased dgtal lbrares as leaf nodes, and drectory servces as hub nodes. Each leaf node s a text database that provdes functonalty to process full text queres by runnng a document

4 D 2 H 2 D 3 D 1 H 3 D 4 D 5 Fgure 3.1 Federated search n herarchcal P2P networks. retreval algorthm over ts ndex of local document collecton and generate responses. Each hub acqures and mantans necessary nformaton about ts neghborng hub and leaf nodes and uses t to provde resource selecton and result mergng servces to peerto-peer networks. In addton to leaf nodes and hubs, there are also nodes representng users wth nformaton requests n peerto-peer networks. They are referred to as clent nodes. In a herarchcal P2P network, leaf nodes and clent nodes can only connect to hubs and hubs connect wth each other. Search n peer-to-peer networks reles on message-passng between nodes. A request message ( query ) s generated by a clent node and routed from a clent node to a hub, from one hub to another, or from a hub to a leaf node. A response message ( queryht ) s generated by a leaf node and routed back along the query path n reverse drecton. Each message n the network has a tme-to-lve (TTL) feld that determnes the maxmum number of tmes t can be relayed n the network. The TTL s decreased by 1 each tme the message s routed to a node. When the TTL reaches 0, the message s no longer routed. When a clent node has an nformaton request, t sends a query message to each of ts connectng hubs. A hub that receves the query message uses ts resource selecton algorthm to rank and select one or more neghborng leaf nodes as well as hubs and routes the query to them f the message s TTL hasn t reached 0. A leaf node that receves the query message uses ts document retreval algorthm to generate a relevance rankng of ts documents and responds wth a queryht message to nclude a lst of top-ranked documents. Each top-level hub (the hub that connects drectly to the clent node that ssues the request) collects the queryht messages and uses ts result mergng algorthm to merge the documents retreved from multple leaf nodes nto a sngle, ntegrated ranked lst and returns t to the clent node. If the clent node ssues the request to more than one hub, then t also needs to merge results returned by multple toplevel hubs. Fgure 3.1 llustrates federated search of text-based dgtal lbrares n herarchcal P2P networks. The C (whte) node s the clent node that ssues the nformaton request, the H (black) nodes are hubs, and the D (gray) nodes are leaf nodes (dgtal lbrares). The edges between nodes represent connectons. The arrows wth sold lnes ndcate the drectons to send query messages and the arrows wth dashed lnes ndcate the drectons to send queryht messages. In the followng subsectons, we present n more detals the solutons to the problems of resource representaton, resource rankng and selecton, and result mergng n both cooperatve and uncooperatve peer-to-peer envronments. H 1 D 6 C H 5 H 4 D 7 D 9 D Resource Representaton The descrpton of a resource s a very compact summary of ts content. Compared wth a copy of the complete ndex of a collecton of documents, resource descrpton requres much less communcaton and storage costs but stll provdes useful nformaton for resource selecton algorthms to determne whch resources are more lkely to contan documents relevant to the query. As mentoned n Secton 2.1.2, the resource descrpton used by most resource selecton algorthms nclude a lst of terms wth correspondng term frequences (collecton language model), and corpus statstcs such as the total number of terms and documents provded or covered by the resource. The resource here could be a sngle leaf node, a hub that covers multple neghborng leaf nodes, or a neghborhood that ncludes all the nodes reachable from a hub. Although resource descrptons for dfferent types of resources have the same format, dfferent methods are requred to acqure them, whch we ntroduce below Resource Descrptons of Leaf Nodes Resource descrptons of leaf nodes are used by hubs for query routng ( resource selecton ) among connectng leaf nodes. In cooperatve envronments, each leaf node provdes accurate resource descrpton to ts connectng hubs upon request. In uncooperatve envronments, each hub conducts query-based samplng ndependently to obtan sampled documents from ts connectng leaf nodes. Sampled documents from a leaf node are used to generate ts collecton language model. They are also used by the Sample-Resample method to estmate the total number of documents n ths leaf node s collecton Resource Descrptons of Hubs The resource descrpton of a hub s the aggregaton of the resource descrptons of ts connectng leaf nodes. Snce hubs work collaboratvely n herarchcal P2P networks, neghborng hubs can exchange wth each other ther aggregate resource descrptons. However, because the aggregate resource descrptons of hubs only have nformaton for nodes wthn 1 hop, f they are drectly used by a hub to decde whch neghborng hubs to route query messages to, the routng would not be effectve when the nodes wth relevant documents st beyond ths horzon. Thus for effectve hub selecton, a hub must have nformaton about what contents can be reached f the query message t routes to a neghborng hub may further travel multple hops. Ths knd of nformaton s referred to as the resource descrpton of a neghborhood and s ntroduced n the followng subsecton Resource Descrptons of Neghborhoods A neghborhood of a hub H n the drecton of ts neghborng hub H j s a set of hubs that can be reached by followng the path from H to H j. Fgure 3.2 llustrates the concept of neghborhood. Hub H 1 has three neghborng hubs H 2, H 3 and H 4. Thus t has three neghborhoods marked by N 1,2, N 1,3 and N 1,4. The resource descrpton of a neghborhood provdes nformaton about the contents covered by all the hubs n ths neghborhood. A hub uses resource descrptons of neghborhoods to select and route queres to ts neghborng hubs. Resource descrptons of neghborhoods provde smlar functonalty as routng ndces [4]. An entry n a routng ndex records the number of documents that may be found along a path for a set of topcs. The key dfference between resource

5 N 1,4 Fgure 3.2 Neghborhoods n herarchcal P2P networks. descrptons of neghborhoods and routng ndces s that resource descrptons of neghborhoods represent contents wth ungram language models (terms wth ther frequences). Thus by usng resource descrptons of neghborhoods, there s no need for hubs and leaf nodes to cluster ther documents nto a set of topcs and t s not necessary to restrct queres to topc keywords. Smlar as exponentally aggregated routng ndces [4], a hub calculates the resource descrpton of a neghborhood by aggregatng the resource descrptons of all the hubs n the neghborhood decayed exponentally accordng to the number of hops. For example, n the resource descrpton of a neghborhood N,j (the neghborhood of H n the drecton of H j ), a term t s exponentally aggregated term frequency s calculated as: [ numhops ( H, H k ) 1] { tf ( t, H k ) / F } (1) H k N, j where tf(t, H k ) s t s term frequency n the resource descrpton of hub H k, and F s the average number of hub neghbors each hub has n the network. The exponentally aggregated total number of documents n a neghborhood s calculated as: [ numhops( H, H k ) 1] { numdocs ( H k ) / F } (2) H k N, j H 9 H 8 H 4 H 3 H 7 N 1,3 H 1 The creaton of resource descrptons of neghborhoods requres several teratons at each hub and dfferent hubs can run the creaton process asynchronously. A hub H n each teraton calculates and sends to ts hub neghbor H j the resource descrpton of neghborhood N j, denoted by ND j, by aggregatng ts hub descrpton HD and the most recent resource descrptons of neghborhoods t receves from all of ts neghborng hubs excludng H j. ND j, s calculated as: ND j = HD + H k { ND drectneghbors H H k F }, ( )\, / (3) j The stoppng condton could be ether the number of teratons reachng a predefned lmt, or the dfference n resource descrptons between adjacent teratons beng small enough. The process of mantanng and updatng resource descrptons of neghborhoods s dentcal to the process used for creatng them. The resource descrptons of neghborhoods could be updated when the dfference between the old and the new value s sgnfcant, or perodcally, or when a node dsconnects from the network. For networks that have cycles, frequences of some terms and the number of documents may be overcounted, whch wll affect the accuraces of resource descrptons. How to deal wth cycles n peer-to-peer networks usng routng ndces s dscussed n detal H 2 H 6 H 10 N 1,2 H 5 n [4]. We could use the same solutons descrbed n [4] for cycle avodance or cycle detecton and recovery. For smplcty, n ths paper, we take the no-op soluton, whch completely gnores cycles. Expermental results show that resource selecton usng resource descrptons of neghborhoods generated n networks wth cycles s stll qute effcent and accurate. 3.2 Resource Rankng and Selecton The goal of query routng s to drect the nformaton request to those nodes that are most lkely to contan relevant documents wth mnmum number of query messages. The floodng technque guarantees to reach nodes wth relevant nformaton contents but requres exponental number of query messages. Random forwardng the request to a small subset of neghbors can sgnfcantly reduce the number of query messages but the reached nodes may not be relevant at all. To acheve both effcency and accuracy, each hub needs to rank ts neghborng leaf nodes by ther lkelhood to satsfy the nformaton request and neghborng hubs by ther lkelhood to reach nodes wth relevant nformaton contents and only forwards the request to top-ranked neghbors. Because the resource descrptons of leaf nodes and those of neghborhoods are not n the same magntude, a hub handles separately the rankng and selecton of ts neghborng leaf nodes and hubs Leaf Node Rankng Adaptng language modelng approaches for ad-hoc nformaton retreval, we use the Kullback-Lebler (K-L) dvergence-based method [24] for leaf node rankng. In the language modelng framework, the K-L dvergence resource selecton algorthm calculates P(L Q), the condtonal probablty of predctng the collecton of leaf node L gven the query Q and uses t to rank dfferent leaf nodes. P(L Q) s calculated as follows: P P( Q L ) P( L ) Q) = P( Q L ) (4) P( Q) ( L wth unform pror probablty for leaf nodes; tf ( q, L ) + µ P( q G) P( Q L ) = (5) µ q Q numterms( L ) + where tf(q L ) s the term frequency of query term q n leaf node L s resource descrpton (collecton language model), P(q G) s the background language model used for smoothng and µ s the smoothng parameter n Drchlet smoothng Leaf Node Selecton wth Unsupervsed Threshold Learnng After leaf nodes are ranked based on ther P(L Q) values, the usual approach s to select the top-ranked leaf nodes up to a predetermned number. In herarchcal P2P networks, the number of leaf nodes served by ndvdual hubs may be qute dfferent, and dfferent hubs may cover dfferent content areas. In ths case, t s not approprate to use a statc, query-ndependent and hubndependent number as threshold for a hub to decde how many leaf nodes to select for a gven query. It s desrable that hubs have the ablty to learn hub-specfc and query type-specfc thresholds automatcally. The problem of learnng threshold to convert relevance rankng scores nto a bnary decson has mostly been studed n nformaton flterng [25, 26, 27]. However, the user relevance

6 feedback requred as tranng data s not as easly avalable for federated search n peer-to-peer networks as for the task of nformaton flterng. Our goal s to develop a technque for each hub to learn the selecton threshold wthout supervson based on the nformaton and functonalty t already has. Because each hub has the ablty to merge the retreval results from multple leaf nodes nto a sngle, ntegrated ranked lst, as long as the result mergng has reasonably good performance, we could assume that the top-ranked merged documents are relevant. If so, the dstrbuton of the top-ranked merged documents over the leaf nodes should provde useful hnts on the number of relevant documents each leaf node s lkely to retreve. Ths s analogous to query expanson wth pseudo-relevance feedback whch treats the top-ranked documents retreved ntally as relevant documents and uses them to mprove the qualty of the query. The key dfferences are ) our approach uses the nformaton about whch top-ranked merged documents are from whch leaf nodes and gnores the actual contents of these documents, and ) the drect goal here s not to mprove mmedately the retreval qualty for current query, but to learn resource selecton thresholds that are specfc to hubs and types of queres and mprove the overall retreval performance for a set of queres. For leaf node selecton, f a hub selects more leaf nodes than necessary, although the retreval results wll nclude a lot of rrelevant documents, as long as there are enough relevant documents, a reasonably good result mergng algorthm can rank most relevant documents above rrelevant documents, yeldng good precsons at top-ranked documents. In ths case, t seems that a loose threshold wll almost always gve good performance. However, a loose threshold leads to low effcency and hgh communcaton costs. Because for search n peer-to-peer networks, accuracy and effcency are equally mportant, the resource selecton threshold must be not too loose n order to guarantee effcency, and not too tght as well so that enough relevant documents are returned (hgh recall). Wth the above crtera n mnd, a hub uses the followng procedure to decde the threshold of leaf node selecton for a query: 1. Gven a query, the hub uses K-L dvergence resource selecton algorthm to calculate leaf node scores and sorts them n descendng order; 2. The hub selects up to 100 top-ranked leaf nodes and normalzes ther scores usng the formula: S' S S mn = (6) Smax Smn where S max s the maxmum score and S mn s the mnmum score among these selected leaf nodes; 3. The hub forwards the query to selected leaf nodes and merges the retreval results returned by these leaf nodes; 4. The hub calculates for each selected leaf node the number of documents that are ranked among top 50 n the merged result; 5. The hub goes down the lst of leaf nodes sorted by ther scores and stops at the leaf node whch has the largest number of documents ranked among top 50 n the merged results (hghest recall usng pseudo-relevance feedback); 6. The hub regards the normalzed score of ths leaf node as the threshold of ts leaf node selecton for the gven query. Learnng thresholds for ndvdual queres s not useful unless the same queres appear agan. Thus queres need to be classfed nto dfferent types and thresholds for ndvdual queres are used to compute thresholds for dfferent query types. Queres can be classfed based on ther contents or statstcal propertes. When the number of queres for tranng s small (whch s desred due to ts low communcaton cost), classfyng queres by contents often leads to sparse and skewed tranng data for varous query types. Hence n our experments we focused on classfyng queres by ther statstcal propertes and found the average probablty of the query terms n a hub s resource descrpton to be a good feature for query classfcaton. Gven a set of tranng queres that have average probabltes of query terms n dfferent ranges, probablty values rangng from 0 to the maxmum term probablty n a hub s resource descrpton are dvded nto 10 non-overlappng bns so that all bns have roughly the same number of queres for tranng. A query type s assocated wth each bn, so there are 10 query types n total. A query s classfed nto one of these 10 types based on the average probablty of ts terms n the hub s resource descrpton. Durng the learnng phase, each hub n the network learns the thresholds for a set of tranng queres and the learned thresholds for queres of the same type are averaged to get the threshold for ths query type at the hub. Gven a new query, a hub determnes the type of the query, ranks up to 100 leaf nodes, normalzes ther scores, and uses the query type-specfc threshold to select the leaf nodes that have normalzed scores no less than the threshold Hub Rankng and Selecton The K-L dvergence resource selecton algorthm used for leaf rankng s also used for hub rankng. The resource descrptons of neghborhoods are used to calculate the collecton language models needed by the resource selecton algorthm. For hub selecton, because selectng a neghborng hub s essentally selectng a neghborhood, usng a pror dstrbuton that favors larger neghborhood could lead to better search performance, whch was ndeed the case n our experments. Thus the pror probablty of a neghborhood s set to be proportonal to the exponentally aggregated total number of documents n the neghborhood. Gven the query Q, the probablty of predctng the neghborhood N that a neghborng hub node H represents s calculated as follows and used to rank neghborng hubs: P( Q N ) P( N ) P ( N Q) = P( Q N ) numdocs( N ) (7) P( Q) tf ( q, N ) + µ P( q G) P( Q N ) = (8) µ q Q numterms( N ) + where tf(q N ) s the term frequency of query term q n the resource descrpton of neghborhood N (collecton language model), P(q G) s the background language model used for smoothng and µ s the smoothng parameter n Drchlet smoothng. A fxed number of top-ranked neghborng hubs are selected. It remans to be future work to apply unsupervsed threshold learnng to hub selecton. 3.3 Result Mergng As descrbed earler, result mergng takes place at each top-level hub. In cooperatve envronments, Krsch s algorthm [10] s

7 extended for result mergng n peer-to-peer networks. In addton to a lst of retreved documents, each resource s requred to provde summary statstcs for each of the retreved documents, for example, document length and how often each query term matched. The corpus statstcs comes from the aggregaton of the hub s resource descrpton and the resource descrptons of neghborhoods for all ts neghborng hubs. The modfed Sem-Supervsed Learnng algorthm (modfed SSL) [15] s used for result mergng n uncooperatve envronments. Each hub along the query path contrbutes to result mergng by provdng document statstcs for overlap documents, whch are documents that appear both n the sampled documents mantaned at the hub for ts leaf node neghbors and n the retreval results sent to the hub by these neghbors. Toplevel hubs use these document statstcs provded by collaboratve hubs to recalculate document scores for overlap documents and par them wth ther orgnal scores returned n the retreval results to use as tranng data for learnng score normalzng functons. The man dfference between result mergng n cooperatve envronments and that n uncooperatve envronments s that n cooperatve envronments leaf nodes provde document statstcs for all the retreved documents to top-level hubs, whle n uncooperatve envronments, hubs provde document statstcs for a subset of retreved documents ( overlap documents) to toplevel hubs. If the clent node ssues the request to more than one hub, then t also needs to merge results returned by multple top-level hubs. Because clent nodes don t mantan nformaton about the contents of other nodes and corpus statstcs as hubs do n herarchcal P2P networks, they cannot use advanced resultmergng algorthms. Thus only smple, but probably less effectve, mergng methods can be appled at clent nodes. For example, results can be merged based on the document scores returned by top-level hubs ( raw score merge ) or n a round robn fashon. 4. TEST DATA We used the P2P testbed [14] developed based on the TREC WT10g web test collecton [8] to evaluate the performance of federated search n herarchcal P2P networks of text-based dgtal lbrares. The P2P testbed conssts of 2,500 collectons obtaned by dvdng WT10g data nto 11,485 collectons based on document URLs and randomly selectng 2,500 of them. The total number of documents n these 2,500 collectons s 1,421,088. Each collecton defnes a leaf node (dgtal lbrary) n a herarchcal P2P network. There are 25 hubs n total n the P2P testbed, each of whch covers a specfc type of content. The connectons between leaf nodes and hubs were determned by clusterng leaf nodes nto 25 clusters usng a smlarty-based soft clusterng algorthm, assocatng each cluster wth a hub, and connectng all the leaf nodes wthn a cluster to the assocated hub. The connectons between hubs were generated randomly. Each hub has no less than 1 and no more than 7 hub neghbors. A hub has on average 4 hub neghbors. Table 4.1 summarzes some statstcs for the testbed. Experments were run on two sets of queres. The frst set of queres came from the ttle felds of TREC topcs used for TREC-8 and TREC-9 Web Tracks. The standard TREC Table 4.1 Summary statstcs for the testbed. mn avg max Number of documents for a leaf node ,505 Number of leaf nodes for a hub ,008 Number of hubs a leaf node connects to relevance assessments suppled by the U. S. Natonal Insttute for Standards and Technology were used. The second set of queres was a set of 1,000 queres selected from the queres defned n the P2P testbed. Queres n the P2P testbed were automatcally generated from WT10g data by extractng key terms from the documents n the collecton. Table 4.2 shows the dstrbuton of query lengths among the selected 1,000 queres. Table 4.3 shows the dstrbuton of term frequences n WT10g for all the query terms n these 1,000 queres. Because t s expensve to obtan relevance judgments for these automatcally generated queres, we used the ranked retreval results from a sngle large collecton as the baselne ( sngle collecton baselne), and measured how well federated search n the herarchcal P2P network could reproduce ths baselne. The sngle large collecton was the subset of the WT10g used to defne the contents of the 2,500 leaf nodes n the peer-to-peer network, and the 50 top-ranked documents retreved usng ths sngle large collecton (WT10g-subset) were treated as the relevant documents for each query. For each query, a leaf node was randomly chosen to act as a clent node temporarly to ssue the query to the network and collect the merged retreval results for evaluaton. 5. EVALUATION METHODOLOGY A smulator was used to evaluate the performance of text-based federated search n herarchcal P2P networks. Both retreval accuracy and query routng effcency are used as performance measures. 5.1 Measurng Retreval Accuracy Retreval accuracy was measured by both set-based and rankbased Recall and. Set-based Recall and are defned as follows: Recall = r / A (9) = r / R (10) where R s the set of the documents returned by retreval n the P2P network, A s the set of relevant documents for a query among the 100 TREC queres, or the set of (up to 50) top-ranked documents returned by retreval usng the sngle WT10g-subset collecton for a query among the 1,000 WT10g queres, and r s the ntersecton of R and A. denotes the sze of the set. The qualty of document rankngs was measured usng precsons Table 4.2 Dstrbuton of query length for 1,000 queres. Length Dstrbuton Table 4.3 Dstrbuton of term frequency for 1,000 queres. Frequency Scale Dstrbuton

8 Leaf descrptons Hub descrptons Neghborhood descrptons Leaf node rankng Table 6.1 Choces of algorthms n the experments. Algorthm Provded by leaf nodes n cooperatve envronments, OR Generated by hubs usng documents sampled from leaf nodes by query-based samplng n uncooperatve envronments Generated by hubs by aggregatng leaf descrptons Generated by hubs by aggregatng hub descrptons and exponentally decayed neghborhood descrptons over several teratons K-L dvergence resource selecton algorthm usng leaf descrptons Leaf node selecton Hub rankng Hub selecton Document retreval Result mergng at top-level hubs 1 of top-ranked leaf nodes, OR Fxed number of top-ranked leaf nodes, OR Top-ranked leaf nodes wth normalzed scores no less than the learned threshold (Secton 3.2.2) K-L dvergence resource selecton algorthm usng neghborhood descrptons All neghborng hubs (floodng), OR 1 randomly selected neghborng hubs, OR Top-ranked neghborng hub K-L dvergence document retreval algorthm Extended Krsch s algorthm n cooperatve envronments, OR Modfed Sem-Supervsed Learnng n uncooperatve envronments (Secton 3.3) Result mergng at clent node Raw score merge (Secton 3.3) at document ranks 5, 10, 15, 20, 30, and 100. Set-based Recall and focus attenton on how well textbased federated search n herarchcal P2P networks returns the rght documents for a query, whle rank-based metrcs measure drectly the performance of document rankng and result mergng. 5.2 Measurng Query Routng Effcency The effcency of query routng was measured by the average number of query messages routed for each query n the network. The average number of query messages routed from hubs to leaf nodes ( Hub-Leaf Messages ) for each query was also used to measure the effcency of leaf node selecton n some experments. 6. EXPERIMENTS AND RESULTS A seres of experments was conducted to study resource selecton and result mergng n both cooperatve ( COOP ) and uncooperatve ( UNCOOP ) P2P envronments. The choces of the algorthms used for resource representaton, resource rankng and selecton, document retreval and result mergng are shown n Table 6.1. Table 6.2 shows the values of some parameters used n our experments. Unsupervsed threshold learnng requred a set of queres for tranng. For each experment that used leaf node selecton wth unsupervsed threshold learnng to run the 100 TREC queres, two runs were conducted. The frst run used the frst half of the 100 TREC queres for tranng and the second half for testng. The second run worked the other way around. The results from two runs were averaged to get the fnal results. For the experments that used leaf node selecton wth unsupervsed threshold learnng to run the 1,000 WT10g queres, the 100 TREC queres were used as tranng data. Unsupervsed threshold learnng only used queres and retreved documents for tranng. The relevance judgments provded by NIST for the 100 TREC queres were not used to learn thresholds for leaf node selecton. Tables 6.3a and 6.3b show respectvely the results of runnng the 100 TREC queres and the 1,000 WT10g queres for text-based federated search n a herarchcal P2P network usng dfferent methods. Both cooperatve and uncooperatve envronments were studed. The sngle collecton baselne whch returned 50 topranked documents for each query by retreval usng the sngle WT10g-subset collecton s also shown n Table 6.3a for the 100 TREC queres. The followng subsectons present the analyss of the results from dfferent perspectves. 6.1 Set-Based Recall/ vs. s at Top Document Ranks The set-based fgures (column 4) are much lower than one mght expect because the number of relevant documents was very small (50 on average for the 100 TREC queres usng relevance judgments and 50 maxmum for the 1,000 WT10g queres usng the sngle collecton baselne), but the total number of retreved documents was at least ten tmes larger for most queres n the herarchcal P2P network. Ths demonstrates a lmtaton of set-based Recall and metrcs for ths task snce generally users only care about the retreval accuracy of top-ranked documents, but we nclude them as another way of comparng resource rankng and selecton methods. Compared wth set-based, the dfferences between precsons at top document ranks for federated search n the herarchcal P2P network and for search usng a centralzed ndex are smaller. Ths mples that both result mergng algorthms for cooperatve and uncooperatve envronments performed qute well by rankng most rrelevant documents lower than relevant documents n spte of low set-based. 6.2 TREC Queres vs. WT10g Queres In contrast to real queres and manual relevance judgments, the Table 6.2 Parameter values used n the experments. Parameters Values Intal TTL for messages 6 Number of documents sampled from each leaf node Up to 300 Number of resample queres used for Sample-Resample to estmate total number of documents Number of teratons to create neghborhood descrptons 6 F (Average number of hub neghbors each hub has) 4 µ (Drchlet smoothng parameter n K-L dvergence resource selecton) Number of documents retreved from each leaf node Up to 50

9 Envronment Table 6.3a Search performance evaluated on the 100 TREC queres usng relevance judgments provded by NIST. Hub Leaf Set-based Recall/ # Query Centralzed N/A N/A / N/A COOP Floodng Top / COOP Random 1 Top / COOP Top 1 Top / COOP Floodng Threshold / COOP Random 1 Threshold / COOP Top 1 Threshold / UNCOOP Floodng Top / UNCOOP Random 1 Top / UNCOOP Top 1 Top / UNCOOP Floodng Threshold / UNCOOP Random 1 Threshold / UNCOOP Top 1 Threshold / Envronment Table 6.3b Search performance evaluated on the 1,000 WT10g queres usng the sngle collecton baselne. Hub Leaf Set-based Recall/ # Query COOP Floodng Top / COOP Random 1 Top / COOP Top 1 Top / COOP Floodng Threshold / COOP Random 1 Threshold / COOP Top 1 Threshold / UNCOOP Floodng Top / UNCOOP Random 1 Top / UNCOOP Top 1 Top / UNCOOP Floodng Threshold / UNCOOP Random 1 Threshold / UNCOOP Top 1 Threshold / ,000 WT10g queres were generated automatcally by extractng key terms from documents and the top-ranked documents retreved usng a sngle centralzed ndex were used for relevance judgments. When ths set of queres was used to evaluate the performance of text-based federated search n herarchcal P2P networks, t drectly measured the ablty of federated search n herarchcal P2P networks to match the results from search n a centralzed envronment. The strong performance ndcated by hgh precsons at top document ranks n Table 6.3b demonstrates that federated search n the herarchcal P2P network mostly agreed wth the centralzed approach on whch documents were most relevant. Addtonal evaluatons on the 100 TREC queres by treatng the documents n the sngle collecton baselne as relevant documents (the same evaluaton methodology as we used for the 1,000 WT10g queres) gave very smlar results (not shown n ths paper due to space reason) as those n Table 6.3b. Ths s an encouragng sgn for federated search n peer-to-peer networks because although dstrbuted retreval systems are not yet better than the sngle collecton baselne, our results show that ther performance can be pretty close at top-ranked documents. However, we note that Table 6.3b gves slghtly overly optmstc vew of federated search qualty, because n cases where federated search n the herarchcal P2P network dsagreed wth search usng a centralzed ndex, federated search was more lkely to gve hgh rank to an rrelevant document whch was ranked lowly by centralzed search. Therefore, the performance dfference between federated search n the herarchcal P2P network and search usng a centralzed ndex s expected to be slghtly larger f we evaluate them usng real relevance judgments, as shown n Table 6.3a. In order to clam that a peer-to-peer system beng able to reproduce the sngle collecton baselne qute well s an effectve system for federated search, we need to rely on the assumpton that search usng a centralzed ndex s effectve n satsfyng user s nformaton needs, whch s not necessarly the case. Due to ths reason, we were concerned wth whether automatcally generated queres would behave smlarly as real queres and whether the conclusons drawn usng the sngle collecton baselne for evaluaton would stll be vald wth real relevance judgments. If we compare the fgures n Table 6.3a wth those n Table 6.3b, we can see that although the absolute values were qute dfferent, the relatve performance dfference of

UB at GeoCLEF Department of Geography Abstract

UB at GeoCLEF Department of Geography Abstract UB at GeoCLEF 2006 Mguel E. Ruz (1), Stuart Shapro (2), June Abbas (1), Slva B. Southwck (1) and Davd Mark (3) State Unversty of New York at Buffalo (1) Department of Lbrary and Informaton Studes (2) Department