Relevance Feedback Document Retrieval using Non-Relevant Documents

Relevance Feedback Document Retreval usng Non-Relevant Documents TAKASHI ONODA, HIROSHI MURATA and SEIJI YAMADA Ths paper reports a new document retreval method usng non-relevant documents. From a large data set of documents, we need to fnd documents that relate to human nterestng n as few teratons of human testng or checkng as possble. In each teraton a comparatvely small batch of documents s evaluated for relatng to the human nterestng. Ths method s called relevance feedback. The relevance feedback needs a set of relevant and non-relevant documents. However, the ntal presented documents whch are checked by a user don t always nclude relevant documents. Accordngly we propose a feedback method usng nformaton of non-relevant documents only. We named ths method non-relevance feedback. The rrelevance feedback selects a set of documents based on learnng result by One-class SVM. Results from experments show that ths method s able to retreve a relevant document from a set of non-relevant documents effectvely. 1. Introducton As Internet technology progresses, accessble nformaton by end users s explosvely ncreasng. In ths stuaton, we can now easly access a huge document database through the web. However t s hard for a user to retreve relevant documents from whch he/she can obtan useful nformaton, and a lot of studes have been done n nformaton retreval, especally document retreval 1). Actve works for such document retreval have been reported n TREC (Text Retreval Conference) 2) for Englsh documents, IREX (Informaton Retreval and Extracton Exercse) 3) and NTCIR (NII- NACSIS Test Collecton for Informaton Retreval System) 4) for Japanese documents. In general, snce a user hardly descrbes a precse query n the frst tral, nteractve approach to modfy the query vector by evaluaton of the user on documents n a lst of retreved documents. Ths method s called relevance feedback 6) and used wdely n nformaton retreval systems. In ths method, a user drectly evaluates whether a document s relevant or non-relevant n a lst of retreved documents, and a system modfes the query vector usng the user evaluaton. In another approach, relevant and rrelevant document vectors are consdered as postve and negatve examples, and relevance feedback s transposed to a bnary class classfcaton problem 7). For the bnary class classfcaton problem, Support Vector Machnes(whch are called SVMs) have shown the Central Research Insttute of Electrc Power Industry Natonal Insttute of Informatcs Fg. 1 Non- Relevant Documents Intal Search: Input Query Feedback Search & Rankng Check relevant or non-relevant Dsplay top N ranked documents Image of a problem n the relevance feedback documents retreval: The gray arrow parts are made teratvely to retreve useful documents for the user. Ths teraton s called feedback teraton n the nformaton retreval research area. But f the evaluaton of the user has only non-relevant documents, ordnary relevance feedback methods can not feed back the nformaton of useful retreval. excellent ablty. And some studes appled SVM to the text classfcaton problems 8) and the nformaton retreval problems 9). Recently, we have proposed a relevance feedback framework wth SVM as actve learnng and shown the usefulness of our proposed method expermentally 10). The ntal retreved documents, whch are dsplayed to a user, sometmes don t nclude relevant documents. In ths case, almost all relevance feedback document retreval systems work hardly, because the systems need relevant and non-relevant documents to construct a bnary class classfcaton problem(see Fgure 1). Whle a machne learnng research feld has some methods whch can deal wth one class classfcaton problem. In the above document retreval case, 1

we can use only non-relevant documents nformaton. Therefore, we consder ths retreval stuaton s as same as one class classfcaton problems. In ths paper, we propose a framework of an nteractve document retreval usng only non-relevant documents nformaton. We call the nteractve document retreval non-relevance feedback document retreval, because we can use only nonrelevant documents nformaton. Our proposed non-relevance document retreval s based on One Class Support Vector Machne(One-Class SVM) 11). One-Class SVM can generate a dscrmnant hyperplane that can separate the non-relevant documents whch are evaluated by a user. Our proposed method can dsplay documents, whch may be relevant documents for the user, usng the dscrmnant hyperplane. In the remanng parts of ths paper, we explan the One-Class SVM algorthm n the next secton brefly, and propose our document retreval method based on One-Class SVM n the thrd secton. In the fourth secton, n order to evaluate the effectveness of our approach, we made experments usng a TREC data set of Los Angels Tmes and dscuss the expermental results. Fnally we conclude our work and dscuss our future work n the ffth secton. 2. One-Class SVM Schölkopf et al. suggested a method of adaptng the SVM methodology to one-class classfcaton problem. Essentally, after transformng the feature va a kernel, they treat the orgn as the only member of the second class. The usng relaxaton parameters they separate the mage of the one class from the orgn. Then the standard two-class SVM technques are employed. One-class SVM 11) returns a functon f that takes the value +1 n a small regon capturng most of the tranng data ponts, and -1 elsewhere. The algorthm can be summarzed as mappng the data nto a feature space H usng an approprate kernel functon, and then tryng to separate the mapped vectors from the orgn wth maxmum margn(see Fgure /reffg:one-classsvm). Let the tranng data be x 1,...,x l (1) belongng to one class X, where X s a compact subset of R N and l s the number of observatons. Let Φ : X H be a kernel map whch transforms the tranng examples to feature space. The dot product n the mage of Φ can be computed by Fg. 2 Orgn -1 +1 One-Class SVM Classfer: the orgn s the only orgnal member of the second class. evaluatng some smple kernel k(x, y) =(Φ(x) Φ(y)) (2) such as the Gaussan kernel ( ) x y 2 k(x, y) =exp. (3) c The strategy s to map the data nto the feature space correspondng to the kernel, and to separate them from the orgn wth maxmum margn. Then, To separate the data set from the orgn, one needs to solve the followng quadratc program: 1 mn w H,ξ R l ρ R N 2 w 2 + 1 ξ ρ νl subject to (w Φ(x )) ρ ξ, (4) ξ 0. Here, ν (0, 1) s an upper bound on the fracton of outlers, and a lower bound on the fracton of Support Vectors. Snce nonzero slack varables ξ are penalzed n the objectve functon, we can expect that f w and ρ solve ths problem, then the decson functon f(x) =sgn((w Φ(x)) ρ) (5) wll be postve for most examples x contaned n the tranng set, whle the SV type regularzaton term w wll stll be small. The actual trade-off between these two s controlled by ν. For a new pont x, the value f(x) s determned by evaluatng whch sde of the hyperplane t falls on, n feature space. Usng multplers α,β 0, we ntroduce a Lagrangan L(w, ξ,ρ,α, β) = 1 2 w 2 + 1 ξ ρ νl α ((w x ) ρ + ξ ) β ξ (6) and set the dervatves wth respect to the prmal varables w,ξ,ρequal to zero, yeldng 2

w = α x, (7) α = 1 νl β 1 νl, α =1. (8) In Eqn. (7), all patterns {x : [l],α > 0} are called Support Vectors. Usng Eqn. (2), the SV expanson transforms ( the decson functon ) Eqn. (5) f(x) =sgn α k(x, x) ρ. (9) Substtutng Eqn. (7) and Eqn. (8) nto Eqn. (6), we obtan the dual problem: 1 mn α α α j k(x, x j ) 2,j subject to 0 α 1 νl, α =1. (10) One can show that at the optmum, the two nequalty constrants Eqn. (4) become equaltes f α and β are nonzero,.e. f 0 <α 1/(νl). Therefore, we can recover ρ by explotng that for any such α, the correspondng pattern x satsfes ρ =(w x )= α j x j x. (11) j Note that f ν approaches 0, the upper boundares on the Lagrange multplers tend to nfnty,.e. the second nequalty constrant n Eqn. (10) become vod. The problem then resembles the correspondng hard margn algorthm, snce the penalzaton of errors becomes nfnte, as can be seen from the prmal objectve functon Eqn. (4). It s stll a feasble problem, snce we have placed no restrcton on ρ, soρ can become a large negatve number n order to satsfy Eqn. (4). If we had requred ρ 0 from the start, we would have ended up wth the constrant α 1 nstead of the correspondng equalty constrant n Eqn. (10), and the multplers α could have dverged. In our research we used the LIBSVM. Ths s an ntegrated tool for support vector classfcaton and regresson whch can handle one-class SVM usng the Schölkopf etc algorthms. The LIBSVM s avalable at http://www.cse.ntu.edu.tw/ cjln/lbsvm. 3. Non-relevance Feedback Document Retreval In ths secton, we descrbe our proposed method of document retreval based on Non-relevant documents usng One-class SVM. The ntal retreved documents, whch are dsplayed to a user, sometmes don t nclude relevant documents. In ths case, almost all relevance feedback document retreval systems does not contrbute to effcent document retreval, because the systems need relevant and non-relevant documents to construct a bnary class classfcaton problem(see Fgure 1). The One-Class SVM can generate dscrmnant hyperplane for the one class usng one class tranng data. Consequently, we propose to apply One- Class SVM n a non-relevance feedback document retreval method. The retreval steps of proposed method perform as follows: Step 1: Preparaton of documents for the frst feedback The conventonal nformaton retreval system based on vector space model dsplays the top N ranked documents along wth a request query to the user. In our method, the top N ranked documents are selected by usng cosne dstance between the request query vector and each document vectors for the frst feedback teraton. Step 2: Judgment of documents The user then classfers these N documents nto relevant or non-relevant. If the user labels all N documents non-relevant, the documents are labeled +1 and go to the next step. If the user classfes the N documents nto relevant documents and non-relevant documents, the non-relevant documents are labeled +1 and non-relevant documents are labeled -1 and then our prevous proposed relevant feedback method s adopted 10). Step 3: Determnaton of non-relevant documents area based on non-relevant documents The dscrmnant hyperplane for classfyng non-relevant documents area s generated by usng One-Class SVM. In order to generate the hyperplane, the One-Class SVM learns labeled non-relevant documents whch are evaluated n the prevous step (see Fgure 3). Step 4: Classfcaton of all documents and Selecton of retreved documents The One-class SVM learned by prevous step can classfes the whole documents as nonrelevant or not non-relevant. The documents whch are dscrmnated n not non-relevant 3

Fg. 3 w Generaton of a hyperplane to dscrmnate non-relevant documents area: Crcles denote documents whch are checked non-relevant by a user. The sold lne denotes the dscrmnant hyperplane. w Fg. 4 Mapped non-checked documents nto the feature space: Boxes denote non-checked documents whch are mapped nto the feature space. Crcles denotes checked documents whch are mapped nto the feature space. Gray boxes denotes the dsplayed documents to a user n the next teraton. These documents are n the not non-relevant document area and near the dscrmnant hyperplane. area are newly selected. From the selected documents, the top N ranked documents, whch are ranked n the order of the dstance from the non-relevant documents area, are shown to user as the document retreval results of the system (see Fgure 4). Then return to Step 2. The feature of our One-Class SVM based nonrelevant feedback document retreval s the selecton of dsplayed documents to a user n Step 4. Our proposed method selects the documents whch are dscrmnated not non-relevant and near the dscrmnant hyperplane between non-relevant documents and not non-relevant documents. Generally f the system got the opposte nformaton from a user, the system should select the nformaton, whch s far from the opposte nformaton area, for dsplayng to the user. However, n our case, the classfed non-relevant documents by the user ncludes a request query vector of the user. Therefore, f we select the documents, whch are far from the non-relevant documents area, the documents may not nclude the request query of the user. Our selected documents (see Fgure 4) s expected that the probablty of the relevant documents for the user s hgh, because the documents are not non-relevant and may nclude the query vector of the user. 4. Experments 4.1 Expermental settng We made experments for evaluatng the utlty of our nteractve document retreval based on non-relevant documents usng One-Class SVM descrbed n secton 3. The document data set we used s a set of artcles n the Los Angels Tmes whch s wdely used n the document retreval conference TREC 2). The data set has about 130 thousands artcles. The average number of words n a artcle s 526. Ths data set ncludes not only queres but also the relevant documents to each query. Thus we used the queres for experments. Our experment used three topcs for experments and show these topcs n Table 1. These topcs do not have relevant documents n top 20 ranked documents n the order of cosne dstance between the query vector and document vectors. Our experments set the sze of N of dsplayed documents presented n Step 1 n the secton 3 10 or 20. We used TFIDF 1), whch s one of the most popular methods n nformaton retreval to generate document feature vectors, and we adoped the equaton of TFIDF n the reference 12). In our experments, we used the lnear kernel for One-class SVM learnng, and found a dscrmnant functon for the One-class SVM classfer n the feature space. The vector space model of documents s hgh dmensonal space. Moreover, the number of the documents whch are evaluated by a user s small. Therefore, we do not need to use the kernel trck and the parameter ν (see secton 2) s set adequately small value (ν =0.01). The small ν means hard margn n the One-Class SVM. For comparson wth our approach, two nformaton retreval methods were used. The frst s an nformaton retreval method that does not use a feedback, namely documents are retreved usng the rank n vector space model(vsm). The second Table 1 Topcs, query words and the number of relevant documents n the Los Angels Tmes used for experments topc query words # of relevant doc. 306 Afrca, cvlan, death 34 343 polce, death 88 383 mental, ll, drug 55 4

Table 2 The number of retreved relevant documents at each teraton: the number of dsplayed documents s 10 at each teraton topc 308 # of retreved relevant doc. # of teratons Proposed method VSM Roccho 1 1 0 0 2 0 0 3 1 0 4 0 topc 343 # of retreved relevant doc. # of teratons Proposed method VSM Roccho 1 0 0 0 2 1 0 0 3 0 0 4 0 0 5 0 0 topc 383 # of retreved relevant doc. # of teratons Proposed method VSM Roccho 1 0 0 0 2 1 0 0 3 0 0 4 1 0 s an nformaton retreval method usng conventonal Roccho-based relevance feedback 6) whch s wdely used n nformaton retreval research. The Roccho-based relevance feedback modfes a query vector Q by evaluaton of a user usng the followng equaton. Q +1 = Q + α x β x, (12) x R r x R n where R r s a set of documents whch were evaluated as relevant documents by a user at the th feedback, and R n s a set of documents whch were evaluated as non-relevant documents at the feedback. α and β are weghts for relevant and nonrelevant documents respectvely. In ths experment, we set α =1.0 β =0.5whch are known adequate expermentally. 4.2 Expermental results Here, we descrbe the relatonshps between the performances of proposed method and the number of feedback teratons. Table 2 gave the number of retreved relevant documents at each feedback teraton. At each feedback teraton, the system dsplays ten hgher ranked not non-relevant documents, whch are near the dscrmnant hyperplane, for our proposed method. We also show the retreved documents of Roccho-based method at each feedback teraton for comparng to proposed method n table 2. We can see from ths table that our non-relevance Table 3 The number of retreved relevant documents at each teraton: the number of dsplayed documents s 20 at each teraton topc 306 # of retreved relevant doc. # of teratons Proposed method VSM Roccho 1 1 1 0 2 0 3 0 4 0 topc 343 # of retreved relevant doc. # of teratons Proposed method VSM Roccho 1 1 0 0 2 0 0 3 0 0 4 1 0 topc 383 # of retreved relevant doc. # of teratons Proposed method VSM Roccho 1 1 0 0 2 1 0 3 0 4 0 feedback approach gves the hgher performance n terms of the number of teraton for retrevng relevant document. On the other hand, the Rocchobased feedback method cannot search a relevant document n all cases. The vector space model wthout feedback s better than Roccho-based feedback. After all, we can beleve that proposed method can make an effectve document retreval usng only non-relevant documents, and Rocchobased feedback method can not work well when the system can receve the only non-relevant documents nformaton. Table 3 gave the number of retreved relevant documents at each feedback teraton. At each feedback teraton, the system dsplays twenty hgher ranked not non-relevant documents, whch are near the dscrmnant hyperplane, for our proposed method. We also show the retreved documents of Roccho-based method at each feedback teraton for comparng to proposed method n table 3. We can observe from ths table that our nonrelevance feedback approach gves the hgher performance n terms of the number of teraton for retrevng relevant documents, and the same expermental results as table 2 about Roccho-based method and VSM. In table 2, a user already have seen twenty documents at the frst teraton. Before the fst teraton, the user have to see ten documents, whch are retreved results usng cosne dstance between a query vector and document vectors n VSM. In 5

table 3, the user also have seen forty documents at the frst teraton. Before the fst teraton, the user also have to see ten documents to evaluate the documents, whch are retreved results usng cosne dstance between a query vector and document vectors n VSM. When we compare the expermental results of table 2 wth the results of table 3, we can observe that the small number of dsplayed documents makes more effectve document retreval performance than the large number of dsplayed documents. In table 2, the user had to see thrty documents by fndng the frst relevant document about topc 343 and 383. In table 3, the user had to see forty documents by fndng the frst relevant document about topc 343 and 383. Therefore, we beleve that the early non-relevance feedback s useful for an nteractve document retreval. 5. Concluson In ths paper, we proposed the non-relevance feedback document retreval based on One-Class SVM usng only non-relevant documents for a user. In our non-relevance feedback document retreval, the system use only non-relevant documents nformaton. One-Class SVM can generate a dscrmnant hyperplane of observed one class nformaton, so our proposed method adopted One-Class SVM for non-relevance feedback document retreval. Ths paper compared our method wth a conventonal relevance feedback method and a vector space model wthout feedback. Expermental results on a set of artcles n the Los Angels Tmes showed that the proposed method gave a consstently better performance than the compared method. Therefore we beleve that our proposed One-Class SVM based approach s very useful for the document retreval wth only non-relevant documents nformaton. If our proposed non-relevant feedback document retreval method fnd and dsplay relevant documents for a user, the system wll swtch from our non-relevant feedback method to ordnary relevance feedback document retreval method mmedately. When one see ths procedure, one may thnk that there are few chances of usng our nonrelevant feedback method n an nteractve document retreval. However, we guess that our nonrelevant feedback method has a lot of chances n the nteractve document retreval. Because t s very hard for the nteractve document retreval to fnd relevant documents usng a few keywords n a large database. For nstance, please consder a case where you want to search patent nformaton related your research. In the frst, you nput a few keywords to fnd your nterestng patent nformaton. And then you can see too many patent documents. Can you fnd your nterestng nformaton n them easly We thnk t s very dffcult to fnd nterestng nformaton n the top 100 documents. Therefore, we beleve that our proposed non-relevance feedback method s useful for an effectve nteractve document retreval. Ths paper proposed that the system should dsplay the documents whch are n the not nonrelevant documents area and near the dscrmnant hyperplane of One-Class SVM at each feedback teraton. However, we do not dscuss how the selecton of documents nfluence both the effectve learnng and the performance of nformaton retreval theoretcally. Ths pont s our future work. References 1) Yates, R. B. and Neto, B. R.: Modern Informaton Retreval, Addson Wesley (1999). 2) TREC Web page: http://trec.nst.gov/. 3) IREX: http://cs.nyu.edu/cs/projects/proteus/rex/. 4) NTCIR: http://www.rd.nacss.ac.jp/ ñtcadm/. 5) Salton, G. and McGll, J.: Introducton to modern nformaton retreval, McGraw-Hll (1983). 6) Salton, G.(ed.): Relevance feedback n nformaton retreval, Englewood Clffs, N.J.: Prentce Hall, pp. 313 323 (1971). 7) Okabe, M. and Yamada, S.: Interactve Document Retreval wth Relatonal Learnng, Proceedngs of the 16th ACM Symposum on Appled Computng, pp. 27 31 (2001). 8) Tong, S. and Koller, D.: Support Vector Machne Actve Learnng wth Applcatons to Text Classfcaton, Journal of Machne Learnng Research, Vol. 2, pp. 45 66 (2001). 9) Drucker, H., Shahrary, B. and Gbbon, D. C.: Relevance Feedback usng Support Vector Machnes, Proceedngs of the Eghteenth Internatonal Conference on Machne Learnng, pp. 122 129 (2001). 10) Onoda, T., Murata, H. and Yamada, S.: Relevance Feedback wth Actve Learnng for Document Retreval, Proc. of IJCNN2003, pp. 1757 1762 (2003). 11) Schölkopf, B., Platt, J., Shawe-Taylor, J., Smola, A. and Wllamson, R.: Estmatng the Support for a Hgh-Dmensonal Dstrbuton, Techncal Report MSR-TR-99-87, Mcrosoft Research, One Mcrosoft Way Redmon WA 98052 (1999). 12) Schapre, R., Snger, Y. and Snghal, A.: Boostng and Roccho Appled to Text Flterng, Proceedngs of the Twenty-Frst Annual Internatonal ACM SI- GIR, pp. 215 223 (1998). 6