Document Filtering Method Using Non-Relevant Information Profile

Document Filtering Method Using Non-Relevant Information Profile Keiichiro Hoashi Kazunori Matsumoto Naomi Inoue Kazuo Hashimoto KDD R&D Laboratories, Inc. 2-1-15 Ohaxa Kamifukuoka, Saitama 356-8502 JAPAN +81-492-78-7332 {hoash±, matsu, inoue, kh}~kddlabs, co. j p Abstract Document filtering is a task to retrieve documents relevant to a user's profile from a flow of documents. Generally, filtering systems calculate the similarity between the profile and each incoming document, and retrieve documents with similarity higher than a threshold. However, many systems set a relatively high threshold to reduce retrieval of non-relevant documents, which results in the ignorance of many relevant documents. In this paper, we propose the use of a non-relevant information profile to reduce the mistaken retrieval of non-relevant documents. Results from experiments show that this filter has successfully rejected a sufficient number of non-relevant documents, resulting in an improvement of filtering performance. 1 Introduction Document filtering is a task which monitors a flow of incoming documents, and selects those which the systems regards as relevant to the user's interest. Many document filtering systems use a similarity-based method to retrieve documents. The user's interest is expressed within the system as a profile. The similarity between the profile and each incoming document is calculated, and documents with similarities higher than a preset threshold are retrieved. Retrieved documents are sent to the user, who returns a relevance feedback to the system. This feedback information is used to update the profile for the upcoming flow of new documents. Due to its similarity to the traditional information retrieval (IR) task, many techniques developed in IR are applied to document filtering systems. For example, profiles and incoming documents are usually indexed Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distnbuted for profit or commercial advantage and that copies bear this notice and the full citation on the first page To copy otherwise, to repubhsh, to post on servers or to red,stnbute to hsts, requires prior specific permission and/or a fee. SIGIR 2000 7/00 Athens, Greece 2000 ACM 1-58113-226-3/00/0007.., $ 5.00 by methods used in IR, such as the vector space model. Therefore, the profile-document similarity calculation method is virtually the same as algorithms used for calculating query-document similarity in IR. Furthermore, query expansion (QE) is often applied to utilize relevance feedback information to profile updating. Numerous document filtering systems have been reported in the Filtering Track[3] of recent TREC conferences. One of the 3 subtasks prepared in the Filtering Track is the adaptive filtering task, where systems start with only the original profile, which is used to build a text classification rule. The adaptive filtering task is considered to be the task which reflects practical illtering situations, but is also the most difficult task in the Filtering Track. Due to the difficulty of this task, many systems become "conservative" as the document flow proceeds, i.e., tend to set a high threshold to avoid mistaken retrieval of non-relevant documents. In other words, a threshold which will retrieve a sufficient number of relevant documents results in the excessive retrieval of non-relevant documents. This suggests that the simple method of comparing profile-document similarity to a threshold may not be effective, especially when the system is expected to retrieve as many relevant documents as possible. In this paper, a novel filtering method is proposed. The proposed method uses a profile which expresses information of non-relevant documents retrieved during the filtering process. This non-relevant information profile can reduce the number of mistakenly retrieved documents, so that the system can retrieve more relevant documents which may be ignored due to a conservative similarity threshold. In Section 2, we will describe existing filtering methods, mainly focusing on profile updating, and present problems of these methods through preliminary experiments. In Section 3, we will make an explanation of the non-relevant information profile, and evaluate its performance through experiments. In Section 4, we will describe another new method which applies pseudo feedback to increase feedback information to the non- 176

relevant information profile. We will conclude this paper in Section 5. 2 Problems of existing filtering methods As described in the previous section, the filtering method in which the system retrieves documents based on the similarity between the profile and each incoming document is suspected to have problems. However, most research have been focused on aspects such as profile updating to improve filtering performance. In this section, we will explain about two existing profile updating methods. We will also describe a preliminary experiment on these filtering methods, and analyze its results to clarify problems. 2.1 Existing profile updating methods 2.1.1 Rocchlo's algorithm One of the most effective and widely applied algorithms of relevance feedback and query expansion is Rocchio's algorithm [5], which was developed in the mid-1960's. Developed for the vector space model, this algorithm is based on the idea that if the relevance for a query is known, an optimal query vector will maximize the average query-document similarity for relevant documents, and will simultaneously minimize query-document similarity for non-relevant documents. Generally, the query expansion method based on Rocchio's algorithm is expressed in the following formula: where R is the number of documents in the relevant document set, N is the number of documents in the non-relevant document set, and o~,fl, 7 are parameters. For the use of Rocchio's algorithm in profile updating, we referred to the method described in [8]. In this method, only positive documents (i.e., selected relevant documents) were used for profile updating. The coefficient for positive documents is fixed to 0.1 (meaning the parameters in Formula (1) are set as: c~ = 1,fl = 0.1,7 -= 0). The profile is updated based on these parameter sets on every n selected documents. In the experiment described below, we set n to 2. 2.1.2 Word contribution We have also evaluated a profile updating method based on word contribution (WC)[2], which is a measure to express the influence of a word to query-document similarity. We will describe the WC-based QE method and its application to profile updating in this section. WC-based QE Word contribution is defined by the following formula: Cont(w, q, d) = Sire(q, d) - Sim(q'(w), d'(w)) (2) where Cont(w, q, d) is the contribution of the word w in the similarity between query q and document d, Sire(q, d) is the similarity between q and d, q'(w) is query q excluding word w, and d'(w) is document d excluding word w. In other words, the contribution of word w is the difference between the similarity of q and d, and the similarity of q and d when word w is assumed to be nonexistent in both data. Therefore, there are words which have positive contribution, and words which have negative contribution. Words with positive contribution raise similarity, and words with negative contribution lower similarity. Analysis on WC[2] show that words with either highly positive or negative contribution are few, and that most words have contribution near zero. This means that most words do not have a significant influence on querydocument similarity. As obvious from the definition of word contribution, words with highly positive contribution are words which cooccur in the query and document. Such words can be considered as informative words of document relevance to the query. On the contrary, words with highly negative contribution can be considered as words which discriminate relevant documents from other non-relevant documents contained in the data collection. Experiments reported in [2] show that using such words with highly negative contribution for query expansion achieved higher performance than the Rocchio-based query expansion method. WC-based profile updating In the previously described QE method, words used for query expansion were extracted only from relevant documents. In the profile updating method based on WC[1], information from all selected documents were used, regardless of their relevance to the profile. First, the word contribution of all words in the selected document are calculated. From each selected document d, N words with the lowest contribution are extracted. Next, a score for each extracted word w is cmculated by the following formula: Score(w) : wgt x Cont(w, p, d) (3) where wgt is a parameter with a negative value (since the contribution of the extracted word is also negative), and Cont(w, p, d) is the WC of word w to the similarity of profile p and document d. On this procedure, the calculated score is regarded as the TF (term frequency) element of the word. Finally, all extracted words and their weights are added to the profile, unless the calculated weight of the word is negative. 177

A Rocchio-like algorithm is applied here to apply information from non-relevant documents to the profile. When the selected document d is relevant to the profile, the weight of word w is added to the element of the profile vector which expresses w. When d is nonrelevant, the weight is subtracted from the element of the profile vector. Seperate parameters (wgt) are used for the calculation of Score(w) described in Formula (3), depending on the relevance of d. wgt~tn is the parameter for words extracted from relevant documents, and wgt,~tn is the parameter for words extracted from non-relevant documents. Elements of the profile vector with negative weights are not used for similarity calculation, but all weights are accumulated for profile updating on upcoming documents. Therefore, the weights of words which appear in both relevant and non-relevant documents are restrained, thus emphasizing words which only appear in relevant documents. where tf is the term's frequency in the document, df is the number of documents that contain the term, and M is the total number of documents in the data collection. We added 1 to the term frequency inside the logarithm of the TF factor because the tf value resulting from word contribution occasionally has values below 1, which results in a negative weight. 2.2.3 Analysis Figures 1 and 2 illustrate the similarity of documents selected by comparison to a profile, using profile updating methods based on Rocchio and WC, respectively. The horizontal axis of each graph expresses the number of documents selected during the filtering process, and the vertic~d axis is the similarity of each selected document. Parameters WgtreIR and WgtnreIR for the WC-based method were set to -200 and -800, respectively. 2.2 Evaluation of existing methods We have made experiments to evaluate existing filtering methods. In this section, we will make a brief explanation on the data used for experiments and our filtering system, and present experiment results. 2.2.1 Experiment data The TREC-8 Filtering Track data[7] was used for our experiments. This data set consists of articles from the Financial Times from 1992 to 1994, which is a total of approximately 200,000 articles. Each article is inputted into the filtering system in order of time to create the document flow. Topics 351-400 are used as profiles, and the relevance document data set for each topic is used to simulate relevance feedback. The vocabulary and IDF data are initially constructed from the data in TREC CD-ROMs Vol 4 and 5, excluding the F, nanczal Tzmes and Congresswnal Records data. Both the vocabulary and IDF data are updated every 10000 documents in the document flow. 2.2.2 System description The filtering system used for our experiments is based on the vector space model. The weighting calculation scheme is based on the TF*IDF based weighting formulas for the SMART system at TREC-7 [6], with minor customizations. The TF and IDF factors for our system are as the following: TF factor * IDF factor log(1 + (4) ~03 05 O4 ~E02 01 o0 f o " o ;,3 l oo o o m o R m o o : : :" 1 ":.,~ =l ~-".'., " " "'" 2o~:~,,~. a ~''~" -~-..'-'.".~ ~'@ "~ /.':.l 200 400 600 800 # of ret tiered clocs Figure 1: Similarity of selected documents (Rocchio) ~03 05 04 E ~02 01 00 ~rne~ :;rae~ vant o o = = ~ o c~o=, =o = o 69 0 = " "~o '.4" ". ~ 5,,.f,o'o. "-- " ^ "',..." %.. = ~...o.pl~pr.,,i.~.'1, ~_~:d"- ~'~, ",. =.~,, _~.=.gao,'.~,~..,o - ~ -~=~o--,~---~---"k-,,aj i 50 100 150 # O! retrieved does Figure 2: Similarity of selected documents (WC) It is clear from both Figures 1 and 2 that Mthough relevant documents have relatively high similarity, many non-relevant documents have similarity close to the similarity of relevant documents. The mixture of relevant and non-relevant documents can particularly be observed o 178

in low similarity areas. Therefore, it is difficult to extract relevant documents from this area without retrieving a large number of non-relevant documents. The easy way to solve this problem is to set a high similarity threshold to reject as many non-relevant documents as possible. However, it is obvious that a high threshold will result in the rejection of a large number of relevant documents. Moreover, such a strict threshold will also result in less feedback to the profile, which may affect filtering performance on upcoming documents. 3 Non-relevant information profile In this section, we will propose a filtering method using the non-relevant,nformatwn profile, which is a profile built to reject non-relevant documents. After the description of this method, we will also make a detailed explanation on evaluation experiments of the proposed method, and analyze the results. 3.1 Method To improve filtering performance without sacrificing retrieval of relevant documents, it is necessary to reduce non-relevant document selection. However; the analysis on results of the experiments described in the previous section show that this is difficult when filtering is based on only the similarity between the profile and incoming documents. In order to reduce retrieval of non-relevant documents, we propose the use of a profile which expresses the features of non-relevant documents. By calculating the similarity between this non-relevant znformatwn profile and incoming documents which have passed the initial profile, and rejecting documents which have high similarity to the non-relevant information profile, it is possible to avoid selection of documents highly similar to past retrieved non-relevant documents. By rejecting such documents, improvement of filtering performance can be expected. The process flow of filtering with the non-relevant information profile is illustrated in Figure 3, where d is the selected document, pr is the initial profile, PN is the non-relevant information profile, and Sirn(p, d) is the similarity between profile p and document d. As illustrated in Figure 3, thresholds ThresR and Thresg are set for each profile. The similarity between PN and documents which have passed PR is calculated, and compared to Thresg. If the similarity exceeds ThresN, then the document is regarded as non-relevant, and, as a result, is rejected by PN. The method to build the non-relevant information profile is as the following: Initial values of all elements in the non-relevaaat information profile are set to 0. For each selected document, N words are extracted and their weights are No Get relevance feedback of d J I Update PR, PN ] No Figure 3: Filtering process with non-relevant information profile calculated based on WC. As in the original WC-based profile updating method, parameter wgt differs based on the relevance of the selected document. For the generation and updating of ply, WgtretN is the parameter for words extracted from relevant documents, and wgtnrelg is the parameter for words extracted from nonrelevant documents. To update the non-relevaat information profile, the weights of words extracted from nonrelevant documents are added, and weights of words extracted from relevant documents are subtracted from the regarding element of the profile vector. This is opposite from the updating of the initial profile, where the weights of words extracted from relevant documents were added to the regarding element of the profile vector, and the weights of words extracted from non-relevant documents were subtracted. In addition to the updating of the non-relevant information profile, the initial profile PR is also updated by the method described in Section 2.1.2. 3.2 Experiment We have made experiments to evaluate the use of the non-relevant information profile. Details of these experiments are described in this section. 3.2.1 Evaluation measures Since recall and precision are not suitable for the evaluation of document filtering, we calculated the scaled utihty [3] of each profile, and averaged the scaled utility 179

of all profiles for evaluation. We will make an explanation about utility and scaled utility in this section. Utility[3] assigns a value or a cost to each document, based on whether it is retrieved or not retrieved and whether it is relevant or not relevant. The general formula for utility is shown in the formula below: Utility = A x R+ + B x N+ + C R_ + D x N_ (6) where R+ is the number of relevant documents retrieved, /~_ is the number of relevant documents not retrieved, N+ is the number of non-relevant documents retrieved, and N_ is the number of non-relevant documents not retrieved. The utility parameters (A, B,C,D) determine the relative value of each possible category. For evaluation of the results of the experiments in this paper, we used the LF1 utility used in TREC-8, where the parameters were set as the following: A = 3, B = -2, C = D = 0. However, it is not appropriate to compare the value of LF1 across topics, due to the wide variation in the number of relevant documents per topic. Therefore, it is necessary to normalize LF1 for fair comparison. We used scaled utility for the normalization of LF1. The formula of scaled utility is as the following: u, (S, T) = max(u(s, T), U(s)) - U(s) MaxU(T) - U(s) (7) where u(s,t) and us(s,t) are the original and scaled utility of system S for topic T, U(s) is the utility of retrieving s non-relevant documents, and MaxU(T) is the maximum possible utility score for topic T. All utility scores less than U(s) are set to U(s). Therefore, utility scores can range between U(s) and MaxU(T), and the scores are renormalized to range between 0 and 1. 3.2.2 Results First, we made experiments using only the relevant information profile (PR) for filtering. Parameters wgtretr and wgtnrezr were each set to {-200,-400,-800} and {-100,-200,-400,-800}, respectively. The similarity threshold (ThresR) was fixed to 0.1. The average scaled utility of all 50 topics for each parameter set is shown in Table 1. Parameter s for the calculation of scaled utility is set to 200. Results in Table 1 show that the parameter set of {wgtreir, wgt,~reir} = {--200,--800} achieved the best performance. Next, we evaluated performance of filtering using the non-relevant information profile (PN). The parameters used for updating PR were fixed to {wgt, ezr, wgt,~elr} = {--200,--800}, based on the results in Table 1. ThresR Table 1: Average scaled utility (pr only) WgtnrelR WgtrelR -100-200 -400-800 -200 0.4558 0.4840 0.5091 0.5257-400 0.4172 0.4777 0.5107 0.5184-800 0.3815 0.4349 0.4842 0.5100 was fixed to 0.1, as in the previous experiment. Parameters for updating PN, WgtrelN and wgtnreig, were each set to {-200,-400, -800} and {-i00, -200, -400,-800} respectively. The similarity threshold ThresN was set to 0.I and 0.25. Results for each ThresN are shown in Tables 2 (ThresN=O.1) and 3 (ThresN=0.25). Table 2: Average scaled utility(thresn = 0.1) wg~nreln WgtrelN -i00-200 -400-800 -200 0.5660 0.5755 0.5814 0.5852-400 0.5667 0.5702 0.5810 0.5858-800 0.5690 0.5728 0.5743 0.5863 Table 3: Average scaled utility(thresn = 0.25) wgtnreln wgtrein -100-200 -400-800 -200 0.5448 0.5464 0.5448. 0.5508-400 0.5448 0.5466 0.5491 0.5466-800 0.5408 0.5466 0.5484 0.5505 Consistent improvement in scaled utility compared to the original filtering method can be observed from the results in Tables 2 and 3. This shows that the application of the non-relevant information profile has contributed to the improvement of filtering performance. 3.3 Analysis For further analysis on the effects of Thresh, we examined the relation between the similarity of each document and the two profiles, Pn and pp. We will refer to the similarity to each of these profiles as Sirnn and SireN, respectively. In order to analyze the relation between Sirnn and Siren of relevant and non-relevant documents, we have plotted all documents which have passed PR on a twodimensional graph. The SitaR-SireN graph for the experiment when ThresN = 0.25 is illustrated in Figure 180

4, and the graph for Threslv = 0.1 is shown in Figure documents are rejected by the non-relevant profile. 5. 05 04 03 02 01 O0 ~1~-2.1q~-0,,9. [ ocd o.~o oo o ] 1 02 03 04 05 06 07 08 0 Sitar Figure 4: Relation of Sirnn and SireN (ThresN=0.25) 05. "~~'k.= o4 it. ~, 0 3 o == E~ "1o. Ioo, O2 o non-relevant o relevant 4 Non-relevant profile with pseudo feedback 4.1 Method Results from the experiments described in the previous section show that there is a tradeoff between the strictness of Thres~r and the performance of profile PN. To solve this problem, we propose the use of pseudo feedback[4] to increase feedback information. Pseudo feedback is often used for QE in the text retrieval task, when the relevance of retrieved documents is uncertain. Generally, documents which are high-ranked on the initial search are assumed to be relevant. This assumation is sent back to the system, which utilizes this information to expand the query. Our proposal is to assume documents that are blocked by PN as non-relevant, and to send this information to the profile updating process. The documents regarded as non-relevant by pseudo feedback are handled as the same as documents which were actually regarded non-relevant from the original relevance feedback. This method allows Thresy to be strict without sacrificing feedback information. 4.2 Experiment 00 -- ~ ~:EP " ~= ': - - - ~ o - - - 01 02 03 04 05 06 07 08 09 SlmR Figure 5: Relation of Sitar and SireN (ThresN=0.1) It is clear from Figure 4 that SirnN is relatively higher for non-relevant documents compared to that of relevant documents. This suggests that it is possible to reject many non-relevant documents by setting ThresN to an appropriate value. In this case, however, ThresN is 0.25. As apparent from Figure 4, there are not many documents where SimN is higher than ThresN, meaning that such a threshold setting is too moderate. However, when ThresN is set to 0.1, as in Figure 5, SimN of relevant and non-relevant documents are mixed, compared to the plots illustrated in Figure 4. The difference between these two experiments is the strictness of ThresN. As a result of strengthening the threshold of the non-relevant information profile, the number of selected documents decreases. This decrease is directly correlated to the amount of feedback information to the profile updating process. The results illustrated in Figures 5 indicate that feedback information was insufficient for accurate discrimination of non-relevant documents. However, Figure 4 shows that the increase of feedback information due to loosening the threshold has little meaning, since less non-relevant Experiments were made to evaluate pseudo feedback. Parameters for these experiments were set as the following: Thresh = ThresN = 0.1, WgtreZn=-200, wgtn~etn=" 800, wg~rezn= {-200,-400,-800 }, WgtnreZg= {-lo0,-200,- 400,-800}. The average scaled utility for each set of wgtreln &lid wgtn~eln is shown in Table 4. Table 4: Average scaled utility (pseudo feedback) ~g~nreln wgtrezn -100-200 -400-800 -200 0.5752 0.5799 0.5900 0.5927-400 0.5779 0.5803 0.5859 0.5954-800 0.5790 0.5813 0.5862 0.5896 The results in Table 4 show overall improvement in filtering performance. This points out that PN is successfully rejecting more non-relevant documents compared to the method described in the previous section. To confirm this result, we made a Siren-SireN graph for this experiment, as we did in Figures 4 and 5 for previous experiments. The Sirnn-Simg graph for the pseudo feedback experiment is illustrated in Figure 6. As clear from Figure 6, Simg of non-relevant documents are more highly distributed compared to the results illustrated in Figures 4 and 5. This graph and the 181

05 gill~=o 0 04 == 03 02 ~ = ' ) 0 o =o o 01 u O~ 0 00 -- qpl~"~ '/:P'= '~ 01 02 03 04 05 06 SItaR non-relevant relevant 07 08 09 Figure 6: Relation of Siren and SirnN (pseudo feedback) scaled utility improvement shown in Table 4 prove that the non-relevant information profile is successfully rejecting a reasonable number of non-relevant documents, as expected. However, it is also clear from Figure 6 that Siren of some relevant documents have also increased, causing a mixture of non-relevant and relevant documents in the area where SimN is relatively high. The cause of this is the inaccuracy of the pseudo feedback, in which it is possible for relevant documents to be mistakenly regarded as non-relevant. This shows that the decrease of non-relevant document selection was achieved with some sacrifice of relevant documents. We suggest two solutions to this problem. One is the selection of pseudo feedback information. Inaccuracy of pseudo feedback can be reduced by simply not using "suspicious" information for feedback. In this case, such information may be documents which were barely rejected by the non-relevant information profile. By ignoring such documents, and using only documents which have high similarity to the non-relevant information profile, the rate of erroneous feedback can be decreased. Another solution is to weigh the pseudo feedback information based on the similarity between each document and the non-relevant information profile. This is a moderate version of the previous solution. Instead of simply ignoring "suspicious" documents for use in pseudo feedback, it is possible to apply a weight to each document based on its similarity to the non-relevant information profile. An ideal weighting scheme will emphasize feedback information extracted from documents highly similar to the non-relevant information profile, which may lead to higher pseudo feedback quality. 5 Conclusion Many existing document filtering systems take a conservative approach to achieve high filtering performance;! to avoid retrieval of non-relevant documents, such systems sacrifice the retrieval of relevant documents. In order to retrieve more relevant documents without excessive retrieval of non-relevant documents, we have proposed the use of a non-relevant information profile. The non-relevant information profile expresses the features of mistakenly retrieved non-relevant documents. The object of this profile is to reject the retrieval of nonrelevant documents which are similar to documents mistakenly retrieved in the past flow of documents. Along with similarity calculation between each document and the original profile, the similarity to the non-relevant information profile is calculated, and documents with high similarity to this profile are rejected. Through experiments, we have proved that the nonrelevant information profile successfully reduces the retrieval of non-relevant documents, resulting in overall improvement of filtering performance. We have also made an experiment on the application of pseudo feedback for building the non-relevant information profile. Results from this experiment show that the increase of feedback information occurring from pseudo feedback has also improved filtering performance. References [1] K Hoashi, K Matsumoto, N Inoue, K Hashimoto: "Experiments on the TREC-8 Filtering Track", (to be published in The 8th Text REtrieval Conference"), 2000. [2] K Hoashi, K Matsumoto, N Inoue, K Hashimoto: "Query Expansion Method Based on Word Contribution", Proceedings of SIGIR'99, pp 303-304, 1999. [3] D Hull: "The TREC-7 Filtering Track: Description and Analysis", The 7th Text REtrieval Conference, NIST SP 500-242, pp 33-56, 1999. [4] S Robertson, S Walker, S Jones, M Hancock- Beaulieu, and M Gatford, "Okapi at TREC-3", Overview of the Third Text REtrieval Conference, pp 109-125, 1994. [5] J Rocchio: "Relevance Feedback in Information Retrieval", in "The SMART Retrieval System - Experiments in Automatic Document Processing", Prentice Hall Inc., pp 313-323, 1971. [6] A Singhal, J Choi, D Hindle, D Lewis, and F Pereira: "AT&T at TREC-7", The Seventh Text REtrieval Conference, NIST SP 500-242, pp 239-251, 1999. [7] E Voorhees, D Harman: "The 8th Text REtrieval Conference", (to be published), 2000. 182

[8] C Zhai, P Jansen, N Roma, E Stoiea, D Evans: "Notes on Optimization in CLARIT Adaptive Filtering", (to be published in The 8th Text REtmeval Conference"), 2000. 183