Document Filtering Method Using Non-Relevant Information Profile

Size: px
Start display at page:

Download "Document Filtering Method Using Non-Relevant Information Profile"

Transcription

1 Document Filtering Method Using Non-Relevant Information Profile Keiichiro Hoashi Kazunori Matsumoto Naomi Inoue Kazuo Hashimoto KDD R&D Laboratories, Inc Ohaxa Kamifukuoka, Saitama JAPAN {hoash±, matsu, inoue, kh}~kddlabs, co. j p Abstract Document filtering is a task to retrieve documents relevant to a user's profile from a flow of documents. Generally, filtering systems calculate the similarity between the profile and each incoming document, and retrieve documents with similarity higher than a threshold. However, many systems set a relatively high threshold to reduce retrieval of non-relevant documents, which results in the ignorance of many relevant documents. In this paper, we propose the use of a non-relevant information profile to reduce the mistaken retrieval of non-relevant documents. Results from experiments show that this filter has successfully rejected a sufficient number of non-relevant documents, resulting in an improvement of filtering performance. 1 Introduction Document filtering is a task which monitors a flow of incoming documents, and selects those which the systems regards as relevant to the user's interest. Many document filtering systems use a similarity-based method to retrieve documents. The user's interest is expressed within the system as a profile. The similarity between the profile and each incoming document is calculated, and documents with similarities higher than a preset threshold are retrieved. Retrieved documents are sent to the user, who returns a relevance feedback to the system. This feedback information is used to update the profile for the upcoming flow of new documents. Due to its similarity to the traditional information retrieval (IR) task, many techniques developed in IR are applied to document filtering systems. For example, profiles and incoming documents are usually indexed Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distnbuted for profit or commercial advantage and that copies bear this notice and the full citation on the first page To copy otherwise, to repubhsh, to post on servers or to red,stnbute to hsts, requires prior specific permission and/or a fee. SIGIR /00 Athens, Greece 2000 ACM /00/0007.., $ 5.00 by methods used in IR, such as the vector space model. Therefore, the profile-document similarity calculation method is virtually the same as algorithms used for calculating query-document similarity in IR. Furthermore, query expansion (QE) is often applied to utilize relevance feedback information to profile updating. Numerous document filtering systems have been reported in the Filtering Track[3] of recent TREC conferences. One of the 3 subtasks prepared in the Filtering Track is the adaptive filtering task, where systems start with only the original profile, which is used to build a text classification rule. The adaptive filtering task is considered to be the task which reflects practical illtering situations, but is also the most difficult task in the Filtering Track. Due to the difficulty of this task, many systems become "conservative" as the document flow proceeds, i.e., tend to set a high threshold to avoid mistaken retrieval of non-relevant documents. In other words, a threshold which will retrieve a sufficient number of relevant documents results in the excessive retrieval of non-relevant documents. This suggests that the simple method of comparing profile-document similarity to a threshold may not be effective, especially when the system is expected to retrieve as many relevant documents as possible. In this paper, a novel filtering method is proposed. The proposed method uses a profile which expresses information of non-relevant documents retrieved during the filtering process. This non-relevant information profile can reduce the number of mistakenly retrieved documents, so that the system can retrieve more relevant documents which may be ignored due to a conservative similarity threshold. In Section 2, we will describe existing filtering methods, mainly focusing on profile updating, and present problems of these methods through preliminary experiments. In Section 3, we will make an explanation of the non-relevant information profile, and evaluate its performance through experiments. In Section 4, we will describe another new method which applies pseudo feedback to increase feedback information to the non- 176

2 relevant information profile. We will conclude this paper in Section 5. 2 Problems of existing filtering methods As described in the previous section, the filtering method in which the system retrieves documents based on the similarity between the profile and each incoming document is suspected to have problems. However, most research have been focused on aspects such as profile updating to improve filtering performance. In this section, we will explain about two existing profile updating methods. We will also describe a preliminary experiment on these filtering methods, and analyze its results to clarify problems. 2.1 Existing profile updating methods Rocchlo's algorithm One of the most effective and widely applied algorithms of relevance feedback and query expansion is Rocchio's algorithm [5], which was developed in the mid-1960's. Developed for the vector space model, this algorithm is based on the idea that if the relevance for a query is known, an optimal query vector will maximize the average query-document similarity for relevant documents, and will simultaneously minimize query-document similarity for non-relevant documents. Generally, the query expansion method based on Rocchio's algorithm is expressed in the following formula: where R is the number of documents in the relevant document set, N is the number of documents in the non-relevant document set, and o~,fl, 7 are parameters. For the use of Rocchio's algorithm in profile updating, we referred to the method described in [8]. In this method, only positive documents (i.e., selected relevant documents) were used for profile updating. The coefficient for positive documents is fixed to 0.1 (meaning the parameters in Formula (1) are set as: c~ = 1,fl = 0.1,7 -= 0). The profile is updated based on these parameter sets on every n selected documents. In the experiment described below, we set n to Word contribution We have also evaluated a profile updating method based on word contribution (WC)[2], which is a measure to express the influence of a word to query-document similarity. We will describe the WC-based QE method and its application to profile updating in this section. WC-based QE Word contribution is defined by the following formula: Cont(w, q, d) = Sire(q, d) - Sim(q'(w), d'(w)) (2) where Cont(w, q, d) is the contribution of the word w in the similarity between query q and document d, Sire(q, d) is the similarity between q and d, q'(w) is query q excluding word w, and d'(w) is document d excluding word w. In other words, the contribution of word w is the difference between the similarity of q and d, and the similarity of q and d when word w is assumed to be nonexistent in both data. Therefore, there are words which have positive contribution, and words which have negative contribution. Words with positive contribution raise similarity, and words with negative contribution lower similarity. Analysis on WC[2] show that words with either highly positive or negative contribution are few, and that most words have contribution near zero. This means that most words do not have a significant influence on querydocument similarity. As obvious from the definition of word contribution, words with highly positive contribution are words which cooccur in the query and document. Such words can be considered as informative words of document relevance to the query. On the contrary, words with highly negative contribution can be considered as words which discriminate relevant documents from other non-relevant documents contained in the data collection. Experiments reported in [2] show that using such words with highly negative contribution for query expansion achieved higher performance than the Rocchio-based query expansion method. WC-based profile updating In the previously described QE method, words used for query expansion were extracted only from relevant documents. In the profile updating method based on WC[1], information from all selected documents were used, regardless of their relevance to the profile. First, the word contribution of all words in the selected document are calculated. From each selected document d, N words with the lowest contribution are extracted. Next, a score for each extracted word w is cmculated by the following formula: Score(w) : wgt x Cont(w, p, d) (3) where wgt is a parameter with a negative value (since the contribution of the extracted word is also negative), and Cont(w, p, d) is the WC of word w to the similarity of profile p and document d. On this procedure, the calculated score is regarded as the TF (term frequency) element of the word. Finally, all extracted words and their weights are added to the profile, unless the calculated weight of the word is negative. 177

3 A Rocchio-like algorithm is applied here to apply information from non-relevant documents to the profile. When the selected document d is relevant to the profile, the weight of word w is added to the element of the profile vector which expresses w. When d is nonrelevant, the weight is subtracted from the element of the profile vector. Seperate parameters (wgt) are used for the calculation of Score(w) described in Formula (3), depending on the relevance of d. wgt~tn is the parameter for words extracted from relevant documents, and wgt,~tn is the parameter for words extracted from non-relevant documents. Elements of the profile vector with negative weights are not used for similarity calculation, but all weights are accumulated for profile updating on upcoming documents. Therefore, the weights of words which appear in both relevant and non-relevant documents are restrained, thus emphasizing words which only appear in relevant documents. where tf is the term's frequency in the document, df is the number of documents that contain the term, and M is the total number of documents in the data collection. We added 1 to the term frequency inside the logarithm of the TF factor because the tf value resulting from word contribution occasionally has values below 1, which results in a negative weight Analysis Figures 1 and 2 illustrate the similarity of documents selected by comparison to a profile, using profile updating methods based on Rocchio and WC, respectively. The horizontal axis of each graph expresses the number of documents selected during the filtering process, and the vertic~d axis is the similarity of each selected document. Parameters WgtreIR and WgtnreIR for the WC-based method were set to -200 and -800, respectively. 2.2 Evaluation of existing methods We have made experiments to evaluate existing filtering methods. In this section, we will make a brief explanation on the data used for experiments and our filtering system, and present experiment results Experiment data The TREC-8 Filtering Track data[7] was used for our experiments. This data set consists of articles from the Financial Times from 1992 to 1994, which is a total of approximately 200,000 articles. Each article is inputted into the filtering system in order of time to create the document flow. Topics are used as profiles, and the relevance document data set for each topic is used to simulate relevance feedback. The vocabulary and IDF data are initially constructed from the data in TREC CD-ROMs Vol 4 and 5, excluding the F, nanczal Tzmes and Congresswnal Records data. Both the vocabulary and IDF data are updated every documents in the document flow System description The filtering system used for our experiments is based on the vector space model. The weighting calculation scheme is based on the TF*IDF based weighting formulas for the SMART system at TREC-7 [6], with minor customizations. The TF and IDF factors for our system are as the following: TF factor * IDF factor log(1 + (4) ~03 05 O4 ~E02 01 o0 f o " o ;,3 l oo o o m o R m o o : : :" 1 ":.,~ =l ~-".'., " " "'" 2o~:~,,~. a ~''~" -~-..'-'.".~ ~'@ "~ /.':.l # of ret tiered clocs Figure 1: Similarity of selected documents (Rocchio) ~ E ~ ~rne~ :;rae~ vant o o = = ~ o c~o=, =o = o 69 0 = " "~o '.4" ". ~ 5,,.f,o'o. "-- " ^ "',..." %.. = ~...o.pl~pr.,,i.~.'1, ~_~:d"- ~'~, ",. =.~,, _~.=.gao,'.~,~..,o - ~ -~=~o--,~---~---"k-,,aj i # O! retrieved does Figure 2: Similarity of selected documents (WC) It is clear from both Figures 1 and 2 that Mthough relevant documents have relatively high similarity, many non-relevant documents have similarity close to the similarity of relevant documents. The mixture of relevant and non-relevant documents can particularly be observed o 178

4 in low similarity areas. Therefore, it is difficult to extract relevant documents from this area without retrieving a large number of non-relevant documents. The easy way to solve this problem is to set a high similarity threshold to reject as many non-relevant documents as possible. However, it is obvious that a high threshold will result in the rejection of a large number of relevant documents. Moreover, such a strict threshold will also result in less feedback to the profile, which may affect filtering performance on upcoming documents. 3 Non-relevant information profile In this section, we will propose a filtering method using the non-relevant,nformatwn profile, which is a profile built to reject non-relevant documents. After the description of this method, we will also make a detailed explanation on evaluation experiments of the proposed method, and analyze the results. 3.1 Method To improve filtering performance without sacrificing retrieval of relevant documents, it is necessary to reduce non-relevant document selection. However; the analysis on results of the experiments described in the previous section show that this is difficult when filtering is based on only the similarity between the profile and incoming documents. In order to reduce retrieval of non-relevant documents, we propose the use of a profile which expresses the features of non-relevant documents. By calculating the similarity between this non-relevant znformatwn profile and incoming documents which have passed the initial profile, and rejecting documents which have high similarity to the non-relevant information profile, it is possible to avoid selection of documents highly similar to past retrieved non-relevant documents. By rejecting such documents, improvement of filtering performance can be expected. The process flow of filtering with the non-relevant information profile is illustrated in Figure 3, where d is the selected document, pr is the initial profile, PN is the non-relevant information profile, and Sirn(p, d) is the similarity between profile p and document d. As illustrated in Figure 3, thresholds ThresR and Thresg are set for each profile. The similarity between PN and documents which have passed PR is calculated, and compared to Thresg. If the similarity exceeds ThresN, then the document is regarded as non-relevant, and, as a result, is rejected by PN. The method to build the non-relevant information profile is as the following: Initial values of all elements in the non-relevaaat information profile are set to 0. For each selected document, N words are extracted and their weights are No Get relevance feedback of d J I Update PR, PN ] No Figure 3: Filtering process with non-relevant information profile calculated based on WC. As in the original WC-based profile updating method, parameter wgt differs based on the relevance of the selected document. For the generation and updating of ply, WgtretN is the parameter for words extracted from relevant documents, and wgtnrelg is the parameter for words extracted from nonrelevant documents. To update the non-relevaat information profile, the weights of words extracted from nonrelevant documents are added, and weights of words extracted from relevant documents are subtracted from the regarding element of the profile vector. This is opposite from the updating of the initial profile, where the weights of words extracted from relevant documents were added to the regarding element of the profile vector, and the weights of words extracted from non-relevant documents were subtracted. In addition to the updating of the non-relevant information profile, the initial profile PR is also updated by the method described in Section Experiment We have made experiments to evaluate the use of the non-relevant information profile. Details of these experiments are described in this section Evaluation measures Since recall and precision are not suitable for the evaluation of document filtering, we calculated the scaled utihty [3] of each profile, and averaged the scaled utility 179

5 of all profiles for evaluation. We will make an explanation about utility and scaled utility in this section. Utility[3] assigns a value or a cost to each document, based on whether it is retrieved or not retrieved and whether it is relevant or not relevant. The general formula for utility is shown in the formula below: Utility = A x R+ + B x N+ + C R_ + D x N_ (6) where R+ is the number of relevant documents retrieved, /~_ is the number of relevant documents not retrieved, N+ is the number of non-relevant documents retrieved, and N_ is the number of non-relevant documents not retrieved. The utility parameters (A, B,C,D) determine the relative value of each possible category. For evaluation of the results of the experiments in this paper, we used the LF1 utility used in TREC-8, where the parameters were set as the following: A = 3, B = -2, C = D = 0. However, it is not appropriate to compare the value of LF1 across topics, due to the wide variation in the number of relevant documents per topic. Therefore, it is necessary to normalize LF1 for fair comparison. We used scaled utility for the normalization of LF1. The formula of scaled utility is as the following: u, (S, T) = max(u(s, T), U(s)) - U(s) MaxU(T) - U(s) (7) where u(s,t) and us(s,t) are the original and scaled utility of system S for topic T, U(s) is the utility of retrieving s non-relevant documents, and MaxU(T) is the maximum possible utility score for topic T. All utility scores less than U(s) are set to U(s). Therefore, utility scores can range between U(s) and MaxU(T), and the scores are renormalized to range between 0 and Results First, we made experiments using only the relevant information profile (PR) for filtering. Parameters wgtretr and wgtnrezr were each set to {-200,-400,-800} and {-100,-200,-400,-800}, respectively. The similarity threshold (ThresR) was fixed to 0.1. The average scaled utility of all 50 topics for each parameter set is shown in Table 1. Parameter s for the calculation of scaled utility is set to 200. Results in Table 1 show that the parameter set of {wgtreir, wgt,~reir} = {--200,--800} achieved the best performance. Next, we evaluated performance of filtering using the non-relevant information profile (PN). The parameters used for updating PR were fixed to {wgt, ezr, wgt,~elr} = {--200,--800}, based on the results in Table 1. ThresR Table 1: Average scaled utility (pr only) WgtnrelR WgtrelR was fixed to 0.1, as in the previous experiment. Parameters for updating PN, WgtrelN and wgtnreig, were each set to {-200,-400, -800} and {-i00, -200, -400,-800} respectively. The similarity threshold ThresN was set to 0.I and Results for each ThresN are shown in Tables 2 (ThresN=O.1) and 3 (ThresN=0.25). Table 2: Average scaled utility(thresn = 0.1) wg~nreln WgtrelN -i Table 3: Average scaled utility(thresn = 0.25) wgtnreln wgtrein Consistent improvement in scaled utility compared to the original filtering method can be observed from the results in Tables 2 and 3. This shows that the application of the non-relevant information profile has contributed to the improvement of filtering performance. 3.3 Analysis For further analysis on the effects of Thresh, we examined the relation between the similarity of each document and the two profiles, Pn and pp. We will refer to the similarity to each of these profiles as Sirnn and SireN, respectively. In order to analyze the relation between Sirnn and Siren of relevant and non-relevant documents, we have plotted all documents which have passed PR on a twodimensional graph. The SitaR-SireN graph for the experiment when ThresN = 0.25 is illustrated in Figure 180

6 4, and the graph for Threslv = 0.1 is shown in Figure documents are rejected by the non-relevant profile O0 ~1~-2.1q~-0,,9. [ ocd o.~o oo o ] Sitar Figure 4: Relation of Sirnn and SireN (ThresN=0.25) 05. "~~'k.= o4 it. ~, 0 3 o == E~ "1o. Ioo, O2 o non-relevant o relevant 4 Non-relevant profile with pseudo feedback 4.1 Method Results from the experiments described in the previous section show that there is a tradeoff between the strictness of Thres~r and the performance of profile PN. To solve this problem, we propose the use of pseudo feedback[4] to increase feedback information. Pseudo feedback is often used for QE in the text retrieval task, when the relevance of retrieved documents is uncertain. Generally, documents which are high-ranked on the initial search are assumed to be relevant. This assumation is sent back to the system, which utilizes this information to expand the query. Our proposal is to assume documents that are blocked by PN as non-relevant, and to send this information to the profile updating process. The documents regarded as non-relevant by pseudo feedback are handled as the same as documents which were actually regarded non-relevant from the original relevance feedback. This method allows Thresy to be strict without sacrificing feedback information. 4.2 Experiment ~ ~:EP " ~= ': ~ o SlmR Figure 5: Relation of Sitar and SireN (ThresN=0.1) It is clear from Figure 4 that SirnN is relatively higher for non-relevant documents compared to that of relevant documents. This suggests that it is possible to reject many non-relevant documents by setting ThresN to an appropriate value. In this case, however, ThresN is As apparent from Figure 4, there are not many documents where SimN is higher than ThresN, meaning that such a threshold setting is too moderate. However, when ThresN is set to 0.1, as in Figure 5, SimN of relevant and non-relevant documents are mixed, compared to the plots illustrated in Figure 4. The difference between these two experiments is the strictness of ThresN. As a result of strengthening the threshold of the non-relevant information profile, the number of selected documents decreases. This decrease is directly correlated to the amount of feedback information to the profile updating process. The results illustrated in Figures 5 indicate that feedback information was insufficient for accurate discrimination of non-relevant documents. However, Figure 4 shows that the increase of feedback information due to loosening the threshold has little meaning, since less non-relevant Experiments were made to evaluate pseudo feedback. Parameters for these experiments were set as the following: Thresh = ThresN = 0.1, WgtreZn=-200, wgtn~etn=" 800, wg~rezn= {-200,-400,-800 }, WgtnreZg= {-lo0,-200,- 400,-800}. The average scaled utility for each set of wgtreln &lid wgtn~eln is shown in Table 4. Table 4: Average scaled utility (pseudo feedback) ~g~nreln wgtrezn The results in Table 4 show overall improvement in filtering performance. This points out that PN is successfully rejecting more non-relevant documents compared to the method described in the previous section. To confirm this result, we made a Siren-SireN graph for this experiment, as we did in Figures 4 and 5 for previous experiments. The Sirnn-Simg graph for the pseudo feedback experiment is illustrated in Figure 6. As clear from Figure 6, Simg of non-relevant documents are more highly distributed compared to the results illustrated in Figures 4 and 5. This graph and the 181

7 05 gill~=o 0 04 == ~ = ' ) 0 o =o o 01 u O~ qpl~"~ '/:P'= '~ SItaR non-relevant relevant Figure 6: Relation of Siren and SirnN (pseudo feedback) scaled utility improvement shown in Table 4 prove that the non-relevant information profile is successfully rejecting a reasonable number of non-relevant documents, as expected. However, it is also clear from Figure 6 that Siren of some relevant documents have also increased, causing a mixture of non-relevant and relevant documents in the area where SimN is relatively high. The cause of this is the inaccuracy of the pseudo feedback, in which it is possible for relevant documents to be mistakenly regarded as non-relevant. This shows that the decrease of non-relevant document selection was achieved with some sacrifice of relevant documents. We suggest two solutions to this problem. One is the selection of pseudo feedback information. Inaccuracy of pseudo feedback can be reduced by simply not using "suspicious" information for feedback. In this case, such information may be documents which were barely rejected by the non-relevant information profile. By ignoring such documents, and using only documents which have high similarity to the non-relevant information profile, the rate of erroneous feedback can be decreased. Another solution is to weigh the pseudo feedback information based on the similarity between each document and the non-relevant information profile. This is a moderate version of the previous solution. Instead of simply ignoring "suspicious" documents for use in pseudo feedback, it is possible to apply a weight to each document based on its similarity to the non-relevant information profile. An ideal weighting scheme will emphasize feedback information extracted from documents highly similar to the non-relevant information profile, which may lead to higher pseudo feedback quality. 5 Conclusion Many existing document filtering systems take a conservative approach to achieve high filtering performance;! to avoid retrieval of non-relevant documents, such systems sacrifice the retrieval of relevant documents. In order to retrieve more relevant documents without excessive retrieval of non-relevant documents, we have proposed the use of a non-relevant information profile. The non-relevant information profile expresses the features of mistakenly retrieved non-relevant documents. The object of this profile is to reject the retrieval of nonrelevant documents which are similar to documents mistakenly retrieved in the past flow of documents. Along with similarity calculation between each document and the original profile, the similarity to the non-relevant information profile is calculated, and documents with high similarity to this profile are rejected. Through experiments, we have proved that the nonrelevant information profile successfully reduces the retrieval of non-relevant documents, resulting in overall improvement of filtering performance. We have also made an experiment on the application of pseudo feedback for building the non-relevant information profile. Results from this experiment show that the increase of feedback information occurring from pseudo feedback has also improved filtering performance. References [1] K Hoashi, K Matsumoto, N Inoue, K Hashimoto: "Experiments on the TREC-8 Filtering Track", (to be published in The 8th Text REtrieval Conference"), [2] K Hoashi, K Matsumoto, N Inoue, K Hashimoto: "Query Expansion Method Based on Word Contribution", Proceedings of SIGIR'99, pp , [3] D Hull: "The TREC-7 Filtering Track: Description and Analysis", The 7th Text REtrieval Conference, NIST SP , pp 33-56, [4] S Robertson, S Walker, S Jones, M Hancock- Beaulieu, and M Gatford, "Okapi at TREC-3", Overview of the Third Text REtrieval Conference, pp , [5] J Rocchio: "Relevance Feedback in Information Retrieval", in "The SMART Retrieval System - Experiments in Automatic Document Processing", Prentice Hall Inc., pp , [6] A Singhal, J Choi, D Hindle, D Lewis, and F Pereira: "AT&T at TREC-7", The Seventh Text REtrieval Conference, NIST SP , pp , [7] E Voorhees, D Harman: "The 8th Text REtrieval Conference", (to be published),

8 [8] C Zhai, P Jansen, N Roma, E Stoiea, D Evans: "Notes on Optimization in CLARIT Adaptive Filtering", (to be published in The 8th Text REtmeval Conference"),

Document Expansion for Text-based Image Retrieval at CLEF 2009

Document Expansion for Text-based Image Retrieval at CLEF 2009 Document Expansion for Text-based Image Retrieval at CLEF 2009 Jinming Min, Peter Wilkins, Johannes Leveling, and Gareth Jones Centre for Next Generation Localisation School of Computing, Dublin City University

More information

Query Likelihood with Negative Query Generation

Query Likelihood with Negative Query Generation Query Likelihood with Negative Query Generation Yuanhua Lv Department of Computer Science University of Illinois at Urbana-Champaign Urbana, IL 61801 ylv2@uiuc.edu ChengXiang Zhai Department of Computer

More information

An Attempt to Identify Weakest and Strongest Queries

An Attempt to Identify Weakest and Strongest Queries An Attempt to Identify Weakest and Strongest Queries K. L. Kwok Queens College, City University of NY 65-30 Kissena Boulevard Flushing, NY 11367, USA kwok@ir.cs.qc.edu ABSTRACT We explore some term statistics

More information

Query Expansion with the Minimum User Feedback by Transductive Learning

Query Expansion with the Minimum User Feedback by Transductive Learning Query Expansion with the Minimum User Feedback by Transductive Learning Masayuki OKABE Information and Media Center Toyohashi University of Technology Aichi, 441-8580, Japan okabe@imc.tut.ac.jp Kyoji UMEMURA

More information

Effect of log-based Query Term Expansion on Retrieval Effectiveness in Patent Searching

Effect of log-based Query Term Expansion on Retrieval Effectiveness in Patent Searching Effect of log-based Query Term Expansion on Retrieval Effectiveness in Patent Searching Wolfgang Tannebaum, Parvaz Madabi and Andreas Rauber Institute of Software Technology and Interactive Systems, Vienna

More information

An Axiomatic Approach to IR UIUC TREC 2005 Robust Track Experiments

An Axiomatic Approach to IR UIUC TREC 2005 Robust Track Experiments An Axiomatic Approach to IR UIUC TREC 2005 Robust Track Experiments Hui Fang ChengXiang Zhai Department of Computer Science University of Illinois at Urbana-Champaign Abstract In this paper, we report

More information

A Study of Methods for Negative Relevance Feedback

A Study of Methods for Negative Relevance Feedback A Study of Methods for Negative Relevance Feedback Xuanhui Wang University of Illinois at Urbana-Champaign Urbana, IL 61801 xwang20@cs.uiuc.edu Hui Fang The Ohio State University Columbus, OH 43210 hfang@cse.ohiostate.edu

More information

Chapter 6: Information Retrieval and Web Search. An introduction

Chapter 6: Information Retrieval and Web Search. An introduction Chapter 6: Information Retrieval and Web Search An introduction Introduction n Text mining refers to data mining using text documents as data. n Most text mining tasks use Information Retrieval (IR) methods

More information

Applying the KISS Principle for the CLEF- IP 2010 Prior Art Candidate Patent Search Task

Applying the KISS Principle for the CLEF- IP 2010 Prior Art Candidate Patent Search Task Applying the KISS Principle for the CLEF- IP 2010 Prior Art Candidate Patent Search Task Walid Magdy, Gareth J.F. Jones Centre for Next Generation Localisation School of Computing Dublin City University,

More information

M erg in g C lassifiers for Im p ro v ed In fo rm a tio n R e triev a l

M erg in g C lassifiers for Im p ro v ed In fo rm a tio n R e triev a l M erg in g C lassifiers for Im p ro v ed In fo rm a tio n R e triev a l Anette Hulth, Lars Asker Dept, of Computer and Systems Sciences Stockholm University [hulthi asker]ø dsv.su.s e Jussi Karlgren Swedish

More information

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University CS473: CS-473 Course Review Luo Si Department of Computer Science Purdue University Basic Concepts of IR: Outline Basic Concepts of Information Retrieval: Task definition of Ad-hoc IR Terminologies and

More information

Term Frequency Normalisation Tuning for BM25 and DFR Models

Term Frequency Normalisation Tuning for BM25 and DFR Models Term Frequency Normalisation Tuning for BM25 and DFR Models Ben He and Iadh Ounis Department of Computing Science University of Glasgow United Kingdom Abstract. The term frequency normalisation parameter

More information

Using the K-Nearest Neighbor Method and SMART Weighting in the Patent Document Categorization Subtask at NTCIR-6

Using the K-Nearest Neighbor Method and SMART Weighting in the Patent Document Categorization Subtask at NTCIR-6 Using the K-Nearest Neighbor Method and SMART Weighting in the Patent Document Categorization Subtask at NTCIR-6 Masaki Murata National Institute of Information and Communications Technology 3-5 Hikaridai,

More information

Search Engines Chapter 8 Evaluating Search Engines Felix Naumann

Search Engines Chapter 8 Evaluating Search Engines Felix Naumann Search Engines Chapter 8 Evaluating Search Engines 9.7.2009 Felix Naumann Evaluation 2 Evaluation is key to building effective and efficient search engines. Drives advancement of search engines When intuition

More information

CLEF-IP 2009: Exploring Standard IR Techniques on Patent Retrieval

CLEF-IP 2009: Exploring Standard IR Techniques on Patent Retrieval DCU @ CLEF-IP 2009: Exploring Standard IR Techniques on Patent Retrieval Walid Magdy, Johannes Leveling, Gareth J.F. Jones Centre for Next Generation Localization School of Computing Dublin City University,

More information

Pseudo-Relevance Feedback and Title Re-Ranking for Chinese Information Retrieval

Pseudo-Relevance Feedback and Title Re-Ranking for Chinese Information Retrieval Pseudo-Relevance Feedback and Title Re-Ranking Chinese Inmation Retrieval Robert W.P. Luk Department of Computing The Hong Kong Polytechnic University Email: csrluk@comp.polyu.edu.hk K.F. Wong Dept. Systems

More information

Fondazione Ugo Bordoni at TREC 2004

Fondazione Ugo Bordoni at TREC 2004 Fondazione Ugo Bordoni at TREC 2004 Giambattista Amati, Claudio Carpineto, and Giovanni Romano Fondazione Ugo Bordoni Rome Italy Abstract Our participation in TREC 2004 aims to extend and improve the use

More information

Information Retrieval. CS630 Representing and Accessing Digital Information. What is a Retrieval Model? Basic IR Processes

Information Retrieval. CS630 Representing and Accessing Digital Information. What is a Retrieval Model? Basic IR Processes CS630 Representing and Accessing Digital Information Information Retrieval: Retrieval Models Information Retrieval Basics Data Structures and Access Indexing and Preprocessing Retrieval Models Thorsten

More information

University of Santiago de Compostela at CLEF-IP09

University of Santiago de Compostela at CLEF-IP09 University of Santiago de Compostela at CLEF-IP9 José Carlos Toucedo, David E. Losada Grupo de Sistemas Inteligentes Dept. Electrónica y Computación Universidad de Santiago de Compostela, Spain {josecarlos.toucedo,david.losada}@usc.es

More information

Real-time Query Expansion in Relevance Models

Real-time Query Expansion in Relevance Models Real-time Query Expansion in Relevance Models Victor Lavrenko and James Allan Center for Intellignemt Information Retrieval Department of Computer Science 140 Governor s Drive University of Massachusetts

More information

ResPubliQA 2010

ResPubliQA 2010 SZTAKI @ ResPubliQA 2010 David Mark Nemeskey Computer and Automation Research Institute, Hungarian Academy of Sciences, Budapest, Hungary (SZTAKI) Abstract. This paper summarizes the results of our first

More information

Using Query History to Prune Query Results

Using Query History to Prune Query Results Using Query History to Prune Query Results Daniel Waegel Ursinus College Department of Computer Science dawaegel@gmail.com April Kontostathis Ursinus College Department of Computer Science akontostathis@ursinus.edu

More information

AT&T at TREC-6. Amit Singhal. AT&T Labs{Research. Abstract

AT&T at TREC-6. Amit Singhal. AT&T Labs{Research. Abstract AT&T at TREC-6 Amit Singhal AT&T Labs{Research singhal@research.att.com Abstract TREC-6 is AT&T's rst independent TREC participation. We are participating in the main tasks (adhoc, routing), the ltering

More information

Making Retrieval Faster Through Document Clustering

Making Retrieval Faster Through Document Clustering R E S E A R C H R E P O R T I D I A P Making Retrieval Faster Through Document Clustering David Grangier 1 Alessandro Vinciarelli 2 IDIAP RR 04-02 January 23, 2004 D a l l e M o l l e I n s t i t u t e

More information

A novel supervised learning algorithm and its use for Spam Detection in Social Bookmarking Systems

A novel supervised learning algorithm and its use for Spam Detection in Social Bookmarking Systems A novel supervised learning algorithm and its use for Spam Detection in Social Bookmarking Systems Anestis Gkanogiannis and Theodore Kalamboukis Department of Informatics Athens University of Economics

More information

Information Retrieval. (M&S Ch 15)

Information Retrieval. (M&S Ch 15) Information Retrieval (M&S Ch 15) 1 Retrieval Models A retrieval model specifies the details of: Document representation Query representation Retrieval function Determines a notion of relevance. Notion

More information

IITH at CLEF 2017: Finding Relevant Tweets for Cultural Events

IITH at CLEF 2017: Finding Relevant Tweets for Cultural Events IITH at CLEF 2017: Finding Relevant Tweets for Cultural Events Sreekanth Madisetty and Maunendra Sankar Desarkar Department of CSE, IIT Hyderabad, Hyderabad, India {cs15resch11006, maunendra}@iith.ac.in

More information

Ranking Function Optimizaton Based on OKAPI and K-Means

Ranking Function Optimizaton Based on OKAPI and K-Means 2016 International Conference on Mechanical, Control, Electric, Mechatronics, Information and Computer (MCEMIC 2016) ISBN: 978-1-60595-352-6 Ranking Function Optimizaton Based on OKAPI and K-Means Jun

More information

Chapter 8. Evaluating Search Engine

Chapter 8. Evaluating Search Engine Chapter 8 Evaluating Search Engine Evaluation Evaluation is key to building effective and efficient search engines Measurement usually carried out in controlled laboratory experiments Online testing can

More information

Optimal Query. Assume that the relevant set of documents C r. 1 N C r d j. d j. Where N is the total number of documents.

Optimal Query. Assume that the relevant set of documents C r. 1 N C r d j. d j. Where N is the total number of documents. Optimal Query Assume that the relevant set of documents C r are known. Then the best query is: q opt 1 C r d j C r d j 1 N C r d j C r d j Where N is the total number of documents. Note that even this

More information

Improving the Effectiveness of Information Retrieval with Local Context Analysis

Improving the Effectiveness of Information Retrieval with Local Context Analysis Improving the Effectiveness of Information Retrieval with Local Context Analysis JINXI XU BBN Technologies and W. BRUCE CROFT University of Massachusetts Amherst Techniques for automatic query expansion

More information

Tilburg University. Authoritative re-ranking of search results Bogers, A.M.; van den Bosch, A. Published in: Advances in Information Retrieval

Tilburg University. Authoritative re-ranking of search results Bogers, A.M.; van den Bosch, A. Published in: Advances in Information Retrieval Tilburg University Authoritative re-ranking of search results Bogers, A.M.; van den Bosch, A. Published in: Advances in Information Retrieval Publication date: 2006 Link to publication Citation for published

More information

Microsoft Cambridge at TREC 13: Web and HARD tracks

Microsoft Cambridge at TREC 13: Web and HARD tracks Microsoft Cambridge at TREC 13: Web and HARD tracks Hugo Zaragoza Λ Nick Craswell y Michael Taylor z Suchi Saria x Stephen Robertson 1 Overview All our submissions from the Microsoft Research Cambridge

More information

Citation for published version (APA): He, J. (2011). Exploring topic structure: Coherence, diversity and relatedness

Citation for published version (APA): He, J. (2011). Exploring topic structure: Coherence, diversity and relatedness UvA-DARE (Digital Academic Repository) Exploring topic structure: Coherence, diversity and relatedness He, J. Link to publication Citation for published version (APA): He, J. (211). Exploring topic structure:

More information

Combining fields for query expansion and adaptive query expansion

Combining fields for query expansion and adaptive query expansion Information Processing and Management 43 (2007) 1294 1307 www.elsevier.com/locate/infoproman Combining fields for query expansion and adaptive query expansion Ben He *, Iadh Ounis Department of Computing

More information

TREC 2016 Dynamic Domain Track: Exploiting Passage Representation for Retrieval and Relevance Feedback

TREC 2016 Dynamic Domain Track: Exploiting Passage Representation for Retrieval and Relevance Feedback RMIT @ TREC 2016 Dynamic Domain Track: Exploiting Passage Representation for Retrieval and Relevance Feedback Ameer Albahem ameer.albahem@rmit.edu.au Lawrence Cavedon lawrence.cavedon@rmit.edu.au Damiano

More information

Automated Online News Classification with Personalization

Automated Online News Classification with Personalization Automated Online News Classification with Personalization Chee-Hong Chan Aixin Sun Ee-Peng Lim Center for Advanced Information Systems, Nanyang Technological University Nanyang Avenue, Singapore, 639798

More information

Mining the Web for Multimedia-based Enriching

Mining the Web for Multimedia-based Enriching Mining the Web for Multimedia-based Enriching Mathilde Sahuguet and Benoit Huet Eurecom, Sophia-Antipolis, France Abstract. As the amount of social media shared on the Internet grows increasingly, it becomes

More information

Melbourne University at the 2006 Terabyte Track

Melbourne University at the 2006 Terabyte Track Melbourne University at the 2006 Terabyte Track Vo Ngoc Anh William Webber Alistair Moffat Department of Computer Science and Software Engineering The University of Melbourne Victoria 3010, Australia Abstract:

More information

Beyond Independent Relevance: Methods and Evaluation Metrics for Subtopic Retrieval

Beyond Independent Relevance: Methods and Evaluation Metrics for Subtopic Retrieval Beyond Independent Relevance: Methods and Evaluation Metrics for Subtopic Retrieval ChengXiang Zhai Computer Science Department University of Illinois at Urbana-Champaign William W. Cohen Center for Automated

More information

CSCI 599: Applications of Natural Language Processing Information Retrieval Evaluation"

CSCI 599: Applications of Natural Language Processing Information Retrieval Evaluation CSCI 599: Applications of Natural Language Processing Information Retrieval Evaluation" All slides Addison Wesley, Donald Metzler, and Anton Leuski, 2008, 2012! Evaluation" Evaluation is key to building

More information

Routing and Ad-hoc Retrieval with the. Nikolaus Walczuch, Norbert Fuhr, Michael Pollmann, Birgit Sievers. University of Dortmund, Germany.

Routing and Ad-hoc Retrieval with the. Nikolaus Walczuch, Norbert Fuhr, Michael Pollmann, Birgit Sievers. University of Dortmund, Germany. Routing and Ad-hoc Retrieval with the TREC-3 Collection in a Distributed Loosely Federated Environment Nikolaus Walczuch, Norbert Fuhr, Michael Pollmann, Birgit Sievers University of Dortmund, Germany

More information

Mercure at trec6 2 IRIT/SIG. Campus Univ. Toulouse III. F Toulouse. fbougha,

Mercure at trec6 2 IRIT/SIG. Campus Univ. Toulouse III. F Toulouse.   fbougha, Mercure at trec6 M. Boughanem 1 2 C. Soule-Dupuy 2 3 1 MSI Universite de Limoges 123, Av. Albert Thomas F-87060 Limoges 2 IRIT/SIG Campus Univ. Toulouse III 118, Route de Narbonne F-31062 Toulouse 3 CERISS

More information

Using Coherence-based Measures to Predict Query Difficulty

Using Coherence-based Measures to Predict Query Difficulty Using Coherence-based Measures to Predict Query Difficulty Jiyin He, Martha Larson, and Maarten de Rijke ISLA, University of Amsterdam {jiyinhe,larson,mdr}@science.uva.nl Abstract. We investigate the potential

More information

Boolean Model. Hongning Wang

Boolean Model. Hongning Wang Boolean Model Hongning Wang CS@UVa Abstraction of search engine architecture Indexed corpus Crawler Ranking procedure Doc Analyzer Doc Representation Query Rep Feedback (Query) Evaluation User Indexer

More information

Custom IDF weights for boosting the relevancy of retrieved documents in textual retrieval

Custom IDF weights for boosting the relevancy of retrieved documents in textual retrieval Annals of the University of Craiova, Mathematics and Computer Science Series Volume 44(2), 2017, Pages 238 248 ISSN: 1223-6934 Custom IDF weights for boosting the relevancy of retrieved documents in textual

More information

Re-ranking Documents Based on Query-Independent Document Specificity

Re-ranking Documents Based on Query-Independent Document Specificity Re-ranking Documents Based on Query-Independent Document Specificity Lei Zheng and Ingemar J. Cox Department of Computer Science University College London London, WC1E 6BT, United Kingdom lei.zheng@ucl.ac.uk,

More information

indexing and query processing. The inverted le was constructed for the retrieval target collection which contains full texts of two years' Japanese pa

indexing and query processing. The inverted le was constructed for the retrieval target collection which contains full texts of two years' Japanese pa Term Distillation in Patent Retrieval Hideo Itoh Hiroko Mano Yasushi Ogawa Software R&D Group, RICOH Co., Ltd. 1-1-17 Koishikawa, Bunkyo-ku, Tokyo 112-0002, JAPAN fhideo,mano,yogawag@src.ricoh.co.jp Abstract

More information

Using a Medical Thesaurus to Predict Query Difficulty

Using a Medical Thesaurus to Predict Query Difficulty Using a Medical Thesaurus to Predict Query Difficulty Florian Boudin, Jian-Yun Nie, Martin Dawes To cite this version: Florian Boudin, Jian-Yun Nie, Martin Dawes. Using a Medical Thesaurus to Predict Query

More information

RMIT University at TREC 2006: Terabyte Track

RMIT University at TREC 2006: Terabyte Track RMIT University at TREC 2006: Terabyte Track Steven Garcia Falk Scholer Nicholas Lester Milad Shokouhi School of Computer Science and IT RMIT University, GPO Box 2476V Melbourne 3001, Australia 1 Introduction

More information

An Incremental Approach to Efficient Pseudo-Relevance Feedback

An Incremental Approach to Efficient Pseudo-Relevance Feedback An Incremental Approach to Efficient Pseudo-Relevance Feedback ABSTRACT Hao Wu Department of Electrical and Computer Engineering University of Delaware Newark, DE USA haow@udel.edu Pseudo-relevance feedback

More information

The Utrecht Blend: Basic Ingredients for an XML Retrieval System

The Utrecht Blend: Basic Ingredients for an XML Retrieval System The Utrecht Blend: Basic Ingredients for an XML Retrieval System Roelof van Zwol Centre for Content and Knowledge Engineering Utrecht University Utrecht, the Netherlands roelof@cs.uu.nl Virginia Dignum

More information

Towards Privacy-Preserving Evaluation for Information Retrieval Models over Industry Data Sets

Towards Privacy-Preserving Evaluation for Information Retrieval Models over Industry Data Sets Towards Privacy-Preserving Evaluation for Information Retrieval Models over Industry Data Sets Peilin Yang 1, Mianwei Zhou 2, Yi Chang 3, Chengxiang Zhai 4, and Hui Fang 1 1 University of Delaware, USA

More information

A RECOMMENDER SYSTEM FOR SOCIAL BOOK SEARCH

A RECOMMENDER SYSTEM FOR SOCIAL BOOK SEARCH A RECOMMENDER SYSTEM FOR SOCIAL BOOK SEARCH A thesis Submitted to the faculty of the graduate school of the University of Minnesota by Vamshi Krishna Thotempudi In partial fulfillment of the requirements

More information

Query Expansion for Noisy Legal Documents

Query Expansion for Noisy Legal Documents Query Expansion for Noisy Legal Documents Lidan Wang 1,3 and Douglas W. Oard 2,3 1 Computer Science Department, 2 College of Information Studies and 3 Institute for Advanced Computer Studies, University

More information

Chapter 27 Introduction to Information Retrieval and Web Search

Chapter 27 Introduction to Information Retrieval and Web Search Chapter 27 Introduction to Information Retrieval and Web Search Copyright 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 27 Outline Information Retrieval (IR) Concepts Retrieval

More information

Study on Merging Multiple Results from Information Retrieval System

Study on Merging Multiple Results from Information Retrieval System Proceedings of the Third NTCIR Workshop Study on Merging Multiple Results from Information Retrieval System Hiromi itoh OZAKU, Masao UTIAMA, Hitoshi ISAHARA Communications Research Laboratory 2-2-2 Hikaridai,

More information

Information Retrieval CS Lecture 06. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Information Retrieval CS Lecture 06. Razvan C. Bunescu School of Electrical Engineering and Computer Science Information Retrieval CS 6900 Lecture 06 Razvan C. Bunescu School of Electrical Engineering and Computer Science bunescu@ohio.edu Boolean Retrieval vs. Ranked Retrieval Many users (professionals) prefer

More information

Data Modelling and Multimedia Databases M

Data Modelling and Multimedia Databases M ALMA MATER STUDIORUM - UNIERSITÀ DI BOLOGNA Data Modelling and Multimedia Databases M International Second cycle degree programme (LM) in Digital Humanities and Digital Knoledge (DHDK) University of Bologna

More information

QUERY EXPANSION USING WORDNET WITH A LOGICAL MODEL OF INFORMATION RETRIEVAL

QUERY EXPANSION USING WORDNET WITH A LOGICAL MODEL OF INFORMATION RETRIEVAL QUERY EXPANSION USING WORDNET WITH A LOGICAL MODEL OF INFORMATION RETRIEVAL David Parapar, Álvaro Barreiro AILab, Department of Computer Science, University of A Coruña, Spain dparapar@udc.es, barreiro@udc.es

More information

Modern Information Retrieval

Modern Information Retrieval Modern Information Retrieval Chapter 3 Retrieval Evaluation Retrieval Performance Evaluation Reference Collections CFC: The Cystic Fibrosis Collection Retrieval Evaluation, Modern Information Retrieval,

More information

Information Retrieval. Session 11 LBSC 671 Creating Information Infrastructures

Information Retrieval. Session 11 LBSC 671 Creating Information Infrastructures Information Retrieval Session 11 LBSC 671 Creating Information Infrastructures Agenda The search process Information retrieval Recommender systems Evaluation The Memex Machine Information Hierarchy More

More information

Ranked Feature Fusion Models for Ad Hoc Retrieval

Ranked Feature Fusion Models for Ad Hoc Retrieval Ranked Feature Fusion Models for Ad Hoc Retrieval Jeremy Pickens, Gene Golovchinsky FX Palo Alto Laboratory, Inc. 3400 Hillview Ave, Building 4 Palo Alto, California 94304 USA {jeremy, gene}@fxpal.com

More information

Performance Evaluation

Performance Evaluation Chapter 4 Performance Evaluation For testing and comparing the effectiveness of retrieval and classification methods, ways of evaluating the performance are required. This chapter discusses several of

More information

An Evaluation Method of Web Search Engines Based on Users Sense

An Evaluation Method of Web Search Engines Based on Users Sense An Evaluation Method of Web Search Engines Based on Users Sense Takashi OHTSUKA y Koji EGUCHI z Hayato YAMANA y y Graduate School of Science and Engineering, Waseda University 3-4-1 Okubo Shinjuku-ku Tokyo,

More information

James Mayfield! The Johns Hopkins University Applied Physics Laboratory The Human Language Technology Center of Excellence!

James Mayfield! The Johns Hopkins University Applied Physics Laboratory The Human Language Technology Center of Excellence! James Mayfield! The Johns Hopkins University Applied Physics Laboratory The Human Language Technology Center of Excellence! (301) 219-4649 james.mayfield@jhuapl.edu What is Information Retrieval? Evaluation

More information

Basic Tokenizing, Indexing, and Implementation of Vector-Space Retrieval

Basic Tokenizing, Indexing, and Implementation of Vector-Space Retrieval Basic Tokenizing, Indexing, and Implementation of Vector-Space Retrieval 1 Naïve Implementation Convert all documents in collection D to tf-idf weighted vectors, d j, for keyword vocabulary V. Convert

More information

Robust Relevance-Based Language Models

Robust Relevance-Based Language Models Robust Relevance-Based Language Models Xiaoyan Li Department of Computer Science, Mount Holyoke College 50 College Street, South Hadley, MA 01075, USA Email: xli@mtholyoke.edu ABSTRACT We propose a new

More information

Analyzing Document Retrievability in Patent Retrieval Settings

Analyzing Document Retrievability in Patent Retrieval Settings Analyzing Document Retrievability in Patent Retrieval Settings Shariq Bashir and Andreas Rauber Institute of Software Technology and Interactive Systems, Vienna University of Technology, Austria {bashir,rauber}@ifs.tuwien.ac.at

More information

Verbose Query Reduction by Learning to Rank for Social Book Search Track

Verbose Query Reduction by Learning to Rank for Social Book Search Track Verbose Query Reduction by Learning to Rank for Social Book Search Track Messaoud CHAA 1,2, Omar NOUALI 1, Patrice BELLOT 3 1 Research Center on Scientific and Technical Information 05 rue des 03 frères

More information

Efficient query processing

Efficient query processing Efficient query processing Efficient scoring, distributed query processing Web Search 1 Ranking functions In general, document scoring functions are of the form The BM25 function, is one of the best performing:

More information

Information Retrieval and Web Search

Information Retrieval and Web Search Information Retrieval and Web Search Relevance Feedback. Query Expansion Instructor: Rada Mihalcea Intelligent Information Retrieval 1. Relevance feedback - Direct feedback - Pseudo feedback 2. Query expansion

More information

Text Documents clustering using K Means Algorithm

Text Documents clustering using K Means Algorithm Text Documents clustering using K Means Algorithm Mrs Sanjivani Tushar Deokar Assistant professor sanjivanideokar@gmail.com Abstract: With the advancement of technology and reduced storage costs, individuals

More information

Retrieval Evaluation. Hongning Wang

Retrieval Evaluation. Hongning Wang Retrieval Evaluation Hongning Wang CS@UVa What we have learned so far Indexed corpus Crawler Ranking procedure Research attention Doc Analyzer Doc Rep (Index) Query Rep Feedback (Query) Evaluation User

More information

A Cluster-Based Resampling Method for Pseudo- Relevance Feedback

A Cluster-Based Resampling Method for Pseudo- Relevance Feedback A Cluster-Based Resampling Method for Pseudo- Relevance Feedback Kyung Soon Lee W. Bruce Croft James Allan Department of Computer Engineering Chonbuk National University Republic of Korea Center for Intelligent

More information

DUTH at ImageCLEF 2011 Wikipedia Retrieval

DUTH at ImageCLEF 2011 Wikipedia Retrieval DUTH at ImageCLEF 2011 Wikipedia Retrieval Avi Arampatzis, Konstantinos Zagoris, and Savvas A. Chatzichristofis Department of Electrical and Computer Engineering, Democritus University of Thrace, Xanthi

More information

iarabicweb16: Making a Large Web Collection More Accessible for Research

iarabicweb16: Making a Large Web Collection More Accessible for Research iarabicweb16: Making a Large Web Collection More Accessible for Research Khaled Yasser, Reem Suwaileh, Abdelrahman Shouman, Yassmine Barkallah, Mucahid Kutlu, Tamer Elsayed Computer Science and Engineering

More information

Information Retrieval

Information Retrieval Information Retrieval CSC 375, Fall 2016 An information retrieval system will tend not to be used whenever it is more painful and troublesome for a customer to have information than for him not to have

More information

Classification and Comparative Analysis of Passage Retrieval Methods

Classification and Comparative Analysis of Passage Retrieval Methods Classification and Comparative Analysis of Passage Retrieval Methods Dr. K. Saruladha, C. Siva Sankar, G. Chezhian, L. Lebon Iniyavan, N. Kadiresan Computer Science Department, Pondicherry Engineering

More information

A New Measure of the Cluster Hypothesis

A New Measure of the Cluster Hypothesis A New Measure of the Cluster Hypothesis Mark D. Smucker 1 and James Allan 2 1 Department of Management Sciences University of Waterloo 2 Center for Intelligent Information Retrieval Department of Computer

More information

Maximal Termsets as a Query Structuring Mechanism

Maximal Termsets as a Query Structuring Mechanism Maximal Termsets as a Query Structuring Mechanism ABSTRACT Bruno Pôssas Federal University of Minas Gerais 30161-970 Belo Horizonte-MG, Brazil bavep@dcc.ufmg.br Berthier Ribeiro-Neto Federal University

More information

Automatic Boolean Query Suggestion for Professional Search

Automatic Boolean Query Suggestion for Professional Search Automatic Boolean Query Suggestion for Professional Search Youngho Kim yhkim@cs.umass.edu Jangwon Seo jangwon@cs.umass.edu Center for Intelligent Information Retrieval Department of Computer Science University

More information

Instructor: Stefan Savev

Instructor: Stefan Savev LECTURE 2 What is indexing? Indexing is the process of extracting features (such as word counts) from the documents (in other words: preprocessing the documents). The process ends with putting the information

More information

A Formal Approach to Score Normalization for Meta-search

A Formal Approach to Score Normalization for Meta-search A Formal Approach to Score Normalization for Meta-search R. Manmatha and H. Sever Center for Intelligent Information Retrieval Computer Science Department University of Massachusetts Amherst, MA 01003

More information

Query Expansion from Wikipedia and Topic Web Crawler on CLIR

Query Expansion from Wikipedia and Topic Web Crawler on CLIR Query Expansion from Wikipedia and Topic Web Crawler on CLIR Meng-Chun Lin, Ming-Xiang Li, Chih-Chuan Hsu and Shih-Hung Wu* Department of Computer Science and Information Engineering Chaoyang University

More information

Question Answering Approach Using a WordNet-based Answer Type Taxonomy

Question Answering Approach Using a WordNet-based Answer Type Taxonomy Question Answering Approach Using a WordNet-based Answer Type Taxonomy Seung-Hoon Na, In-Su Kang, Sang-Yool Lee, Jong-Hyeok Lee Department of Computer Science and Engineering, Electrical and Computer Engineering

More information

City, University of London Institutional Repository

City, University of London Institutional Repository City Research Online City, University of London Institutional Repository Citation: MacFarlane, A., McCann, J. A. & Robertson, S. E. (2000). Parallel search using partitioned inverted files. In: Seventh

More information

MeSH-based dataset for measuring the relevance of text retrieval

MeSH-based dataset for measuring the relevance of text retrieval MeSH-based dataset for measuring the relevance of text retrieval Won Kim, Lana Yeganova, Donald C Comeau, W John Wilbur, Zhiyong Lu National Center for Biotechnology Information, NLM, NIH, Bethesda, MD,

More information

On a Combination of Probabilistic and Boolean IR Models for WWW Document Retrieval

On a Combination of Probabilistic and Boolean IR Models for WWW Document Retrieval On a Combination of Probabilistic and Boolean IR Models for WWW Document Retrieval MASAHARU YOSHIOKA and MAKOTO HARAGUCHI Hokkaido University Even though a Boolean query can express the information need

More information

Indexing and Query Processing

Indexing and Query Processing Indexing and Query Processing Jaime Arguello INLS 509: Information Retrieval jarguell@email.unc.edu January 28, 2013 Basic Information Retrieval Process doc doc doc doc doc information need document representation

More information

Advanced Search Techniques for Large Scale Data Analytics Pavel Zezula and Jan Sedmidubsky Masaryk University

Advanced Search Techniques for Large Scale Data Analytics Pavel Zezula and Jan Sedmidubsky Masaryk University Advanced Search Techniques for Large Scale Data Analytics Pavel Zezula and Jan Sedmidubsky Masaryk University http://disa.fi.muni.cz The Cranfield Paradigm Retrieval Performance Evaluation Evaluation Using

More information

DOCUMENT INDEXING USING INDEPENDENT TOPIC EXTRACTION. Yu-Hwan Kim and Byoung-Tak Zhang

DOCUMENT INDEXING USING INDEPENDENT TOPIC EXTRACTION. Yu-Hwan Kim and Byoung-Tak Zhang DOCUMENT INDEXING USING INDEPENDENT TOPIC EXTRACTION Yu-Hwan Kim and Byoung-Tak Zhang School of Computer Science and Engineering Seoul National University Seoul 5-7, Korea yhkim,btzhang bi.snu.ac.kr ABSTRACT

More information

Word Indexing Versus Conceptual Indexing in Medical Image Retrieval

Word Indexing Versus Conceptual Indexing in Medical Image Retrieval Word Indexing Versus Conceptual Indexing in Medical Image Retrieval (ReDCAD participation at ImageCLEF Medical Image Retrieval 2012) Karim Gasmi, Mouna Torjmen-Khemakhem, and Maher Ben Jemaa Research unit

More information

GlOSS: Text-Source Discovery over the Internet

GlOSS: Text-Source Discovery over the Internet GlOSS: Text-Source Discovery over the Internet LUIS GRAVANO Columbia University HÉCTOR GARCÍA-MOLINA Stanford University and ANTHONY TOMASIC INRIA Rocquencourt The dramatic growth of the Internet has created

More information

Navigation Retrieval with Site Anchor Text

Navigation Retrieval with Site Anchor Text Navigation Retrieval with Site Anchor Text Hideki Kawai Kenji Tateishi Toshikazu Fukushima NEC Internet Systems Research Labs. 8916-47, Takayama-cho, Ikoma-city, Nara, JAPAN {h-kawai@ab, k-tateishi@bq,

More information

X. A Relevance Feedback System Based on Document Transformations. S. R. Friedman, J. A. Maceyak, and S. F. Weiss

X. A Relevance Feedback System Based on Document Transformations. S. R. Friedman, J. A. Maceyak, and S. F. Weiss X-l X. A Relevance Feedback System Based on Document Transformations S. R. Friedman, J. A. Maceyak, and S. F. Weiss Abstract An information retrieval system using relevance feedback to modify the document

More information

A Practical Passage-based Approach for Chinese Document Retrieval

A Practical Passage-based Approach for Chinese Document Retrieval A Practical Passage-based Approach for Chinese Document Retrieval Szu-Yuan Chi 1, Chung-Li Hsiao 1, Lee-Feng Chien 1,2 1. Department of Information Management, National Taiwan University 2. Institute of

More information

Fondazione Ugo Bordoni at TREC 2003: robust and web track

Fondazione Ugo Bordoni at TREC 2003: robust and web track Fondazione Ugo Bordoni at TREC 2003: robust and web track Giambattista Amati, Claudio Carpineto, and Giovanni Romano Fondazione Ugo Bordoni Rome Italy Abstract Our participation in TREC 2003 aims to adapt

More information

Web Information Retrieval. Exercises Evaluation in information retrieval

Web Information Retrieval. Exercises Evaluation in information retrieval Web Information Retrieval Exercises Evaluation in information retrieval Evaluating an IR system Note: information need is translated into a query Relevance is assessed relative to the information need

More information

An Exploration of Query Term Deletion

An Exploration of Query Term Deletion An Exploration of Query Term Deletion Hao Wu and Hui Fang University of Delaware, Newark DE 19716, USA haowu@ece.udel.edu, hfang@ece.udel.edu Abstract. Many search users fail to formulate queries that

More information