indexing and query processing. The inverted le was constructed for the retrieval target collection which contains full texts of two years' Japanese pa

Similar documents
Overview of Patent Retrieval Task at NTCIR-5

Document Structure Analysis in Associative Patent Retrieval

Study on Merging Multiple Results from Information Retrieval System

Analyzing Document Retrievability in Patent Retrieval Settings

CLEF-IP 2009: Exploring Standard IR Techniques on Patent Retrieval

Overview of the Patent Retrieval Task at the NTCIR-6 Workshop

Overview of Classification Subtask at NTCIR-6 Patent Retrieval Task

ABRIR at NTCIR-9 at GeoTime Task Usage of Wikipedia and GeoNames for Handling Named Entity Information

On a Combination of Probabilistic and Boolean IR Models for WWW Document Retrieval

Pseudo-Relevance Feedback and Title Re-Ranking for Chinese Information Retrieval

Query Likelihood with Negative Query Generation

Using the K-Nearest Neighbor Method and SMART Weighting in the Patent Document Categorization Subtask at NTCIR-6

TEXT CHAPTER 5. W. Bruce Croft BACKGROUND

R 2 D 2 at NTCIR-4 Web Retrieval Task

Applying the KISS Principle for the CLEF- IP 2010 Prior Art Candidate Patent Search Task

Siemens TREC-4 Report: Further Experiments with Database. Merging. Ellen M. Voorhees. Siemens Corporate Research, Inc.

Query Expansion from Wikipedia and Topic Web Crawler on CLIR

Query Expansion with the Minimum User Feedback by Transductive Learning

characteristic on several topics. Part of the reason is the free publication and multiplication of the Web such that replicated pages are repeated in

Overview of the Patent Mining Task at the NTCIR-8 Workshop

A Patent Retrieval Method Using a Hierarchy of Clusters at TUT

Information Retrieval. (M&S Ch 15)

RMIT University at TREC 2006: Terabyte Track

CMPSCI 646, Information Retrieval (Fall 2003)

Fall CS646: Information Retrieval. Lecture 2 - Introduction to Search Result Ranking. Jiepu Jiang University of Massachusetts Amherst 2016/09/12

Chinese track City took part in the Chinese track for the rst time. Two runs were submitted, one based on character searching and the other on words o

Automatically Generating Queries for Prior Art Search

James Mayfield! The Johns Hopkins University Applied Physics Laboratory The Human Language Technology Center of Excellence!

TREC-10 Web Track Experiments at MSRA

Term Frequency Normalisation Tuning for BM25 and DFR Models

A Practical Passage-based Approach for Chinese Document Retrieval

Navigation Retrieval with Site Anchor Text

Mercure at trec6 2 IRIT/SIG. Campus Univ. Toulouse III. F Toulouse. fbougha,

Probabilistic Learning Approaches for Indexing and Retrieval with the. TREC-2 Collection

ResPubliQA 2010

Inference Networks for Document Retrieval. A Dissertation Presented. Howard Robert Turtle. Submitted to the Graduate School of the

From CLIR to CLIE: Some Lessons in NTCIR Evaluation

Boolean Model. Hongning Wang

Performance Measures for Multi-Graded Relevance

NTCIR-5 Query Expansion Experiments using Term Dependence Models

Evaluation of the Document Categorization in Fixed-point Observatory

Evaluation of Information Access Technologies at NTCIR Workshop

Verbose Query Reduction by Learning to Rank for Social Book Search Track

Routing and Ad-hoc Retrieval with the. Nikolaus Walczuch, Norbert Fuhr, Michael Pollmann, Birgit Sievers. University of Dortmund, Germany.

Effect of log-based Query Term Expansion on Retrieval Effectiveness in Patent Searching

Designing and Building an Automatic Information Retrieval System for Handling the Arabic Data

Lecture 5: Information Retrieval using the Vector Space Model

Using an Image-Text Parallel Corpus and the Web for Query Expansion in Cross-Language Image Retrieval

Cross-Lingual Information Access and Its Evaluation

Towards Privacy-Preserving Evaluation for Information Retrieval Models over Industry Data Sets

TREC-7 Experiments at the University of Maryland Douglas W. Oard Digital Library Research Group College of Library and Information Services University

Proceedings of NTCIR-9 Workshop Meeting, December 6-9, 2011, Tokyo, Japan

Real-time Query Expansion in Relevance Models

FLL: Answering World History Exams by Utilizing Search Results and Virtual Examples

Introduction to Information Retrieval

IITH at CLEF 2017: Finding Relevant Tweets for Cultural Events

In = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most

Experiments on Patent Retrieval at NTCIR-4 Workshop

Melbourne University at the 2006 Terabyte Track

Evaluating a Conceptual Indexing Method by Utilizing WordNet

Chapter 6: Information Retrieval and Web Search. An introduction

Relevance of a Document to a Query

An Evaluation Method of Web Search Engines Based on Users Sense

Integrating Query Translation and Text Classification in a Cross-Language Patent Access System

IRCE at the NTCIR-12 IMine-2 Task

AT&T at TREC-6. Amit Singhal. AT&T Labs{Research. Abstract

Document indexing, similarities and retrieval in large scale text collections

IPL at ImageCLEF 2010

HARD Track Overview in TREC 2004 (Notebook) High Accuracy Retrieval from Documents

Retrieval and Feedback Models for Blog Distillation

Cross-Language Chinese Text Retrieval in NTCIR Workshop Towards Cross-Language Multilingual Text Retrieval

A Distributed Retrieval System for NTCIR-5 Patent Retrieval Task

M erg in g C lassifiers for Im p ro v ed In fo rm a tio n R e triev a l

An Axiomatic Approach to IR UIUC TREC 2005 Robust Track Experiments

An Evaluation of Information Retrieval Accuracy. with Simulated OCR Output. K. Taghva z, and J. Borsack z. University of Massachusetts, Amherst

United we fall, Divided we stand: A Study of Query Segmentation and PRF for Patent Prior Art Search

where w t is the relevance weight assigned to a document due to query term t, q t is the weight attached to the term by the query, tf d is the number

Document Expansion for Text-based Image Retrieval at CLEF 2009

Inferring Variable Labels Considering Co-occurrence of Variable Labels in Data Jackets

Custom IDF weights for boosting the relevancy of retrieved documents in textual retrieval

An Investigation of Basic Retrieval Models for the Dynamic Domain Task

Dynamic Programming Matching for Large Scale Information Retrieval

CSCI 599: Applications of Natural Language Processing Information Retrieval Retrieval Models (Part 3)"

WEIGHTING QUERY TERMS USING WORDNET ONTOLOGY

Building a Search Engine: Part 2

THE IMPLEMENTATION OF A DISTRIBUTED FILE SYSTEM SUPPORTING THE PARALLEL WORLD MODEL. Jun Sun, Yasushi Shinjo and Kozo Itano

Component ranking and Automatic Query Refinement for XML Retrieval

A Document-centered Approach to a Natural Language Music Search Engine

Midterm Exam Search Engines ( / ) October 20, 2015

2. PRELIMINARIES MANICURE is specically designed to prepare text collections from printed materials for information retrieval applications. In this ca

Information Retrieval

Overview of the Web Retrieval Task at the Third NTCIR Workshop

TREC 2016 Dynamic Domain Track: Exploiting Passage Representation for Retrieval and Relevance Feedback

Adding a Source Code Searching Capability to Yioop ADDING A SOURCE CODE SEARCHING CAPABILITY TO YIOOP CS297 REPORT

From Passages into Elements in XML Retrieval

Okapi in TIPS: The Changing Context of Information Retrieval

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University

second_language research_teaching sla vivian_cook language_department idl

Fondazione Ugo Bordoni at TREC 2004

Ecient Implementation of Sorting Algorithms on Asynchronous Distributed-Memory Machines

Transcription:

Term Distillation in Patent Retrieval Hideo Itoh Hiroko Mano Yasushi Ogawa Software R&D Group, RICOH Co., Ltd. 1-1-17 Koishikawa, Bunkyo-ku, Tokyo 112-0002, JAPAN fhideo,mano,yogawag@src.ricoh.co.jp Abstract In cross-database retrieval, the domain of queries diers from that of the retrieval target in the distribution of term occurrences. This causes incorrect term weighting in the retrieval system which assigns to each term a retrieval weight based on the distribution of term occurrences. To resolve the problem, we propose \term distillation", a framework for query term selection in cross-database retrieval. The experiments using the NTCIR-3 patent retrieval test collection demonstrate that term distillation is eective for cross-database retrieval. 1 Introduction For the mandatory runs of NTCIR-3 patent retrieval task 1, participants are required to construct a search query from a news article and retrieve patents which might be relevant to the query. This is a kind of cross-database retrieval in that the domain of queries (news article) differs from that of the retrieval target (patent) (Iwayama et al., 2001). Because in the distribution of term occurrences the query domain diers from the target domain, some query terms are given very large weights (importance) by the retrieval system even if the terms are not appropriate for 1 http://research.nii.ac.jp/ntcir/ntcir-ws3/patent/ retrieval. For example, the query term \president" in a news article might not be eective for patent retrieval. However, the retrieval system gives the term a large weight, because the document frequency of the term in the patent genre is very low. We think these problematic terms are so many that the terms cannot be eliminated using a stop word dictionary. In order to resolve the problem mentioned above, we propose \term distillation" which is a framework for query term selection in crossdatabase retrieval. The experiments using the NTCIR patent retrieval test collection demonstrate that term distillation is eective for crossdatabase retrieval. 2 System description Before describing our approach, we give a short description on our retrieval system. For the NTCIR-3 experiments, we revised query processing although the framework is the same as that of NTCIR-2 (Ogawa and Mano, 2001). The basic features of the system are as follows : Eective document ranking with pseudorelevance feedback based on Okapi's approach (Robertson and Walker, 1997) with some improvements. Scalable and ecient indexing and search based on the inverted le system (Ogawa and Matsuda, 1999) Originally developed Japanese morphological analyzer and normalizer for document

indexing and query processing. The inverted le was constructed for the retrieval target collection which contains full texts of two years' Japanese patents. We adopted character n-gram indexing because it might be dicult for Japanese morphological analyzer to correctly recognize technical terms which are crucial for patent retrieval. In what follows, we describe the full automatic process of document retrieval in the NTCIR-3 patent retrieval task. 1. Query term extraction Input query string is transformed into a sequence of words using the Japanese morphological analyzer. Query terms are extracted by matching patterns against the sequence. We can easily specify term extraction using some patterns which are described in regular expression on each word form or tag assigned by the analyzer. Stop words are eliminated using a stop word dictionary. For initial retrieval, both \single terms" and \phrasal terms" are used. A phrasal term consists of two adjacent words in the query string. 2. Initial retrieval Each query term is submitted one by oneto the ranking search module, which assigns a weight to the term and scores documents including it. Retrieved documents are merged and sorted on the score in the descending order. 3. Seed document selection As a result of the initial retrieval, top ranked documents are assumed to be pseudo-relevant to the query and selected as a \seed" of query expansion. The maximum number of seed documents is ten. 4. Query expansion Candidates of expansion terms are extracted from the seed documents by pattern matching as in the query term extraction mentioned above. Phrasal terms are not used for query expansion because phrasal terms may be less eective to improve recall and risky in case of pseudo-relevance feedback. The weight of initial query term is re-calculated with the Robertson/Spark- Jones formula (Robertson and Sparck- Jones, 1976) if the term is found in the candidate pool. The candidates are ranked on the Robertson's Selection Value (Robertson, 1990) and top-ranked terms are selected as expansion terms. 5. Final retrieval Each query and expansion term is submitted one by one to the ranking search module as in the initial retrieval. 3 Term distillation In cross-database retrieval, the domain of queries (news article) diers from that of the retrieval target (patent) in the distribution of term occurrences. This causes incorrect term weighting in the retrieval system which assigns to each term a retrieval weight based on the distribution of term occurrences. Moreover, the terms which might begiven an incorrect weight are too many to be collected in a stop word dictionary. For these reasons, we nd it necessary to have a query term selection stage specially designed for cross-database retrieval. We dene \term distillation" as a general framework for the query term selection. More specically, the term distillation consists of the following steps : 1. Extraction of query term candidates Candidates of query terms are extracted from the query string (news articles) and pooled. 2. Assignment of TDV (Term Distillation Value) Each candidate in the pool is givenatdv which represents \goodness" of the term to retrieve documents in the target domain.

3. Selection of query terms The candidates are ranked on the TDV and top-ranked n terms are selected as query terms, where n is an unknown constant and treated as a tuning parameter for fullautomatic retrieval. The term distillation seems appropriate to avoid falling foul of the \curse of dimensionality" (Robertson, 1990) in case that a given query is very lengthy. In what follows in this section, we explain a generic model to dene the TDV. Thereafter some instances of the model which embody the term distillation are introduced. 3.1 Generic Model In order to dene the TDV, we give a generic model with the following formula. TDV = QV TV where QV and TV represent the importance of the term in the query and the target domain respectively. QV seems to be commonly used for query term extraction in ordinary retrieval systems, however, TV is newly introduced for cross-database retrieval. A combination of QV and TV embodies a term distillation method. We instance them separately as bellow. 3.2 Instances of TV We give some instances of TV using two probabilities p and q, where p is a probability that the term occurs in the target domain and q is a probability that the term occurs in the query domain. Because the estimation method of p and q is independent on the instances of TV,it is explained later. We showeach instance of TV with the id-tag as follows: TV0 : Zero model TV = constant =1 TV1 : Swet model (Robertson, 1990) TV = p ; q TV2 : Naive Bayes model TV = p q TV3 : Bayesian classication model TV = p p+(1;;)q+ where and are unknown constants. TV4 : Binary independence model (Robertson and Sparck-Jones, 1976) TV =log p(1;q) q(1;p) TV5 : Target domain model TV = p TV6 : Query domain model TV =1; q TV7 : Binary model T V =1 (p > 0) or 0 (p =0) TV8 : Joint probability model TV = p (1 ; q) TV9 : Decision theoretic model (Robertson and Sparck-Jones, 1976) TV =log(p) ; log(q) 3.3 Instances of QV We show each instance of QV with the id-tag as follows: QV0 : Zero model QV = constant = 1 QV1 : Approximated 2-poisson model (Robertson and Walker, 1994) QV = tf tf + where tf is the within-query term frequency and is an unknown constant. QV2 : Term frequency model QV = tf QV3 : Term weight model QV = weight where weight is the retrieval weight given by the retrieval system. QV4 : Combination of QV1 and QV3 QV = tf tf + weight QV5 : Combination of QV2 and QV3 QV = tf weight

4 Experiments on term distillation Using the NTCIR-3 patent retrieval test collection, we conducted experiments to evaluate the eect of term distillation. For query construction, we usedonlynewsar- ticle elds in the 31 topics for the formal run. The number of query terms selected by term distillation was just eight in each topic. As described in the section 2, retrieval was fullautomatically executed with pseudo-relevance feedback. The evaluation results for some combinations of QV and TV are summarized in Table 1, where the documents judged to be \A" were taken as relevant ones. The combinations were selected on the results in our preliminary experiments. Each of \t", \i", \a" and \w" in the columns \p" or \q" represents a certain method for estimation of the probability p or q as follows : t : estimate p by the probability that the term occurs in titles of patents. More specically p = nt, where n t is the number of patent titles including the term and is the number of patents in the NTCIR-3 collection. i : estimate q by the probability that the term occurs in news articles. More specically q = n i N i, where n i is the number of articles including the term and N i is the number of news articles in the IREX collection ('98-'99 MAINICHI news article). a : estimate p by the probability that the term occurs in abstracts of patents. More specifically p = na, where n a is the number of patent abstracts in which the term occurs. w : estimate q by the probability that the term occurs in the whole patent. More specifically q = nw, where n w is the number of patents in which the term occurs. We tried to approximate the dierence in term statistics between patents and news articles using the conbination of "a" and "w" in the term distillation. In Table 1, the combination of QV2 and TV0 corresponds to query term extraction without QV TV p q AveP P@10 QV2 TV4 t i 0.1953 0.2645 QV2 TV9 t i 0.1948 0.2677 QV5 TV3 t i 0.1844 0.2355 QV2 TV3 t i 0.1843 0.2645 QV0 TV3 t i 0.1816 0.2452 QV2 TV6 t i 0.1730 0.2258 QV2 TV2 t i 0.1701 0.2194 QV2 TV3 a w 0.1694 0.2355 QV2 TV0 - - 0.1645 0.2226 QV2 TV7 t i 0.1597 0.2065 Table 1: Results using article eld term distillation. Comparing with the combination, retrieval performances are improved using instances of TV except for TV7. This means the term distillation produces a positive eect. The best performance in the table is produced by the combination of QV2 (raw term frequency) and TV4 (BIM). While the combination of \a" and \w" for estimation of probabilities p and q has the virtue in that the estimation requires only target document collection, the performance is poor in comparison with the combination of \t" and \i". Although the instances of QV can be compared each other by focusing on TV3, it is unclear whether QV5 is superior to QV2. We think it is necessary to proceed to the evaluation including the other combinations of TV and QV. 5 Results in NTCIR-3 patent task We submitted four mandatory runs. The evaluation results of our submitted runs are summarized in Table 2, where the documents judged to be \A" were taken as relevant ones. These runs were automatically produced using both article and supplement elds, where each supplement eld includes a short description on the content of the news article. Term distillation using TV3 (Bayes classication model) and query expansion by pseudo-relevance feedback were applied to all runs. The retrieval performances are remarkable among all submitted runs. However, the eect

QV TV p q AveP P@10 QV2 TV3 t i 0.2794 0.3903 QV0 TV3 t i 0.2701 0.3484 QV2 TV3 a w 0.2688 0.3645 QV5 TV3 t i 0.2637 0.3613 Table 2: Results in the NTCIR-3 patent task of term distillation is somewhat unclear, comparing with the run with only supplement elds in Table 3 (the average precision is 0.2712). We think supplement elds supply enough terms so that it is dicult to evaluate the performance of cross-database retrieval in the mandatory runs. 6 Results on ad-hoc patent retrieval In Table 3, we show evaluation results corresponding to various combinations of topic elds in use. The documents judged to be \A" were taken as relevant ones. elds AveP P@10 Rret t,d,c 0.3262 0.4323 1197 t,d,c,n 0.3056 0.4258 1182 d 0.3039 0.4032 1133 t,d 0.2801 0.3581 1100 t,d,n 0.2753 0.4000 1140 d,n 0.2750 0.4323 1145 s 0.2712 0.3806 991 t 0.1283 0.1968 893 Table 3: Results on ad-hoc patent retrieval In the table, the elds \t", \d", \c", \n" or \s" correspond to title, description, concept, narrative or supplement respectively. As a result, the combination of \t,d,c" produced the best retrieval performance for a set of the formal run topics. Pseudo-relevance feedback had a positive eect except for the case using a title eld only. retrieval can be applied to various settings including personalized retrieval, similar document retrieval and so on. For the future work, we hope to apply term distillation to cope with vocabulary gap problems in these new settings. In addition, we think term distillation can be used to present query terms to users in reasonable order in interactive retrieval systems. References M. Iwayama, A. Fujii, A. Takano, and N. Kando. 2001. Patent retrieval challenge in NTCIR-3. IPSJ SIG Notes, 2001-FI-63:49{56. Y. Ogawa and H. Mano. 2001. RICOH at NTCIR- 2. Proc. of NTCIR Workshop 2 Meeting, pages 121{123. Y. Ogawa and T. Matsuda. 1999. An ecient document retrieval method using n-gram indexing. Trans. of IEICE, J82-D-I(1):121{129. S. E. Robertson and K. Sparck-Jones. 1976. Relevance weighting of search terms. Journal of ASIS, 27:129{146. S. E Robertson and S. Walker. 1994. Some simple eective approximations to the 2-poisson model for probabilistic weighted retrieval. Proc. of 17th ACM SIGIR Conf., pages 232{241. S. E. Robertson and S. Walker. 1997. On relevance weights with little relevance information. Proc. of 20th ACM SIGIR Conf., pages 16{24. S. E. Robertson. 1990. On term selection for query expansion. Journal of Documentation, 46(4):359{ 364. 7 Conclusions We proposed term distillation for cross-database retrieval. Using NTCIR-3 test collection, we evaluated this technique in patent retrieval and found a positive eect. We think cross-database