Analyzing Document Retrievability in Patent Retrieval Settings

Similar documents
Document Structure Analysis in Associative Patent Retrieval

indexing and query processing. The inverted le was constructed for the retrieval target collection which contains full texts of two years' Japanese pa

Automatically Generating Queries for Prior Art Search

On the Relationship Between Query Characteristics and IR Functions Retrieval Bias

Applying the KISS Principle for the CLEF- IP 2010 Prior Art Candidate Patent Search Task

Overview of the Patent Retrieval Task at the NTCIR-6 Workshop

Prior Art Retrieval Using Various Patent Document Fields Contents

CLEF-IP 2009: Exploring Standard IR Techniques on Patent Retrieval

ResPubliQA 2010

From Passages into Elements in XML Retrieval

Effect of log-based Query Term Expansion on Retrieval Effectiveness in Patent Searching

A New Measure of the Cluster Hypothesis

Overview of Patent Retrieval Task at NTCIR-5

CADIAL Search Engine at INEX

Chapter 6: Information Retrieval and Web Search. An introduction

Performance Measures for Multi-Graded Relevance

Tilburg University. Authoritative re-ranking of search results Bogers, A.M.; van den Bosch, A. Published in: Advances in Information Retrieval

Custom IDF weights for boosting the relevancy of retrieved documents in textual retrieval

Overview of Classification Subtask at NTCIR-6 Patent Retrieval Task

Term Frequency Normalisation Tuning for BM25 and DFR Models

United we fall, Divided we stand: A Study of Query Segmentation and PRF for Patent Prior Art Search

IITH at CLEF 2017: Finding Relevant Tweets for Cultural Events

Popularity Weighted Ranking for Academic Digital Libraries

University of Santiago de Compostela at CLEF-IP09

Predicting Query Performance on the Web

International Journal of Advance Foundation and Research in Science & Engineering (IJAFRSE) Volume 1, Issue 2, July 2014.

Using the K-Nearest Neighbor Method and SMART Weighting in the Patent Document Categorization Subtask at NTCIR-6

Query Likelihood with Negative Query Generation

A Patent Retrieval Method Using a Hierarchy of Clusters at TUT

An Investigation of Basic Retrieval Models for the Dynamic Domain Task

External Query Reformulation for Text-based Image Retrieval

Time-Surfer: Time-Based Graphical Access to Document Content

Pseudo-Relevance Feedback and Title Re-Ranking for Chinese Information Retrieval

A Document-centered Approach to a Natural Language Music Search Engine

Overview of the Patent Mining Task at the NTCIR-8 Workshop

TEXT CHAPTER 5. W. Bruce Croft BACKGROUND

Improving Suffix Tree Clustering Algorithm for Web Documents

CS54701: Information Retrieval

Search for an Appropriate Journal in Depth Evaluation of Data Fusion Techniques

Available online at ScienceDirect. Procedia Computer Science 78 (2016 )

A BFS-BASED SIMILAR CONFERENCE RETRIEVAL FRAMEWORK

IMAGE RETRIEVAL SYSTEM: BASED ON USER REQUIREMENT AND INFERRING ANALYSIS TROUGH FEEDBACK

A New Technique to Optimize User s Browsing Session using Data Mining

Search Evaluation. Tao Yang CS293S Slides partially based on text book [CMS] [MRS]

Northeastern University in TREC 2009 Million Query Track

Diversifying Query Suggestions Based on Query Documents

Designing and Building an Automatic Information Retrieval System for Handling the Arabic Data

Context based Re-ranking of Web Documents (CReWD)

Assignment 1. Assignment 2. Relevance. Performance Evaluation. Retrieval System Evaluation. Evaluate an IR system

Concept Based Tie-breaking and Maximal Marginal Relevance Retrieval in Microblog Retrieval

Search Costs vs. User Satisfaction on Mobile

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University

UTA and SICS at CLEF-IP

Incorporating Satellite Documents into Co-citation Networks for Scientific Paper Searches

Ranking models in Information Retrieval: A Survey

Integrating Query Translation and Text Classification in a Cross-Language Patent Access System

NTUBROWS System for NTCIR-7. Information Retrieval for Question Answering

Developing a Test Collection for the Evaluation of Integrated Search Lykke, Marianne; Larsen, Birger; Lund, Haakon; Ingwersen, Peter

Using XML Logical Structure to Retrieve (Multimedia) Objects

Federated Search. Jaime Arguello INLS 509: Information Retrieval November 21, Thursday, November 17, 16

WikiTranslate: Query Translation for Cross-Lingual Information Retrieval Using Only Wikipedia

Using Coherence-based Measures to Predict Query Difficulty

Encoding Words into String Vectors for Word Categorization

Information Retrieval. CS630 Representing and Accessing Digital Information. What is a Retrieval Model? Basic IR Processes

Data Modelling and Multimedia Databases M

Enhanced Performance of Search Engine with Multitype Feature Co-Selection of Db-scan Clustering Algorithm

Introducing Partial Matching Approach in Association Rules for Better Treatment of Missing Values

Patent Terminlogy Analysis: Passage Retrieval Experiments for the Intellecutal Property Track at CLEF

On a Combination of Probabilistic and Boolean IR Models for WWW Document Retrieval

WebSci and Learning to Rank for IR

Truncation Errors. Applied Numerical Methods with MATLAB for Engineers and Scientists, 2nd ed., Steven C. Chapra, McGraw Hill, 2008, Ch. 4.

number of documents in global result list

CMPSCI 646, Information Retrieval (Fall 2003)

VIDEO SEARCHING AND BROWSING USING VIEWFINDER

CSCI 5417 Information Retrieval Systems. Jim Martin!

Efficient Case Based Feature Construction

KDD, SEMMA AND CRISP-DM: A PARALLEL OVERVIEW. Ana Azevedo and M.F. Santos

Automatic Boolean Query Suggestion for Professional Search

Experiments on Patent Retrieval at NTCIR-4 Workshop

Query Expansion using Wikipedia and DBpedia

James Mayfield! The Johns Hopkins University Applied Physics Laboratory The Human Language Technology Center of Excellence!

Fall CS646: Information Retrieval. Lecture 2 - Introduction to Search Result Ranking. Jiepu Jiang University of Massachusetts Amherst 2016/09/12

Sentiment analysis under temporal shift

Ranking Web Pages by Associating Keywords with Locations

Study on Merging Multiple Results from Information Retrieval System

A Universal Model for XML Information Retrieval

Improving Information Retrieval Effectiveness in Peer-to-Peer Networks through Query Piggybacking

QUT IElab at CLEF 2018 Consumer Health Search Task: Knowledge Base Retrieval for Consumer Health Search

Using Association Rules for Better Treatment of Missing Values

Letter Pair Similarity Classification and URL Ranking Based on Feedback Approach

Indexing and Query Processing

A Formal Approach to Score Normalization for Meta-search

LODatio: A Schema-Based Retrieval System forlinkedopendataatweb-scale

University of Amsterdam at INEX 2010: Ad hoc and Book Tracks

Relevance Feedback and Query Reformulation. Lecture 10 CS 510 Information Retrieval on the Internet Thanks to Susan Price. Outline

Similarity Image Retrieval System Using Hierarchical Classification

UMass at TREC 2017 Common Core Track

Are people biased in their use of search engines? Keane, Mark T.; O'Brien, Maeve; Smyth, Barry. Communications of the ACM, 51 (2): 49-52

Comparison of Recommender System Algorithms focusing on the New-Item and User-Bias Problem

A Model for Interactive Web Information Retrieval

Transcription:

Analyzing Document Retrievability in Patent Retrieval Settings Shariq Bashir and Andreas Rauber Institute of Software Technology and Interactive Systems, Vienna University of Technology, Austria {bashir,rauber}@ifs.tuwien.ac.at http://www.ifs.tuwien.ac.at Abstract. Most information retrieval settings, such as web search, are typically precision-oriented, i.e. they focus on retrieving a small number of highly relevant documents. However, in specific domains, such as patent retrieval or law, recall becomes more relevant than precision: in these cases the goal is to find all relevant documents, requiring algorithms to be tuned more towards recall at the cost of precision. This raises important questions with respect to retrievability and search engine bias: depending on how the similarity between a query and documents is measured, certain documents may be more or less retrievable in certain systems, up to some documents not being retrievable at all within common threshold settings. Biases may be oriented towards popularity of documents (increasing weight of references), towards length of documents, favour the use of rare or common words; rely on structural information such as metadata or headings, etc. Existing accessibility measurement techniques are limited as they measure retrievability with respect to all possible queries. In this paper, we improve accessibility measurement by considering sets of relevant and irrelevant queries for each document. This simulates how recall oriented users create their queries when searching for relevant information. We evaluate retrievability scores using a corpus of patents from US Patent and Trademark Office. Introduction In several information retrieval applications such as web search, e-commerce, scientific literature, patent applications etc., growing emphasis is put on the measurement of accessibility and retrievability of documents given an underlying information retrieval system [,2]. In recent years measurement concepts like document retrievability, searchability and findability emerged []. These concepts measure, how retrievable each individual document is in the retrieval system, i.e. how likely it is that a document can be found at all given a specific set of queries. Any retrieval system is inherently biased towards certain document characteristics. This results in the risk that a certain number of documents cannot be found in the top-n ranked results via any query terms that they would actually be relevant for, which ultimately decreases the usability of the retrieval system [0]. This is specifically critical in recall oriented application scenarios, such as patent retrieval, or legal settings. In these cases, the focus of a system S.S. Bhowmick, J. Küng, and R. Wagner (Eds.): DEXA 2009, LNCS 5690, pp. 753 760, 2009. c Springer-Verlag Berlin Heidelberg 2009

754 S. Bashir and A. Rauber is not so much on providing the best document to answer a specific information need (as e.g. in Web search settings), but to retrieve all documents that are relevant [9]. Thus, all documents should at least potentially be retrievable via correct query terms. In recent years, emphasis is put on designing retrieval systems for recall oriented tasks such as patent or legal documents search [4,9]. Before designing a new or using an existing retrieval system for recall oriented applications one needs to analyze the effects of the retrieval system bias as well as the overall retrievability of all documents in the collection using the retrieval function at hand. In this paper, we take a closer look at document retrievability measurements particularly for patent retrieval applications. Section 2. and 2.2 introduce both the standard way of measuring retrievability as well as three novel, more finegrained measures for assessing retrievability. Section 2.3 explains how queries are constructed, forming the basis for the experiments reported in this paper. Section 3 presents the experiments performed on the dentistry category of the US Patent and Trademark Office database, with conclusions as well as an outlook on future work being provided in Section 4. 2 Measuring Retrievability 2. Standard Retrievability Measurement Given a retrieval system RS and a collection of documents D, the concept of retrievability [,2] is to measure how much each and every document d D is retrievable in top-n rank results of all queries, if RS is presented with a large set of queries q Q. Defined in this way, the retrievability of a document is essentially a cumulative score that is proportional to the number of times the document can be retrieved within that cut-off c over the set Q. Aretrieval system is called best retrievable, if each document d D has nearly the same retrievability score. More formally, retrievability r(d) of d D can be defined as follows. r(d) = q Q f(k dq,c) () Here, f(k dq,c) is a generalized utility/cost function, where k dq is the rank of d in the result set of query q, c denotes the maximum rank that a user is willing to proceed down the ranked list. The function f(k dq,c) returns a value of if k dq c, and0otherwise. The work of Leif et al. [] is pioneering in this regard. In their experiments using collections of news and government web documents, they analyze document retrievability, differentiating between highly retrievable and less retrievable documents. 2.2 Limitations of Standard Retrievability Measure In this paper we argue that analyzing document retrievability using a single retrievability measure [,2] has several limitations in terms of interpretability. For

Analyzing Document Retrievability in Patent Retrieval Settings 755 example, when using a single retrievability curve we cannot analyze accurately how large a gap exists between an optimal retrievable system and the current system; or what the effect of the query set is that is used for retrievability measurement. Other issues to be analyzed include whether highly retrievable documents are really highly retrievable, or whether they are simply more accessible from many irrelevant queries rather than from relevant queries. Motivated by these limitations of existing retrievability measurement, the focus of our paper lies in understanding the following aspects: We identify four retrievability measurements rather than using just a single descriptor. These are How retrievable is each document using all queries, as done in [] From how many relevant queries out of all queries each document is retrievable From how many irrelevant queries out of all queries each document d D is retrievable, and What is the total number of relevant queries for each document. The last measure provides an upper bound for by how much we can increase the retrievability score of all documents. The one but last indicates where we can decrease the relevance score of highly retrievable documents and thus potentially increase the relevance score of the other documents. In [] queries used for retrievability measurements were selected using a sampling approach [3] without considering what type of queries are relevant and irrelevant to individual documents. From our experiments we learned that there is a significant difference in retrievability if queries are selected randomly or considering relevant and irrelevant queries seperately. 2.3 Query Generation Techniques Clearly, it is impractical to calculate the absolute r(d) scores because the set Q would be extremely large and require a significant amount of computation time as each query would have to be issued against the index for a given retrieval system. So, in order to perform the measurements in a practical way, a subset of all possible queries is commonly used that is sufficiently large and contains relatively probable queries. For generating reproducible and theoretically consistent queries, we try to reflect the way how patent examiners generate queries sets in patent invalidity search problems [5,8]. In invalidity search, the examiners have to find out all existing patent specifications that describe the same invention for collecting claims to make a particular patent invalid. In this search process, the examiners extract relevant query terms from a new patent application, particularly from the Claim sections for creating query sets [6,7]. We first extract all those frequent terms that are present in the Claim sections of each patent document and have a support greater than a certain threshold. Then, we combine the single frequent terms of each individual patent document into two and three terms combinations. After creating the query set, individual query terms of Q which appear in the claim section of a patent document d D are separated for representing its relevant queries ˆQ, and all those query terms which do not exist in the claim section of d are used for representing their irrelevant queries.

756 S. Bashir and A. Rauber 2.4 Retrievability Measurement Using Relevant Queries For analyzing the above factors, in this paper we conduct our retrievability measurements considering relevant queries for each document. In our approach, a set of relevant queries q ˆQ for each document contains those terms or combinations of terms, that are considered most important for an individual document s accessibility. In our measurements, as a first step, we extract all possible relevant queries from the Claim sections of every document. The number of queries in ˆQ, can be considered as an upper bound for the retrievability score. If any document exhibits a much lower retrievability value than ˆQ, then it is called less retrievable. In step two, relevant queries of all documents are used for constructing a single query set Q. In step three, using query sets Q and ˆQ retrievability measurements are computed for every document according to Equation. Document retrievability in set Q minus document retrievability in set ˆQ, helps in determining the main cause behind low retrievability. It also identifies the list of queries, where we may be able to decrease the relevance of those documents which are wrongly listed in the top rank results set, for increasing the relevance of less retrievable documents. In short, rather than analyzing document retrievability from a single perspective we analyze retrievability using four factors. (a) Document retrievability r(d) inquerysetq (Equation ); (b) Document retrievability ˆr(d) in relevant queries ˆQ (Equation 2); (c) Document retrievability in irrelevant queries r(d) (Equation 3); and (d) Total number of relevant queries for each document ˆQ. ˆr(d) = q ˆQ f(k dq,c) (2) r(d) = q Q f(k dq,c) q ˆQ f(k dq,c) (3) 3 Experiments 3. Experiment Set-Up For our experiments we use a collection of patents freely available from the US Patent and Trademark Office, downloaded from (http://www.uspto.gov/). We collected all patents that are listed under United States Patent Classification (USPC) class 433 (Dentistry Domain). For query generation we consider only the Claim section of every document as this is the section that most professional patent searchers use as their basis for query formulation. However, for retrieval we index the full text of all documents (Title, Abstract, Claim, Description). This reflects the default setting in a standard full-text retrieval engine. Some basic statistical properties of the data collection used are listed in Table. Four standard IR models are used for evaluating the retrievability bias. These are tf-idf, the OKAPI retrieval function (BM25) [], the OKAPI field retrieval

Analyzing Document Retrievability in Patent Retrieval Settings 757 Table. Properties of Patent Collection used for Retrievability Measurements Total Documentument Unique Terms Average Doc- Average Average Average Average De- Length Title Sec- Abstract Sec- Claim Secscription Sec- (words) tion Length tion Length tion Length tion Length (words) (words) (words) (words) 723 62343 2888 7.65 35.32 878.5 2234.5 Table 2. Query Collection Approaches Approach Total Average Retrievability Average Relevant Queries/Document Queries Score Query Expansion (2-Terms) 67735 37.6 35 Query Expansion (3-Terms) 337200 248 50 function (BM25F) [2], and the exact match model. Before indexing, we remove stop words and apply stemming. For indexing and querying we use Apache LUCENE IR toolkit. Each measurement graph depicts the four document retrievability indicators (cf. Section 2.4). () Document retrievability across all queries, (2) Document retrievability via relevant query set, (3) Document retrievability in irrelevant query set, and (4) Total number of relevant queries for each document in collection, correlating with the length of the respective Claim section. In query generation approach we select all the single terms which are present in the Claim section of every document that have a term frequency greater than 2 minimum support threshold. There are a total of 9, 75 single term queries extracted from all documents in the collection, with an average 25 terms per patent document. For creating longer length queries, we expand all the single term queries with two and three terms combinations, again extracted from the Claim sections. For documents which contain large number of single frequent terms, the different co-occurring term combinations of size two and three can become very large. Therefore, for generating similar number of queries for every document, we put an upper bound of 200 queries generated for every patent document. On average there are 35 queries per each document in two terms combinations, and 50 queries in three terms combinations. For generating the complete query set Q we remove all duplicate queries which are present in multiple documents. After generating these query sets for retrievability measurement, these were subdivided into relevant and irrelevant query sets for each document, depending on whether the query terms originated from the respective document. Table 2 shows the main properties of these query sets. For all experiments, the cut-off factor c is set to the top-35 documents in the ranked list, following the experiment set-up in []. 3.2 Retrievability Results Figures and 2 show retrievability measurements on different types of query sets for the four different retrieval models. Following the presentation in [], documents are sorted in ascending order in terms of overall retrievability. From all http://lucene.apache.org/java/docs/

758 S. Bashir and A. Rauber 0000 000 00 0 0000 000 00 0 0. (BM25) 0. (BM25F) 0000 000 00 0 0000 000 00 0 0. 0. 0.0 (TF-IDF) (Exact Model) : Retrievability in all Queries, : Retrievability in Relevant Queries, : Retrievability in Irrelevant Queries, : Total Relevant Queries Fig.. Retrievability Measurement Results using Two Pairs Query Expansion Approach graphs it is clear, that there is a high difference in overall document retrievability scores between less and highly retrievable documents when using all queries (blue line / square symbols). This effect increases as the size of the query set Q increases, specifically with the two and three terms query expansion approaches. When using only the set of relevant queries per document (purple line / rhombus symbols), retrievability is almost constant, irrespective of document length or of the overall retrievability of documents across all queries. In most cases an overall high retrievability score is owed to high retrievability of documents via irrelevant queries (light blue line, triangular symbols). This means that most documents are frequently retrieved not because of high matches in the Claim section, as would be desired by patent retrieval experts, but via matches in other sections of the patent - at the cost of missing relevant matches in the Claim section for other documents. (Figures and 2). Due to the bias of the given retrieval models, highly retrievable documents are not really highly accessible on their relevant queries, but on the other side decrease the accessibility of other documents. The measurements depicted in Figures and 2 show, that there is sufficient space for improving retrievability of less retrievable documents based on the number of potentially relevant queries (green line / circular symbols), which is

Analyzing Document Retrievability in Patent Retrieval Settings 759 0000 000 00 0 0000 000 00 0 (BM25) (BM25F) 0000 000 00 0 0. 00000 0000 000 00 0 0. 0.0 (TF-IDF) (Exact Model) : Retrievability in all Queries, : Retrievability in Relevant Queries, : Retrievability in Irrelevant Queries, : Total Relevant Queries Fig. 2. Retrievability Measurement Results using Three Pairs Query Expansion Approach closer to the retrievability values on relevant queries for the highly retrievable documents on the right side of the graphs. When comparing different retrieval models, we see that the exact match model shows the worst performance on all query generation approaches. There are very few documents which are retrieved for almost all of their relevant queries (the optimal case). On the other hand, 22% of the documents cannot be retrieved by any of the queries they would be relevant for, result sets there being dominated by irrelevant documents in terms of query term presence in the Claim section. There is little difference between the BM25 and BM25F (which considers individual sections) retrieval models. On almost all measurements they show comparable performance. 4 Conclusions We use retrieval systems in order to access information. Therefore, it is important to measure how much different retrieval systems restrict us in accessing different information. Document Retrievability is a measurement, used for this purpose in order to analyze how much a given retrieval system makes individual documents

760 S. Bashir and A. Rauber in a collection easier to find ranked within top-n results. Existing document Retrievability (Findability) measurement techniques, which measure document accessibility with single factor analysis, are not suitable for understanding the complex aspects involved in documents retrievability. In this paper, we evaluate document retrievability by considering retrieval both for relevant and irrelevant query sets. Rather than taking random queries, we first model how expert users formulate their queries, identifying the Claim section as the relevant source of query terms. Extensive experiments reveal that 90% of documents which are highly retrievable considering all types of queries, are not highly retrievable on their relevant query sets. Furthermore, retrievability is rather constant across all documents when considering only relevant queries, as opposed to the rather large differences encountered when considering all potential queries. The number of relevant queries may also serve as a kind of upper bound of retrievability performance for every document. Further analysis is required to understand the effect of different query selection approaches and query expansion techniques as well as the characteristics hat make documents more or less retrievable under certain systems. References. Azzopardi, L., Vinay, V.: Retrievability: An evaluation measure for higher order information access tasks. In: Proc. of CIKM 2008, Napa Valley, CA, USA, pp. 56 570 (2008) 2. Azzopardi, L., Vinay, V.: Accessibility in Information Retrieval. In: Macdonald, C., Ounis, I., Plachouras, V., Ruthven, I., White, R.W. (eds.) ECIR 2008. LNCS, vol. 4956, pp. 482 489. Springer, Heidelberg (2008) 3. Callen, J., Connell, M.: Query-based sampling of text databases. ACM Transactions on Information Systems 9(2), 97 30 (200) 4. Fujii, A., Iwayama, M., Kando, N.: Introduction to the special issue on patent processing. Information Processing and Management: an International Journal 43(5), 49 53 (2007) 5. Fujita, S.: Technology survey and invalidity search: An comparative study of different tasks for Japanese patent document retrieval. Information Processing and Management: an International Journal 43(5), 54 72 (2007) 6. Itoh, H., Mano, H., Ogawa, Y.: Term distillation in patent retrieval. In: Proc. of ACL 2003, Sapporo, Japan, pp. 4 45 (2003) 7. Konishi, K., Kitauchi, A., Takaki, T.: Invalidity patent search system at NTT data. In: NTCIR 2004: Proceedings of NTCIR-4 Workshop Meeting, Tokyo, Japan (2004) 8. Konishi, K.: Query terms extraction from patent document for invalidity search. In: NTCIR 2005: Proceedings of NTCIR-5 Workshop Meeting, Tokyo, Japan (2005) 9. Kontostathis, A., Kulp, S.: The Effect of normalization when recall really matters. In: Proc. of IKE 2008, Las Vegas, Nevada, USA, pp. 96 0 (2008) 0. Baeza-Yates, R.: Applications of web query mining. In: Losada, D.E., Fernández- Luna, J.M. (eds.) ECIR 2005. LNCS, vol. 3408, pp. 7 22. Springer, Heidelberg (2005). Robertson, S., Walker, S.: Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval. In: Proc. of SIGIR 994, Dublin, Ireland, pp. 345 354 (994) 2. Robertson, S., Zaragoza, H., Taylor, M.: Simple BM25 extension to multiple weighted fields. In: Proc. of CIKM 2004, Washington, D. C., USA, pp. 42 49 (2004)