A Patent Retrieval and Visualization Case Study

Size: px

Start display at page:

Download "A Patent Retrieval and Visualization Case Study"

Jody Perry
5 years ago
Views:

1 A Patent Retrieval and Visualization Case Study COMP8755 Individual Computing Project Honggu Lin May 24,

2 Abstract Most of Information Retrieval(IR) tasks such as web search and news search dedicated to precision improvement. However, some other IR tasks such as patent search and legal search value recall more than precision. The users of patent search are usually professional patent analysts who are willing to check hundreds or even thousands of patent documents to make sure there is no infringement of intellectual property rights. The main concern of these recall-oriented tasks vary greatly from those precision-oriented tasks whose users tend to only focus on limited top relevant retrieval results. In this research project, we explore the distinguishing features of patent- prior-art retrieval compare to generic information retrieval. Then we build a baseline patent retrieval system using the CLEF-IP 2010 patent data collection in the vector space retrieval model. Moreover, we experiment with several query formulation methods with the aim of improving the baseline system. Finally, we build a website to visualize the retrieval result with the goal of improving patent-prior-art information access. i

3 Acknowledgment First, I would like to express my sincere gratitude to my project supervisor Gabriela Ferraro for her continuous support during this project. She can always provide good advice to me whenever I get stuck and steer me in the right the direction whenever she thought I needed it. Besides my supervisor, I would like to thank my formal supervisor and project examiner Hanna Suominen, who is involved in the examination for this research project for her passionate participation and her time involved in this project. I would also like to thanks my course convenor Peter Strazdins for his great advice for this course, the insightful comments and hard questions during the project presentation. Finally, I am grateful for my families and my friends for providing me with reliable support and continuous encouragement throughout my student career and through the process of researching and writing this thesis. ii

4 Contents Abstract Acknowledgement i ii 1 Introduction Motivation Objectives of this Study Structure Of The Thesis Background Patents Overview Of An Information Retrieval System Retrieval Model Query Formulation Query Reduction Query Expansion Prior work in Patent Retrieval and Visualization Materials and Evaluation Metrics Data Collection Evaluation Metrics Precision and Recall Patent Retrieval Evaluation Score(PRES) Mean Average Precision(MAP) Improving Patent Prior Art Search with Query Reduction and Query Structure Formulation Patent Retrieval System Framework Data Preprocessing Indexing Query Reduction Via Term Selection Global Frequency-Based Term Selection Local Frequency-Based Term Selection Query Key-Phrase Selection Query Structure Formulation Compound Query Common Term/Phrase Weight Assignment Experiments Baseline definition Retrieval Experimental Results and Analysis Visualization Experiment Database Setup Web Framework Visualization Results Conclusion What I learned during this project Future Work iii

5 1 Introduction Figure 1: Comparison Between Web Search And Prior Art Search 1.1 Motivation A patent is a legal right granted by a patent office to an inventor or assignee for a device, substance, method or process that is new, useful and inventive (IP Australia 2018). This right protect the commercial interest of the inventor or assignee during the life of the patent (WIPO 2018b). Patents, therefore, as a kind of intellectual property, have a great impact on enterprises market value (IP Australia 2018). With the continuous rise in the number of patent applications every year, the need for an accurate and efficient system that can return all possible relevant patents of the patent application become increasing necessary. With the help of this system, the patent analysts, whose duty is to prevent all the possible intellectual property infringement of the new patent application can figure out all relevant patents more efficiently and more accurately. Other users such as the inventors and patent lawyers can also use this system to check patent application novelty during the patent application process. There are great differences between the prior art search and the standard web search that we are familiar with (Far et al. 2015). Figure 1 illustrates three main differences between them. The main object of prior art search is to find all the relevant documents given a patent application as a query. Thus, its evaluation focuses more on recall. The return result list can contain hundreds or even thousands of documents because the users of prior art search are usually professional patent analysts and are willing to work through the long results list to find any possible relevant documents. In contrast, the major goal of the standard web search is to find one or a few top relevant documents to explore and the main evaluation principle is precision. It is widely used by people who want to acquire the information they need as quickly as possible. Another main difference between prior art search and web search is the content of the query. The query of web search is usually very short contains only a few keywords while the query of prior art search is the whole patent document which makes the query formulation process far more difficult than general web search. 1

6 1.2 Objectives of this Study There are three objectives of this research project: Objective 1: build a baseline patent prior art retrieval system Objective 2: improve the baseline system by applying query formulation methods Objective 3: experiment with visualization strategies within the aim of improving information access to patent prior art search content. In order to achieve these three objectives, we implement several query formulation methods and analyze these retrieval results to see whether these methods work and how to combine these methods to formulate the query that can provide best results. 1.3 Structure Of The Thesis This chapter introduces the research problem of the thesis and analysis the difference between prior search and general web search. Chapter 2 provide the background knowledge related to generic information retrieval and patent retrieval. We also review previous work about existing query formulation techniques in this chapter. In chapter 3, We introduce the data materials and evaluation metrics we used in this experiment. Our main experiment is described in Chapter 4 and 5. Chapter 4 explain the query formulation techniques we used in our improved patent retrieval system. Experimental setting and baseline system is described in Chapter 5. Chapter 5 also cover the results and results analysis of the experiments. Chapter 6 conclude the thesis by summarizing the results and observations, as well as proposing possible directions for future work. 2 Background This chapter first explains the details about patent structure, then we introduce information retrieval system and query formulation. Moreover, we present and discuss previous works on patent prior art search and how the patent structure has been exploited to improve retrieval. 2.1 Patents A patent is a structured document assigned with legal right for a new invention that contains specific sections such as title, abstract, description, claims, and etc to define the protected invention. Patent Classification A patent needs to pass through several versions to become a qualify granted patent. The initial version that the inventor summited to the patent office for novelty checking is called patent application. The last two letters of the patent ID of a patent application is start with A. There may be several update patent application until the patent application finally being accepted as a granted patent or being rejected. The granted patent with ID that has the last two letters start with B. 2

7 Figure 2: A sample XML patent document from EPO 3

8 Patent Structure Different patent office has different patent structure requirement. There are many patent offices across the world, such as the United States patent and trademark office(uspto), the European patent office(epo), and the Japan patent office(jpo). The dataset we use in this experiment is from EPO, thus we will explain more detail about the EPO patent structure. Figure 2 is a sample XML patent document form EPO. A patent is a structured document, as we can see from Figure 2, the EPO patent consists of several sections, such as ID, abstract, description, claims, bibliography-data,... etc. I will explain several sections of EPO patents that are commonly used in a patent retrieval system as follows: ID: ID is a unique identification for EPO patents - a string start with "EP" followed by 7 digits, then followed by two letters version string. The two letters version string starting with "A" stands for a patent application, and "B" stands for a granted patent. Abstract: the abstract is a short paragraph of abstract of the patent in three languages, English(EN), German(DE) and French(FR). This section does not always exist because it is an optional section in EPO patent. Description: the description section is the core of the invention in EPO patent (Walid Magdy 2012). All the technical detail of the patent are contained inside the description section. It contains a few paragraphs and each paragraph describes an aspect of the invention in detail. The description section can contain tables, experimentation on the performance of the invention, and description of figures relating to the invention. The first paragraph of the description section usually contains information about the topical field of the invention. The description text also contains references to other patent documents which are very important information that patent analysts would like to examine to measure the contribution of the invention against prior art. Claims: the claims section of the patent document lists what aspects of the invention that the patent is going to protect. A successful patent does not have to have all its claims accepted, but at least one of them must be (Walid Magdy 2012). The examination can lead to dropping some of the claims by showing that they are not novel. This usually happens because patent applicants try to generalize their invention as much as possible, which can lead to the novelty of some of the very general claims being found to be invalid. Invention-title: the title of the patent is presented in three languages, English(EN), German(DE) and French(FR) in bibliography-data section. Patent Classification Code: Patent classification schemes are used to organize and index the technical content of patent specifications so that specifications on a specific topic can be identified easily and accurately (The British Library Board 2017). There are several patent classification schemes, the International Patent Classification(IPC), ECLA, US classification, and British classification. IPC is widely used around the world and is used by the European Patent Office 4

9 Figure 3: Complete Classification Symbol (EPO) where our experiment data obtain from. The International Patent Classification (IPC) provides a hierarchical system of language independent symbols for the classification of patents and utility models according to the different areas of technology to which they pertain (WIPO 2018c). The IPC divides technology into eight main sections with approximately 70,000 subdivisions. Each subdivision has a symbol consisting of Arabic numerals and letters of the Latin alphabet (WIPO 2018a). The IPC is updated once a year to keep the IPC up to date (WIPO 2018c). Figure 3 illustrate the components of an IPC classification and give two examples of the IPC code in patent. 2.2 Overview Of An Information Retrieval System An overall process of an Information Retrieval(IR) system is illustrated in Figure 4. In the collection side, each document in the data collection is indexed before being searched. The user formulates the query using the provided information and searches the formulated query through the IR system. In the matching process, the query and the document representations are compared using a retrieval model and the result would be a ranked list of documents. The return ranked list can be the final retrieval return result list or being used as feedback and being passed to query formulation module to reformulate the query Retrieval Model Vector Space Model In a vector space model, documents and queries are represented by vectors of term weights, and the collection is represented by a matrix of term weights as follows: A document term weight vector: A query term weight vector: D i = [d i1, d i2, d i3,..., d im ] Q j = [q j1, q j2, q j3,..., q jm ] 5

10 Figure 4: The overall Process Of A Prior Art Information Retrieval System A documents collection term weight matrix: d 11 d 12 d d 1m d 21 d 22 d d 2m d 31 d 32 d d 3m D =... d N1 d N2 d d Nm where D i is a document in the collection D, d ik is a weight for each term t k in the document D i, and q jk represents a term in the query Q j. The index collection represented by the matrix D Nm, where N is the number of documents in the index collection and m is the number of unique terms in the collection. If a term does not appear in a document, the weight for that particular term will be zero. The TFIDF weight of a term in a document which is shown in Equation 2 is calculated by multiplying the term frequency(tf) of the term in that document and the inversed document frequency(idf) of the term in the collection. idf(t k ) = log N + 1 df(t k ) (1) T F IDF (t k, D i ) = tf(t k, D i ) idf(t k ) (2) where tf(t k, D i ) is the number of occurrence of the term t k in the document D i. df(t k ) is the number of document in the collection D that contains at least one occurrence of the term t k. Given a query Q, documents are ranked based on the overlap score measure which is notated at Equation 3 T F IDF (Q, D i ) = T F IDF (q, D i ) q Q D i (1 b) + b Di avdl 6 (3)

11 where D i is the length of document D i. avdl is the average document length. b is a parameter that can be adjust by user and is set default as Query Formulation Query formulation is a process during which the original keyword query is issued by the user is transformed into a structured query representation that is used by the search engine (M.Malathi 2013). The main goal of query formulation is to improve the overall quality of the ranking presented to the user in response to their query (M.Malathi 2013). Query formulation can be generally divided into two main processing stages: The first processing stage is query refinement or reformulation, the process that alters the query on the morphological level (M.Malathi 2013). The query term processing includes tokenization, which split character sequence into word tokens. Then perform normalization upon the tokens by map text and query term to the same form. Also, we may want different forms of a root to match, then stemming is the third subprocess should be added. Stop words removing and spelling corrections are also commonly used for query term processing methods. After query term processing, we can apply query reduction and/or query expansion to the query term for query refinement. These two seemingly incompatible approaches can both improve document retrieval performance. We will discuss these two parts detailedly in the next two sections. The second processing stage is to alter the query on the structural level, which is performed after the query refinement stage is completed. The structural alterations may include, among other actions, segmenting the query into atomic concepts, assigning weights to these concepts, or expanding the query with related weighted concepts (M.Malathi 2013) Query Reduction Query reduction is to reduce the length of the query. It is widely used in patent prior art search because the query of patent retrieval is as long as a whole patent document. There are three main methods to reduce the query. Query Summarization (Mahdabi et al. 2011) utilizes a known text summarization technique, called TextTiling to summarize the patent documents. The summary-based query was aspired to capture the main topic of the document as well as the most important subtopics and discard subtopics, which are only marginally discussed in the patent document. Query Segmentation (Hearst n.d.) introduce a technique call TextTiling for subdividing texts into multi-paragraph units that which represent different subtopics. Multiparagraph subtopic segmentation can be used in patent query segmentation, with each patent query being segmented into several sub queries which represent different subtopics, then we search several sub queries and obtain several retrieval results and finally we merge these results into one final result which represent the result of the original query. Query Term Selection 7

12 Query term selection is to remove noise terms which may detract retrieval performance from a query or select informative terms from the long query and only use these selected terms as query. Words that with high document frequency, appear in lots of documents in the data collection are considered as stop words, such as you, and, this,... etc. There are several methods to identify stop words. Using common language stop word summarize by the expert is a frequently used methods. However, these stop word is not base on the data set we use. For example, system and machine can be considered as stop work in a patent data collection while these word are not contained in the common stop word list. Thus, another approach to obtain stop word which related to the data set we use is to obtain the document frequency for each individual word in the data set and treat the word with document frequency in a top specific percentage as stop word (Corremans. and G. 2000). Query terms can be weighted based on their perceived significant in the target corpus, combined with their significance in the query in cross-database retrieval (Hideo Itoh 2003). Since the domain of queries differs from that of the retrieval target in the distribution of term occurrences, only using the distribution on one corpus can causes incorrect term weighting. Thus, in this experiment, the document frequency of the query term is obtain from the target data collection and the term frequency of the query term is obtain from the query document. Multiply the term frequency and the inversed document frequency to produce the query term weight for query term selection Query Expansion In most collections, one concept can be represented by different words, which is known as synonymy, has an impact on the recall of most information retrieval(ir) systems (Christopher D. Manning and Schütze 2008). The methods for tackling this problem split into two major classes: global methods and local methods. Global methods are techniques for expanding or reformulating query terms independent of the query and results returned from it, they used prior knowledge to find semantically similar terms for the query word (Christopher D. Manning and Schütze 2008). Global methods include: Query expansion/reformulation with a thesaurus or WordNet. We can obtain the synonymy of a term by using a controlled vocabulary that is maintained by human editors. One of the popular synonymy library we can use is NLTK WordNet. Query expansion via automatic thesaurus generation. We can attempt to generate a thesaurus automatically by analyzing a collection of documents. There are two main approaches. One is simply to exploit word cooccurrence statistics. We believe that words that co-occur in a document or paragraph are likely to be semantically similar with each other (Christopher D. Manning and Schütze 2008), thus counting text statistic is a simple method to find the most similar words. The other approach is to use a shallow grammatical analysis of the text and to exploit grammatical relations or grammatical dependencies (Christopher D. Manning and Schütze 2008). 8

13 Local methods adjust a query relative to the documents that initially appear to match the query. The basic methods here are: Relevance feedback(rf) is to involve the user in the IR process so as to improve the final result set (Christopher D. Manning and Schütze 2008). The basic procedure of RF is that the user issues an initial query first, then the system returns a set of initial retrieval results. Next, the user marks some of the retrieval results as relevant or unrelevant. The system then reformulates a better query according to the user feedback. Finally, the system displays a revised set of retrieval results. RF can go through one or more iterations of this sort. Pseudo-relevance feedback, also known as blind relevance feedback, provides a method for automatic local analysis (Christopher D. Manning and Schütze 2008). It automates the manual part of RF, so that the user gets improved retrieval performance without an extended interaction. The method is to do normal retrieval to find an initial set of most relevant documents, then assume that the top k ranked documents are relevant, and finally do RF as before under this assumption. 2.3 Prior work in Patent Retrieval and Visualization This section reviews the existing work on patent search tasks and patent visualization tasks and summarizes the special characteristics of patent search according to our review of the existing work. Prior Work in Patent Prior Art Retrieval (Walid Magdy 2012) explore the special nature of recall-oriented IR and patent search, also, he proposes a new Recall-Oriented Information Retrieval tasks evaluation matrix called Patent Retrieval Evaluation Score(PRES). (Far et al. 2015) build an Oracular Relevance Feedback System to select optimal query term by acquire pseudo-relevance feedback from the initial query results and assign a score to each term for query reduction. Using the whole patent document as query is not practical and not accurate because there is a huge amount of text in one patent and most of the text is not useful for the retrieval. Hence we need to extract useful terms from the whole patent document. As introduced before, patents are structured documents that consist of several different sections such as title, abstract, description, claims, classification code, etc. Different sections use different types of language for invention description. The abstract and description section tend to use more technical terminology while the claims usually use legal jargon. These difference are because of their different function in the patent document. Abstract and description are responsible for explaining what is the new invention and how the new invention works, while claims are responsible for claim the legal advantages for the new invention. There are contrasting findings from previous work with respect to which fields should be used for query term extraction. According to early patent research tasks, claims section is the primary section to build the query, which agrees with where the examiners start in the novelty checking process (Takaki, Fujii, and Ishikawa n.d.). However, recent works show that building queries form the description field can obtain a better result in patent retrieval (Xue and 9

14 Croft n.d.). In contrast, an experiment shows that discarding description from queries improves the MAP up to 30% because description section contains more noise than information (Gobeill et al. 2010). Also, there is research suggest that extracting terms according to their TF-IDF scores from every field of the query patent, and give higher importance to the terms extracted from the title field, is an effective way of constructing a search query (Cetintas and Si 2012). Prior Work in Patent Prior Art Visualization Patent visualization has been tracked from many different angles such as document visualization, collection visualization and exploration, and patent landscapes, and less has been done in designing visualizations for patent prior art specifically. (Kucher and Kerren 2015) present an interactive visual survey of text visualization techniques that can be used for the purposes of search for related work. They also introduce the subfield and gaining insight into research trends of text visualization techniques. The taxonomy of text visualization techniques are also summarized in this paper. 3 Materials and Evaluation Metrics This section introduce the patent data we use in this project and three evaluation metrics we use in the experiment to evaluate the retrieval results. 3.1 Data Collection The patent data collection we use in this experiment is from The Cross Language Evaluation Forum for Intellectual Property evaluation track(clef-ip). The CLEF-IP track was launched in 2009 to investigate IR techniques for patent retrieval and was part of the CLEF 2009 evaluation campaign (TUWIEN 2018a). The prior art candidate search task(pac) ran in five subsequent years: 2009,2010,1011,2012,2013 (TUWIEN 2018b). The task we use in this experiment is CLEF-IP 2010 which is a benchmarking activity of the CLEF-IP 2010 conference. This track contains 1.3 million patent documents derived from European Patent Office(EPO) (Piroi 2010). The data collection covers English, French, and German. Figure 5 shows the percentage of the English, German and French patents in the CLEF-IP 2010 collection. Only 68% of the patents in the data collection are English patent and we only use the English patents in this experiment. The patent documents in the patent collection are stored as XML files. There are two tasks in the 2010 s track. Our experiment performs the first task which is to find patent documents that are candidates to constitute prior art for a given document. The target data set contains all EPO documents that have an application date previous (2,680,698 patent documents constituting 1,331,106 patents) (Piroi 2010). The data collection contains patent documents without merging the documents related to the same patent into one document. Thus we do the merging in our data preprocessing. Each patent in the collection is identified by a unique patent number, which is a string starting with "EP", followed by 7 digits which corresponding to each patent is a directory containing the patent documents related to that patent and then followed by a patent two-letters kind codes. The kind codes represent different stages of the patent s life-cycle. The kind code start with "A" means this is a patent application while 10

Figure 5: Percentage of English, German, and French patents in the CLEF-IP 2010 collection Figure 6: Completeness of the presence of English patents in the CLEF-IP 2010 collection the kind code start

Figure 6 show the completeness of the presence of English patent in the data collection where only 52% of the English patents are complete.

15 Figure 5: Percentage of English, German, and French patents in the CLEF-IP 2010 collection Figure 6: Completeness of the presence of English patents in the CLEF-IP 2010 collection the kind code start with "B" refers to a granted patent. Table 1 shows the meaning of common kind codes. Not all the patents in the collection contain all section. Figure 6 show the completeness of the presence of English patent in the data collection where only 52% of the English patents are complete. We not only use the complete English patents, we use the other English patents as well although they are not completed. The query data set contains 2000 queries (Piroi 2010). The query document is a patent application, A1 or A2, where the citation information was removed. The query documents also cover three languages, 1,348 English queries, 518 German queries, and 134 French queries and we only use the English query documents as query in our experiment. 11

16 Kind Code A1 A2 A3 A4 A8 A9 B1 B2 B8 B9 Meaning publication of application with search report publication of application without search report publication of search report supplementary search report corrected title page of an EP A document complete reprint of an EP A document granted patent granted patent after modification corrected front page of an EP B document complete reprint of an EP B document Table 1: The patent ID kind codes and their meaning 3.2 Evaluation Metrics Precision and Recall An ideal retrieval system can retrieve all relevant documents and all the retrieved documents are relevant. Recall and precision are two evaluation matrices that used to evaluate these two aspects respectively and they are the most basic and frequent used evaluation measure for information retrieval effectiveness. Equation 4 and 5 are the formula of Precision and Recall. Precision is the fraction of the retrieved documents that are relevant: P recision = P (relevant retrieved) = T P T P + F P Recall is the fraction of the relevant documents that are retrieved: Recall = P (retrieved relevant) = T P T P + F N where: True Positive(TP): number of retrieved relevant documents False Positive(FP): number of retrieved irrelevant documents True Negative(TN): number of not-retrieved irrelevant documents False Negative(FN): number of not-retrieved relevant documents Prior art search is recall-oriented search thus we pay more attention to the recall rate. Precision is not that informative to the patent retrieval result and we have more proper evaluation matrices to evaluate the accurateness of our result so we do not use precision as the evaluation matrix in this experiment Patent Retrieval Evaluation Score(PRES) A new evaluation metric called Patent Retrieval Evaluation Score(PRES) is introduced by (Walid Magdy 2012), which is based on the same idea as normalized recall(r norm ) (Joseph Rocchio 1964; ROBERTSON 1969),shown in Equation 6, but with a different definition for the worst case. (4) (5) R norm = A 2 A 1 + A 2 = 1 ri i n(n n) (6) 12

17 where: A 1, A 2 : areas shown in Figure 7 r i : the rank at which the i th relevant document is retrieved N: collection size n: number of relevant docs This R norm score can reflect the precision-recall curve in one number, with the requirement to rank all documents in the collection according to relevance to a query (Joseph Rocchio 1964; ROBERTSON 1969).This metric measures a system s effectiveness in ranking documents relative to the best and worst ranking cases (Walid Magdy 2012), where the best ranking case is retrieving all relevant documents at the top of the result list, and the worst case is retrieving them at the bottom of the result list with the result list contains the rank of all documents in the collection. Figure 7 is an illustrative graph of the calculation of R norm, where A1 represent the area between the best case and the actual case, A2 represent the area between the actual case and the worst case. Figure 7: Illustration of how R n orm curve is bounded by the best and worst cases (Rijsbergen 1979) Figure 8: PRES curve is bounded between the best case and the new defined worst case (Walid Magdy 2012) Different from R norm, in PRES, the assumption for the worst case is to retrieve all the relevant documents just after the maximum number of documents to be checked by the user N max. Any relevant document not retrieved in the 13

18 top N max is assumed to be the worst case. Figure 8 is an illustrative graph of the calculation of P RES. Applying this assumption in Equation 6 replace N with N max + n, where N max is the number of retrieved documents which is also the maximum number of documents to be checked by the user. P RES = R norm N=Nmax+n ri i = 1 n(n n) N=N max+n ri i = 1 n N max and, the summation of the ranks of all the relevant document is n i = i=1 n(n + 1) 2 (7) then, P RES = 1 ri n n+1 2 (8) N max Equation 9 shows the direct calculation of the summation of the ranks of relevant documents in the general case when some relevant documents are missing from the top N max. ri = R r i + (n R)(N max + n) i=1 (n R)(n R 1) 2 (9) where: R: number of retrieved relevant documents in the first N max documents Mean Average Precision(MAP) Mean average precision(map) is the most popular evaluation metrics in general use for d hoc type IR tasks by far (Baeza-Yates and Ribeiro-Neto 2010). Equation 10 shows the definition of average precision (AP) for a given topic, and MAP shown in Equation 11 is the mean of AP taken over all topics in the test collection. Average precision(ap) is the average of precision at each point where a relevant document is found: AP = N r=1 (P (r) rel(r)) n (10) Mean Average Precision(MAP) is the average of all average precision score among a query set: q Q AP (q) MAP (Q) = (11) Q 14

19 where: r: the rank P(r): precision at a given cut-off rank, i.e. Precision(r) rel(r): a binary function of the document relevance at a given rank, where rel(r)=1 when document at rank r is relevant and rel(r)=0 otherwise. n: the total number of relevant documents Q: the query set As its name implies, MAP is a precision metric. According to Equation 10, it can be seen that the the bigger the rank number a relevant document has, the weaker impact it has on AP, which means, even though two result lists have the same recall rate, the result list with great number the relevant documents at the top of the result list has much higher MAP than the result list with most of its relevant document at the bottom of the list. That is why MAP can provide a good and intuitive evaluation for IR task emphasizing precision, but will often not give a meaningful interpretation for recall focused tasks (Walid Magdy 2012). 4 Improving Patent Prior Art Search with Query Reduction and Query Structure Formulation This section presents an overview of the retrieval framework of this study and two general methods to improve patent prior art search. The methods are inspired in two ideas: (i) query refinement via term selection, and (ii) query structure formulation. The proposed methods are compared against the baseline described in Section Patent Retrieval System Framework Figure 9: The Overall Retrieval Process in Patent Retrieval Experiment The overall progress of retrieval experiment is described in Figure 9. The 15

20 most important module is query formulation. The query formulation techniques we used to improve retrieval results in this study for are explain detailedly in the following sections Data Preprocessing At the very first beginning, we need to preprocess all the English Patents in the CLEF-IP 2010 collection and all the English patent application topics in CLEF-IP topic Preprocessing includes convert XML document into JSON document then merge different versions of one patent in the collection into one document and filter out other patent sections except sections title, abstract, description, claims, and classification Indexing Structured indexing (parameters are shown in Table 2) is applied to the patent documents in the collection, which means the document structure is preserved in the indexing, and we can search each specific fields in the document or search the full document as a whole. Also, a customer analyzer as shown in Table 3 is used in the indexed mapping. An analyzer consists of character filter, tokenizer, and token filters. Our customer analyzer use lowercase tokenizer, English stop words token filter, and Porter Stem token filter. A term vector parameter is also being set for indexing so that we can obtain the term vector information, which is used later in the term selection process. Parameters Title Abstract Description Claims ipcr ucid field datatype text [text] keyword term_vector with_position_offset none none analyzer my_analyzer none none Table 2: Index mapping parameters my_analyzer character filter tokenizer token filter none lowercase porter stemmer, english stop Table 3: Customer analyzer 4.2 Query Reduction Via Term Selection Global Frequency-Based Term Selection Removing the terms with high document frequency in a global context is to build a stop word list for each field (title, abstract, description, claims) based on the whole patent collection. Different from the Language common stop word list, these patent-specific stop word lists are subject to the data collection we used. Thus, it can identify data collection specified stop word. Patent-specific stop words are extracted from each individual patent field according to (Corremans. and G. 2000). To obtain the field stop words, we need to obtain the field frequency for each identified term in the field from the Elasticsearch. The field 16

21 frequency for a term T in field F is the number of fields that contains the term T across all documents in the index. We need to obtain the patentspecific stop words for each text fields (title, abstract, description, claims). For each field, the terms that with field frequency higher than 1% of the highest term field frequency for this field were selected as stop words. The value 1% was selected subjectively based on our observation and experiment on the data Local Frequency-Based Term Selection Removing the terms with high document frequency in a local context is to remove a percentage of high document frequency term in a specific field of a patent document. We first obtain the document frequency for all the terms in the field and then sorted the terms base on their document frequency. Then remove x% of the terms that with the highest document frequency. The threshold for the percentage of removing terms is x, and different field has different threshold Query Key-Phrase Selection Query phrase selection is to extract key-phrases automatically based on their informative score. Automatic key-phrase extraction consists of two steps: (i) identify a set of noun phrases from the given text as candidates, (ii) score the candidate phrases based on a score function and select the phrases with a high score as key-phrases. Candidate Key-phrase Identification Generally, all words and/or phrases in a document can be considered as candidate phrase. However, not all of the candidate phrases are informative for the retrieval task, so that we need to identify key phrases from these candidate phrases to reduce the computational cost and improve the retrieval accuracy. Heuristics are typically used to identify a smaller subset of better candidates (DeWilde n.d.). Common heuristics include removing common stop words, digits and punctuation; filtering words with certain parts of speech. More specifically, for multiword phrases, using certain POS patterns to identify noun phrases and using external knowledge bases like WordNet or Wikipedia as a reference source of good/bad key-phrases. In our study, we use Part-Of-Speech patterns to extract noun phrases as key-phrase candidates, using a regular expression. The regular expression is: {(< JJ > < NN. > + < IN >)? < JJ > < NN. > +} (a regular expression written in a simplified format used by NLTK s RegexpParser()). This matches any number of adjectives followed by at least one noun that may be joined by a proposition to one other adjective(s)+noun(s) sequence (DeWilde n.d.). Keyphrase Selection There are amounts of methods to distinguishing between key-phrase candidates and noise phrase candidates. The basic and simplest one is to score candidates solely based on frequency statistics, such as TF*IDF or BM25. For this method, we assume that the keyphrases within a document tend to be phrases that with high phrase frequency and low document frequency. 17

22 To perform Key-phrase Selection, we use the TFIDF scoring method shown in Equation 13 to score each term. Then the score of each phrase shown in Equation 14 is the average score of the terms that consist of it. Since the domain of queries differs from that of the retrieval target in the distribution of term occurrences, only using the distribution on one corpus can cause incorrect term weighting (Hideo Itoh 2003). Thus, in this experiment, the document frequency of the query term is obtained from the target data collection and the term frequency of the query term is obtained from the query document. idf(t k ) = log N + 1 df(t k ) (12) T F IDF (t k, Q i ) = tf(t k, Q i ) idf(t k ) (13) T F IDF (p k, Q i ) = t p k T F IDF (t k, Q i ) p k (14) where tf(t k, Q i ) is the number of occurrence of the term t k in the query document Q i. df(t k ) is the number of document in the target data collection D that contains at least one occurrence of the term t k. p k is the number of terms that phrase p k contains. 4.3 Query Structure Formulation In this section, we present two query structure formulation methods that are used in this research project Compound Query The compound query is composed of leaf queries, which are queries using a specific field of text as query input. The compound query then combines results and scores of the leaf queries, to form a new score and provide a new result. There are two types of compound structure. The first type of compound structure is to combine all leaf queries using corresponding query field and these leaf queries search within the corresponding field in the patent collection, for example, query string extract from field "abstract" of the query patent search on field "abstract" in the patent collection, query string extract from field "description" search on field "description" in the patent collection, etc. Then use the "OR" operator to combine the search results of each leaf query, also, sum up the score of each leaf query as the new score of the compound query. The second type of compound structure is also to combine the leaf queries which are queries that using specific query fields while these leaf queries search within the whole document context. It means all the leaf query such as "abstract" leaf query, which is query that uses field "abstract" of query patent as query string and searches on the full patent document as a whole, are combined using operator "OR". Adding up the leaf queries score as the new compound score and merge the leaf query search results as the new results set. 18

23 We have implemented both compound structure in our research and compare their results in section Common Term/Phrase Weight Assignment We assume that the terms or phrases that appear in several fields of the query patent are more important than the terms or phrases that only appear in a single section, and the more fields the terms/phrases appear, the more important the terms/phrases are. The Common Term Selection method assign the number of field the common term appear within a document as the weight that is used in the query structure formulation of the common term. 5 Experiments In this section, we first present the baseline retrieval system, then we compare and analyze the results from the baseline system and several improved systems using the query formulation techniques described in Section 4. Finally, we introduce our visualization experiment. 5.1 Baseline definition In the baseline query formulation, we use the lowercase tokenizer as the indexing analyzer and remove English common stop word (countwordsfree 2018), digits, words with length less than 3 letters. Also, we use the meta data IPC codes that assigned to each topic to filter the search results, which makes each return results has at least one common IPC code with the topic query patent application. Performance is evaluated using the three popular metrics defined in Section Average recall, MAP, and PRES on the top 100 results for each query. The results are in Table 4: Metric Title Abstract Description Claims PRES MAP A.Recall Table 4: The baseline results using different patent sections as queries According to the result show in Table 4, the best section to query with in the baseline system is section claims. 5.2 Retrieval Experimental Results and Analysis This section presents the evaluation and analysis of several results with different techniques improvement from the baseline system. Frequency base term selection We first apply global frequency based term selection to the baseline system. We obtain the patent specific field stop-word lists after extract the terms with high field document frequency from the target data collection. After applying the baseline query formulation, we reformulate the baseline query by filtering it with our patent specific field stop word lists to reduce the noise terms in 19

24 the query. Table 5 shows the evaluation result using frequency-based term selection on baseline system. Compare these results with the baseline results, we found that the baseline system is improved by using the frequency based term selection. Metric Title Abstract Description Claims PRES MAP A.Recall Table 5: Adding frequency based term selection to the baseline system Compound query structure Secondly, we try the compound query structure methods upon the above frequency query reduction improved system. There are two kind of combination methods, the first one is to use all four text sections(title, abstract, description and claims) in the query patent as leaf queries, with each leaf query search on corresponding section in the target data collection and sum up the leaf queries score as the final score. we denote this type of combination as Combination(1). The result of Combination(1) is shown in Table 6 Metric Title + Abstract + Description + Claims(Combination1) PRES MAP A.Recall Table 6: Adding Combination(1) method to the frequency based term selection improved system The other kinds of combination method is also to use all four text sections in the query patent as leaf queries, bit with each leaf query search on the full patent documents on the target data collection, and sum up the leaf queries score as the final score.we denote this type of combination as Combination(2). The result of Combination(2) is shown in Table 7 Metric Title + Abstract + Description + Claims (Combination2) PRES MAP A.Recall Table 7: Adding Combination(2) method to the frequency based term selection improved system Comparing the results from Combination(1) in Table 6 and Combination(2) in Table 7 with the frequency based improved results in Table 5, we prove that Combination(1) does not help and that Combination(2) improve the results of the frequency based method. Thus, in the following experiments, we abandon combination(1) and continue to improve the query upon the Combination(2) method. 20

25 Query Key-Phrase selection We include a key-phrase selection method on the Combination(2) query formulation method. We select key-phrases using the method described in Section 4, and then construct phrase queries using key-phrases. Table 8 shows the retrieval results using frequency base term selection, phrase queries, and Combination(2) query formulation method. The result is better than the Combination2 method alone (see Table 7). In conclusion, key-phrase selection query formulation method improve the retrieval results. Metric Combination2 + key-phrases PRES MAP A.Recall Table 8: Combination(2) combined with key-phrases Common term weight reassignment To prove whether the common term weight reassignment query formulation method can improve the retrieval result, we apply common term weight reassignment method on the Combination(2) improved query formulation. We identify the common term in four different sections of a patent document and assign a higher weight to these common terms according to the method we mentioned before. As shown in Table 9 adding the common term query formulation method on Combination(2) outperforms the Combination(2) (Table 7), but not the method that combines Combination(2) and key-phrases. Metric Combination2 + common term PRES MAP A.Recall Table 9: Combination full text search with assign higher weight to common term Combination of all the presented query formulation methods Finally, we combine all the useful query formulation methods above to form a query. Table 10 shows the results of using frequency filter, Combination(2), key-phrases and common term weigh reassignment query formulation methods to form the query As shown in Table 10, combining all the query formulation methods, according to our experiments, obtain a better retrieval result than any of the retrieval result mention before. Table 11 compare the baseline retrieval result with all the retrieval results using different query formulation methods. We can conclude from the results that our query formulation methods improve the retrieval results substantially from the baseline system. 21

26 Metric frequency filter + combination(2) + phrase search + common term PRES MAP A.Recall Table 10: Combine frequency filter, combination(2), key-phrases and common term query formulation 5.3 Visualization Experiment The overall process of the visualization experiment is shown in Figure 10. We need to set up the MongoDB database with the query patent documents and all the related patent documents first and then build a web application based on Django framework using the MongoDB database. Figure 10: The overall process of the visualization experiment Database Setup The format of the retrieval results from the retrieval experiment is shown in Figure 11. The database we use in this visualization experiment is MongoDB. Different from MySQL which is a relational database, MongoDB is a document-oriented database which is the best fit for our experiment whose data are documents without a unique field format. MongoDB has good scalability and flexibility which is a better choice for managing a large amount of JSON data with high efficiency (MongoDB 2018). MongoDB stores data in flexible, JSON-like documents, meaning fields can vary from document to document and data structure can be changed over time. Our patent and qrel documents are stored in JSON format with vary meaning field. MongoDB makes things easier to work with data collections like this.mongodb also provides Ad hoc queries, indexing, and real-time aggregation which are powerful ways to access and analyze our data. MongoDB is free, open-source and it provides drivers for

27 Query Type Query Section PRES MAP A.Recall Title baseline Abstract Description Claims Title baseline + frequency filter Abstract Description Claims baseline + frequency filter+combination1 baseline + frequency filter+combination2 baseline + frequency filter+combination2+phrase search baseline + frequency filter+combination2+common term baseline + frequency filter+combination2+phrase search+common term All section All section All section All section All section Table 11: Compare between retrieval results using different query formulation methods languages, and the community has built dozens more which all make MongoDB a popular database nowadays. There are two collections in my database, collection qrel store the query patent documents and collection patent store the related patent documents of the query documents. The related patent documents obtain from the result list (Figure 11) are added with a field called Qkey which value is the qrel identity that we enter in the search bar. The qrel patent obtains from the result list also has a field added. We add a field call PAC as the qrel identity of each qrel. This added field is useful for making queries Web Framework Our visualization system is built upon Django. Django is a free and opensource high-level web framework, written in Python, which follows the modelview-template (MVT) architectural pattern (Django 2018). Django is created to ease the creation of complex, database-driven websites which is a good fit for our visualization experiment. Since the architectural pattern of Django is model-view-template(mvt), we will explain the detail of these three parts below. First, in the model part, we map the collections in the database to the classes we created in the models. With the help of the Python database driver 23

28 Figure 11: Qrel format provided by MongoDB, we can connect to the database we use easily. Then we edit the meta data in the class we created to specify which collection it refers to and we also need to explicitly declare the fields that the documents in this collection must have in this class. Then the class and the document in the collection automatically map together, thus we can use the documents in the database like it is a class instance in the view. All these jobs are done within the models.py file in the Django framework. Second, the view part is contained in the views.py in the Django framework. A python function in the views.py represent a view which takes a web request and returns a web response. This response can be the HTML contents of a Web page, or a redirect, or a 404 error, etc. Generally, a view retrieves data from the model objects according to the parameters, loads a template and renders the template with the retrieved data. To view the template render by the view in a web page we also need to associate the view to a URL in the urls.py. Third, the template folder represents the template part of the MVT architecture. All the HTML file that describes the web page rendered by the view functions are stored in this folder and their related CSS files and JS files are stored in the static folder. Variables that pass from the view function to the HTML can be used by surrounding these variables with double-curly braces. Django also has a template search path, which allows you to minimize redundancy among templates. With the help of Django framework and MongoDB, we can develop a visualization system with a clear structure and with quickly and flexibly scale Visualization Results The visualization application is deployed in Heroku, a cloud application platform. The database used by the visualization application is store in mlab, which provide fully managed cloud database service that hosts MongoDB databases. 24

Figure 12: Visualization Website Use Case Diagram Here is the link for the visualization application: https://patentvis. herokuapp.

29 Figure 12: Visualization Website Use Case Diagram Here is the link for the visualization application: herokuapp.com/visualization/ To help the user to explore the retrieval patent result visualization website thoroughly, a use case diagram is provided in Figure 12. To explore the retrieval results, the user should enter the query string in the search bar first, the query string is the query id in the CLEF-IP 2010 query topic set, PAC-1 is an example query string. Then the page will be redirected to the retrieval results page which contains the related patent list, results analysis, and related results network. You can view the related patent by click on the view patent list within each element of the related patent list. Also, you can explore the common and difference between a specifically related patent and the query topic patent by click on a specific patent element in the related results network, then you will be redirected to the comparison page between the specifically related patent and the query topic patent, which contains common analysis and patent text comparing panels. 6 Conclusion 6.1 What I learned during this project I have developed an information retrieval system and a retrieval result visualization website during this project, what I learned from this project is described below. Information Retrieval I learn a lot from the patent retrieval experiment. Before the experiment, I search and read related material about information retrieval to figure out how an information retrieval system work and to learn the theoretical knowledge about information retrieval models and related evaluation metrics. Also, I 25

Prior Art Retrieval Using Various Patent Document Fields Contents

Prior Art Retrieval Using Various Patent Document Fields Contents Metti Zakaria Wanagiri and Mirna Adriani Fakultas Ilmu Komputer, Universitas Indonesia Depok 16424, Indonesia metti.zakaria@ui.edu, mirna@cs.ui.ac.id