A Patent Retrieval and Visualization Case Study

Size: px
Start display at page:

Download "A Patent Retrieval and Visualization Case Study"

Transcription

1 A Patent Retrieval and Visualization Case Study COMP8755 Individual Computing Project Honggu Lin May 24,

2 Abstract Most of Information Retrieval(IR) tasks such as web search and news search dedicated to precision improvement. However, some other IR tasks such as patent search and legal search value recall more than precision. The users of patent search are usually professional patent analysts who are willing to check hundreds or even thousands of patent documents to make sure there is no infringement of intellectual property rights. The main concern of these recall-oriented tasks vary greatly from those precision-oriented tasks whose users tend to only focus on limited top relevant retrieval results. In this research project, we explore the distinguishing features of patent- prior-art retrieval compare to generic information retrieval. Then we build a baseline patent retrieval system using the CLEF-IP 2010 patent data collection in the vector space retrieval model. Moreover, we experiment with several query formulation methods with the aim of improving the baseline system. Finally, we build a website to visualize the retrieval result with the goal of improving patent-prior-art information access. i

3 Acknowledgment First, I would like to express my sincere gratitude to my project supervisor Gabriela Ferraro for her continuous support during this project. She can always provide good advice to me whenever I get stuck and steer me in the right the direction whenever she thought I needed it. Besides my supervisor, I would like to thank my formal supervisor and project examiner Hanna Suominen, who is involved in the examination for this research project for her passionate participation and her time involved in this project. I would also like to thanks my course convenor Peter Strazdins for his great advice for this course, the insightful comments and hard questions during the project presentation. Finally, I am grateful for my families and my friends for providing me with reliable support and continuous encouragement throughout my student career and through the process of researching and writing this thesis. ii

4 Contents Abstract Acknowledgement i ii 1 Introduction Motivation Objectives of this Study Structure Of The Thesis Background Patents Overview Of An Information Retrieval System Retrieval Model Query Formulation Query Reduction Query Expansion Prior work in Patent Retrieval and Visualization Materials and Evaluation Metrics Data Collection Evaluation Metrics Precision and Recall Patent Retrieval Evaluation Score(PRES) Mean Average Precision(MAP) Improving Patent Prior Art Search with Query Reduction and Query Structure Formulation Patent Retrieval System Framework Data Preprocessing Indexing Query Reduction Via Term Selection Global Frequency-Based Term Selection Local Frequency-Based Term Selection Query Key-Phrase Selection Query Structure Formulation Compound Query Common Term/Phrase Weight Assignment Experiments Baseline definition Retrieval Experimental Results and Analysis Visualization Experiment Database Setup Web Framework Visualization Results Conclusion What I learned during this project Future Work iii

5 1 Introduction Figure 1: Comparison Between Web Search And Prior Art Search 1.1 Motivation A patent is a legal right granted by a patent office to an inventor or assignee for a device, substance, method or process that is new, useful and inventive (IP Australia 2018). This right protect the commercial interest of the inventor or assignee during the life of the patent (WIPO 2018b). Patents, therefore, as a kind of intellectual property, have a great impact on enterprises market value (IP Australia 2018). With the continuous rise in the number of patent applications every year, the need for an accurate and efficient system that can return all possible relevant patents of the patent application become increasing necessary. With the help of this system, the patent analysts, whose duty is to prevent all the possible intellectual property infringement of the new patent application can figure out all relevant patents more efficiently and more accurately. Other users such as the inventors and patent lawyers can also use this system to check patent application novelty during the patent application process. There are great differences between the prior art search and the standard web search that we are familiar with (Far et al. 2015). Figure 1 illustrates three main differences between them. The main object of prior art search is to find all the relevant documents given a patent application as a query. Thus, its evaluation focuses more on recall. The return result list can contain hundreds or even thousands of documents because the users of prior art search are usually professional patent analysts and are willing to work through the long results list to find any possible relevant documents. In contrast, the major goal of the standard web search is to find one or a few top relevant documents to explore and the main evaluation principle is precision. It is widely used by people who want to acquire the information they need as quickly as possible. Another main difference between prior art search and web search is the content of the query. The query of web search is usually very short contains only a few keywords while the query of prior art search is the whole patent document which makes the query formulation process far more difficult than general web search. 1

6 1.2 Objectives of this Study There are three objectives of this research project: Objective 1: build a baseline patent prior art retrieval system Objective 2: improve the baseline system by applying query formulation methods Objective 3: experiment with visualization strategies within the aim of improving information access to patent prior art search content. In order to achieve these three objectives, we implement several query formulation methods and analyze these retrieval results to see whether these methods work and how to combine these methods to formulate the query that can provide best results. 1.3 Structure Of The Thesis This chapter introduces the research problem of the thesis and analysis the difference between prior search and general web search. Chapter 2 provide the background knowledge related to generic information retrieval and patent retrieval. We also review previous work about existing query formulation techniques in this chapter. In chapter 3, We introduce the data materials and evaluation metrics we used in this experiment. Our main experiment is described in Chapter 4 and 5. Chapter 4 explain the query formulation techniques we used in our improved patent retrieval system. Experimental setting and baseline system is described in Chapter 5. Chapter 5 also cover the results and results analysis of the experiments. Chapter 6 conclude the thesis by summarizing the results and observations, as well as proposing possible directions for future work. 2 Background This chapter first explains the details about patent structure, then we introduce information retrieval system and query formulation. Moreover, we present and discuss previous works on patent prior art search and how the patent structure has been exploited to improve retrieval. 2.1 Patents A patent is a structured document assigned with legal right for a new invention that contains specific sections such as title, abstract, description, claims, and etc to define the protected invention. Patent Classification A patent needs to pass through several versions to become a qualify granted patent. The initial version that the inventor summited to the patent office for novelty checking is called patent application. The last two letters of the patent ID of a patent application is start with A. There may be several update patent application until the patent application finally being accepted as a granted patent or being rejected. The granted patent with ID that has the last two letters start with B. 2

7 Figure 2: A sample XML patent document from EPO 3

8 Patent Structure Different patent office has different patent structure requirement. There are many patent offices across the world, such as the United States patent and trademark office(uspto), the European patent office(epo), and the Japan patent office(jpo). The dataset we use in this experiment is from EPO, thus we will explain more detail about the EPO patent structure. Figure 2 is a sample XML patent document form EPO. A patent is a structured document, as we can see from Figure 2, the EPO patent consists of several sections, such as ID, abstract, description, claims, bibliography-data,... etc. I will explain several sections of EPO patents that are commonly used in a patent retrieval system as follows: ID: ID is a unique identification for EPO patents - a string start with "EP" followed by 7 digits, then followed by two letters version string. The two letters version string starting with "A" stands for a patent application, and "B" stands for a granted patent. Abstract: the abstract is a short paragraph of abstract of the patent in three languages, English(EN), German(DE) and French(FR). This section does not always exist because it is an optional section in EPO patent. Description: the description section is the core of the invention in EPO patent (Walid Magdy 2012). All the technical detail of the patent are contained inside the description section. It contains a few paragraphs and each paragraph describes an aspect of the invention in detail. The description section can contain tables, experimentation on the performance of the invention, and description of figures relating to the invention. The first paragraph of the description section usually contains information about the topical field of the invention. The description text also contains references to other patent documents which are very important information that patent analysts would like to examine to measure the contribution of the invention against prior art. Claims: the claims section of the patent document lists what aspects of the invention that the patent is going to protect. A successful patent does not have to have all its claims accepted, but at least one of them must be (Walid Magdy 2012). The examination can lead to dropping some of the claims by showing that they are not novel. This usually happens because patent applicants try to generalize their invention as much as possible, which can lead to the novelty of some of the very general claims being found to be invalid. Invention-title: the title of the patent is presented in three languages, English(EN), German(DE) and French(FR) in bibliography-data section. Patent Classification Code: Patent classification schemes are used to organize and index the technical content of patent specifications so that specifications on a specific topic can be identified easily and accurately (The British Library Board 2017). There are several patent classification schemes, the International Patent Classification(IPC), ECLA, US classification, and British classification. IPC is widely used around the world and is used by the European Patent Office 4

9 Figure 3: Complete Classification Symbol (EPO) where our experiment data obtain from. The International Patent Classification (IPC) provides a hierarchical system of language independent symbols for the classification of patents and utility models according to the different areas of technology to which they pertain (WIPO 2018c). The IPC divides technology into eight main sections with approximately 70,000 subdivisions. Each subdivision has a symbol consisting of Arabic numerals and letters of the Latin alphabet (WIPO 2018a). The IPC is updated once a year to keep the IPC up to date (WIPO 2018c). Figure 3 illustrate the components of an IPC classification and give two examples of the IPC code in patent. 2.2 Overview Of An Information Retrieval System An overall process of an Information Retrieval(IR) system is illustrated in Figure 4. In the collection side, each document in the data collection is indexed before being searched. The user formulates the query using the provided information and searches the formulated query through the IR system. In the matching process, the query and the document representations are compared using a retrieval model and the result would be a ranked list of documents. The return ranked list can be the final retrieval return result list or being used as feedback and being passed to query formulation module to reformulate the query Retrieval Model Vector Space Model In a vector space model, documents and queries are represented by vectors of term weights, and the collection is represented by a matrix of term weights as follows: A document term weight vector: A query term weight vector: D i = [d i1, d i2, d i3,..., d im ] Q j = [q j1, q j2, q j3,..., q jm ] 5

10 Figure 4: The overall Process Of A Prior Art Information Retrieval System A documents collection term weight matrix: d 11 d 12 d d 1m d 21 d 22 d d 2m d 31 d 32 d d 3m D =... d N1 d N2 d d Nm where D i is a document in the collection D, d ik is a weight for each term t k in the document D i, and q jk represents a term in the query Q j. The index collection represented by the matrix D Nm, where N is the number of documents in the index collection and m is the number of unique terms in the collection. If a term does not appear in a document, the weight for that particular term will be zero. The TFIDF weight of a term in a document which is shown in Equation 2 is calculated by multiplying the term frequency(tf) of the term in that document and the inversed document frequency(idf) of the term in the collection. idf(t k ) = log N + 1 df(t k ) (1) T F IDF (t k, D i ) = tf(t k, D i ) idf(t k ) (2) where tf(t k, D i ) is the number of occurrence of the term t k in the document D i. df(t k ) is the number of document in the collection D that contains at least one occurrence of the term t k. Given a query Q, documents are ranked based on the overlap score measure which is notated at Equation 3 T F IDF (Q, D i ) = T F IDF (q, D i ) q Q D i (1 b) + b Di avdl 6 (3)

11 where D i is the length of document D i. avdl is the average document length. b is a parameter that can be adjust by user and is set default as Query Formulation Query formulation is a process during which the original keyword query is issued by the user is transformed into a structured query representation that is used by the search engine (M.Malathi 2013). The main goal of query formulation is to improve the overall quality of the ranking presented to the user in response to their query (M.Malathi 2013). Query formulation can be generally divided into two main processing stages: The first processing stage is query refinement or reformulation, the process that alters the query on the morphological level (M.Malathi 2013). The query term processing includes tokenization, which split character sequence into word tokens. Then perform normalization upon the tokens by map text and query term to the same form. Also, we may want different forms of a root to match, then stemming is the third subprocess should be added. Stop words removing and spelling corrections are also commonly used for query term processing methods. After query term processing, we can apply query reduction and/or query expansion to the query term for query refinement. These two seemingly incompatible approaches can both improve document retrieval performance. We will discuss these two parts detailedly in the next two sections. The second processing stage is to alter the query on the structural level, which is performed after the query refinement stage is completed. The structural alterations may include, among other actions, segmenting the query into atomic concepts, assigning weights to these concepts, or expanding the query with related weighted concepts (M.Malathi 2013) Query Reduction Query reduction is to reduce the length of the query. It is widely used in patent prior art search because the query of patent retrieval is as long as a whole patent document. There are three main methods to reduce the query. Query Summarization (Mahdabi et al. 2011) utilizes a known text summarization technique, called TextTiling to summarize the patent documents. The summary-based query was aspired to capture the main topic of the document as well as the most important subtopics and discard subtopics, which are only marginally discussed in the patent document. Query Segmentation (Hearst n.d.) introduce a technique call TextTiling for subdividing texts into multi-paragraph units that which represent different subtopics. Multiparagraph subtopic segmentation can be used in patent query segmentation, with each patent query being segmented into several sub queries which represent different subtopics, then we search several sub queries and obtain several retrieval results and finally we merge these results into one final result which represent the result of the original query. Query Term Selection 7

12 Query term selection is to remove noise terms which may detract retrieval performance from a query or select informative terms from the long query and only use these selected terms as query. Words that with high document frequency, appear in lots of documents in the data collection are considered as stop words, such as you, and, this,... etc. There are several methods to identify stop words. Using common language stop word summarize by the expert is a frequently used methods. However, these stop word is not base on the data set we use. For example, system and machine can be considered as stop work in a patent data collection while these word are not contained in the common stop word list. Thus, another approach to obtain stop word which related to the data set we use is to obtain the document frequency for each individual word in the data set and treat the word with document frequency in a top specific percentage as stop word (Corremans. and G. 2000). Query terms can be weighted based on their perceived significant in the target corpus, combined with their significance in the query in cross-database retrieval (Hideo Itoh 2003). Since the domain of queries differs from that of the retrieval target in the distribution of term occurrences, only using the distribution on one corpus can causes incorrect term weighting. Thus, in this experiment, the document frequency of the query term is obtain from the target data collection and the term frequency of the query term is obtain from the query document. Multiply the term frequency and the inversed document frequency to produce the query term weight for query term selection Query Expansion In most collections, one concept can be represented by different words, which is known as synonymy, has an impact on the recall of most information retrieval(ir) systems (Christopher D. Manning and Schütze 2008). The methods for tackling this problem split into two major classes: global methods and local methods. Global methods are techniques for expanding or reformulating query terms independent of the query and results returned from it, they used prior knowledge to find semantically similar terms for the query word (Christopher D. Manning and Schütze 2008). Global methods include: Query expansion/reformulation with a thesaurus or WordNet. We can obtain the synonymy of a term by using a controlled vocabulary that is maintained by human editors. One of the popular synonymy library we can use is NLTK WordNet. Query expansion via automatic thesaurus generation. We can attempt to generate a thesaurus automatically by analyzing a collection of documents. There are two main approaches. One is simply to exploit word cooccurrence statistics. We believe that words that co-occur in a document or paragraph are likely to be semantically similar with each other (Christopher D. Manning and Schütze 2008), thus counting text statistic is a simple method to find the most similar words. The other approach is to use a shallow grammatical analysis of the text and to exploit grammatical relations or grammatical dependencies (Christopher D. Manning and Schütze 2008). 8

13 Local methods adjust a query relative to the documents that initially appear to match the query. The basic methods here are: Relevance feedback(rf) is to involve the user in the IR process so as to improve the final result set (Christopher D. Manning and Schütze 2008). The basic procedure of RF is that the user issues an initial query first, then the system returns a set of initial retrieval results. Next, the user marks some of the retrieval results as relevant or unrelevant. The system then reformulates a better query according to the user feedback. Finally, the system displays a revised set of retrieval results. RF can go through one or more iterations of this sort. Pseudo-relevance feedback, also known as blind relevance feedback, provides a method for automatic local analysis (Christopher D. Manning and Schütze 2008). It automates the manual part of RF, so that the user gets improved retrieval performance without an extended interaction. The method is to do normal retrieval to find an initial set of most relevant documents, then assume that the top k ranked documents are relevant, and finally do RF as before under this assumption. 2.3 Prior work in Patent Retrieval and Visualization This section reviews the existing work on patent search tasks and patent visualization tasks and summarizes the special characteristics of patent search according to our review of the existing work. Prior Work in Patent Prior Art Retrieval (Walid Magdy 2012) explore the special nature of recall-oriented IR and patent search, also, he proposes a new Recall-Oriented Information Retrieval tasks evaluation matrix called Patent Retrieval Evaluation Score(PRES). (Far et al. 2015) build an Oracular Relevance Feedback System to select optimal query term by acquire pseudo-relevance feedback from the initial query results and assign a score to each term for query reduction. Using the whole patent document as query is not practical and not accurate because there is a huge amount of text in one patent and most of the text is not useful for the retrieval. Hence we need to extract useful terms from the whole patent document. As introduced before, patents are structured documents that consist of several different sections such as title, abstract, description, claims, classification code, etc. Different sections use different types of language for invention description. The abstract and description section tend to use more technical terminology while the claims usually use legal jargon. These difference are because of their different function in the patent document. Abstract and description are responsible for explaining what is the new invention and how the new invention works, while claims are responsible for claim the legal advantages for the new invention. There are contrasting findings from previous work with respect to which fields should be used for query term extraction. According to early patent research tasks, claims section is the primary section to build the query, which agrees with where the examiners start in the novelty checking process (Takaki, Fujii, and Ishikawa n.d.). However, recent works show that building queries form the description field can obtain a better result in patent retrieval (Xue and 9

14 Croft n.d.). In contrast, an experiment shows that discarding description from queries improves the MAP up to 30% because description section contains more noise than information (Gobeill et al. 2010). Also, there is research suggest that extracting terms according to their TF-IDF scores from every field of the query patent, and give higher importance to the terms extracted from the title field, is an effective way of constructing a search query (Cetintas and Si 2012). Prior Work in Patent Prior Art Visualization Patent visualization has been tracked from many different angles such as document visualization, collection visualization and exploration, and patent landscapes, and less has been done in designing visualizations for patent prior art specifically. (Kucher and Kerren 2015) present an interactive visual survey of text visualization techniques that can be used for the purposes of search for related work. They also introduce the subfield and gaining insight into research trends of text visualization techniques. The taxonomy of text visualization techniques are also summarized in this paper. 3 Materials and Evaluation Metrics This section introduce the patent data we use in this project and three evaluation metrics we use in the experiment to evaluate the retrieval results. 3.1 Data Collection The patent data collection we use in this experiment is from The Cross Language Evaluation Forum for Intellectual Property evaluation track(clef-ip). The CLEF-IP track was launched in 2009 to investigate IR techniques for patent retrieval and was part of the CLEF 2009 evaluation campaign (TUWIEN 2018a). The prior art candidate search task(pac) ran in five subsequent years: 2009,2010,1011,2012,2013 (TUWIEN 2018b). The task we use in this experiment is CLEF-IP 2010 which is a benchmarking activity of the CLEF-IP 2010 conference. This track contains 1.3 million patent documents derived from European Patent Office(EPO) (Piroi 2010). The data collection covers English, French, and German. Figure 5 shows the percentage of the English, German and French patents in the CLEF-IP 2010 collection. Only 68% of the patents in the data collection are English patent and we only use the English patents in this experiment. The patent documents in the patent collection are stored as XML files. There are two tasks in the 2010 s track. Our experiment performs the first task which is to find patent documents that are candidates to constitute prior art for a given document. The target data set contains all EPO documents that have an application date previous (2,680,698 patent documents constituting 1,331,106 patents) (Piroi 2010). The data collection contains patent documents without merging the documents related to the same patent into one document. Thus we do the merging in our data preprocessing. Each patent in the collection is identified by a unique patent number, which is a string starting with "EP", followed by 7 digits which corresponding to each patent is a directory containing the patent documents related to that patent and then followed by a patent two-letters kind codes. The kind codes represent different stages of the patent s life-cycle. The kind code start with "A" means this is a patent application while 10

15 Figure 5: Percentage of English, German, and French patents in the CLEF-IP 2010 collection Figure 6: Completeness of the presence of English patents in the CLEF-IP 2010 collection the kind code start with "B" refers to a granted patent. Table 1 shows the meaning of common kind codes. Not all the patents in the collection contain all section. Figure 6 show the completeness of the presence of English patent in the data collection where only 52% of the English patents are complete. We not only use the complete English patents, we use the other English patents as well although they are not completed. The query data set contains 2000 queries (Piroi 2010). The query document is a patent application, A1 or A2, where the citation information was removed. The query documents also cover three languages, 1,348 English queries, 518 German queries, and 134 French queries and we only use the English query documents as query in our experiment. 11

16 Kind Code A1 A2 A3 A4 A8 A9 B1 B2 B8 B9 Meaning publication of application with search report publication of application without search report publication of search report supplementary search report corrected title page of an EP A document complete reprint of an EP A document granted patent granted patent after modification corrected front page of an EP B document complete reprint of an EP B document Table 1: The patent ID kind codes and their meaning 3.2 Evaluation Metrics Precision and Recall An ideal retrieval system can retrieve all relevant documents and all the retrieved documents are relevant. Recall and precision are two evaluation matrices that used to evaluate these two aspects respectively and they are the most basic and frequent used evaluation measure for information retrieval effectiveness. Equation 4 and 5 are the formula of Precision and Recall. Precision is the fraction of the retrieved documents that are relevant: P recision = P (relevant retrieved) = T P T P + F P Recall is the fraction of the relevant documents that are retrieved: Recall = P (retrieved relevant) = T P T P + F N where: True Positive(TP): number of retrieved relevant documents False Positive(FP): number of retrieved irrelevant documents True Negative(TN): number of not-retrieved irrelevant documents False Negative(FN): number of not-retrieved relevant documents Prior art search is recall-oriented search thus we pay more attention to the recall rate. Precision is not that informative to the patent retrieval result and we have more proper evaluation matrices to evaluate the accurateness of our result so we do not use precision as the evaluation matrix in this experiment Patent Retrieval Evaluation Score(PRES) A new evaluation metric called Patent Retrieval Evaluation Score(PRES) is introduced by (Walid Magdy 2012), which is based on the same idea as normalized recall(r norm ) (Joseph Rocchio 1964; ROBERTSON 1969),shown in Equation 6, but with a different definition for the worst case. (4) (5) R norm = A 2 A 1 + A 2 = 1 ri i n(n n) (6) 12

17 where: A 1, A 2 : areas shown in Figure 7 r i : the rank at which the i th relevant document is retrieved N: collection size n: number of relevant docs This R norm score can reflect the precision-recall curve in one number, with the requirement to rank all documents in the collection according to relevance to a query (Joseph Rocchio 1964; ROBERTSON 1969).This metric measures a system s effectiveness in ranking documents relative to the best and worst ranking cases (Walid Magdy 2012), where the best ranking case is retrieving all relevant documents at the top of the result list, and the worst case is retrieving them at the bottom of the result list with the result list contains the rank of all documents in the collection. Figure 7 is an illustrative graph of the calculation of R norm, where A1 represent the area between the best case and the actual case, A2 represent the area between the actual case and the worst case. Figure 7: Illustration of how R n orm curve is bounded by the best and worst cases (Rijsbergen 1979) Figure 8: PRES curve is bounded between the best case and the new defined worst case (Walid Magdy 2012) Different from R norm, in PRES, the assumption for the worst case is to retrieve all the relevant documents just after the maximum number of documents to be checked by the user N max. Any relevant document not retrieved in the 13

18 top N max is assumed to be the worst case. Figure 8 is an illustrative graph of the calculation of P RES. Applying this assumption in Equation 6 replace N with N max + n, where N max is the number of retrieved documents which is also the maximum number of documents to be checked by the user. P RES = R norm N=Nmax+n ri i = 1 n(n n) N=N max+n ri i = 1 n N max and, the summation of the ranks of all the relevant document is n i = i=1 n(n + 1) 2 (7) then, P RES = 1 ri n n+1 2 (8) N max Equation 9 shows the direct calculation of the summation of the ranks of relevant documents in the general case when some relevant documents are missing from the top N max. ri = R r i + (n R)(N max + n) i=1 (n R)(n R 1) 2 (9) where: R: number of retrieved relevant documents in the first N max documents Mean Average Precision(MAP) Mean average precision(map) is the most popular evaluation metrics in general use for d hoc type IR tasks by far (Baeza-Yates and Ribeiro-Neto 2010). Equation 10 shows the definition of average precision (AP) for a given topic, and MAP shown in Equation 11 is the mean of AP taken over all topics in the test collection. Average precision(ap) is the average of precision at each point where a relevant document is found: AP = N r=1 (P (r) rel(r)) n (10) Mean Average Precision(MAP) is the average of all average precision score among a query set: q Q AP (q) MAP (Q) = (11) Q 14

19 where: r: the rank P(r): precision at a given cut-off rank, i.e. Precision(r) rel(r): a binary function of the document relevance at a given rank, where rel(r)=1 when document at rank r is relevant and rel(r)=0 otherwise. n: the total number of relevant documents Q: the query set As its name implies, MAP is a precision metric. According to Equation 10, it can be seen that the the bigger the rank number a relevant document has, the weaker impact it has on AP, which means, even though two result lists have the same recall rate, the result list with great number the relevant documents at the top of the result list has much higher MAP than the result list with most of its relevant document at the bottom of the list. That is why MAP can provide a good and intuitive evaluation for IR task emphasizing precision, but will often not give a meaningful interpretation for recall focused tasks (Walid Magdy 2012). 4 Improving Patent Prior Art Search with Query Reduction and Query Structure Formulation This section presents an overview of the retrieval framework of this study and two general methods to improve patent prior art search. The methods are inspired in two ideas: (i) query refinement via term selection, and (ii) query structure formulation. The proposed methods are compared against the baseline described in Section Patent Retrieval System Framework Figure 9: The Overall Retrieval Process in Patent Retrieval Experiment The overall progress of retrieval experiment is described in Figure 9. The 15

20 most important module is query formulation. The query formulation techniques we used to improve retrieval results in this study for are explain detailedly in the following sections Data Preprocessing At the very first beginning, we need to preprocess all the English Patents in the CLEF-IP 2010 collection and all the English patent application topics in CLEF-IP topic Preprocessing includes convert XML document into JSON document then merge different versions of one patent in the collection into one document and filter out other patent sections except sections title, abstract, description, claims, and classification Indexing Structured indexing (parameters are shown in Table 2) is applied to the patent documents in the collection, which means the document structure is preserved in the indexing, and we can search each specific fields in the document or search the full document as a whole. Also, a customer analyzer as shown in Table 3 is used in the indexed mapping. An analyzer consists of character filter, tokenizer, and token filters. Our customer analyzer use lowercase tokenizer, English stop words token filter, and Porter Stem token filter. A term vector parameter is also being set for indexing so that we can obtain the term vector information, which is used later in the term selection process. Parameters Title Abstract Description Claims ipcr ucid field datatype text [text] keyword term_vector with_position_offset none none analyzer my_analyzer none none Table 2: Index mapping parameters my_analyzer character filter tokenizer token filter none lowercase porter stemmer, english stop Table 3: Customer analyzer 4.2 Query Reduction Via Term Selection Global Frequency-Based Term Selection Removing the terms with high document frequency in a global context is to build a stop word list for each field (title, abstract, description, claims) based on the whole patent collection. Different from the Language common stop word list, these patent-specific stop word lists are subject to the data collection we used. Thus, it can identify data collection specified stop word. Patent-specific stop words are extracted from each individual patent field according to (Corremans. and G. 2000). To obtain the field stop words, we need to obtain the field frequency for each identified term in the field from the Elasticsearch. The field 16

21 frequency for a term T in field F is the number of fields that contains the term T across all documents in the index. We need to obtain the patentspecific stop words for each text fields (title, abstract, description, claims). For each field, the terms that with field frequency higher than 1% of the highest term field frequency for this field were selected as stop words. The value 1% was selected subjectively based on our observation and experiment on the data Local Frequency-Based Term Selection Removing the terms with high document frequency in a local context is to remove a percentage of high document frequency term in a specific field of a patent document. We first obtain the document frequency for all the terms in the field and then sorted the terms base on their document frequency. Then remove x% of the terms that with the highest document frequency. The threshold for the percentage of removing terms is x, and different field has different threshold Query Key-Phrase Selection Query phrase selection is to extract key-phrases automatically based on their informative score. Automatic key-phrase extraction consists of two steps: (i) identify a set of noun phrases from the given text as candidates, (ii) score the candidate phrases based on a score function and select the phrases with a high score as key-phrases. Candidate Key-phrase Identification Generally, all words and/or phrases in a document can be considered as candidate phrase. However, not all of the candidate phrases are informative for the retrieval task, so that we need to identify key phrases from these candidate phrases to reduce the computational cost and improve the retrieval accuracy. Heuristics are typically used to identify a smaller subset of better candidates (DeWilde n.d.). Common heuristics include removing common stop words, digits and punctuation; filtering words with certain parts of speech. More specifically, for multiword phrases, using certain POS patterns to identify noun phrases and using external knowledge bases like WordNet or Wikipedia as a reference source of good/bad key-phrases. In our study, we use Part-Of-Speech patterns to extract noun phrases as key-phrase candidates, using a regular expression. The regular expression is: {(< JJ > < NN. > + < IN >)? < JJ > < NN. > +} (a regular expression written in a simplified format used by NLTK s RegexpParser()). This matches any number of adjectives followed by at least one noun that may be joined by a proposition to one other adjective(s)+noun(s) sequence (DeWilde n.d.). Keyphrase Selection There are amounts of methods to distinguishing between key-phrase candidates and noise phrase candidates. The basic and simplest one is to score candidates solely based on frequency statistics, such as TF*IDF or BM25. For this method, we assume that the keyphrases within a document tend to be phrases that with high phrase frequency and low document frequency. 17

22 To perform Key-phrase Selection, we use the TFIDF scoring method shown in Equation 13 to score each term. Then the score of each phrase shown in Equation 14 is the average score of the terms that consist of it. Since the domain of queries differs from that of the retrieval target in the distribution of term occurrences, only using the distribution on one corpus can cause incorrect term weighting (Hideo Itoh 2003). Thus, in this experiment, the document frequency of the query term is obtained from the target data collection and the term frequency of the query term is obtained from the query document. idf(t k ) = log N + 1 df(t k ) (12) T F IDF (t k, Q i ) = tf(t k, Q i ) idf(t k ) (13) T F IDF (p k, Q i ) = t p k T F IDF (t k, Q i ) p k (14) where tf(t k, Q i ) is the number of occurrence of the term t k in the query document Q i. df(t k ) is the number of document in the target data collection D that contains at least one occurrence of the term t k. p k is the number of terms that phrase p k contains. 4.3 Query Structure Formulation In this section, we present two query structure formulation methods that are used in this research project Compound Query The compound query is composed of leaf queries, which are queries using a specific field of text as query input. The compound query then combines results and scores of the leaf queries, to form a new score and provide a new result. There are two types of compound structure. The first type of compound structure is to combine all leaf queries using corresponding query field and these leaf queries search within the corresponding field in the patent collection, for example, query string extract from field "abstract" of the query patent search on field "abstract" in the patent collection, query string extract from field "description" search on field "description" in the patent collection, etc. Then use the "OR" operator to combine the search results of each leaf query, also, sum up the score of each leaf query as the new score of the compound query. The second type of compound structure is also to combine the leaf queries which are queries that using specific query fields while these leaf queries search within the whole document context. It means all the leaf query such as "abstract" leaf query, which is query that uses field "abstract" of query patent as query string and searches on the full patent document as a whole, are combined using operator "OR". Adding up the leaf queries score as the new compound score and merge the leaf query search results as the new results set. 18

23 We have implemented both compound structure in our research and compare their results in section Common Term/Phrase Weight Assignment We assume that the terms or phrases that appear in several fields of the query patent are more important than the terms or phrases that only appear in a single section, and the more fields the terms/phrases appear, the more important the terms/phrases are. The Common Term Selection method assign the number of field the common term appear within a document as the weight that is used in the query structure formulation of the common term. 5 Experiments In this section, we first present the baseline retrieval system, then we compare and analyze the results from the baseline system and several improved systems using the query formulation techniques described in Section 4. Finally, we introduce our visualization experiment. 5.1 Baseline definition In the baseline query formulation, we use the lowercase tokenizer as the indexing analyzer and remove English common stop word (countwordsfree 2018), digits, words with length less than 3 letters. Also, we use the meta data IPC codes that assigned to each topic to filter the search results, which makes each return results has at least one common IPC code with the topic query patent application. Performance is evaluated using the three popular metrics defined in Section Average recall, MAP, and PRES on the top 100 results for each query. The results are in Table 4: Metric Title Abstract Description Claims PRES MAP A.Recall Table 4: The baseline results using different patent sections as queries According to the result show in Table 4, the best section to query with in the baseline system is section claims. 5.2 Retrieval Experimental Results and Analysis This section presents the evaluation and analysis of several results with different techniques improvement from the baseline system. Frequency base term selection We first apply global frequency based term selection to the baseline system. We obtain the patent specific field stop-word lists after extract the terms with high field document frequency from the target data collection. After applying the baseline query formulation, we reformulate the baseline query by filtering it with our patent specific field stop word lists to reduce the noise terms in 19

24 the query. Table 5 shows the evaluation result using frequency-based term selection on baseline system. Compare these results with the baseline results, we found that the baseline system is improved by using the frequency based term selection. Metric Title Abstract Description Claims PRES MAP A.Recall Table 5: Adding frequency based term selection to the baseline system Compound query structure Secondly, we try the compound query structure methods upon the above frequency query reduction improved system. There are two kind of combination methods, the first one is to use all four text sections(title, abstract, description and claims) in the query patent as leaf queries, with each leaf query search on corresponding section in the target data collection and sum up the leaf queries score as the final score. we denote this type of combination as Combination(1). The result of Combination(1) is shown in Table 6 Metric Title + Abstract + Description + Claims(Combination1) PRES MAP A.Recall Table 6: Adding Combination(1) method to the frequency based term selection improved system The other kinds of combination method is also to use all four text sections in the query patent as leaf queries, bit with each leaf query search on the full patent documents on the target data collection, and sum up the leaf queries score as the final score.we denote this type of combination as Combination(2). The result of Combination(2) is shown in Table 7 Metric Title + Abstract + Description + Claims (Combination2) PRES MAP A.Recall Table 7: Adding Combination(2) method to the frequency based term selection improved system Comparing the results from Combination(1) in Table 6 and Combination(2) in Table 7 with the frequency based improved results in Table 5, we prove that Combination(1) does not help and that Combination(2) improve the results of the frequency based method. Thus, in the following experiments, we abandon combination(1) and continue to improve the query upon the Combination(2) method. 20

25 Query Key-Phrase selection We include a key-phrase selection method on the Combination(2) query formulation method. We select key-phrases using the method described in Section 4, and then construct phrase queries using key-phrases. Table 8 shows the retrieval results using frequency base term selection, phrase queries, and Combination(2) query formulation method. The result is better than the Combination2 method alone (see Table 7). In conclusion, key-phrase selection query formulation method improve the retrieval results. Metric Combination2 + key-phrases PRES MAP A.Recall Table 8: Combination(2) combined with key-phrases Common term weight reassignment To prove whether the common term weight reassignment query formulation method can improve the retrieval result, we apply common term weight reassignment method on the Combination(2) improved query formulation. We identify the common term in four different sections of a patent document and assign a higher weight to these common terms according to the method we mentioned before. As shown in Table 9 adding the common term query formulation method on Combination(2) outperforms the Combination(2) (Table 7), but not the method that combines Combination(2) and key-phrases. Metric Combination2 + common term PRES MAP A.Recall Table 9: Combination full text search with assign higher weight to common term Combination of all the presented query formulation methods Finally, we combine all the useful query formulation methods above to form a query. Table 10 shows the results of using frequency filter, Combination(2), key-phrases and common term weigh reassignment query formulation methods to form the query As shown in Table 10, combining all the query formulation methods, according to our experiments, obtain a better retrieval result than any of the retrieval result mention before. Table 11 compare the baseline retrieval result with all the retrieval results using different query formulation methods. We can conclude from the results that our query formulation methods improve the retrieval results substantially from the baseline system. 21

26 Metric frequency filter + combination(2) + phrase search + common term PRES MAP A.Recall Table 10: Combine frequency filter, combination(2), key-phrases and common term query formulation 5.3 Visualization Experiment The overall process of the visualization experiment is shown in Figure 10. We need to set up the MongoDB database with the query patent documents and all the related patent documents first and then build a web application based on Django framework using the MongoDB database. Figure 10: The overall process of the visualization experiment Database Setup The format of the retrieval results from the retrieval experiment is shown in Figure 11. The database we use in this visualization experiment is MongoDB. Different from MySQL which is a relational database, MongoDB is a document-oriented database which is the best fit for our experiment whose data are documents without a unique field format. MongoDB has good scalability and flexibility which is a better choice for managing a large amount of JSON data with high efficiency (MongoDB 2018). MongoDB stores data in flexible, JSON-like documents, meaning fields can vary from document to document and data structure can be changed over time. Our patent and qrel documents are stored in JSON format with vary meaning field. MongoDB makes things easier to work with data collections like this.mongodb also provides Ad hoc queries, indexing, and real-time aggregation which are powerful ways to access and analyze our data. MongoDB is free, open-source and it provides drivers for

27 Query Type Query Section PRES MAP A.Recall Title baseline Abstract Description Claims Title baseline + frequency filter Abstract Description Claims baseline + frequency filter+combination1 baseline + frequency filter+combination2 baseline + frequency filter+combination2+phrase search baseline + frequency filter+combination2+common term baseline + frequency filter+combination2+phrase search+common term All section All section All section All section All section Table 11: Compare between retrieval results using different query formulation methods languages, and the community has built dozens more which all make MongoDB a popular database nowadays. There are two collections in my database, collection qrel store the query patent documents and collection patent store the related patent documents of the query documents. The related patent documents obtain from the result list (Figure 11) are added with a field called Qkey which value is the qrel identity that we enter in the search bar. The qrel patent obtains from the result list also has a field added. We add a field call PAC as the qrel identity of each qrel. This added field is useful for making queries Web Framework Our visualization system is built upon Django. Django is a free and opensource high-level web framework, written in Python, which follows the modelview-template (MVT) architectural pattern (Django 2018). Django is created to ease the creation of complex, database-driven websites which is a good fit for our visualization experiment. Since the architectural pattern of Django is model-view-template(mvt), we will explain the detail of these three parts below. First, in the model part, we map the collections in the database to the classes we created in the models. With the help of the Python database driver 23

28 Figure 11: Qrel format provided by MongoDB, we can connect to the database we use easily. Then we edit the meta data in the class we created to specify which collection it refers to and we also need to explicitly declare the fields that the documents in this collection must have in this class. Then the class and the document in the collection automatically map together, thus we can use the documents in the database like it is a class instance in the view. All these jobs are done within the models.py file in the Django framework. Second, the view part is contained in the views.py in the Django framework. A python function in the views.py represent a view which takes a web request and returns a web response. This response can be the HTML contents of a Web page, or a redirect, or a 404 error, etc. Generally, a view retrieves data from the model objects according to the parameters, loads a template and renders the template with the retrieved data. To view the template render by the view in a web page we also need to associate the view to a URL in the urls.py. Third, the template folder represents the template part of the MVT architecture. All the HTML file that describes the web page rendered by the view functions are stored in this folder and their related CSS files and JS files are stored in the static folder. Variables that pass from the view function to the HTML can be used by surrounding these variables with double-curly braces. Django also has a template search path, which allows you to minimize redundancy among templates. With the help of Django framework and MongoDB, we can develop a visualization system with a clear structure and with quickly and flexibly scale Visualization Results The visualization application is deployed in Heroku, a cloud application platform. The database used by the visualization application is store in mlab, which provide fully managed cloud database service that hosts MongoDB databases. 24

29 Figure 12: Visualization Website Use Case Diagram Here is the link for the visualization application: herokuapp.com/visualization/ To help the user to explore the retrieval patent result visualization website thoroughly, a use case diagram is provided in Figure 12. To explore the retrieval results, the user should enter the query string in the search bar first, the query string is the query id in the CLEF-IP 2010 query topic set, PAC-1 is an example query string. Then the page will be redirected to the retrieval results page which contains the related patent list, results analysis, and related results network. You can view the related patent by click on the view patent list within each element of the related patent list. Also, you can explore the common and difference between a specifically related patent and the query topic patent by click on a specific patent element in the related results network, then you will be redirected to the comparison page between the specifically related patent and the query topic patent, which contains common analysis and patent text comparing panels. 6 Conclusion 6.1 What I learned during this project I have developed an information retrieval system and a retrieval result visualization website during this project, what I learned from this project is described below. Information Retrieval I learn a lot from the patent retrieval experiment. Before the experiment, I search and read related material about information retrieval to figure out how an information retrieval system work and to learn the theoretical knowledge about information retrieval models and related evaluation metrics. Also, I 25

Prior Art Retrieval Using Various Patent Document Fields Contents

Prior Art Retrieval Using Various Patent Document Fields Contents Prior Art Retrieval Using Various Patent Document Fields Contents Metti Zakaria Wanagiri and Mirna Adriani Fakultas Ilmu Komputer, Universitas Indonesia Depok 16424, Indonesia metti.zakaria@ui.edu, mirna@cs.ui.ac.id

More information

CLEF-IP 2009: Exploring Standard IR Techniques on Patent Retrieval

CLEF-IP 2009: Exploring Standard IR Techniques on Patent Retrieval DCU @ CLEF-IP 2009: Exploring Standard IR Techniques on Patent Retrieval Walid Magdy, Johannes Leveling, Gareth J.F. Jones Centre for Next Generation Localization School of Computing Dublin City University,

More information

Applying the KISS Principle for the CLEF- IP 2010 Prior Art Candidate Patent Search Task

Applying the KISS Principle for the CLEF- IP 2010 Prior Art Candidate Patent Search Task Applying the KISS Principle for the CLEF- IP 2010 Prior Art Candidate Patent Search Task Walid Magdy, Gareth J.F. Jones Centre for Next Generation Localisation School of Computing Dublin City University,

More information

Effect of log-based Query Term Expansion on Retrieval Effectiveness in Patent Searching

Effect of log-based Query Term Expansion on Retrieval Effectiveness in Patent Searching Effect of log-based Query Term Expansion on Retrieval Effectiveness in Patent Searching Wolfgang Tannebaum, Parvaz Madabi and Andreas Rauber Institute of Software Technology and Interactive Systems, Vienna

More information

Patent Terminlogy Analysis: Passage Retrieval Experiments for the Intellecutal Property Track at CLEF

Patent Terminlogy Analysis: Passage Retrieval Experiments for the Intellecutal Property Track at CLEF Patent Terminlogy Analysis: Passage Retrieval Experiments for the Intellecutal Property Track at CLEF Julia Jürgens, Sebastian Kastner, Christa Womser-Hacker, and Thomas Mandl University of Hildesheim,

More information

Automatically Generating Queries for Prior Art Search

Automatically Generating Queries for Prior Art Search Automatically Generating Queries for Prior Art Search Erik Graf, Leif Azzopardi, Keith van Rijsbergen University of Glasgow {graf,leif,keith}@dcs.gla.ac.uk Abstract This report outlines our participation

More information

Document Structure Analysis in Associative Patent Retrieval

Document Structure Analysis in Associative Patent Retrieval Document Structure Analysis in Associative Patent Retrieval Atsushi Fujii and Tetsuya Ishikawa Graduate School of Library, Information and Media Studies University of Tsukuba 1-2 Kasuga, Tsukuba, 305-8550,

More information

Toward Higher Effectiveness for Recall- Oriented Information Retrieval: A Patent Retrieval Case Study. Walid Magdy

Toward Higher Effectiveness for Recall- Oriented Information Retrieval: A Patent Retrieval Case Study. Walid Magdy Toward Higher Effectiveness for Recall- Oriented Information Retrieval: A Patent Retrieval Case Study Walid Magdy BSc., MSc. A dissertation submitted in fulfilment of the requirements for the award of

More information

University of Santiago de Compostela at CLEF-IP09

University of Santiago de Compostela at CLEF-IP09 University of Santiago de Compostela at CLEF-IP9 José Carlos Toucedo, David E. Losada Grupo de Sistemas Inteligentes Dept. Electrónica y Computación Universidad de Santiago de Compostela, Spain {josecarlos.toucedo,david.losada}@usc.es

More information

Information Retrieval. (M&S Ch 15)

Information Retrieval. (M&S Ch 15) Information Retrieval (M&S Ch 15) 1 Retrieval Models A retrieval model specifies the details of: Document representation Query representation Retrieval function Determines a notion of relevance. Notion

More information

Chapter 6: Information Retrieval and Web Search. An introduction

Chapter 6: Information Retrieval and Web Search. An introduction Chapter 6: Information Retrieval and Web Search An introduction Introduction n Text mining refers to data mining using text documents as data. n Most text mining tasks use Information Retrieval (IR) methods

More information

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University CS473: CS-473 Course Review Luo Si Department of Computer Science Purdue University Basic Concepts of IR: Outline Basic Concepts of Information Retrieval: Task definition of Ad-hoc IR Terminologies and

More information

ResPubliQA 2010

ResPubliQA 2010 SZTAKI @ ResPubliQA 2010 David Mark Nemeskey Computer and Automation Research Institute, Hungarian Academy of Sciences, Budapest, Hungary (SZTAKI) Abstract. This paper summarizes the results of our first

More information

modern database systems lecture 4 : information retrieval

modern database systems lecture 4 : information retrieval modern database systems lecture 4 : information retrieval Aristides Gionis Michael Mathioudakis spring 2016 in perspective structured data relational data RDBMS MySQL semi-structured data data-graph representation

More information

arxiv: v2 [cs.ir] 20 Dec 2018

arxiv: v2 [cs.ir] 20 Dec 2018 This is a pre-print of an article published in Knowledge and Information Systems. The final authenticated version is available online at: https://doi.org/10.1007/s10115-018-1322-7 Patent Retrieval: A Literature

More information

Semantic Search in s

Semantic Search in  s Semantic Search in Emails Navneet Kapur, Mustafa Safdari, Rahul Sharma December 10, 2010 Abstract Web search technology is abound with techniques to tap into the semantics of information. For email search,

More information

WEIGHTING QUERY TERMS USING WORDNET ONTOLOGY

WEIGHTING QUERY TERMS USING WORDNET ONTOLOGY IJCSNS International Journal of Computer Science and Network Security, VOL.9 No.4, April 2009 349 WEIGHTING QUERY TERMS USING WORDNET ONTOLOGY Mohammed M. Sakre Mohammed M. Kouta Ali M. N. Allam Al Shorouk

More information

Designing and Building an Automatic Information Retrieval System for Handling the Arabic Data

Designing and Building an Automatic Information Retrieval System for Handling the Arabic Data American Journal of Applied Sciences (): -, ISSN -99 Science Publications Designing and Building an Automatic Information Retrieval System for Handling the Arabic Data Ibrahiem M.M. El Emary and Ja'far

More information

Information Retrieval and Web Search

Information Retrieval and Web Search Information Retrieval and Web Search Relevance Feedback. Query Expansion Instructor: Rada Mihalcea Intelligent Information Retrieval 1. Relevance feedback - Direct feedback - Pseudo feedback 2. Query expansion

More information

United we fall, Divided we stand: A Study of Query Segmentation and PRF for Patent Prior Art Search

United we fall, Divided we stand: A Study of Query Segmentation and PRF for Patent Prior Art Search United we fall, Divided we stand: A Study of Query Segmentation and PRF for Patent Prior Art Search Debasis Ganguly Johannes Leveling Gareth J. F. Jones Centre for Next Generation Localisation School of

More information

Optimal Query. Assume that the relevant set of documents C r. 1 N C r d j. d j. Where N is the total number of documents.

Optimal Query. Assume that the relevant set of documents C r. 1 N C r d j. d j. Where N is the total number of documents. Optimal Query Assume that the relevant set of documents C r are known. Then the best query is: q opt 1 C r d j C r d j 1 N C r d j C r d j Where N is the total number of documents. Note that even this

More information

Citation for published version (APA): He, J. (2011). Exploring topic structure: Coherence, diversity and relatedness

Citation for published version (APA): He, J. (2011). Exploring topic structure: Coherence, diversity and relatedness UvA-DARE (Digital Academic Repository) Exploring topic structure: Coherence, diversity and relatedness He, J. Link to publication Citation for published version (APA): He, J. (211). Exploring topic structure:

More information

Information Retrieval

Information Retrieval Information Retrieval CSC 375, Fall 2016 An information retrieval system will tend not to be used whenever it is more painful and troublesome for a customer to have information than for him not to have

More information

Chapter 2. Architecture of a Search Engine

Chapter 2. Architecture of a Search Engine Chapter 2 Architecture of a Search Engine Search Engine Architecture A software architecture consists of software components, the interfaces provided by those components and the relationships between them

More information

Outline. Possible solutions. The basic problem. How? How? Relevance Feedback, Query Expansion, and Inputs to Ranking Beyond Similarity

Outline. Possible solutions. The basic problem. How? How? Relevance Feedback, Query Expansion, and Inputs to Ranking Beyond Similarity Outline Relevance Feedback, Query Expansion, and Inputs to Ranking Beyond Similarity Lecture 10 CS 410/510 Information Retrieval on the Internet Query reformulation Sources of relevance for feedback Using

More information

Introduction to Information Retrieval

Introduction to Information Retrieval Introduction to Information Retrieval (Supplementary Material) Zhou Shuigeng March 23, 2007 Advanced Distributed Computing 1 Text Databases and IR Text databases (document databases) Large collections

More information

Knowledge Retrieval. Franz J. Kurfess. Computer Science Department California Polytechnic State University San Luis Obispo, CA, U.S.A.

Knowledge Retrieval. Franz J. Kurfess. Computer Science Department California Polytechnic State University San Luis Obispo, CA, U.S.A. Knowledge Retrieval Franz J. Kurfess Computer Science Department California Polytechnic State University San Luis Obispo, CA, U.S.A. 1 Acknowledgements This lecture series has been sponsored by the European

More information

Effective Query Generation and Postprocessing Strategies for Prior Art Patent Search

Effective Query Generation and Postprocessing Strategies for Prior Art Patent Search Effective Query Generation and Postprocessing Strategies for Prior Art Patent Search Suleyman Cetintas and Luo Si Department of Computer Sciences, Purdue University, West Lafayette, IN 47907. E-mail: {scetinta,

More information

Similarity search in multimedia databases

Similarity search in multimedia databases Similarity search in multimedia databases Performance evaluation for similarity calculations in multimedia databases JO TRYTI AND JOHAN CARLSSON Bachelor s Thesis at CSC Supervisor: Michael Minock Examiner:

More information

Information Retrieval CSCI

Information Retrieval CSCI Information Retrieval CSCI 4141-6403 My name is Anwar Alhenshiri My email is: anwar@cs.dal.ca I prefer: aalhenshiri@gmail.com The course website is: http://web.cs.dal.ca/~anwar/ir/main.html 5/6/2012 1

More information

Information Retrieval

Information Retrieval Information Retrieval WS 2016 / 2017 Lecture 2, Tuesday October 25 th, 2016 (Ranking, Evaluation) Prof. Dr. Hannah Bast Chair of Algorithms and Data Structures Department of Computer Science University

More information

Improving Patent Search by Search Result Diversification

Improving Patent Search by Search Result Diversification Improving Patent Search by Search Result Diversification Youngho Kim University of Massachusetts Amherst yhkim@cs.umass.edu W. Bruce Croft University of Massachusetts Amherst croft@cs.umass.edu ABSTRACT

More information

Natural Language Processing

Natural Language Processing Natural Language Processing Information Retrieval Potsdam, 14 June 2012 Saeedeh Momtazi Information Systems Group based on the slides of the course book Outline 2 1 Introduction 2 Indexing Block Document

More information

CHAPTER 3 ASSOCIATON RULE BASED CLUSTERING

CHAPTER 3 ASSOCIATON RULE BASED CLUSTERING 41 CHAPTER 3 ASSOCIATON RULE BASED CLUSTERING 3.1 INTRODUCTION This chapter describes the clustering process based on association rule mining. As discussed in the introduction, clustering algorithms have

More information

Chapter 27 Introduction to Information Retrieval and Web Search

Chapter 27 Introduction to Information Retrieval and Web Search Chapter 27 Introduction to Information Retrieval and Web Search Copyright 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 27 Outline Information Retrieval (IR) Concepts Retrieval

More information

Multi-Stage Rocchio Classification for Large-scale Multilabeled

Multi-Stage Rocchio Classification for Large-scale Multilabeled Multi-Stage Rocchio Classification for Large-scale Multilabeled Text data Dong-Hyun Lee Nangman Computing, 117D Garden five Tools, Munjeong-dong Songpa-gu, Seoul, Korea dhlee347@gmail.com Abstract. Large-scale

More information

CMPSCI 646, Information Retrieval (Fall 2003)

CMPSCI 646, Information Retrieval (Fall 2003) CMPSCI 646, Information Retrieval (Fall 2003) Midterm exam solutions Problem CO (compression) 1. The problem of text classification can be described as follows. Given a set of classes, C = {C i }, where

More information

TREC 2016 Dynamic Domain Track: Exploiting Passage Representation for Retrieval and Relevance Feedback

TREC 2016 Dynamic Domain Track: Exploiting Passage Representation for Retrieval and Relevance Feedback RMIT @ TREC 2016 Dynamic Domain Track: Exploiting Passage Representation for Retrieval and Relevance Feedback Ameer Albahem ameer.albahem@rmit.edu.au Lawrence Cavedon lawrence.cavedon@rmit.edu.au Damiano

More information

VK Multimedia Information Systems

VK Multimedia Information Systems VK Multimedia Information Systems Mathias Lux, mlux@itec.uni-klu.ac.at This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Results Exercise 01 Exercise 02 Retrieval

More information

Exam in course TDT4215 Web Intelligence - Solutions and guidelines - Wednesday June 4, 2008 Time:

Exam in course TDT4215 Web Intelligence - Solutions and guidelines - Wednesday June 4, 2008 Time: English Student no:... Page 1 of 14 Contact during the exam: Geir Solskinnsbakk Phone: 735 94218/ 93607988 Exam in course TDT4215 Web Intelligence - Solutions and guidelines - Wednesday June 4, 2008 Time:

More information

A Study on Query Expansion with MeSH Terms and Elasticsearch. IMS Unipd at CLEF ehealth Task 3

A Study on Query Expansion with MeSH Terms and Elasticsearch. IMS Unipd at CLEF ehealth Task 3 A Study on Query Expansion with MeSH Terms and Elasticsearch. IMS Unipd at CLEF ehealth Task 3 Giorgio Maria Di Nunzio and Alexandru Moldovan Dept. of Information Engineering University of Padua giorgiomaria.dinunzio@unipd.it,alexandru.moldovan@studenti.unipd.it

More information

Query Refinement and Search Result Presentation

Query Refinement and Search Result Presentation Query Refinement and Search Result Presentation (Short) Queries & Information Needs A query can be a poor representation of the information need Short queries are often used in search engines due to the

More information

Introduction to Information Retrieval

Introduction to Information Retrieval Introduction Inverted index Processing Boolean queries Course overview Introduction to Information Retrieval http://informationretrieval.org IIR 1: Boolean Retrieval Hinrich Schütze Institute for Natural

More information

Relevance Feedback and Query Reformulation. Lecture 10 CS 510 Information Retrieval on the Internet Thanks to Susan Price. Outline

Relevance Feedback and Query Reformulation. Lecture 10 CS 510 Information Retrieval on the Internet Thanks to Susan Price. Outline Relevance Feedback and Query Reformulation Lecture 10 CS 510 Information Retrieval on the Internet Thanks to Susan Price IR on the Internet, Spring 2010 1 Outline Query reformulation Sources of relevance

More information

Prior Art Search using International Patent Classification Codes and All-Claims-Queries

Prior Art Search using International Patent Classification Codes and All-Claims-Queries Prior Art Search using International Patent Classification Codes and All-Claims-Queries György Szarvas, Benjamin Herbert, Iryna Gurevych UKP Lab, Technische Universität Darmstadt, Germany http://www.ukp.tu-darmstadt.de

More information

Web Information Retrieval using WordNet

Web Information Retrieval using WordNet Web Information Retrieval using WordNet Jyotsna Gharat Asst. Professor, Xavier Institute of Engineering, Mumbai, India Jayant Gadge Asst. Professor, Thadomal Shahani Engineering College Mumbai, India ABSTRACT

More information

Tansu Alpcan C. Bauckhage S. Agarwal

Tansu Alpcan C. Bauckhage S. Agarwal 1 / 16 C. Bauckhage S. Agarwal Deutsche Telekom Laboratories GBR 2007 2 / 16 Outline 3 / 16 Overview A novel expert peering system for community-based information exchange A graph-based scheme consisting

More information

James Mayfield! The Johns Hopkins University Applied Physics Laboratory The Human Language Technology Center of Excellence!

James Mayfield! The Johns Hopkins University Applied Physics Laboratory The Human Language Technology Center of Excellence! James Mayfield! The Johns Hopkins University Applied Physics Laboratory The Human Language Technology Center of Excellence! (301) 219-4649 james.mayfield@jhuapl.edu What is Information Retrieval? Evaluation

More information

In the recent past, the World Wide Web has been witnessing an. explosive growth. All the leading web search engines, namely, Google,

In the recent past, the World Wide Web has been witnessing an. explosive growth. All the leading web search engines, namely, Google, 1 1.1 Introduction In the recent past, the World Wide Web has been witnessing an explosive growth. All the leading web search engines, namely, Google, Yahoo, Askjeeves, etc. are vying with each other to

More information

Information Retrieval and Web Search

Information Retrieval and Web Search Information Retrieval and Web Search Introduction to IR models and methods Rada Mihalcea (Some of the slides in this slide set come from IR courses taught at UT Austin and Stanford) Information Retrieval

More information

In = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most

In = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most In = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most common word appears 10 times then A = rn*n/n = 1*10/100

More information

MIRACLE at ImageCLEFmed 2008: Evaluating Strategies for Automatic Topic Expansion

MIRACLE at ImageCLEFmed 2008: Evaluating Strategies for Automatic Topic Expansion MIRACLE at ImageCLEFmed 2008: Evaluating Strategies for Automatic Topic Expansion Sara Lana-Serrano 1,3, Julio Villena-Román 2,3, José C. González-Cristóbal 1,3 1 Universidad Politécnica de Madrid 2 Universidad

More information

Hyperlink-Extended Pseudo Relevance Feedback for Improved. Microblog Retrieval

Hyperlink-Extended Pseudo Relevance Feedback for Improved. Microblog Retrieval THE AMERICAN UNIVERSITY IN CAIRO SCHOOL OF SCIENCES AND ENGINEERING Hyperlink-Extended Pseudo Relevance Feedback for Improved Microblog Retrieval A thesis submitted to Department of Computer Science and

More information

Reducing Over-generation Errors for Automatic Keyphrase Extraction using Integer Linear Programming

Reducing Over-generation Errors for Automatic Keyphrase Extraction using Integer Linear Programming Reducing Over-generation Errors for Automatic Keyphrase Extraction using Integer Linear Programming Florian Boudin LINA - UMR CNRS 6241, Université de Nantes, France Keyphrase 2015 1 / 22 Errors made by

More information

Information Retrieval. Information Retrieval and Web Search

Information Retrieval. Information Retrieval and Web Search Information Retrieval and Web Search Introduction to IR models and methods Information Retrieval The indexing and retrieval of textual documents. Searching for pages on the World Wide Web is the most recent

More information

Information Retrieval

Information Retrieval Natural Language Processing SoSe 2015 Information Retrieval Dr. Mariana Neves June 22nd, 2015 (based on the slides of Dr. Saeedeh Momtazi) Outline Introduction Indexing Block 2 Document Crawling Text Processing

More information

A Text Retrieval Approach to Recover Links among s and Source Code Classes

A Text Retrieval Approach to Recover Links among  s and Source Code Classes 318 A Text Retrieval Approach to Recover Links among E-Mails and Source Code Classes Giuseppe Scanniello and Licio Mazzeo Universitá della Basilicata, Macchia Romana, Viale Dell Ateneo, 85100, Potenza,

More information

International ejournals

International ejournals Available online at www.internationalejournals.com International ejournals ISSN 0976 1411 International ejournal of Mathematics and Engineering 112 (2011) 1023-1029 ANALYZING THE REQUIREMENTS FOR TEXT

More information

Instructor: Stefan Savev

Instructor: Stefan Savev LECTURE 2 What is indexing? Indexing is the process of extracting features (such as word counts) from the documents (in other words: preprocessing the documents). The process ends with putting the information

More information

Automatic Generation of Query Sessions using Text Segmentation

Automatic Generation of Query Sessions using Text Segmentation Automatic Generation of Query Sessions using Text Segmentation Debasis Ganguly, Johannes Leveling, and Gareth J.F. Jones CNGL, School of Computing, Dublin City University, Dublin-9, Ireland {dganguly,

More information

Analyzing Document Retrievability in Patent Retrieval Settings

Analyzing Document Retrievability in Patent Retrieval Settings Analyzing Document Retrievability in Patent Retrieval Settings Shariq Bashir and Andreas Rauber Institute of Software Technology and Interactive Systems, Vienna University of Technology, Austria {bashir,rauber}@ifs.tuwien.ac.at

More information

Improving Difficult Queries by Leveraging Clusters in Term Graph

Improving Difficult Queries by Leveraging Clusters in Term Graph Improving Difficult Queries by Leveraging Clusters in Term Graph Rajul Anand and Alexander Kotov Department of Computer Science, Wayne State University, Detroit MI 48226, USA {rajulanand,kotov}@wayne.edu

More information

A RECOMMENDER SYSTEM FOR SOCIAL BOOK SEARCH

A RECOMMENDER SYSTEM FOR SOCIAL BOOK SEARCH A RECOMMENDER SYSTEM FOR SOCIAL BOOK SEARCH A thesis Submitted to the faculty of the graduate school of the University of Minnesota by Vamshi Krishna Thotempudi In partial fulfillment of the requirements

More information

Midterm Exam Search Engines ( / ) October 20, 2015

Midterm Exam Search Engines ( / ) October 20, 2015 Student Name: Andrew ID: Seat Number: Midterm Exam Search Engines (11-442 / 11-642) October 20, 2015 Answer all of the following questions. Each answer should be thorough, complete, and relevant. Points

More information

Exploiting Internal and External Semantics for the Clustering of Short Texts Using World Knowledge

Exploiting Internal and External Semantics for the Clustering of Short Texts Using World Knowledge Exploiting Internal and External Semantics for the Using World Knowledge, 1,2 Nan Sun, 1 Chao Zhang, 1 Tat-Seng Chua 1 1 School of Computing National University of Singapore 2 School of Computer Science

More information

Clustering (COSC 488) Nazli Goharian. Document Clustering.

Clustering (COSC 488) Nazli Goharian. Document Clustering. Clustering (COSC 488) Nazli Goharian nazli@ir.cs.georgetown.edu 1 Document Clustering. Cluster Hypothesis : By clustering, documents relevant to the same topics tend to be grouped together. C. J. van Rijsbergen,

More information

CHAPTER 3 INFORMATION RETRIEVAL BASED ON QUERY EXPANSION AND LATENT SEMANTIC INDEXING

CHAPTER 3 INFORMATION RETRIEVAL BASED ON QUERY EXPANSION AND LATENT SEMANTIC INDEXING 43 CHAPTER 3 INFORMATION RETRIEVAL BASED ON QUERY EXPANSION AND LATENT SEMANTIC INDEXING 3.1 INTRODUCTION This chapter emphasizes the Information Retrieval based on Query Expansion (QE) and Latent Semantic

More information

Balancing Manual and Automatic Indexing for Retrieval of Paper Abstracts

Balancing Manual and Automatic Indexing for Retrieval of Paper Abstracts Balancing Manual and Automatic Indexing for Retrieval of Paper Abstracts Kwangcheol Shin 1, Sang-Yong Han 1, and Alexander Gelbukh 1,2 1 Computer Science and Engineering Department, Chung-Ang University,

More information

Chapter 8. Evaluating Search Engine

Chapter 8. Evaluating Search Engine Chapter 8 Evaluating Search Engine Evaluation Evaluation is key to building effective and efficient search engines Measurement usually carried out in controlled laboratory experiments Online testing can

More information

Basic Tokenizing, Indexing, and Implementation of Vector-Space Retrieval

Basic Tokenizing, Indexing, and Implementation of Vector-Space Retrieval Basic Tokenizing, Indexing, and Implementation of Vector-Space Retrieval 1 Naïve Implementation Convert all documents in collection D to tf-idf weighted vectors, d j, for keyword vocabulary V. Convert

More information

CS 6320 Natural Language Processing

CS 6320 Natural Language Processing CS 6320 Natural Language Processing Information Retrieval Yang Liu Slides modified from Ray Mooney s (http://www.cs.utexas.edu/users/mooney/ir-course/slides/) 1 Introduction of IR System components, basic

More information

Information Retrieval CS Lecture 06. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Information Retrieval CS Lecture 06. Razvan C. Bunescu School of Electrical Engineering and Computer Science Information Retrieval CS 6900 Lecture 06 Razvan C. Bunescu School of Electrical Engineering and Computer Science bunescu@ohio.edu Boolean Retrieval vs. Ranked Retrieval Many users (professionals) prefer

More information

UTA and SICS at CLEF-IP

UTA and SICS at CLEF-IP UTA and SICS at CLEF-IP Järvelin, Antti*, Järvelin, Anni* and Hansen, Preben** * Department of Information Studies and Interactive Media University of Tampere {anni, antti}.jarvelin@uta.fi ** Swedish Institute

More information

Shrey Patel B.E. Computer Engineering, Gujarat Technological University, Ahmedabad, Gujarat, India

Shrey Patel B.E. Computer Engineering, Gujarat Technological University, Ahmedabad, Gujarat, India International Journal of Scientific Research in Computer Science, Engineering and Information Technology 2018 IJSRCSEIT Volume 3 Issue 3 ISSN : 2456-3307 Some Issues in Application of NLP to Intelligent

More information

The Goal of this Document. Where to Start?

The Goal of this Document. Where to Start? A QUICK INTRODUCTION TO THE SEMILAR APPLICATION Mihai Lintean, Rajendra Banjade, and Vasile Rus vrus@memphis.edu linteam@gmail.com rbanjade@memphis.edu The Goal of this Document This document introduce

More information

Query Expansion using Wikipedia and DBpedia

Query Expansion using Wikipedia and DBpedia Query Expansion using Wikipedia and DBpedia Nitish Aggarwal and Paul Buitelaar Unit for Natural Language Processing, Digital Enterprise Research Institute, National University of Ireland, Galway firstname.lastname@deri.org

More information

---(Slide 0)--- Let s begin our prior art search lecture.

---(Slide 0)--- Let s begin our prior art search lecture. ---(Slide 0)--- Let s begin our prior art search lecture. ---(Slide 1)--- Here is the outline of this lecture. 1. Basics of Prior Art Search 2. Search Strategy 3. Search tool J-PlatPat 4. Search tool PATENTSCOPE

More information

Research Article Relevance Feedback Based Query Expansion Model Using Borda Count and Semantic Similarity Approach

Research Article Relevance Feedback Based Query Expansion Model Using Borda Count and Semantic Similarity Approach Computational Intelligence and Neuroscience Volume 215, Article ID 568197, 13 pages http://dx.doi.org/1.1155/215/568197 Research Article Relevance Feedback Based Query Expansion Model Using Borda Count

More information

Using Coherence-based Measures to Predict Query Difficulty

Using Coherence-based Measures to Predict Query Difficulty Using Coherence-based Measures to Predict Query Difficulty Jiyin He, Martha Larson, and Maarten de Rijke ISLA, University of Amsterdam {jiyinhe,larson,mdr}@science.uva.nl Abstract. We investigate the potential

More information

Chapter 3 - Text. Management and Retrieval

Chapter 3 - Text. Management and Retrieval Prof. Dr.-Ing. Stefan Deßloch AG Heterogene Informationssysteme Geb. 36, Raum 329 Tel. 0631/205 3275 dessloch@informatik.uni-kl.de Chapter 3 - Text Management and Retrieval Literature: Baeza-Yates, R.;

More information

Representation of Documents and Infomation Retrieval

Representation of Documents and Infomation Retrieval Representation of s and Infomation Retrieval Pavel Brazdil LIAAD INESC Porto LA FEP, Univ. of Porto http://www.liaad.up.pt Escola de verão Aspectos de processamento da LN F. Letras, UP, th June 9 Overview.

More information

Information Retrieval

Information Retrieval Multimedia Computing: Algorithms, Systems, and Applications: Information Retrieval and Search Engine By Dr. Yu Cao Department of Computer Science The University of Massachusetts Lowell Lowell, MA 01854,

More information

Multimedia Information Systems

Multimedia Information Systems Multimedia Information Systems Samson Cheung EE 639, Fall 2004 Lecture 6: Text Information Retrieval 1 Digital Video Library Meta-Data Meta-Data Similarity Similarity Search Search Analog Video Archive

More information

Information Retrieval

Information Retrieval Information Retrieval Natural Language Processing: Lecture 12 30.11.2017 Kairit Sirts Homework 4 things that seemed to work Bidirectional LSTM instead of unidirectional Change LSTM activation to sigmoid

More information

Information Retrieval

Information Retrieval Natural Language Processing SoSe 2014 Information Retrieval Dr. Mariana Neves June 18th, 2014 (based on the slides of Dr. Saeedeh Momtazi) Outline Introduction Indexing Block 2 Document Crawling Text Processing

More information

Automatic prior art searching and patent encoding at CLEF-IP 10

Automatic prior art searching and patent encoding at CLEF-IP 10 Automatic prior art searching and patent encoding at CLEF-IP 10 1 Douglas Teodoro, 2 Julien Gobeill, 1 Emilie Pasche, 1 Dina Vishnyakova, 2 Patrick Ruch and 1 Christian Lovis, 1 BiTeM group, Medical Informatics

More information

Conceptual document indexing using a large scale semantic dictionary providing a concept hierarchy

Conceptual document indexing using a large scale semantic dictionary providing a concept hierarchy Conceptual document indexing using a large scale semantic dictionary providing a concept hierarchy Martin Rajman, Pierre Andrews, María del Mar Pérez Almenta, and Florian Seydoux Artificial Intelligence

More information

Automatic Boolean Query Suggestion for Professional Search

Automatic Boolean Query Suggestion for Professional Search Automatic Boolean Query Suggestion for Professional Search Youngho Kim yhkim@cs.umass.edu Jangwon Seo jangwon@cs.umass.edu Center for Intelligent Information Retrieval Department of Computer Science University

More information

Prior Art Search - Entry level - Japan Patent Office

Prior Art Search - Entry level - Japan Patent Office Prior Art Search - Entry level - Japan Patent Office 0 Outline I. Basics of Prior Art Search II. Search Strategy III. Search Tool - J-PlatPat IV. Search Tool - PATENTSCOPE 1 Outline I. Basics of Prior

More information

Application of Patent Networks to Information Retrieval: A Preliminary Study

Application of Patent Networks to Information Retrieval: A Preliminary Study Application of Patent Networks to Information Retrieval: A Preliminary Study CS224W (Jure Leskovec): Final Project 12/07/2010 Siddharth Taduri Civil and Environmental Engineering, Stanford University,

More information

Patent documents usecases with MyIntelliPatent. Alberto Ciaramella IntelliSemantic 25/11/2012

Patent documents usecases with MyIntelliPatent. Alberto Ciaramella IntelliSemantic 25/11/2012 Patent documents usecases with MyIntelliPatent Alberto Ciaramella IntelliSemantic 25/11/2012 Objectives and contents of this presentation This presentation: identifies and motivates the most significant

More information

Query Expansion Based on Crowd Knowledge for Code Search

Query Expansion Based on Crowd Knowledge for Code Search PAGE 1 Query Expansion Based on Crowd Knowledge for Code Search Liming Nie, He Jiang*, Zhilei Ren, Zeyi Sun, Xiaochen Li Abstract As code search is a frequent developer activity in software development

More information

Problem 1: Complexity of Update Rules for Logistic Regression

Problem 1: Complexity of Update Rules for Logistic Regression Case Study 1: Estimating Click Probabilities Tackling an Unknown Number of Features with Sketching Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox January 16 th, 2014 1

More information

XML RETRIEVAL. Introduction to Information Retrieval CS 150 Donald J. Patterson

XML RETRIEVAL. Introduction to Information Retrieval CS 150 Donald J. Patterson Introduction to Information Retrieval CS 150 Donald J. Patterson Content adapted from Manning, Raghavan, and Schütze http://www.informationretrieval.org OVERVIEW Introduction Basic XML Concepts Challenges

More information

A B2B Search Engine. Abstract. Motivation. Challenges. Technical Report

A B2B Search Engine. Abstract. Motivation. Challenges. Technical Report Technical Report A B2B Search Engine Abstract In this report, we describe a business-to-business search engine that allows searching for potential customers with highly-specific queries. Currently over

More information

Measurements of the effect of linear interpolation values and reduced bigram model size for text prediction

Measurements of the effect of linear interpolation values and reduced bigram model size for text prediction Measurements of the effect of linear interpolation s and reduced bigram model size for text prediction Marit Ånestad Lunds Tekniska Högskola man039@post.uit.no Michael Geier Lunds Tekniska Högskola michael.geier@student.tugraz.at

More information

CADIAL Search Engine at INEX

CADIAL Search Engine at INEX CADIAL Search Engine at INEX Jure Mijić 1, Marie-Francine Moens 2, and Bojana Dalbelo Bašić 1 1 Faculty of Electrical Engineering and Computing, University of Zagreb, Unska 3, 10000 Zagreb, Croatia {jure.mijic,bojana.dalbelo}@fer.hr

More information

A New Technique to Optimize User s Browsing Session using Data Mining

A New Technique to Optimize User s Browsing Session using Data Mining Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 4, Issue. 3, March 2015,

More information

Department of Electronic Engineering FINAL YEAR PROJECT REPORT

Department of Electronic Engineering FINAL YEAR PROJECT REPORT Department of Electronic Engineering FINAL YEAR PROJECT REPORT BEngCE-2007/08-HCS-HCS-03-BECE Natural Language Understanding for Query in Web Search 1 Student Name: Sit Wing Sum Student ID: Supervisor:

More information

A Document-centered Approach to a Natural Language Music Search Engine

A Document-centered Approach to a Natural Language Music Search Engine A Document-centered Approach to a Natural Language Music Search Engine Peter Knees, Tim Pohle, Markus Schedl, Dominik Schnitzer, and Klaus Seyerlehner Dept. of Computational Perception, Johannes Kepler

More information