Web Search Engine Question Answering

Size: px

Start display at page:

Download "Web Search Engine Question Answering"

Mitchell Holt
5 years ago
Views:

1 Web Search Engine Question Answering Reena Pindoria Supervisor Dr Steve Renals Com /05/2003 This report is submitted in partial fulfilment of the requirement for the degree of Bachelor of Science with Honours in Computer Science by Reena Pindoria

2 ii Declaration All sentences or passages quoted in this report from other people's work have been specifically acknowledged by clear cross-referencing to author, work and page(s). Any illustrations which are not the work of the author of this report have been used with the explicit permission of the originator and are specifically acknowledged. I understand that failure to do this amounts to plagiarism and will be considered grounds for failure in this project and the degree examination as a whole. Name: Reena Pindoria Signature: Date: 07/05/2003 ii

3 iii Abstract Question Answering has extensively been researched, with the focus being on information contained within a fixed size corpus, although the web has been considered. Question answering involves querying an information retrieval system where the query is expressed in natural language. The aim of this project is to develop a tool that will effectively answer questions that may be submitted to a web search engine. The answer will take the form of the most relevant document and the retrieval system will be the popular web search engine Google. In this project, techniques that potentially improve text retrieval have been used for question answering based on the web. Such techniques consist of query expansion and local context analysis. The procedure begins with a training process then progresses to apply the above techniques. Achievements to date have been in the form of developing the learning process that is, obtaining a large database of FAQs and thereby forming the training set and reformulating the query. The reformulated query is then submitted to Google via the Google API, thereby the results can be analysed for the query expansion and re-ranking of documents. iii

4 iv Acknowledgements I would like to thank my supervisor, Dr Steve Renals, for providing much appreciated assistance throughout the year. Thanks also to all my family who have provided me with unconditional support when it was needed most. iv

5 v Contents Chapter 1: Introduction Aims and Objectives Overview of Report... 2 Chapter 2: Literature Review Information Retrieval Web Retrieval Standard Text Retrieval Information Retrieval and the Natural Language Processing and Text Processing Techniques Stemming Stop Words Part of Speech Tagging Retrieval Models Precision and Recall Evaluation of Precision and Recall Weighting Query Expansion/Refinement Robertson Sparck-Jones Query Expansion Local Context Analysis Question Answering Question Answering and TREC Question Answering Systems Evaluation of Techniques Chapter 3: The Tritus Question Answering System The Learning Process Generating Question Phrases Generate Candidate Transforms Ranking Candidate Transforms Weighting and Re-ranking transforms using Search Engines Apply Transformation to query Chapter 4: Implementation and Testing Strategy Learning Process Extracting Question and Answer pairs Generate Question Phrases Generate Candidate Transforms Weighting of Candidate Transforms Apply Transformation Query Expansion Testing Strategy Chapter 5: The Google Application Java Interface The Google API Chapter 6: Results and Evaluation Chapter 7: Conclusion Appendix A: Code Excerpts v

6 1 Chapter 1: Introduction Search engines on the web, like Google [2], are often provided with questions as queries. The user hopes that he/she will be provided with the most suitable answer(s) if they supply questions. However, in existing web search engines the question words are sometimes removed from the query before searching or the question words are treated as query terms. Most search engines work by looking through the text of a document, and some search engines require that documents contain all the query terms, including the question words, for it to be retrieved. On the other hand, if the search engine removes the question words, there seems to be no point in entering a question when a few keywords will suffice. Often, it is up to the user to scan through the results to find the answer they are looking for. A search engine, typically, does perform well with keyword search. When a query like Stonehenge is supplied to a search engine, a good search engine will return documents containing information about Stonehenge. However, the same search engine may struggle with a specific query like Who won the London marathon last year?. A document may not exist that contains this combination of words or common words may be removed, as mentioned above. Note that search engines don t typically employ a thesaurus, so will not attempt to search for similar meaning words if the first search fails. Experienced users therefore, will try to add query words like refers since you might expect for example, a definition of an object to start with the name of the object followed by refers to. A question of the form What is a microphone? is a natural language question and some search engines encourage users to enter such questions as the query, a prime example being Ask Jeeves [3]. Yet, some search engines still do not fully identify when the query is a question. If the search engine can do this then it can also decide how the question should be answered and hence be able to answer it; that is question answering. For example, the top document returned by Google with this query, refers to a new technology for use on Mars, Ask Jeeves returns a web page offering to sell mobile phone accessories and Altavista [4] returns a web page offering a description of microphones. AltaVista uses the whole query to perform the search and prefers when web pages can match the query exactly. The title of the returned web page for the above query happens to be What is a microphone?, so this page is ranked top, AltaVista does nothing more than simple pattern matching. Ranking also influences the display of results. The rank given by a search engine to a document may not match the rank a user would give to the same document. This is due to the search engine using a personal ranking method based on technical aspects of documents and queries and not the information in the documents. The search is initiated using the query as the base so a better query should presumably provide better results. Another point to note is that some search engines may rank documents according to frequency of query terms so, what if the most relevant document contains few of the query terms? It will obviously be ranked lower than it should be. The way a question is expressed indicates how the question should be answered. For this reason, a question instead of a list of terms should presumably provide the best answer. What is a microphone? requires relevant documents to consist of a description of a microphone and not necessarily where a microphone can be bought. Nevertheless, since this sort of query at present is treated as a list of terms, these results are what most search engines will retrieve. 1

7 2 Therefore, it stands to reason that a tool should be developed that can take as input a question and provide a specific and relevant answer than is presently obtainable with a question query. There are many such systems available however, more often the concentration lies in trying to form a single sentence answer or a short paragraph. This offers little opportunity to obtain any further relevant information that may be contained in the original document (or documents if they were combined) without some means to access the whole document(s). 1.1 Aims and Objectives The main aim of this project is to devise a tool that will take as input a query in the form of a natural language question, manipulate it in some way and use the optimised query to retrieve the most relevant web documents using Google, thereby improving retrieval performance. Google is a popular web search engine and another aim of this project is to determine if any improvements are recorded using query transformations on this particular search engine, although other search engines may be used in the future. The answer should take the form of documents that answer the question. The objective of this project therefore, is to create a tool that is useful on the web as opposed to a system with a limited corpus and more importantly improve effectiveness of the search engine. The manipulation of the query will involve applying natural language processing techniques and a transformation procedure based on the work of Agichtein et al. [1]. Consequently, another aim is to attempt to confirm the claims made by Agichtein et al. of the successfulness of the Tritus system [1]. A constraint that has been identified is that the learning process requires processing of FAQs (Frequently Asked Questions) [5] and a significant number need to be processed in order to create a good set of transformations to apply to the query. As it is not possible to process the whole set, some decision must be made as to how many questions will be adequate. At this present point in time transformation of the query will occur with a database of about 1000 questions plus corresponding answers. Naturally there is scope to increase the training set size to include more questions in the future. 1.2 Overview of Report Chapter 2: Literature Review. This chapter provides more details of work and techniques relevant to this project. This includes background on question answering and progressing onto specific question answering systems. Chapter 3: The Tritus Question Answering System details the Tritus system that is the basis for this project. Chapter 4: Implementation and Testing Strategy outlines how the system was developed for this project and the testing strategy used to assess the system. Chapter 5: The Google Application provides the details of the Java that wraps the system in order to provide a user-friendly means of input and background information of the Google API. In Chapter 6: Results and Evaluation the results of the testing can be found and the analysis of these results is outlined here including suggestions for future work. Chapter 7: Conclusion 2

8 3 Chapter 2: Literature Review There are several techniques that may be applicable to this project, which are described in more detail below. Also, in this chapter details and analyses of previous work on question answering can be found as well as background information of Question answering. The previous works have employed different techniques and methods in order to accomplish the task of question answering and the analysis of each will show the level of effectiveness of these systems. From this it will be possible to identify which general techniques or methods are most successful. First, the broad techniques of natural language processing and text processing will be outlined then this chapter will move on to the topic of question answering. 2.1 Information Retrieval Information retrieval [6] is the process of gaining documents that are located in a database and can also be referred to as a communication process. The user will have a need and will express this need to the information retrieval (IR) system in a way that can be processed by the system. The system will usually process text, which is the typical means of describing the content of a document. The system will then endeavour to provide the information to satisfy the need, which is called the query, by linking the content of a document with the query. A document is the word given to a collection of information that can take the form of textual documents, images, sound or any other type of information that may be required to be retrieved. The focus in the development of retrieval systems may lie in several areas. One may be the manipulation of the query. It is beneficial to concentrate on the query as it is the query that is used as an expression of the information need of the user, so improving the query for example, by making it more precise, presumably will return different and maybe more accurate results. Also, the objective normally lies in trying to increase the number of relevant documents retrieved. Relevance is a measure used to judge the outcome of a search and to measure the outcome of an IR system. When a document is regarded as being relevant, it usually means that the document is related to the query, where the degree of relevance is determined by techniques applied by the system. Relevancy of documents to queries can be incorporated into two other measures that can assess the effectiveness of the IR system which are easier to understand and measure. The reason is that degree of relevancy may well be subjective whereas the two measures (precision and recall) are objective; these measures only regard how many documents are relevant and not how relevant they are Web Retrieval Web based retrieval takes into account the links between web pages, such as Google PageRank does which is described below in the Weighting section. Web retrieval also analyses the keyword Meta tags in web pages and these tags are considered when a search is performed on the web. The result of keyword Meta tag use can often lead to spamming. Spamming is where web pages attempt to deceive web search engines into assigning them a higher rank. One way in which this is carried out is when irrelevant but popular keywords are added to the Meta tags. Therefore, web search engines have to form some 3

9 4 guard against spamming and attempt to assign true ranks to web pages so that the user can achieve the most relevant results from a search Standard Text Retrieval Standard text retrieval on the other hand, does no more than analyse documents at word level. When a search is performed on a non-web based retrieval system, the query words are simply matched against the words in the document and nothing more. This project will concentrate on the standard text retrieval idea but will employ a web search engine to retrieve web documents. This means that the query will be manipulated using pattern matching and then Google will be used, which uses links between web pages to retrieve web documents. Finally, the web documents will be analysed at text level for ranking and query expansion purposes, thus reverting to the standard text retrieval idea Information Retrieval and the WWW The World Wide Web can be thought of as a database that contains a vast collection of documents and information and a web search engine is the information retrieval system. As the web contains such great amounts of information, it has become increasingly important to develop search engines that can return the most relevant documents upon conducting the search process and quickly. The motivation for this development is that more people are using the web as their primary source of information. Users wish to obtain the information as quickly as possible and for the information to be as accurate as possible. One reason for increased use of the web is that there is a designated tool which will find information for you, provided a good query is supplied, the tool being the search engine, whereas with a book the user has to scan through the pages to find the right information which is time consuming. Therefore, ways to improve retrieval are a popular research area. 2.2 Natural Language Processing and Text Processing Techniques Stemming Stemming [7] is a technique to transform different inflections and derivations of the same word to one common "stem". Stemming can mean both prefix and suffix removal. The purpose of applying stemming to words is so that the most basic form of a particular word can be obtained and used in some way in the retrieval process. An example of this could be the replacement of a word in the query, which is a plural with its singular; the singular word may appear in the document when its plural does not and so if this is not taken into account it is possible that this document may be ranked lower than it should be. A general difficulty with stemming is that this technique is primarily for reducing query terms. If a query contains many words that all have the same stem and stemming is applied directly to the query, these words will be disregarded and replaced with the stemmed word. This can reduce the number of words in the query, which may not always desirable as sometimes more query words can narrow the search or make the query more precise. The way the stemmer will be used will affect whether or not this is a problem; replace the stemmed words or just add the stem; if the words are added then care must be taken as to which kind of search the search engine performs. An AND search is an 4

10 5 example of a type of search and this search will attempt to find all the words in a document whereas an OR search will try to find any (or all) of the words in a document. A widely used stemming algorithm is the Porter Stemmer [8] that removes suffixes, which has apparently performed very well in previously conducted evaluations [7]. Web search engines on average do not use stemming because it is not easy to identify when a word should be stemmed and when it shouldn t be. One reason, which may be considered to be quite important, is that the Porter stemming algorithm does frequently over-stem; this means that even when a word shouldn t be stemmed it is done so and the result can be a word that is not in the English Language; this is obviously not very useful. Nevertheless, as mentioned earlier the Porter algorithm has shown improvements in precision and recall when used Stop Words Stop words are just natural language words that are likely to appear in any given document very frequently, more so than any other words, and so are often removed from the query. There is no definite list of stop words, which means some systems may remove a particular word and some may not. Examples of such words are the, of, it, and etc. Even though users are encouraged to provide many query terms to narrow the search, the removal of stop words is insignificant; in fact it may be more efficient to omit these words so that the search engine doesn t have to search for these as they are so common and will occur in all documents. Google, for example removes this class of words, which can include question words ( who, what, how etc). This is undesirable for the initial question processing stages of a question answering system Part of Speech Tagging Part of Speech tagging is a very useful natural language processing technique. Such a tagger assigns to each word in a sentence its proper part of speech label (noun, verb, etc.). Thus, the benefit of such a tool becomes apparent when a particular type of word needs to be removed from certain phrases. There are several such taggers available [9] and are categorised into two main groups: stochastic and rule-based taggers (using rules to tag). Stochastic taggers have been found to perform much better than rule-based taggers, namely stochastic taggers demonstrate a higher degree of accuracy as compared to rule-based taggers. Specifically, the part-ofspeech tagger presented by Brill [9] claims to have great accuracy despite being a rulebased tagger. This tagger [10] apparently is able to improve its performance and therefore can endeavour to increase its accuracy level and for this reason alone this tagger may be more favourable, although there are many more taggers available Retrieval Models There are three main retrieval models used in information retrieval and there is a brief description of each in this section. The models [11] are Boolean Model, Vector Space Model and Probabilistic Model. 5

11 6 Boolean Model The Boolean model of retrieval allows for the use of Boolean operators when forming the query i.e. AND, OR etc. However, it is not always easy to use Boolean algebra to formulate the query and a major problem is that it is extremely difficult to rank documents using this model. Documents either match the query or they don t so this model will just return all the documents that match the query in no particular order. This has the disadvantage of the user not being able to quickly see which document fully satisfies their need. For these reasons this model is not often used. Vector Space Model The vector space model represents each document and the query as a vector in n dimensions, where n is the number of all possible words. The similarity of for example, two documents is measured mathematically by the cosine measure. The angle between two vectors is computed using the dot product and so in the most basic form, the closer the angle between the vectors is to 0 the more similar the documents are. The dot product of two vectors is calculated as a. b = a b Cosθ, where a and b are vectors and theta is the angle between the vectors. There is a more specified formula for computing similarity [11] where the document is represented by weighted index terms. The similarity measure is a variation of the cosine measure as mentioned above and is used to gauge similarity between the query and document vectors rather than comparing just two documents. The outcome of the similarity measure determines the ranking of the documents. Word ordering or semantics and syntax of the document is irrelevant for the basic vector representation where each document is represented by words and the frequency. So if two documents happen to contain the same number of every word then it doesn t necessarily mean that the documents are identical. Only the representation is identical. Whereas word ordering is completely lost in this model, in question answering word order is vital. In question answering it is necessary to maintain word order as the meaning of the question is inferred from the order of the words in the question. If the vector space model is implemented the meaning of the question cannot be identified therefore, the vector space representation is unsuitable for question answering systems. Probabilistic Model The probabilistic model [12] ranks documents according to how relevant they are to a given query. Each document is assigned probabilities of relevancy; this means that if the probability that a document is relevant is p the probability that it is irrelevant is 1-p enforcing independency. The model then attempts to retrieve the set of documents that contains only the relevant documents. This model requires relevancy information and if it is unable to obtain this information it behaves as a vector space model using only term frequency information [11] however, weights can be determined using the ideas in [13]. As with the vector space model this model is also popular, this model more so because of its capability of ranking according to relevance. This model will be included in this project because the retrieval process is required to rank the answers to the question in decreasing order of relevance. 6

12 Precision and Recall Precision can be used to measure the performance of a retrieval system. Precision along with the other measure, recall, can be used to measure the efficiency or effectiveness of the retrieval system. The most basic formula for measuring precision is: RetrievedI Relevant Precision= (Equation 1) Retrieved Retrieved means the documents retrieved by the IR system and Retrieved denotes the number of documents retrieved. The numerator is the size of the set found by intersecting the set of retrieved documents and the set of relevant documents retrieved which means the number of relevant documents out of all the documents retrieved. Precision is a ratio so the value will range from 0 to 1 therefore, it is easy to identify if the value is good or not; the closer it is to 1 the better the IR system. Ideally, out of the total number of documents retrieved most should be relevant than irrelevant. Meaning that a good IR system should have a precision value of greater then 0.5, which is the average. The IR system can be evaluated using precision since it takes into account how many relevant documents are retrieved. Precision assigns a mathematical value to the system, which is easier to measure as it is discrete. On the other hand, relevancy alone as a measure doesn t easily reflect the effectiveness of the system because each individual retrieval system may have a unique method of deciding how relevant a document is to the query. What is required is whether or not the system returns relevant documents. Precision reflects this and is therefore useful in web search engines as well. If the objective of an IR system is to retrieve as many relevant documents as possible, then using recall as a measure of effectiveness is useful as it is a ratio of the number of relevant documents retrieved to the total number of relevant documents in the database. Recall is calculated as: RetrievedI Relevant Recall = (Equation 2) Relevant Again, recall values range from 0 to 1 so a value closer to 1 than 0 is more desirable. A user ideally would like all the relevant documents to be returned to them and recall is a representation of what the user actually gets from a search. Improvements in recall will therefore offer a more effective IR system as a larger portion of relevant documents is retrieved. Again, a recall value of greater than 0.5 will indicate a good IR system which would mean that the system is able to retrieve greater than 50% of the total relevant documents Evaluation of Precision and Recall Precision is a more important measure of effectiveness than recall for the web. This is because it is difficult to measure the number of documents which are relevant to a query on the web as there are simply too many documents and at any one time not all documents exist in a search engines index. This is a reason the top n returned documents are often regarded or assumed to be relevant. Recall is a measure appropriate when the collection of documents is small and it is easy to calculate which documents are relevant and which aren t. If the aim of a system is to improve its retrieval then precision will always be an appropriate and useful measure. This is because the retrieved documents can be quickly 7

13 8 scanned for relevancy and if more of the returned documents are relevant than irrelevant then the system is performing well Weighting Most weighting has to occur with some prior information especially if ranking needs to be done with respect to relevancy. If no relevancy information is available then it is useful to rely on the number of documents in the collection (idf see below) and probability (i.e. the probabilistic retrieval model and the derivation of such weighting schemes is extensively explained in [13]). The whole idea of presenting documents in a specific order to the user after retrieval is referred to as ranked retrieval. Usually a ranked retrieval system treats a document as just a collection of words. Due to this, ranking is performed by giving each document a score and thus, the document that has the highest score is ranked above the documents that gain lower scores. A basic scheme for scoring is the Term Frequency (tf) and Inverse Document Frequency (idf) scores TFIDF = TF*IDF N t.log (Equation 3) n t Term Frequency t refers to the number of times a particular query term appears in a given document Inverse Document Frequency (log N/n t ) N is the number of documents in the whole collection n t is the number of documents the term under observation occurs in This weighting scheme often forms the basis of subsequently developed schemes and is quite effective despite being very simple. This scheme is often used with the vector space representation of documents. However, this scoring only relies on how many times each query word appears in a particular document so as mentioned earlier, it is quite possible that a relevant document may contain few of the query words so will have a low score, this is obviously undesirable as most relevant documents are required. Thus, the query needs to be changed or a new scoring scheme needs to be applied. A sum of weights approach [14] achieves good performance, where the weight of a term is determined by probability (which can be estimated from relevance feedback see query expansion below): pt (1 qt ) wt = log (Equation 4) q (1 p ) t t where, p t is the probability that a given relevant document is assigned the term t and q t is the equivalent non-relevant probability. This weighting method has the advantage of using probability/odds to calculate the weight of a term and more importantly will assign heavier weights to important terms if they occur in relevant documents. Another popular weighting scheme is BM25 used in the Okapi system since TREC-3 [15]. This scheme is more elaborate than the simple tf idf scheme. 8

14 9 (k1 + 1)tf (k3 + 1)qtf avdl dl w(1) + k2. Q. (Equation 5) T Q K + tf k3 + qtf avdl + dl where Q is a query containing terms T w is the Robertson-Sparck Jones weight of T in Q (See Query Expansion section) K is k 1 ((1-b) + b.dl/avdl) k 1, b, k 2, k 3 are parameters which depend on the database tf is the term frequency in a document. qtf is the query term frequency. dl is the document length (arbitrary units). avdl is the average document length. The BM25 weighting method takes into account more factors so is a more precise weighting method and this is the reason for it being more widely used and accurate. Simple is good but a more precise weight is much more desirable as it is the initial weighting that influences which documents are returned to the user. Therefore, it is imperative to weigh the documents accurately. Web Search Engines however, rank in different ways and Google has a unique method of doing so called PageRank [16, 17]. Google PageRank Google PageRank [17] assigns a rank to each web page depending on the links between pages. If a web page A is linked to another web page B then A is voting for B. The number of votes each page gets, called importance, is considered when ranking occurs. Also, the web page that is voting is considered. If the voting page itself has many votes then the linking pages from the voting page have increased importance. A more important page will be one with many votes and will have a higher PageRank which Google takes note of. This process is performed without a query and when a query is input the PageRank technique is combined with text matching techniques resulting in the Google web search engine. 2.3 Query Expansion/Refinement Query Refinement is a tool that recommends new terms for a query. Refinement is the process of reformulating the query in order to improve precision and recall. Query Expansion is the process of adding words to an original query in order to improve the retrieval performance. There are a few ways of performing query expansion explained below namely, automatically, manually and interactively [18] each having their own advantages and disadvantages. In automatic query expansion, as the name suggests, the new terms are selected by the retrieval system automatically by whatever method and the user doesn t participate in any way. The system decides which documents are relevant and then from these documents the system extracts terms to add to the query. Manual query expansion requires the user to do the expansion process, giving the task to the user of deciding which terms to add and therefore the importance of the terms. This 9

15 10 form of query expansion relies heavily on the user and therefore, the aim is to encourage the user to think about the query more carefully. A user sometimes changes the query if the first search was fruitless and this can be thought of as manual expansion; the system doesn t expand. In interactive query expansion the retrieval system chooses the terms from documents regarded as relevant by the user. The terms are then weighed and ranked by the system accordingly and usually the system adds the terms to the query. Also, semi-automatic expansion involves the system finding terms and offering them to the user in some order, usually the more important/useful terms are located at the top of the list. Query expansion can also include the concept of relevance feedback where the initial query generates results and the user decides which documents are relevant and resubmits these documents for the retrieval system to extract terms for searching once again. It is the system that decides which terms are extracted and only requires input from the user to judge which documents are relevant. Relevance feedback can also be applied to terms; the retrieval system provides the user with terms (from retrieved documents) in a particular order and the user selects the most relevant terms for another search. The ultimate decision of which terms are useful lies with the user, which has the advantage that the user presumably has a greater understanding of what he/she is looking for so can easily identify relevant query terms; the search engine doesn t have this understanding it just deals with terms the user gives then finds documents that it deems are most relevant. However, for this same reason the performance of the search using interactive query expansion is difficult to measure because the user is potentially a random and uncontrollable factor; one user may find one word more relevant another might not and each user will have a unique need Robertson Sparck-Jones Query Expansion The following equation weighs terms for expansion based on the probabilistic retrieval model by Robertson Sparck-Jones [19]. This relies on the top n retrieved documents as relevant and uses the probabilities of term occurrence in documents including the idea of term independence thus is called the relevance weight: RW t ( r ) / ( R r ) ( n r ) / ( N n R + r ) = log (Equation 6) R is number of relevant documents r is the number of relevant documents that contain the term t (t is not in the query) n is the number of documents containing t N is the total number of documents The query expansion weight is then defined as QEW = r. RW called the Offer Weight. t t This weighting is for automatic query expansion, as the user does not interact with the IR system. This enables the weighting to be done probabilistically which offers the ability to choose terms independently. 10

16 Local Context Analysis Local Analysis [20,21] is usually employed with automatic query expansion, i.e. keeping the user out of the expansion process. This is because the user may not be able to suggest the most relevant words and the system is able to hold more information and so can apply several factors and techniques to find the best alternatives. In local analysis, a search is initially carried out then the retrieved documents are analysed for word relationships; that is relationships between the initial query terms and words in the top ranked documents. Although there is a technique, which analyses the whole corpus for such relations (global analysis), it has been found that local analysis is more advantageous in that it is more effective [21]; research in the TREC collections has been able to show that local analysis indeed outperforms global analysis. Local analysis specifically local feedback, takes into account all the terms in the query therefore, the addition of extra query terms is done by analysing the whole query rather than just the individual words. The advantage of this is that the new words that are added will be in the same context as the query avoiding any possible deviation from the original topic and avoiding any ambiguity. Local analysis has been applied in many forms and includes the topic of local feedback. The use of local feedback can be the analysis of the top ranked documents and thereby building an appropriate thesaurus for expansion of the query. Another method is reweighting of terms after retrieval of documents rather than adding new terms. However, as some search engines suggest (Google), more query terms will narrow a search. Whether or not more query terms is an improvement is unclear, although more terms can make the query more precise and is possibly something to research on in this project. One reason this technique may be more effective is that the documents that are analysed are the top ranked documents from the initial search. Therefore, it is reasonably assumed that these documents are relevant and that further analysis of these documents will provide useful information for another search. On the other hand, the initial query has to be quite precise and not be too vague as to provoke the retrieval system to retrieve varied documents, if this happens the local analysis of these documents will also be very varied. Local Context Analysis is the use of global techniques (context, phrase structure) as applied to a local set of documents; a blend of global analysis techniques and local feedback. It is useful to combine the techniques of global analysis and local feedback, which have already proven to be effective, and a combination should presumably be even more effective. In Local Context Analysis [21] noun groups are considered as concepts and are selected depending on co-occurrence with query terms. This additional information is gathered from the passages contained in the top ranked documents consequently, we again assume that the top ranked documents are relevant. Passages are used instead of whole documents in order to save on computation costs and to avoid deviations in topic in case a particular document contains information on several topics. For the latter reason global analysis is not always the better technique because global analysis looks at the whole document at once then creates something called a concept the global context of this is used to determine similarities between concepts. Local context analysis on the other hand looks through the passages of the document and can create several concepts so whichever concept applies is used. 11

17 12 The extracted concepts are ranked according to the following LCA formula, which includes the simple tf idf weighting scheme as mentioned above. The LCA formula has rewarding and penalising concepts, which will affect the weights. For example, idf c penalizes concepts occurring frequently in the collection and the idf i takes into account infrequently occurring terms. af(c,t i ) = f bel ( Q, c) = ( δ + log( af ( c, t )) idf / log( n)) i (Equation 7) j= n j=1 ft ij fc ti idf i = max(1.0,log10(n/n i )/5.0) idf c = max(1.0,log10(n/n c )/5.0) c is a concept ft ij is number of occurrences of t i in p j fc j is number of occurrences of c in p j N is number of passages in collection N i is number of passages containing t i N i is number of passages containing t c δ is 0.1 j Σ i The terms retained are determined by the average weighting based on the INQUERY system [26]. The use of passage based techniques is more applicable and important that concept extraction. Each passage is analysed in turn for relating terms then weighting can be applied thereby producing a collection of candidate terms for expansion of the query. An approach which obtains expansion terms directly from a document is preferable than just choosing terms by some other means. If a document is relevant then it will contain other important keywords which when added to the original query will be beneficial in terms of retrieving more specified documents. Thus local context analysis is a valuable technique for doing this. The techniques used to select the terms to add from the expansion vary and probabilistic techniques can ensure independence i.e. the selection process will be unbiased. 2.4 Question Answering Question Answering (QA) is used in information retrieval and refers to providing an answer to a question where the question is the query to the retrieval system. Question answering systems simply endeavour to return specific answers to natural language questions from information retrieval systems [22]; the query is always a proper question rather than a list of terms because it is more straightforward to express what you are looking for with the use of a proper question than the use of a number of keywords. For these reasons question answering is a popular research subject for those in NLP field and generally users who would appreciate the availability of such systems. Traditionally, when a QA system for example is used on the web, the aim is to return a fixed size answer i.e. a short paragraph or a sentence. Extensive research has been undertaken to apply question answering to the web [34, 42], the web contains vast amounts of information and nowadays is one of the prime sources of obtaining information. For the reason that the web contains so much information, such a c id 12

18 13 system is more useful in this domain because a user would want to obtain the information they need as quickly as possible and at the same time obtain the most relevant material as possible. The aim of such systems at present has usually been to extract a short paragraph or a sentence from a document that contains the answer to the question. However, this is restrictive in that the user will get the specific answer only and if the document contains other useful information it becomes unavailable to the user. Also, the source of the answer is lost if many documents are combined to form one answer Question Answering and TREC TREC [48] has been formed in order to undertake research on text retrieval methods. The focus of TREC therefore is not solely on question answering but also other useful topics in text retrieval. TREC have undertaken research on Question answering since 1999 and each track has had its own specification whereby small paragraphs of text are required to be formed that contain the answer. The focus of TREC question answering is obtaining an exact short paragraph whereas this project is concerned with improving retrieval of specifically a web search engine using question answering. The Question Answering task as undertaken by TREC [49] aims to progress from obtaining simple answers of simple questions from single documents, to obtaining detailed answers of complex questions from multiple sources. The main aim is to reach a point where capabilities can be provided for users who would wish to ask complex questions and obtain detailed answers. This task is being completed in stages over a time period where milestones are set which can be used to assess performance and progress being made. In each new track the level of difficulty endeavours to increase in order to improve QA systems. The roadmap endeavours to provide basic standards to gauge Q&A systems such as answers must not be completely incorrect and identify new problems for each track to overcome. A further aim is to develop such systems for multiple languages using English questions thereby, for example developing translation tools. Hence, having some sort of standard and path for development proves very useful as this can be used to judge the level of QA systems. Another requirement is to present answers of simple questions in natural language and be easily understood by any type of user. This requirement is justified since questions are in natural language form so answers should be as well. Where the question begins to get more complex, the form of the answer should reflect the understanding of the individual who posed the question. The evaluation of QA systems is not simply defined since the source of the questions and answers for example can influence the outcome. Does a QA system fail if it cannot find an answer to a question? Or is the QA system good but it s the answer that doesn t exist? Question Answering Systems There have been many question answering systems developed and a vast collection is available in TREC [43] question answering tracks. The main Question answering system for this project which uses query transformations is the Tritus system [1]. Query transformation is effective as the question words are considered and the query is altered appropriately using sample questions and answers. This means concrete data, which contains words that normally appear for question answers, is used at run-time i.e. 13

19 14 learning is done by example. Although the disadvantage of this is that the learning data has to represent the typical phrases that are contained in answers of a particular question type. Pattern Matching Systems [23] AnswerBus taking part in TREC concentrates on making the process of retrieval faster by generating one simple query. The objective is to return sentences that contain the correct answer. The sentences are analysed in turn by a process of word matching to determine whether or not this particular sentence is a candidate answer. The main focus lies in trying to shorten the query by applying simple query transformation. The process begins by retrieving documents which the system itself then processes. The next step is to look through each sentence and identify how many query words are contained in each sentence. The system applies a type to each question type and uses a QA dictionary that contains information about relationship of words between questions and answers. This dictionary is used to determine the question type and whether a sentence is the right answer. The question type is a category for example, a question beginning with Who is will have the type PERSON/ORGANISATION. The system also tags the whole document to extract named entities and employs coreference resolution (words that relate to other objects). Each sentence is scored according to the specific search engine and compared sentences for redundancy. This kind of answering provides no means of gaining additional information which may be useful as only sentences are regarded. This approach is similar to that of the Tritus system whereby the question type is identified in order to produce candidate answers i.e. in the Tritus system a question of type What may have in the answer a phrase refers. In the same way the AnswerBus system identifies these kinds of relationships. [24] This system is focused on the particular search engine AltaVista. Therefore, the system tailors the questions into queries particular to AltaVista i.e. the query is converted into a logical representation. The system uses both probabilistic and Boolean retrieval approaches for first stage retrieval. Then passages of documents are used for the second retrieval whereby paragraphs are detected according to the formatting of the text i.e. indentation. Paragraph windows are created within each of the paragraphs and weighting is applied to the paragraphs which will reflect which paragraphs potentially contain the most exact answer. The selection of paragraphs is determined by the number of occurrences of keywords where weighting is determined by distance between keywords. So this system moves from simple pattern matching to statistical methods as well, where the focus of the question should be considered by the potential answer candidates. The answer candidates are then weighed and presented to the user. Statistical Methods [25] NTT DATA: This system, based on the BM25 algorithm, extracts query terms from the query then uses the terms to retrieve documents. The top n retrieved documents are candidates for answer extraction. If a query term occurs frequently in a particular passage 14

20 15 in the document then this passage is extracted from the document assuming the part contains the answer. The passages are then scored using IDF. Information types are assigned to the text when extracted. This then leads to the use of a dictionary which contains names of for example, countries. Thus when an information type suits the answer type enforced by the query, the scores for the candidate parts are changed. However, this system doesn t seem to take into consideration the actual question that was asked. The terms are simply extracted from the query and used as a search engine does at present. So the aim seems to be just extracting an answer giving little interest into improving retrieval of relevant documents. 2.5 Evaluation of Techniques Various techniques that can be implemented to achieve question answering have been reviewed above and the benefits and disadvantages have been outlined. A simple query word matching to the documents is a good general retrieval method however, it is not as effective as it can possibly get. Therefore, methods need to be applied that can improve the retrieval process, namely Local Context Analysis and hence, Query Expansion. For the more specific area of question answering, query transformations are effective due to the use of example data. Also, query expansion is useful as more precise terms will improve retrieval performance. Stop word removal is implemented by Google automatically unless the terms are surrounded by quotation marks. Thus, this does not need to be explicitly implemented. The part-of-speech tagger is required by the learning process in the Tritus system and stemming may not be appropriate for the learning stage but may be implemented further into the project. This project will focus on the Tritus system and query expansion research carried out by Robertson and Sparck-Jones and on the probabilistic retrieval model in conjunction with the probabilistic methods of query expansion. 15

21 16 Chapter 3: The Tritus Question Answering System The Tritus system has been developed by Agichtein et al. [1] for the purpose of answering questions. These questions are in natural language form and are transformed with respect to several web search engines. The Tritus system does not attempt to create an answer from several sources instead it endeavours to achieve the document containing the answer as the top-ranking document. The technicalities of the Tritus system, which forms the basis of this project, are described in depth in this chapter. 3.1 The Learning Process The aim of the Tritus system is to automatically create web search engine style queries from natural language questions. Web search engine style queries means queries of the form that the specific web search engine can search with. As mentioned in chapter 1, Google, specifically, removes question words such as What is a from a query such as What is a hard disk?, therefore the question is transformed into a query that will contain phrases that will be in the web documents that contain the relevant answers. The way that the Tritus system performs this transformation is called the Learning Process and is described in detail below Generating Question Phrases The initial stage of the learning process is to generate question phrases, of varying length, which will in effect represent the different classes of questions. This stage is important, as it is unique in taking account of what the natural language question is asking. For example, at present most search engines accepting queries such as What is a treasurer? and Who is the treasurer? will not recognise the different type of answers required for each question. The question phrases therefore, reflect how the questions should be answered and hence, questions having the same general goal will lie under the same category of question phrase. With respect to the example above, What is a treasurer?, the goal of this question is to obtain a definition or description. Hence, the goal of the question can be categorised with the question phrase What is a. Consequently, when a new query is input to the system it can immediately be categorised by pattern matching techniques and therefore the goal of this new query can be identified. The process begins by extracting question and answer pairs from a database of FAQs. These FAQs contain simply questions and their corresponding answers; question and answer pairs are what constitute the training data. For this particular stage of the learning process only the questions are considered. Each question is analysed in turn and phrases of varying lengths are generated (2-4 words). The phrases generated are at the beginning of the question for example a question What is a CDROM? can generate the question phrases What is, What is a and What is a CDROM?. It is convenient to think of the generated phrases as being in a list and the decision of whether a generated phrase will be used in the following stages of the learning process is made by calculating the number of times this particular phrase occurs in this list. If the phrase under consideration appears at least some number of times (30) in this list then this phrase will be retained for future use. 16

22 17 This process can be achieved also with the use of regular expressions if the question phrases need to be restricted in some way. For example, if only the most useful and common phrases are adequate, then regular expressions can be used, which is what was done in the Tritus system. Regular expressions make this very easy, as computation using regular expressions is simpler and more cost-effective because this only requires a simple matching of strings. Table 1 provides examples of question phrases and the categories they fall under. Question Type What When How Question Phrase What is What is a What are When did When is When is the How can I How can Table 1: Examples of generated Question Phrases Generate Candidate Transforms This second stage of the learning process uses the Question and Answer pairs in the training data and the question phrases generated in the previous stage of the learning process. Here the answers will be tagged with a part-of-speech tagger in order to eliminate nouns, words that can potentially alter the topic of a query. Each question phrase will be assigned candidate transforms that are generated from the answers in the training data. The candidate transforms may specify what kind of phrases should be in the documents to be retrieved in order to answer the question. Hence, forming a query using example answers i.e. candidate transforms, should ensure a more relevant answer (document) is obtained. The first part of this stage is to examine each question and answer pair in turn. The examination involves generating answer phrases from the beginning of each answer, where the beginning of the corresponding question matches a phrase from the list of question phrases. These answer phrases like the question phrases will be of varying length (1-5 words) and filtered according to frequency; how many times this answer phrase has occurred before for this specific question phrase. Again, the frequency must be greater than a minimum occurrence value (3). At this point in the process it is necessary to remove candidate transform phrases that contain nouns. Even though the use of nouns makes a query more precise, during the learning process this has adverse effects, namely redirecting the focus of the query because these transforms will be applied to random queries after the learning process has been completed. Upon removal of phrases containing nouns, the top n phrases in decreasing order of frequency are retained. Table 2 provides examples of the transforms. 17

23 18 Question Type What is a Candidate Transform Phrases stands for means refers to Table 2: Examples of generated candidate transforms Ranking Candidate Transforms At this point, phrases have been generated from sample answers that correspond to question phrases. These phrases now need to be initially ranked using ranking methods used for normal information retrieval. The weighting technique is a variation of the Robertson-Sparck Jones weighting scheme, which assigns weights to terms specific to query topic. As there aren t any query topics, this scheme cannot be applied in its entirety so the meaning of query topic is changed to mean question phrase. A relevant document is then an answer in the training set that corresponds to the question phrase. Another change is to add weights to phrases, tr i, rather than single word terms. The number of times the phrase occurs in Answer in <Question, Answer> pairs (where the Question matches the question phrase) is the number of relevant documents. All the other <Question Answer> pairs where the phrase occurs in Answer are counted as being irrelevant. The relevance weight (RW) formula is applied (Equation 6) incorporating the modifications. Term selection weights are calculated in the context of automatic query expansion selection. The final weight is calculated using the formula below, where the term qtf i is the co-occurrence count of tr i with respect to the question phrase and w i (1) is the relevance based term weight of tr i also with respect to question phrase. wtr i = qtf i.w i (1) (Modified Offer Weight Formula) The candidate transforms are sorted into groups according to the number of words in each phrase and as many as 25 transforms with the highest values of wtr i are retained from each group Weighting and Re-ranking transforms using Search Engines The transforms are then evaluated for performance on web search engines. The process is completed for all question phrases and all web search engines that are required. The process starts by retrieving a set of <Question, Answer> pairs where Question begins with a question phrase; this becomes the set of training examples. The pairs are sorted in increasing order of answer length and the top 100 pairs are retained. The example <Question, Answer> pairs are examined one at a time along with the candidate transforms generated in the previous stage of the process. Each transform is applied to the Question in turn. When a Question begins with a question phrase the application of the transform enforces the removal of the question phrase from the question. Then a rewritten query is obtained in the form of the remainder of the question AND the candidate transform. The AND is Boolean algebra so the query is a set of words or phrases. For example, if Question = {QP R} where QP is the question phrase and R is the remainder of the terms of the question then the transformation will result in Question = {R AND tr i } where tr i is the ith transform in the set of transforms for this particular QP. 18

24 19 This stage requires the transforms to be treated as phrases, which is achieved by using the appropriate syntax of each search engine to be used. The rewritten query is submitted to the search engine and the top ten documents are retained and analysed. Subdocuments are generated for calculating similarity between original answers and the documents. The next step is to calculate the score of the document retrieved with respect to the original answer from the <Question Answer> pairs. The score for the document is the maximum of the similarities of each subdocument in this document. The BM25 scoring scheme (Equation 5) is applied but with the changes mentioned above to allow for incorporation of phrase weights, which is described in further detail in [1]. Next, the weight for a particular transform is calculated as being the average similarity between the original answers and the documents returned using the rewritten query. The calculations are performed over the whole set of <Question Answer> pair examples. This automatically ranks the transforms in order of weights and each transform can be applied to a new query according to these weights. 3.2 Apply Transformation to query The transformations are then stored as rules so when a new question is entered the transforms are applied as follows. The question phrase is identified from the question entered, thus the set of transforms can be retrieved. When categorising the new question, the preferable question phrase should be longer thus more specific. A question phrase such How do I is more preferable than How do. For each transform in the set, the transform is applied to the question to obtain a new query. The application process is as above when weighting the transforms with respect to search engines. Each transformed query is then submitted and the resulting documents are scored and hence ranked in decreasing order of scores and the top documents are returned to the user. This completes the Tritus System s question answering process. 19

25 20 Chapter 4: Implementation and Testing Strategy The following sections describe the system implemented for this project. Where the implementation has changed from the original method of the Tritus System will be clearly indicated and reasons for doing so will be outlined in detail. Since this project follows the work of Agichtein et al. a design was not necessary for the learning process and I have endeavoured to follow the implementation of the Tritus system as closely as possible, within the constraints. At this point it is important to make clear that the pattern matching is case sensitive. This means that the QA system will match against What is but not what is. This may be corrected if wished in future development of this QA system however, for this project the preferable input is a natural language question that is grammatically correct, i.e. a question that begins with a capital letter and ends with a question mark. At each stage of the learning process the testing could only be carried out by scanning the relevant output by eye. This to ensure that the appropriate information is output to each file as it should be. The only point where the system can be fully tested is at the transformation application stage, where the new query created after transformation is verified for correctness. 4.1 Learning Process Throughout the learning process, the Perl programming language has been used, purely for the reason that Perl handles pattern matching more competently than say Java, and Perl manipulates plain text much faster with regular expressions Extracting Question and Answer pairs The initial stage for the learning process requires the formation of a set of training data. This involves extracting questions and corresponding answers from FAQs. The FAQs were obtained from the Internet FAQ Consortium s Internet FAQs Archive of Usenet FAQs [5]. The FAQs in the archive are sorted into categories relating to what the FAQs are concerned with. Therefore, the first task is to download these files, where only the files written in English were considered. This was easily achieved by downloading multiple web pages. Any of the files in the FAQ archive may be used for this stage although note that the archive also contains FAQs written in languages other than English. The FAQs selected at this stage should ensure that a broad variety of questions are chosen, so that the transforms to be generated can also be broad, that is to say, the Questions and Answers should not all relate to one subject. For example, I tried to achieve this by using questions that relate to topics such as Astronomy, Perl, C++, Copyright laws, J R R Tolkien and more. The next task was to extract the questions and the corresponding answers. A Perl program was written to achieve this since the pattern matching required for extraction is achieved better with Perl. The result was a set of 1073 questions and their corresponding answers. The files containing the questions and answers had to be first scanned by eye to identify the exact layout. Only the questions and answers were required for this stage and the FAQ files also contained other text and non-alphabetical characters that were not required so had 20

26 21 to be discarded. Also, some FAQ files contained a table of contents which also is not required, therefore the Perl program had to identify whether the question located in the file was part of the table of contents or not. Finally, the questions themselves had to be identified. Some questions in the files begin with a combination of, for example numbers, dots, brackets, Q and Subject, so these had to be identified and subsequently removed. The line in the file also had to end in a question mark. Only questions that could be contained in one line in the file were considered in order to simplify the implementation. Since Google limits queries to 10 words, I felt that shorter questions would better reflect the variety of questions that users would typically ask a search engine. Due to the lack of uniform formatting of the FAQs, the question and answer extraction process was extremely time consuming in order to extract the exact information required and at the same time avoid useless characters, such as dashes. Eventually, all the different formats of the FAQ files were catered for by one Perl file. The file produced at this point in the system was examined by eye to ensure that the questions appeared under the correct tags and that the corresponding answers also appeared under the equivalent tags. Any changes that needed to be made were done so before proceeding as advancement cannot occur until this stage is absolutely complete, therefore the results were checked after each new set of FAQs were added to the set. The learning process requires that the question and answer pairs be tagged by a part-ofspeech tagger. The tagger chosen for this was Brill s part-of-speech tagger that required the text be formatted in a particular way namely, the text requires being in the form of one sentence per line. For this, a Perl program (splitter [44]) was used which would identify the sentences and create a new file with each sentence on one line. However, before this could be carried out another Perl program had to be implemented which would make sure that paragraphs ended with full stops because the splitter would use certain text and rules to identify end of sentences [45]. The split text was then formatted once again to remove the tags that the splitter inserted and the result is the question and answer pairs. Again, the results were scanned and verified for any unwanted formatting and corrected as required. This testing had to be completed in order to proceed to the next stage of the learning process Generate Question Phrases The next stage of the process is to generate question phrases. This was accomplished by using the questions and answers extracted in the previous stage. This stage involved scanning through the questions and answers and extracting the questions alone. Once these questions are extracted the phrases are generated of length two to four. For example, taking an example from the FAQs used in the project How do I create my own default style sheet?, the phrases would be generated as shown in Table 3. Question Type How Question Phrase How do How do I How do I create Table 3: Examples of generated Question Phrases 21

27 22 These phrases are then stored in a file to be accessed later in the process along with the corresponding question type. The phrases to be retained are determined according to the frequency of this phrase with respect to the whole collection. Since, only approximately 1000 questions formed the training data the minimum frequency of a retainable phrase was lowered to five. That is to say, a question phrase had to occur at least five times for it to be considered a common phrase, as a result the set of question phrases consisted of 51 question phrases ranging across the different question types like How, What etc. Despite this constraint, this stage of the process was accomplished successfully since some examples were available for comparison [1] and the results are similar. To ensure the correct phrases were obtained, the phrases were associated with a frequency using a hash structure and using the value of the phrase the phrases were written to the file. Before, continuing the output was thoroughly verified for any incorrect data, that is to say, check that only the correct question phrases are output and nothing more and make changes if required. During this phrase generation process, the answers were also extracted separately in order to be tagged by a part of speech tagger. The extracted questions are also saved at this point for access later in the process. The Brill Tagger was implemented on the questions and answers separately then combined in order to associate answers to questions. The original Brill tagger [10] could not be compiled to be compatible with the operating system utilised in the development of this project. Therefore, a pre-compiled slightly modified version was obtained [46] and employed. For this reason, the questions and answers were tagged separately, as the whole text was failing to be tagged, and may still fail if any line in the text is too long. To get around this some editing may need to be completed on the questions and answers file, such as insert full stops for the splitter tool or insert new line markers Generate Candidate Transforms The part of speech tagger assigns syntactic part of speech labels to the words in the text in the context of the sentence as mentioned in chapter 2. In order to proceed to the next stage of the learning process the nouns have to be removed from the text, principally from the answers in the text rather than the questions. The nouns are removed so that when the transforms are generated and are applied to the query the query s objective does not deviate. Once the nouns have been removed from the answers, each answer is considered in turn to generate the candidate transforms. The question phrase is identified of the corresponding question of the answer. The transforms are then associated to this question phrase and the most frequently occurring transforms for each question phrase are retained, the result was approximately 400 transforms across the range of the 51 question phrases. This stage unfortunately, was not as successful as the question phrase generation stage. Due to the scale of this project transforms are generated from the first five words of the answers only. If a larger portion of the answer was to be used a more sophisticated algorithm would be required which was not easily achievable in the available time. I also did not appear to receive similar results as the examples provided [1]. For example, for the question phrase What is a I did not achieve transforms like refers to or to describe. 22

28 23 A reason for this is most likely to be due to using a smaller portion of the answers being used than in the Tritus system or that the answers are of an informal nature. An example of such transforms is shown below in table 4 for an example answer in the training data. The answer begins as The following may be used to enable in some popular valid this portion of the answer does not make sense since the nouns have been removed. Question Phrase How do I Candidate Transform Phrases The following may be used The following following may may be be used The following may following may be may be used The following may be following may be used The following may be used Table 4: Examples of generated candidate transforms Weighting of Candidate Transforms This process involves calculating a score for each transform with respect to its question phrase as generated in the previous step. The initial weighting was calculated as understood from the method outlined in the Tritus system. This involved counting the total number of relevant answers, the number of relevant and non relevant answers for each individual transform and the total number of answers in the training set. The number of relevant answers for a given transform is calculated as follows. Firstly, the whole collection is iterated through. For each transform the question phrase is identified. Whilst reading the answers in the training set, the transforms are compared in turn with the current line of the current answer. If the transform appears in the line being examined and the answer belongs to the question phrase of the transform, then this answer counts as a relevant answer. An answer belongs to the question phrase of the transform if the question of the answer belongs to the question phrase of the transform. On the other hand, if the transform appears in the line being examined and the answer does not belong to the question phrase of the transform, then this answer counts as a nonrelevant answer. All other answers are ignored. Once a transform has been identified as being relevant or non relevant i.e. if it has already been identified that a particular answer is relevant or non relevant to a particular transform, the answer will not be added to the transform s list of relevant and non relevant documents, respectively, again. 23

29 24 Once this information has been collected the weight is calculated using the relevance weight formula as in chapter 2 above. The formula is modified to incorporate the concept of question phrases and phrases as opposed to single words. The modification is shown below. (1) w i = log ( r ) / ( R r ) ( n r ) / ( N n R + r ) R is the total number of relevant answers r is the number of relevant answers that contain the term tr i n is the number of answers that contain tr i N is the total number of answers in the training set The final weight is then calculated as follows: wtr = r i (1). wi r is the number of relevant answers that contain the term tr i (1) w i is as above Unfortunately, due to time constraints the transforms could not be weighted with respect to Google. This process is extremely long and was not achievable entirely in the time period since a great deal of calculation and computational resources is required. Therefore, the initial wtr i weight is considered to be the final score. 4.2 Apply Transformation The process of applying a transformation to a new query involves first identifying the question phrase of the query then retrieving the appropriate transforms. The question phrase is then replaced by the top ranking transform and the result is the query to be supplied to Google. Again, note that the pattern matching is case sensitive. When a question is supplied to the system for transformation, the first step is to read the scores for each transform into memory. The next step is to split the question into individual words and form question phrases as in the Generating Question Phrases step in the learning process. Once the question phrases have been generated, the phrases are matched against the phrases in the training data. The longer phrases are preferable and if such a phrase exists in the training set the transforms are retrieved. The best scored transform is added into the query and the question phrase words and the question mark is removed from the question thus forming the new query. This query is then submitted to the Google API via the Perl implementation. For example, a question such as What is a microphone? would obtain the top transform TR. The result of the transformation will be microphone TR, where What is a is the question phrase and TR is the transform encoded as a phrase for Google. 24

30 25 Unless the query doesn t begin with a question phrase it will not be transformed, that is to say the system will not attempt to transform a non-question query. Therefore, there is also capability, in this implementation, to perform normal keyword searching. Originally, the query would be transformed using all the top n transforms for the question phrase. Unfortunately, adequate resources were unavailable to carry this out. This step involved retrieving all the documents for each new query and assigning scores to each document. If a document had previously been scored then the score is updated and finally the top n documents with the highest scores are returned in order. The number of documents to be analysed would be the number of transforms multiplied by 10 (Google APIs return 10 documents for every search) and I was unable to gather enough resources to do this analysis. The problem was not identification of documents; this can be achieved using the URL of the document, but simply lack of computational resources and time constraints which forced this decision. Therefore, only one transform is applied to each question. 4.3 Query Expansion After the query has been submitted to Google, the retrieved documents are parsed i.e. the HTML tags are removed using the Perl module HTML::Parser. Each line of each document is then analysed and the neighbouring words of the query words (up to two from the left and up to two from the right) are entered as potential query expansion words. The words are then ranked according to frequency, the highest frequency words last and lowest frequency words first. In order to avoid common words being included for expansion, the words are ranked in increasing order. These words are then ready for selection by the user to add to the existing query for a new search. At this point, each retrieved document is analysed and ranked using the tf scoring scheme. The tfidf is not used since all the words appear in all documents so idf would equal 0; log(n/n) = 0 since N=n. This subsequently would force the score for each document to equal 0. The results are written to a file readable by a web browser ready to be displayed. The system is presented using Java which provides an interface to enter the query and execute the search. 4.4 Testing Strategy Upon the completion of the learning process, the created candidate transforms were examined in order to identify which question phrases possessed the greatest number of transforms. I then made the assumption that the phrases with the greatest number of candidate transforms would perform the best on this QA system as opposed to the phrases with few or no transforms. Therefore, as well as comparing this QA system with Google, the test results may also prove or disprove this assumption. Since there are over 50 different question phrases the system could not be tested as extensively in the available time period as would have been required. Thus, the question phrases being tested are done so using three questions that begin with the phrase, and a number of random questions have also been included. The aim is to score both the systems and the results of doing so follow in Chapter 6. 25

31 26 Chapter 5: The Google Application This chapter provides details of how the system was made presentable to a user. An interface was created which offers a means to easily enter a query and search, thereby making the system more user-friendly than say, command line entry. Additionally, background information of the Google API as well as other useful information is provided in this chapter. 5.1 Java Interface An interface has been created in Java which accepts a query of any type, not necessarily a question, and displays a list of words for query expansion. When a query is submitted via this Java application, the list of query expansion words is checked to see if any words have been selected or not and accordingly submits the query for transformation by a Perl program. The query is submitted to Google through a Perl application of the Google APIs. Finally, the query is expanded again by a Perl program as in section above and the query words are displayed in the Java interface. The results are displayed in a Microsoft Internet Explorer window after being ranked using the tfidf score as in chapter 2. The application consists of three Java classes and three Perl programs. The first Java class creates the interface components of the application using Java Swing components and the ActionListener interface. From this interface, a user can enter a question and search. If a search had previously been performed, the list under the search field will contain several words which the user can select to be added to the query and search again. Multiple words may be selected to be added to the query. Also, the user can be directed to the Google homepage at via the button at the top of the screen with the Google logo. Picture 1 below shows the look of the interface of this application. Picture 1: Screenshot of interface 26

ResPubliQA 2010

SZTAKI @ ResPubliQA 2010 David Mark Nemeskey Computer and Automation Research Institute, Hungarian Academy of Sciences, Budapest, Hungary (SZTAKI) Abstract. This paper summarizes the results of our first