Quoogle: A Query Expander for Google

Size: px

Start display at page:

Download "Quoogle: A Query Expander for Google"

Maryann Hardy
5 years ago
Views:

1 Quoogle: A Query Expander for Google Michael Smit Faculty of Computer Science Dalhousie University 6050 University Avenue Halifax, NS B3H 1W5 smit@cs.dal.ca ABSTRACT The query is the fundamental way through which a user interacts with a body of data. The query is the sole vehicle for a user to obtain the information they are looking for with any degree of efficiency. Hence, providing a means for the user to improve the query itself is a quintessential step toward obtaining better results. We investigate the topic of query revision by creating a system that allows users to refine their query with vocabulary obtained during the searching process. Common terms in the hit set generated by the user's query will be shown to the user as suggestions as a way to interactively refine their queries. This will direct the search to particular clusters of results that should be more relevant to the user. Keywords Information retrieval, query expansion, query augmentation INTRODUCTION There has been much discussion in the literature regarding query reformulation. Different techniques, such as using a thesaurus or a corpus analysis [3] to expand the query, have been suggested. Other techniques include automatic query reformulation or expansion [4], nearest neighbor expansions [14], and direct relevance feedback [1]. It has been shown that under certain circumstances, these methods are effective in providing better results. However, each method has its own unique weaknesses and strengths. Direct relevance feedback is not commonly used in today s commercial information retrieval systems. The cost of learning to use the feedback mechanism often outweighs the perceived benefit, and users want a click and go search, not complicated feedback measures. Our investigation attempts to approach the issue of user feedback in a way that is intuitive and easily understood for the end user. The search algorithm should do any complicated computations such as clustering, indexing or topic relevance determinations in the background, where the user can t see the details. However, the user Whitney Thiele Faculty of Computer Science Dalhousie University 6050 University Avenue Halifax, NS B3H 1W5 thiele@cs.dal.ca should be given the final choice on what modifications are made to their query, if any. In short, we attempt to refine (or broaden) a query as necessary to suggest some possible additions to the query as guided by the user s feedback. We have created a system that allows users to refine their query with vocabulary obtained from the results of their initial query. High-weighted terms in the hit set that is returned by the user s query are shown to the user as suggestions for ways to enhance their queries. This directs the search to particular clusters of results that are more relevant to the user. The choice the user makes will be stored to aid in the automatic refinement process of other queries. Query term suggestions to modify the original query should be a tool which is available for the user to both provide insight into the representative terms in the resulting hit set, and to help the user augment the query quickly and immediately. BACKGROUND AND MOTIVATION One shortcoming in many modern information retrieval systems is the interface between the retrieval system and the end user. Specifically, the query formulated by the user is the only input means a system has to determine the intentions of the user and provide suitable results. As data sets like the World Wide Web get larger and larger, and as they cover an ever-widening range of topics, more sophistication in our query generation is required. However, new users of search engines that search these large data sets are generally not capable of entering advanced queries they expect relevant results while providing only the minimum of input. It has been shown in some cases that compared to doing no expansion to the query at all, query expansion can improve the results obtained. [13] Unfortunately, the cumbersome interfaces and complex processes on some current query modification techniques deter the users who would receive the most benefit from these methods. User s do not perceive any immediate benefit to

2 providing extra information to the search engine, and view more complicated methods with skepticism. There have been many investigations into different aspects of query modification using various techniques. These methods include relevance feedback, term cooccurrence, word stemming, and thesauruses. The use of terms to augment the query obtained through relevance feedback has been shown in [6] to produce better results. Relevance feedback is the method in which the system adjusts the query to produce a greater number of relevant documents. In the case of the vector space model, this is accomplished by adjusting the term weights in the query so that the resulting query vector moves closer to the to large groupings or clusters of relevant documents. This is dependant on factors such as the document-ranking algorithm which determines which documents are relevant. Another modification that has been examined is the use of terms that co-occur in the document set. Terms that have a high frequency of occurring in the same document are added to the query, thereby expanding the query. This has been shown to be a highly variable technique that can produce both relevant and irrelevant results depending on the terms that are added to the query. [5] Word Stemming algorithms are used in query expansion by conflating word stems with a collection of suffixes. In other words, this extends the terms in the query by adding similar terms based on lexical structure. This strategy for automatically expanding the query terms is also highly variable, depending on the term root, or stem. This can lead to cases where the returned results are improved, and in other situations providing poorer results. [7][11] A thesaurus is used to provide terms that are somehow related to the query terms by similarities in definition instead of the using word stems. This strategy has also been studied for use in automatic query expansion. The query term is augmented with the addition of synonyms from a thesaurus in an attempt to include related terms that might occur in related documents. This will obviously increase the recall capacity of the system, since a wider variety of terms will be included in the results. [12] However, it can also have the effect of increasing the size of hit set, for the new query would match any of the added terms, not all of them. The increase in recall comes at a cost to precision, and forces the user to sort through a larger hit set if their desired document is not the first result. New retrieval algorithms use different techniques to rank documents in an attempt to provide the better results to the end user. For example, some systems use link topology to assign a measure of popularity, and others use geographical constraints to localize potential search results. In commercial systems, especially Internet search engines, the retrieval mechanism and any resultant processing is done out of the user s sight. This means that additional processing of the query can be done without the knowledge of the user, with the user seeing only the suggested queries. IMPLEMENTATION To test our approach to query augmentation, we developed a system that makes use of an existing search engine which has the infrastructure and a proven searching method already in place. Quoogle, our augmentation to the existing search engine Google 1, was written using Perl and tested on a Windows XP Pentium 2.4 GHz machine running Apache. To use Quoogle, a user enters their query into the Quoogle search page. The results page is divided into two sections the top portion is for Quoogle s suggested augmented queries, and the bottom section is the standard Google results page. If one of the top ten Google responses is sufficient for the user, they can ignore Quoogle s suggestions; if they find that their query was too broad, they can select one of the augment queries, or try another query of their own invention. This method of user choice regarding query augmentation is very similar to Google s own spelling correction suggestions. Although Google was chosen for our implementation, other search engine systems such as AltaVista 2 or Yahoo 3 could have been augmented in a similar fashion. To generate the augmented queries, the Quoogle engine performs an analysis of the hit set. It takes a random sample of 30 result pages from the first 100 results returned by Google. It then downloads each of those pages, removes HTML tags and comments, tokenizes the documents based on white space, and removes any words that are on the stop word list. It then calculates normalized term frequency and inverse document frequency scores for each of documents and terms in this set of 30, and assigns each term a normalized weight according to this formula:

3 Figure 1: The Quoogle user interface. This normalization is done to minimize the effect that document length would have on the term weights. The terms are ranked according to their weight, and all but the top 5 terms are discarded. These remaining terms are used to generate five suggested queries that are displayed to the user. If the user clicks on of these links, the system stores which key word was chosen, and redirects the user to a Google search page with the results for the augmented query. This ends the user s interaction with Quoogle. Limitations There are a number of limitations to the current implementation of Quoogle. In particular, Quoogle is an augmentation to an existing system, Google. Although Google provides a convenient method for programmers to interface with their search engine, this method was not suited to this application. Their method limits each query to 10 results; to obtain the top 100 documents, ten separate queries were run. Once the 30 sample documents were selected, each page had to be downloaded. Running the queries and downloading the pages is very time consuming; the time taken for the actual analysis of the pages is insignificant in comparison. This prevents the current implantation from being an effective real-time system. If Google chose to deploy an augmentation such as ours, these limitations would not be a factor. Look and Feel The interface to the Quoogle system was carefully designed so that it remains consistent with the interface of the augmented system. In this particular case, the term suggestions are presented via simple, single click hypertext links which are located in a frame directly Term Relevant Explanation casinos yes Casino gambling sports yes Gambling on sports cyber yes Gambling online thrill no money yes Gambling with money Table 1: The term suggestions for gambling, their relevance, and the reason given for their relevance. above the main Google search results page. Figure 1 illustrates how an augmented system can be easily integrated into an existing system. Experiment The Quoogle system was evaluated by considering the relevance of the term suggestions that were returned by a typical query. The hypothesis was that the Quoogle system would be able to return term suggestions that were related or relevant to the original query. The first step in determining the feasibility of this approach was to analyze the terms which were returned by Quoogle from a typical search. A collection of 100 search queries was randomly selected from the Metaspy website 4. This site provides a random sample of queries that users have made on a selection of search engines online. Each search query was entered into the Quoogle system, and the resulting suggested terms were determined to be either relevant or not relevant terms to 4

4 Term Position Average Relevance 1 94% 2 88% 3 80% 4 87% 5 28% Table 2: Relevance by rank of weighted term suggestions. the original query. The system created 500 suggested terms over the course of the evaluation of 100 queries. The determination of relevance was based on relational information from WordNet [10] and a system similar to Joho [8] in which conceptual or contextual relations are considered. These types of relations can be considered for terms that have a conceptual relation or association outside the entries of a typical thesaurus or WordNet. For instance, the term elevator is related to the terms stairs and office tower (Table 1). The resulting size of the hit set after the query modification was considered and recorded for further analysis. RESULTS The Quoogle augmentation of the Google system showed that the system may provide some useful suggested terms. The analysis of the suggested terms shows an 85.2% relevance rating when compared to the original terms in the query. Some terms will co-occur over a specific topic category and may be included in the returned suggestions, which might account for such a high relevance result. There may be some natural clustering of vocabulary terms that may exist in the returned hit set which should be picked up by the termweighting mechanism. Another interesting feature of the relevance result is that on average, there was a substantial drop of relevance for the last suggestion that was returned by the system (Table 2). This could be explained by terms which should have been captured by Quoogle in a stop list. For instance, in a number of specific cases, there were errors on the hit set pages that caused some HTML to appear in the term index. This in turn allowed some erroneous results to be returned to the end user in the form of suggestions that had little value for query modification. The hit set of the augmented queries was reduced on average by 71% from the hit set of the original query. It should be noted that although we measure relevance to the initial query, we have no way of knowing how Cheap cigarettes: Tobacco, search, information, free, products Affirmative Action: Cell phones: Cheap Tickets: Admissions, court, office, black, opportunity Radiation, cellular, cancer, rf, middot Airline, pm, air, airlines, flights, eacute Table 3: The first four queries from Metaspy and the terms suggested for addition to the queries. Notice that airline and airlines appears in the cheap tickets expansion, which indicates that some type of stemming may be useful for the query suggestions. frequently the terms are relevant to what the user wished to find. There is a random factor in the equation, as there could be many terms relevant to the initial query. It is impossible to determine what terms the original creator of the query would find relevant. FUTURE WORK In its current form, Quoogle presents relevant terms to the user that might narrow and focus their search. There have been many studies on hypertext systems and how the structure hypertext can have an affect on the end user. Some studies illustrate that this structure is important in developing mental models of concepts [2]. It is therefore important to facilitate both the information retrieval task, concurrently with helping in the development of the end users model. The addition of term suggestions to current online systems should help the end user focus their query and obtain better results. The mechanism of this is dependant on several usability issues that should be investigated to determine the role they may play. For instance, there have been investigations into different interface presentation mechanisms for interactive query expansion systems[8]. Hierarchical lists are another option to single click expansion tools to be provided to the end user to limit the amount of typing required. Some research suggests that there is little difference in performance between automatic and interactive query expansion [9]. However, other studies show that the end user may prefer to have control of the system and make the decisions instead of having the system automatically

5 choose for them [8]. More investigation of this effect should be addressed in a comprehensive user study. Other future work may include ways to add more intelligence to the weighting mechanism, and attempting to reduce the random factor. For example, words that co-occur in the hit set with the original query terms could be given a higher weight. A user s history could be stored to assign additional weight to hit set terms based on the past interests of the user. CONCLUSION We have presented a system to augment a user s query using relevant terms from the hit set. The system is an addition to an existing service, and therefore has some speed limitations. However, initial results show that the system holds some promise. Future work should include a user study to perform a more detailed analysis. ACKNOWLEDGMENTS We thank Google for their provision of a method for researches to query their system. Their API made the development of Quoogle easier. Thanks also to the CSCI 6403 class for their helpful comments, questions, and criticisms. REFERENCES 1. ALLAN, J. (1996) Incremental relevance feedback for information filtering, Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval, August 18-22, Zurich, Switzerland, p DILLON, A., (1991) A. Readers' models of text structures: the case of academic articles. International Journal of Man-Machine Studies, 35:p GAUCH, S., WANG, J., AND MAHESH, S.,(1999) A corpus analysis approach for automatic query expansion and its extension to multiple databases, ACM Transactions on Information Systems (TOIS), v.17 n.3, p GAUCH, S. AND SMITH, J. B.,(1991) Search improvement via automatic query reformulation, ACM Transactions on Information Systems (TOIS), v.9 n.3, p HARMAN, D. (1988) Towards interactive query expansion: In Proceedings of the Eleventh International Conference on Research & Development in Information Retrieval (New York, NY) HARMAN, D. (1992). Relevance feedback revisited: In Proceedings of the 15th International ACM/SIGIR Conference on Research and Development in Information Retrieval. 7. HARMAN, D. (1992) A failure analysis on the limitations of suffixing in an online environment. In SIGIR '87, JOHO, H et al., (2002) Hierarchical presentation of expansion terms, Proceedings of the 17th symposium on Proceedings of the 2002 ACM symposium on applied computing, March 11-14, Madrid, Spain 9. MAGENNIS, M. AND VAN RIJSBERGEN, C. J. V. (1997) The potential and actual effectiveness of interactive query expansion: In Proceedings of the 20th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval MILLER, G. A. (1995) WordNet: A lexical database for English. Communications of the ACM, 38(11): PAICE C. D. (1994), An evaluation method for stemming algorithms. Proceedings of the 17th annual International ACM-SIGIR conference on Research and Development in Information Retrieval 1994, pp QIU, Y. AND FREI, H. P. (1993) Concept Based Query Expansion Proceedings of SIGIR-93, 16th ACM International Conference on Research and Development in Information Retrieval 13. RUTHVEN, I. (2003): Re-examining the potential effectiveness of interactive query expansion. Proceedings of the 26th annual international ACM SIGIR conference on Research and development in information retrieval: p SMEATON, A. F., AND VAN RIJSBERGEN, (1981) The nearest neighbour problem in information retrieval: an algorithm using upperbounds, Proceedings of the 4th annual international ACM SIGIR conference on Information storage and retrieval: theoretical issues in information retrieval, Oakland, California., p

Shrey Patel B.E. Computer Engineering, Gujarat Technological University, Ahmedabad, Gujarat, India

Shrey Patel B.E. Computer Engineering, Gujarat Technological University, Ahmedabad, Gujarat, India International Journal of Scientific Research in Computer Science, Engineering and Information Technology 2018 IJSRCSEIT Volume 3 Issue 3 ISSN : 2456-3307 Some Issues in Application of NLP to Intelligent