Evaluating Clustering Techniques for Collaborative Querying

Size: px

Start display at page:

Download "Evaluating Clustering Techniques for Collaborative Querying"

Brenda Willis
5 years ago
Views:

1 Evaluating clustering techniques for collaborative querying Ray, C.S., Goh, H.L.D., Foo, S., & Win,.C.S. (006). Proc. 006 International Conference on Hybrid Information Technology (ICHIT006), Cheu Island, Korea, ovember 9- Evaluating Clustering Techniques for Collaborative Querying Chandrani Sinha Ray, Dion Hoe-Lian Goh, Schubert Foo, yein Chan Soe Win Division of Information Studies, School of Communication and Information anyang Technological University, Singapore {rayc000, ashlgoh, assfoo, Abstract. Collaborative querying is a technique which makes use of past users search experiences in order to help the current user formulate an appropriate query. In this technique, related queries are extracted from query logs and clustered. Queries from these clusters that are related to the user s query are then recommended. This work compares a range of similarity measures and features (e.g. titles, URLs, snippets) used for clustering, and also proposes an extended K-means clustering algorithm. Experimental results reveal that the best clusters are obtained by using a combination of these sources rather than using only query terms or only result URLs alone. Keywords: Query Clustering, Information Retrieval, Collaborative Querying Introduction Web search engines play a vital role in retrieving information from the Internet, but there are several challenges faced by users of information retrieval (IR) systems in general. Users are often overwhelmed by the number of documents returned by an IR system or fail to express their information needs in terms compatible to those used in the system. While new IR techniques can potentially alleviate these problems, specifying information needs properly in the form of a query is still a fundamental problem with users. [] attributed this to the lack of conceptual knowledge of the information retrieval process. Put differently, users need to know how to translate their information needs into searchable queries. Research in information seeking behavior suggests an alternative approach in helping users meet their information needs. Studies have found that collaboration among searchers is an important step in the process of information seeking and use. For example [3] highlights the importance of the interaction between the inquirer and the librarian, while [4] argues that communication with colleagues is a key component in the initial search for information. We use the term collaborative IR to refer to a family of techniques that facilitate collaboration among users while conducting searches. These techniques can be categorized into: collaborative filtering, collaborative browsing and collaborative querying []. In particular, collaborative querying helps searchers in query formulation by harnessing other users search experience or expert knowledge [7]. Queries, being expressions of information needs, have the potential to provide a wealth of information that could be used to guide other searchers with similar information needs, helping them with query reformulation. If large quantities of queries are collected, they may be mined for emerging patterns and relationships that could be used to define communities of interest (e.g. Lycos 50), and even help users explore not only their current information needs but related ones as well. One can imagine a network of queries, amassed from the collective expertise of numerous searchers representing what people think as helping them meet their information needs. Query formulation and reformulation will then be a matter of exploring the query network, executing relevant queries, and retrieving the resulting documents.

2 There are many techniques for uncovering related queries to support collaborative querying, one of which is query clustering [5]. Here, query logs of IR systems are mined, and similar queries are grouped and used as recommendations. A key issue in query clustering is the measure of the similarity between queries. Traditional term-based similarity measures borrowed from classical IR [] is one example. An alternative is based on the overlap of the top results (such as URLs) returned from an IR system [6, 9]. Further, a hybrid approach was proposed by [5], which is a linear combination of the content-based and results-based approaches. Experiments showed that this approach produces better query clusters in terms of precision, recall and coverage, than using either the content-based or results-based method alone. Although these results show potential, there is still room for improvement by using additional information for measuring similarity. Examples include titles, snippets (document excerpts) or other metadata in the search results listings. Such information is commonly returned by IR systems and is used by searchers to udge whether the results satisfy their information needs. These can therefore be used to complement query terms and result URLs as criteria to measure the similarity between queries. Thus, in this paper, we try to improve on existing work in query clustering by [5] in two ways. First, the type of features used for calculating similarity between queries is expanded to include titles and snippets from search results documents in addition to query terms and results URLs. Secondly, a new algorithm, a variation of K- means, is proposed for clustering queries. The results of this work are then compared with previous work.. Related Work Query logs maintained by IR systems contain information related to users queries including search sessions, query terms submitted, documents selected, and so on. Hence, query logs provide valuable indications for understanding the types of documents users intend to retrieve when formulating a query [3]. One of the common techniques for collaborative querying through the reuse of previously submitted queries obtained from query logs is via query clustering [5]. Queries within the clusters are then used as candidates for query reformulation. For example, [7] developed a collaborative IR system that recommended related queries extracted and clustered from query logs. Both query terms and results returned were utilized in their hybrid query clustering approach. Query clustering approaches can be categorized into content-based, results-based, feedback-based, and a hybrid of these approaches, depending on the type of information used to determine the similarity between queries. The content-based approach borrows from classical term-based techniques of traditional IR. Here, features are constructed from the query keywords. A maor drawback is that on an average queries submitted to a web search engine are very short [], comprising -3 words in contrast to documents which are rich in content. Since the information content in queries is sparse [], it is difficult to deduce the semantics behind queries which in turn makes it difficult to establish similarity between queries [5]. An alternative is to use clickthrough data (feedback approach) which uses documents selected by users returned in response to a query. The idea behind this approach is that two queries are similar if they lead to the same selection of documents by searchers [3]. Work by [5] demonstrated that a combination of the contentbased and feedback-based approaches performed better in terms of the F-measure than the individual approaches. The disadvantage however is that the quality of query clusters may suffer if users select too many irrelevant documents. In results-based approach, documents returned in response to a query are used to measure similarity between queries. Here, document identifiers (e.g. URLs) as well as other metadata such as the document title, snippet or even the entire document are used. The latter was used by [9] to determine the similarity between queries by calculating the overlap in document content but this method is time consuming to execute. [6] therefore determined query similarity by the amount of overlap in their respective top-50 search results listings. Finally, [0] uses the results documents containing the query terms to extract other terms which often occur in context with the original query terms and these are used to establish similarity between queries. Our work differs from previous work in that it uses not only the content of queries and results URLs, but also the titles and snippets from search results documents since they are considered good descriptors of a query [6]. Also, since the entire result document is not used as feature set, it reduces the processing time, which is

3 important in a web search scenario. Moreover, our work uses the extended K-means algorithm [4] to define the similarity between queries rather than a simple measure of overlap, as in [5]. The K-means algorithm is popular for its speed and simplicity and is effective in computing the distances between obects in very large data sets [] which makes it appropriate for our dataset. The need for specifying the initial k centroids in advance which is the greatest disadvantage of K-means is overcome by using the extended K-means algorithm [4] described in section Query Clustering Algorithm Let W denote a collection of queries to be clustered. We represent each query Q i in W with composite unit vectors: Q i = (D i, E i, F i ) () where Q i is the composite unit vector representing i-th query in the query set W, i n (n is the total number of queries in the dataset) ; D i is the component unit vector representing keywords in Q i ; E i is the component unit vector representing the terms from the snippets and the title of the documents returned in response to Q i ; F i is the component unit vector representing the results URLs returned in response to Q i.. 3. The Content-based Similarity Measure A query is considered a document and query terms are vectors representing the query []. All terms are extracted after which stop words are removed and the remainder are stemmed. Terms that appear in less than two queries are eliminated, as they do not contribute in defining the relationship between two or more queries. The extracted terms are used to represent each query (Q i ) in the query set W as the content-based component vector (D i ) in Equation (). Specifically, D i is a weight vector obtained using the standard TFIDF formula: wnq = tfnq * idfn idfn = log dfn () where w nqi is the weight of the n-th term in query Q ; tf nqi is the query term frequency; is the number of queries in the query set W ; df n is the number of queries containing the n-th term ; idf n is the inverted term frequency for the n-th term. The content-based similarity between query Q i and Q is then measured using the cosine similarity function as defined in Equation (3): sim _ content( Q, Q ) = i n= w nqi w wnqi n= n= where w nqi and w nq are weights of the common n-th term in query Q i and Q respectively obtained using Equation (). nq w nq (3) 3. The Results-based Similarity Measure As discussed, limitations exist when using only query terms for clustering. [6] notes that the results lists returned in response to queries are readily accessible sources of rich information for measuring similarity. In our work, the results URLs as well as results terms (titles plus snippets) are used to represent each query as a results-based component vector (E) in Equation (). For the results URLs component of E, let U(Q ) be a set of query results URLs to a query Q, U(Q ) = {u, u,, u n } where u i represents the i-th result URL for Q. We then define R i as the overlap between two results URLs vectors between two queries Qi and Q: R = u : u U ( Q ) U ( Q )} (4) i { i 3

4 The similarity measure based on results URLs (adopted from [6]) is defined as: Ri (5) sim _ resultsurls ( Qi, Q ) = Max( Q, Q ) i where Qi and Q represent the number of results URLs in queries Qi and Q respectively ; R i is the number of common results URLs between Qi and Q. For the results terms component of E, we employ the cosine similarity measure as presented in Equation (3): sim_ results terms ( Q, Q i wnqi wnq n= ) = wnqi wnq n= n= where w nqi and w nq are weights of the common n-th term in query Q i and Q extracted from the results titles and snippets, using the TFIDF formula in Equation (). Finally, the complete results-based similarity measure is a linear combination of both URLs and terms and is defined as: sim _ results ( Qi, Q ) = ω k * sim _ results URLs + ω l * sim _ results terns (7) (6) where ω k is the weight assigned to sim_results URLs ; ω l is the weight assigned to sim_results terms ; ω k and ω l are non negative numbers such that ω k + ω l =. 3.3 The Hybrid Similarity Measure Similar to [5], the hybrid approach adopted in the present work is a linear combination of the content-based and results-based approaches, and is expected to overcome the disadvantages of using either approach alone: sim _ hybrid( Q, Q ) = α * sim _ results + β * sim _ content where α and β are non-negative weights and α + β =. i (8) 3.4 The Extended K-Means Algorithm. Define a similarity threshold, t where 0 t.. Randomly select a query as a cluster centroid. 3. Assign the remaining queries to the existing cluster if the similarity between the new query and the cluster centroid exceeds the defined threshold. 4. The query becomes a cluster by itself if condition in step 3 is not satisfied. 5. The centroids of the clusters are recomputed if its cluster members are changed. 6. Repeat steps 3 to 5 until all queries are assigned and cluster centroids do not change any more. Fig.. K-Means Clustering Algorithm Figure summarizes the extended K-means algorithm [4], an iterative partitioning algorithm which tries to overcome some of the limitations of the standard K-means. It uses similarity thresholds instead of pre-defined k centroids. This is the maor advantage as the number of query clusters are not known in advance. 4

5 4. Methodology Six months worth of query logs obtained from the anyang Technological University (TU, Singapore) digital library was used in this work. The queries belong to various domains including engineering disciplines, business and the arts. Each entry in a query log contains a timestamp indicating when the query was submitted, the submitted query, session ID, and the number of records returned. In the preprocessing stage, only the queries from the logs were extracted as the remainder of the content did not contribute to the query clustering technique used in this work. From the logs, approximately,700 queries were randomly sampled for use. Each query was then submitted to the Google search engine using the Google Web API. From the ensuing results listing, the top 0 URLs, titles and snippets were extracted for use in the results-based component of the hybrid similarity measure. 4. Computing Similarity Recall from Equation (8) that hybrid similarity is a linear combination of the content and results-based similarity. In our experiments, five pairs of values for (α,β) were used to vary the levels of contribution of each of these similarity measures: (0.0,.0), (.0, 0.0), (0.75, 0.5), (0.50, 0.50) and (0.5, 0.75) respectively for the content-based (sim_content), results-based (sim_result), hybrid (sim_hybrid), hybrid (sim_hybrid) and hybrid3 (sim_hybrid3) similarity measures. ote also that the results-based approach, defined in Equation (7), is a linear combination of results terms and URLs. Here, three pairs of values of (ω k, ω l ), namely, (.0, 0.0), (0.5, 0.5) and (0.0,.0) were used to represent the varying levels of contribution of results terms and URLs. In order to manage the number of experiments to be carried out to facilitate easier comparison, the following technique was used. We first set ω k =.0 and ω l = 0.0 to observe the results-based method using results URLs alone. These similarity measures are denoted with a subscript basic-link. Using this we had 5 sets of experiments - sim_content, sim_result basic-link, sim_hybrid basic-link, sim_hybrid basic-link and sim_hybrid3 basic-link through varying (α,β) as mentioned above. These measures are the same ones defined in [5]. From these, query clustering experiments were conducted using the extended K-Means algorithm at two different similarity thresholds (0.5 and 0.9) to determine the threshold that will provide better query clustering results in terms of the F-measure. That threshold would then be used for the remainder of the experiments. Given the selected similarity threshold, we next set ω k = ω l =0.5 so that the results-based method considered both the result URLs and terms equally during clustering. These similarity measures are denoted by the subscript extended. This gave rise to 4 sets of experiements- sim_result extended, sim_hybrid extended, sim_hybrid extended, sim_hybrid3 extended for varying (α,β). Finally, we considered result terms alone by setting ω k =0.0 and ω l =.0. These similarity measures are denoted by the subscript basic-term and we again had 4 similarity measures- sim_result basic-term, sim_hybrid basicterm, sim_hybrid basic-, sim_hybrid basic-term for varying (α,β). ote that the content-based similarity measure remains unchanged in the hybrid equation but the resultsbased component has been changed by the three combinations of results URLs and terms. Overall, a total of 3 separate experiments were run. 4. Comparison Criteria The quality of the query clusters generated from the 3 aforementioned approaches was compared using average cluster size, coverage, precision, recall and F-measures similar to [5] and others. Average cluster size is the average of all the cluster sizes. Coverage is the percentage of queries for which the similarity measure is able to provide a cluster. Precision is the ratio of the number of similar queries to the total number of queries in a cluster. Recall is the ratio of the number of similar queries to the total number of all similar queries in the dataset (those in the current cluster and others). It is difficult to calculate recall directly because no standard clusters are available in our dataset. Therefore, an alternative normalized recall [5] was used instead (explained 5

6 in the next section). For measuring precision and recall, 00 sample clusters were selected. Finally, F-measure is used to generate a single measure by combining recall and precision. 5. Findings and Analyses Recall that two similarity thresholds 0.5 and 0.9 were used for the basic-link approach to determine which threshold provided the best result in terms of F-measure. It was found that the threshold 0.5 was the consistent best performer (Table ) and hence this value was used in our experiments. Average cluster size indicates the ability of the clustering algorithm to recommend queries for a user s query. Figure (a) shows a plot of the average cluster size against different values of α and β (see Equation 8). Average cluster size exhibits an increasing trend along with the increase in β, the weight for the content-based component. The average cluster size of sim_content at similarity threshold 0.5 ranks highest at 6.93 indicating that users have a higher likelihood of finding more queries from a given query using the content-based approach. This is because different queries share a number of identical terms. Conversely, the average cluster size of sim_result basic-link is the smallest (.8) meaning that users will have the fewest number of recommended queries per issued query. This may be due to the fact that the number of distinct URLs is usually large and thus many similar queries cannot be clustered together for lack of common URLs []. Figure (a) also shows that the basic-term approaches have consistently higher average cluster sizes than that of other approaches. Thus when result terms were used, these similarity measures provided more recommended queries than those without using results terms. Coverage [] indicates the probability that a user can obtain recommended queries from his/her issued query. The content-based approach, sim_content, ranks highest in coverage (8.40%) followed by the hybrid approaches sim_hybrid3 basic-term (78.55%) and sim_hybrid3 extended (76.99%), indicating that the content-based approach is better able to find similar queries for a given query than other approaches. The coverage of sim_result basic-link is the lowest (.5%) which means that users will be less likely to receive recommended queries from an issued query. This reason may be the same for average cluster size in that the number of distinct URLs is usually large and many similar queries cannot be clustered for lack of common URLs []. Figure (b) shows that like average cluster size, coverage exhibits an increasing trend along with the increase in β. In addition, the coverage of the basic-term approaches is consistently higher than that of the other approaches, suggesting the importance of the use of results terms. Hence, users have a higher likelihood of receiving recommendations from a given query by using result terms during clustering than not using them. basic-link (0.5) extended (0.5) basic-term (0.5) 8 basic-link (0.5) extended (0.5) basic-term (0.5) 7 Average Cluster Size α=.0 α=0.75 α=0.5 α=0.5 α=0.0 β=0.0 β=0.5 β=0.5 β=0.75 β=.0 sim_result sim_hybrid sim_hybrid sim_hybrid3 sim_content Coverage α=.0 α=0.75 α=0.5 α=0.5 α=0.0 β=0.0 β=0.5 β=0.5 β=0.75 β=.0 sim_result sim_hybridsim_hybridsim_hybrid3sim_content (a) (b) Fig.. (a) Average Cluster Size and (b) Coverage for different similarity measures In terms of precision, the hybrid approach, sim_hybrid basic-link, performs best (99.73%) followed by sim_hybrid extended (99.50%) and sim_result extended (99.00%). The worst performer was sim_content (65.9%) suggesting that the use of query terms alone are not sufficient for producing good quality clusters. Figure 3(a) shows that precision decreases with the increase in α while it increases with a corresponding increase in β, the weight for the results-based component. In general, sim_hybrid generally performs better and this is attributed to the fact that the search engine used in our experiments (Google) tends to return results URLs that are the 6

7 same for semantically related queries. It is also interesting to observe that the precision of the basic-term approaches are consistently lower than that of other approaches indicating that the use of results terms compromises precision. However, these differences tend to be small. 00% 80% basic-link (0.5) extended (0.5) basic-term (0.5) 00% 80% basic-link (0.9) basic-link (0.5) extended (0.5) basic-term (0.5) Precision 60% 40% Recall 60% 40% 0% 0% 0% α=.0 α=0.75 α=0.5 α=0.5 α=0.0 β=0.0 β=0.5 β=0.5 β=0.75 β=.0 sim_result sim_hybridsim_hybridsim_hybrid3sim_content (a) (b) Fig. 3. (a) Precision and (b) Recall for different similarity measures. As mentioned, the calculation of recall posed a problem as no standard query clusters were available. Therefore, the normalized recall [5] measure was used. It is defined as the ratio of the number of correctly clustered queries within the 00 selected clusters to the maximum number of correctly clustered queries across the dataset (that is, clusters from the 3 experiments). In our work, the maximum number of correctly clustered queries was 54, obtained by sim_hybrid3 basic-term. Thus, the normalized recall of sim_hybrid3 basic-term ranks highest at 00%, followed by sim_hybrid3 extended (96.03%) and sim_content (95%). This indicates that sim_hybrid3 basic-term is better able to uncover clusters of similar queries generated by different similarity functions on the given query set used in this experiment. At the similarity threshold of 0.5, the recall of sim_result basic-link is the lowest (4.03%) meaning that users have the lowest likelihood of finding more query clusters. This could be because the number of distinct URLs is usually large and similar queries cannot be clustered together for lack of common URLs [], thus lowering the ability to uncover clusters of similar queries. 0% α=.0 α=0.75 α=0.5 α=0.5 α=0.0 β=0.0 β=0.5 β=0.5 β=0.75 β=.0 sim_result sim_hybrid sim_hybrid sim_hybrid3 sim_content Figure 3(b) shows that sim_result extended has lowest normalized recall (4.35%). The figure also indicates that the recall measures resulting from the basic-term approaches are consistently higher than other approaches though the extended approaches outperform basic-link approaches. It is evident that the use of result terms alone is better able to uncover clusters of similar queries than using result links alone. This may explain why the use of result terms in all similarity functions improves the ability to uncover more clusters of similar queries. The reason behind this finding could be that as the number of common terms increases due to the contribution of result terms used, the chances of finding related queries increase. Table. F-Measure Comparison Similarity Function basic-link (threshold = 0.9) basic-link (0.5) basic-term (0.5) extended (0.5) sim_result sim_hybrid sim_hybrid sim_hybrid sim_content To measure the global quality of the clusters, the F-measure is used [5]. Table shows that the F-measures returned from basic-link experiments at threshold 0.5 were consistently better than those at threshold 0.9. Hence similarity threshold 0.5 was used in the other approaches to generate query clusters. It was observed that the use of result terms consistently improved the F-measure compared to the approaches that did not use them. The reasons for this are similar to those cited for precision and recall since the F-measure itself is the harmonic mean of precision and recall. 7

8 6. Discussion We have observed that the basic-term approach performs better than the other approaches in terms of all the performance criteria except precision. As there are five similarity functions used in the basic-term approach, the comparison between the qualities of results for each is presented in Table. Only F-measures are reported in place of recall and precision. As can be seen in the table, the average cluster sizes of the results-based approach sim_result basic-term and hybrid approaches sim_hybrid basic-term and sim_hybrid basic-term are low. However sim_hybrid3 basic-term and sim_content basic-term provide 5.7 and 6.93 queries per cluster respectively on average. Since users are already burdened with a number of search results, too many query suggestions to the original queries will not be of much benefit to users. Hence, the average of 5 to 7 suggestions per query is quite reasonable. Table 3 shows sample query clusters that resulted from sim_hybrid3 basic-term. Table. Summary of Evaluation Results Similarity Functions Average Range of (similarity threshold Cluster Size Cluster Size = 0.5) Coverage F-measure sim_result basic-term 3.04 ~7 7.89% 0.76 sim_hybrid basic-term.98 ~ % 0.77 sim_hybrid basic-term 3.60 ~ % 0.8 sim_hybrid3 basic-term 5.7 ~ % 0.95 sim_content basic-term 6.93 ~8 8.40% 0.78 Table 3. Two Query Clusters from sim_hybrid3 basic-term A wavelet tour of signal processing, wavelet processing, wavelet tool signal processing Strains stresses, handbook formulas stress strain, aluminum stress-strain curve, strain material, stress transformation, distribution curve, engineering stress, failure curve Moreover, sim_content basic-term provides the highest coverage at 8.40% while sim_hybrid3 basic-term ranks second highest at 78.55%.. Stated differently, a user is more likely to receive recommendations to his original query using sim_content basic-term and sim_hybrid3 basic-term. In terms of F-measure, sim_hybrid3 basic-term ranks highest at 0.95 which suggests that the recommended queries will be useful to the searcher. Also, sim_hybrid3 extended achieved the second highest score at 0.94, while sim_hybrid3 basic-link ranks third at 0.9 (refer to Table ). It is observed that although sim_content basic-term achieved the largest average cluster size (6.93) among other approaches and the highest coverage (8.40%), it has a lower F-measure (0.78) compared to sim_hybrid3 basic-term (0.95) (from Table ). Table 4. Comparison of Query Clustering Approaches Best Worst Avg Cluster Size sim_hybrid3 basic-term sim_result basic-link Coverage sim_content basic-term sim_result basic-link Precision sim_hybrid basic-link sim_content basic-term Recall sim_hybrid3 basic-term sim_result basic-link F-measure sim_hybrid3 basic-term sim_result basic-link It can thus be deduced that sim_hybrid3 basic-term provides the most balanced results for query clusters followed by sim_hybrid3 extended and sim_hybrid3 basic-link. Table 4 gives an alternative view by comparing the performances of all approaches at similarity threshold 0.5. Although there is no single best performer, the table shows that the combination of the content-based and results-based approach outperforms the use of either approach alone. Further, the hybrid clustering approach, sim_hybrid3 basic-term, yields the best performance in terms of average cluster size, recall and F-measure. 8

9 7. Conclusion Collaborative querying is inspired from research in information seeking behaviour, which shows that collaboration among searchers is important in the process of information seeking and use. In our work, we use query clustering to discover similar queries to assist in query formulation and reformulation. We compared the content-based approach, results-based approach and a hybrid approach of both measures to cluster similar queries from query logs using the extended K-means algorithm. Our experiments also explored whether the use of more features (terms from title and snippets from the search results lists), in addition to the query terms and results URL, produces better clusters. Results showed that the use of either content-based or results-based approaches alone was inadequate. Instead, the hybrid content and results-based approach (sim_hybrid3 basic-term ), considering query terms, title terms and snippet terms, produced better query clustering results than approaches that do not consider terms from titles and snippets. The reason could be due to the increased number of common terms between two queries extracted from titles and snippets. With this increased number, term weights play a more important role in finding related queries. Our proposed hybrid clustering approach is seen to perform better than other clustering algorithms. It is able to overcome the drawbacks of the results-based approach through the addition of additional information from result documents such as titles and snippets. Besides, it also takes into account the important information in the queries submitted by users by combining the content-based approach with the results-based approach. Our clustering algorithm will allow IR systems to identify high-quality clusters to help users reformulate queries to meet their information needs by sharing queries issued by other users. High quality query clusters can also be used in automatic query expansion. We propose to extend our work in the following directions. Firstly, phrases could be used in the calculation of query similarity since they are less sensitive to noise due to the lower probability of finding matching phrases in non-related queries [8]. Hence, this technique could increase precision. Further, lexical knowledge through resources such as Wordnet could be used to improve the recall of the clusters. Similar to [], the use of lexical knowledge can be used to create a query hierarchy representing broader or specialized queries and allow users to modify their queries by either broadening or narrowing their searches. Acknowledgments. This proect is partially supported by TU research grant number RG5/05. References. Chien, S. & Immorlica,. Semantic similarity between search engine queries using temporal correlation. WWW 005, (005) -.. Chuang, S.L. & Chien, L.F. Towards automatic generation of query taxonomy: a hierarchical query clustering approach. Proceedings of IEEE 00 International Conference on Data Mining, (00) Cui, H., Wen, J.R., ie, J.Y. & Ma, W.Y. Probabilistic query expansion using query logs. Proceedings of the eleventh international conference on World Wide Web, (00) Fonseca, B.M., Golgher, P.B., De Moura, E.S. & Ziviani,. Using association rules to discover search engines related queries. First Latin American Web Congress (LA-WEB 03), (003) Fu, L., Goh, D.H., Foo, S. & a, J.C. Collaborative querying through a hybrid query clustering approach. Digital libraries: Technology and management of indigenous knowledge for global access, ICADL 003, (003) Fu, L., Goh, D.H., Foo, S. & Supangat, Y. Collaborative querying for enhanced information retrieval. European conference on research and advanced technology for digital libraries (ECDL 004), (004) Furnas, G.W., Landauer, T.K., Gomez, L.M. & Dumais, S.T. The vocabulary problem in human-system communication. Communications of the ACM, 30(), (987) Huang, C.K., Oyang, Y.J. & Chien, L.F. A contextual term suggestion mechanism for interactive search. Proceedings of 00 Web Intelligence Conference (WI 00), (00) pp Jansen, B. J., Spink, A. & Saracevic, T. Real life, real users and real needs: A study and analysis of users queries on the Web. Information Processing and Management, 36(), (000) Lokman, I.M. & Stephanie, W.H. Information seeking behavior and use of social science faculty studying stateless nations: A case study. Journal of library and Information Science Research, 3(), (00) Matwin, S. & Scott, S. Text classification using wordnet hypernyms. Proceedings of the COLIG/ACL Workshop on Usage of Wordet in atural Language Processing Systems, COLIG-ACL'98, (998) Raghavan, V.V. & Sever, H. On the reuse of past optimal queries. Proceedings of the Eighteenth International ACM SIGIR Conference on Research and Development in Information Retrieval, (995)

10 3. Turney, P.D. Word sense disambiguation by web mining for word co-occurrence probabilities. Proceedings Third International Workshop on the Evaluation of Systems for the Semantic Analysis of Text (SESEVAL-3), (004) Velardi, P. & avigli, R. An analysis of ontology-based query expansion strategies. Workshop on Adaptive Text Extraction and Mining at the 4th European Conference on Machine Learning, ECML 003, (003) Voorhees, E. Using wordnet to disambiguate word senses for text retrieval. Proceedings of the 6th annual international ACM SIGIR conference on Research and development in information retrieval, (993) Wen, J.R., ie, J.Y. & Zhang, H.J. Query clustering using user logs. ACM Transactions on Information Systems, 0 (), (00)

Chapter 27 Introduction to Information Retrieval and Web Search

Chapter 27 Introduction to Information Retrieval and Web Search Copyright 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 27 Outline Information Retrieval (IR) Concepts Retrieval