Evaluating Clustering Techniques for Collaborative Querying

Size: px
Start display at page:

Download "Evaluating Clustering Techniques for Collaborative Querying"

Transcription

1 Evaluating clustering techniques for collaborative querying Ray, C.S., Goh, H.L.D., Foo, S., & Win,.C.S. (006). Proc. 006 International Conference on Hybrid Information Technology (ICHIT006), Cheu Island, Korea, ovember 9- Evaluating Clustering Techniques for Collaborative Querying Chandrani Sinha Ray, Dion Hoe-Lian Goh, Schubert Foo, yein Chan Soe Win Division of Information Studies, School of Communication and Information anyang Technological University, Singapore {rayc000, ashlgoh, assfoo, Abstract. Collaborative querying is a technique which makes use of past users search experiences in order to help the current user formulate an appropriate query. In this technique, related queries are extracted from query logs and clustered. Queries from these clusters that are related to the user s query are then recommended. This work compares a range of similarity measures and features (e.g. titles, URLs, snippets) used for clustering, and also proposes an extended K-means clustering algorithm. Experimental results reveal that the best clusters are obtained by using a combination of these sources rather than using only query terms or only result URLs alone. Keywords: Query Clustering, Information Retrieval, Collaborative Querying Introduction Web search engines play a vital role in retrieving information from the Internet, but there are several challenges faced by users of information retrieval (IR) systems in general. Users are often overwhelmed by the number of documents returned by an IR system or fail to express their information needs in terms compatible to those used in the system. While new IR techniques can potentially alleviate these problems, specifying information needs properly in the form of a query is still a fundamental problem with users. [] attributed this to the lack of conceptual knowledge of the information retrieval process. Put differently, users need to know how to translate their information needs into searchable queries. Research in information seeking behavior suggests an alternative approach in helping users meet their information needs. Studies have found that collaboration among searchers is an important step in the process of information seeking and use. For example [3] highlights the importance of the interaction between the inquirer and the librarian, while [4] argues that communication with colleagues is a key component in the initial search for information. We use the term collaborative IR to refer to a family of techniques that facilitate collaboration among users while conducting searches. These techniques can be categorized into: collaborative filtering, collaborative browsing and collaborative querying []. In particular, collaborative querying helps searchers in query formulation by harnessing other users search experience or expert knowledge [7]. Queries, being expressions of information needs, have the potential to provide a wealth of information that could be used to guide other searchers with similar information needs, helping them with query reformulation. If large quantities of queries are collected, they may be mined for emerging patterns and relationships that could be used to define communities of interest (e.g. Lycos 50), and even help users explore not only their current information needs but related ones as well. One can imagine a network of queries, amassed from the collective expertise of numerous searchers representing what people think as helping them meet their information needs. Query formulation and reformulation will then be a matter of exploring the query network, executing relevant queries, and retrieving the resulting documents.

2 There are many techniques for uncovering related queries to support collaborative querying, one of which is query clustering [5]. Here, query logs of IR systems are mined, and similar queries are grouped and used as recommendations. A key issue in query clustering is the measure of the similarity between queries. Traditional term-based similarity measures borrowed from classical IR [] is one example. An alternative is based on the overlap of the top results (such as URLs) returned from an IR system [6, 9]. Further, a hybrid approach was proposed by [5], which is a linear combination of the content-based and results-based approaches. Experiments showed that this approach produces better query clusters in terms of precision, recall and coverage, than using either the content-based or results-based method alone. Although these results show potential, there is still room for improvement by using additional information for measuring similarity. Examples include titles, snippets (document excerpts) or other metadata in the search results listings. Such information is commonly returned by IR systems and is used by searchers to udge whether the results satisfy their information needs. These can therefore be used to complement query terms and result URLs as criteria to measure the similarity between queries. Thus, in this paper, we try to improve on existing work in query clustering by [5] in two ways. First, the type of features used for calculating similarity between queries is expanded to include titles and snippets from search results documents in addition to query terms and results URLs. Secondly, a new algorithm, a variation of K- means, is proposed for clustering queries. The results of this work are then compared with previous work.. Related Work Query logs maintained by IR systems contain information related to users queries including search sessions, query terms submitted, documents selected, and so on. Hence, query logs provide valuable indications for understanding the types of documents users intend to retrieve when formulating a query [3]. One of the common techniques for collaborative querying through the reuse of previously submitted queries obtained from query logs is via query clustering [5]. Queries within the clusters are then used as candidates for query reformulation. For example, [7] developed a collaborative IR system that recommended related queries extracted and clustered from query logs. Both query terms and results returned were utilized in their hybrid query clustering approach. Query clustering approaches can be categorized into content-based, results-based, feedback-based, and a hybrid of these approaches, depending on the type of information used to determine the similarity between queries. The content-based approach borrows from classical term-based techniques of traditional IR. Here, features are constructed from the query keywords. A maor drawback is that on an average queries submitted to a web search engine are very short [], comprising -3 words in contrast to documents which are rich in content. Since the information content in queries is sparse [], it is difficult to deduce the semantics behind queries which in turn makes it difficult to establish similarity between queries [5]. An alternative is to use clickthrough data (feedback approach) which uses documents selected by users returned in response to a query. The idea behind this approach is that two queries are similar if they lead to the same selection of documents by searchers [3]. Work by [5] demonstrated that a combination of the contentbased and feedback-based approaches performed better in terms of the F-measure than the individual approaches. The disadvantage however is that the quality of query clusters may suffer if users select too many irrelevant documents. In results-based approach, documents returned in response to a query are used to measure similarity between queries. Here, document identifiers (e.g. URLs) as well as other metadata such as the document title, snippet or even the entire document are used. The latter was used by [9] to determine the similarity between queries by calculating the overlap in document content but this method is time consuming to execute. [6] therefore determined query similarity by the amount of overlap in their respective top-50 search results listings. Finally, [0] uses the results documents containing the query terms to extract other terms which often occur in context with the original query terms and these are used to establish similarity between queries. Our work differs from previous work in that it uses not only the content of queries and results URLs, but also the titles and snippets from search results documents since they are considered good descriptors of a query [6]. Also, since the entire result document is not used as feature set, it reduces the processing time, which is

3 important in a web search scenario. Moreover, our work uses the extended K-means algorithm [4] to define the similarity between queries rather than a simple measure of overlap, as in [5]. The K-means algorithm is popular for its speed and simplicity and is effective in computing the distances between obects in very large data sets [] which makes it appropriate for our dataset. The need for specifying the initial k centroids in advance which is the greatest disadvantage of K-means is overcome by using the extended K-means algorithm [4] described in section Query Clustering Algorithm Let W denote a collection of queries to be clustered. We represent each query Q i in W with composite unit vectors: Q i = (D i, E i, F i ) () where Q i is the composite unit vector representing i-th query in the query set W, i n (n is the total number of queries in the dataset) ; D i is the component unit vector representing keywords in Q i ; E i is the component unit vector representing the terms from the snippets and the title of the documents returned in response to Q i ; F i is the component unit vector representing the results URLs returned in response to Q i.. 3. The Content-based Similarity Measure A query is considered a document and query terms are vectors representing the query []. All terms are extracted after which stop words are removed and the remainder are stemmed. Terms that appear in less than two queries are eliminated, as they do not contribute in defining the relationship between two or more queries. The extracted terms are used to represent each query (Q i ) in the query set W as the content-based component vector (D i ) in Equation (). Specifically, D i is a weight vector obtained using the standard TFIDF formula: wnq = tfnq * idfn idfn = log dfn () where w nqi is the weight of the n-th term in query Q ; tf nqi is the query term frequency; is the number of queries in the query set W ; df n is the number of queries containing the n-th term ; idf n is the inverted term frequency for the n-th term. The content-based similarity between query Q i and Q is then measured using the cosine similarity function as defined in Equation (3): sim _ content( Q, Q ) = i n= w nqi w wnqi n= n= where w nqi and w nq are weights of the common n-th term in query Q i and Q respectively obtained using Equation (). nq w nq (3) 3. The Results-based Similarity Measure As discussed, limitations exist when using only query terms for clustering. [6] notes that the results lists returned in response to queries are readily accessible sources of rich information for measuring similarity. In our work, the results URLs as well as results terms (titles plus snippets) are used to represent each query as a results-based component vector (E) in Equation (). For the results URLs component of E, let U(Q ) be a set of query results URLs to a query Q, U(Q ) = {u, u,, u n } where u i represents the i-th result URL for Q. We then define R i as the overlap between two results URLs vectors between two queries Qi and Q: R = u : u U ( Q ) U ( Q )} (4) i { i 3

4 The similarity measure based on results URLs (adopted from [6]) is defined as: Ri (5) sim _ resultsurls ( Qi, Q ) = Max( Q, Q ) i where Qi and Q represent the number of results URLs in queries Qi and Q respectively ; R i is the number of common results URLs between Qi and Q. For the results terms component of E, we employ the cosine similarity measure as presented in Equation (3): sim_ results terms ( Q, Q i wnqi wnq n= ) = wnqi wnq n= n= where w nqi and w nq are weights of the common n-th term in query Q i and Q extracted from the results titles and snippets, using the TFIDF formula in Equation (). Finally, the complete results-based similarity measure is a linear combination of both URLs and terms and is defined as: sim _ results ( Qi, Q ) = ω k * sim _ results URLs + ω l * sim _ results terns (7) (6) where ω k is the weight assigned to sim_results URLs ; ω l is the weight assigned to sim_results terms ; ω k and ω l are non negative numbers such that ω k + ω l =. 3.3 The Hybrid Similarity Measure Similar to [5], the hybrid approach adopted in the present work is a linear combination of the content-based and results-based approaches, and is expected to overcome the disadvantages of using either approach alone: sim _ hybrid( Q, Q ) = α * sim _ results + β * sim _ content where α and β are non-negative weights and α + β =. i (8) 3.4 The Extended K-Means Algorithm. Define a similarity threshold, t where 0 t.. Randomly select a query as a cluster centroid. 3. Assign the remaining queries to the existing cluster if the similarity between the new query and the cluster centroid exceeds the defined threshold. 4. The query becomes a cluster by itself if condition in step 3 is not satisfied. 5. The centroids of the clusters are recomputed if its cluster members are changed. 6. Repeat steps 3 to 5 until all queries are assigned and cluster centroids do not change any more. Fig.. K-Means Clustering Algorithm Figure summarizes the extended K-means algorithm [4], an iterative partitioning algorithm which tries to overcome some of the limitations of the standard K-means. It uses similarity thresholds instead of pre-defined k centroids. This is the maor advantage as the number of query clusters are not known in advance. 4

5 4. Methodology Six months worth of query logs obtained from the anyang Technological University (TU, Singapore) digital library was used in this work. The queries belong to various domains including engineering disciplines, business and the arts. Each entry in a query log contains a timestamp indicating when the query was submitted, the submitted query, session ID, and the number of records returned. In the preprocessing stage, only the queries from the logs were extracted as the remainder of the content did not contribute to the query clustering technique used in this work. From the logs, approximately,700 queries were randomly sampled for use. Each query was then submitted to the Google search engine using the Google Web API. From the ensuing results listing, the top 0 URLs, titles and snippets were extracted for use in the results-based component of the hybrid similarity measure. 4. Computing Similarity Recall from Equation (8) that hybrid similarity is a linear combination of the content and results-based similarity. In our experiments, five pairs of values for (α,β) were used to vary the levels of contribution of each of these similarity measures: (0.0,.0), (.0, 0.0), (0.75, 0.5), (0.50, 0.50) and (0.5, 0.75) respectively for the content-based (sim_content), results-based (sim_result), hybrid (sim_hybrid), hybrid (sim_hybrid) and hybrid3 (sim_hybrid3) similarity measures. ote also that the results-based approach, defined in Equation (7), is a linear combination of results terms and URLs. Here, three pairs of values of (ω k, ω l ), namely, (.0, 0.0), (0.5, 0.5) and (0.0,.0) were used to represent the varying levels of contribution of results terms and URLs. In order to manage the number of experiments to be carried out to facilitate easier comparison, the following technique was used. We first set ω k =.0 and ω l = 0.0 to observe the results-based method using results URLs alone. These similarity measures are denoted with a subscript basic-link. Using this we had 5 sets of experiments - sim_content, sim_result basic-link, sim_hybrid basic-link, sim_hybrid basic-link and sim_hybrid3 basic-link through varying (α,β) as mentioned above. These measures are the same ones defined in [5]. From these, query clustering experiments were conducted using the extended K-Means algorithm at two different similarity thresholds (0.5 and 0.9) to determine the threshold that will provide better query clustering results in terms of the F-measure. That threshold would then be used for the remainder of the experiments. Given the selected similarity threshold, we next set ω k = ω l =0.5 so that the results-based method considered both the result URLs and terms equally during clustering. These similarity measures are denoted by the subscript extended. This gave rise to 4 sets of experiements- sim_result extended, sim_hybrid extended, sim_hybrid extended, sim_hybrid3 extended for varying (α,β). Finally, we considered result terms alone by setting ω k =0.0 and ω l =.0. These similarity measures are denoted by the subscript basic-term and we again had 4 similarity measures- sim_result basic-term, sim_hybrid basicterm, sim_hybrid basic-, sim_hybrid basic-term for varying (α,β). ote that the content-based similarity measure remains unchanged in the hybrid equation but the resultsbased component has been changed by the three combinations of results URLs and terms. Overall, a total of 3 separate experiments were run. 4. Comparison Criteria The quality of the query clusters generated from the 3 aforementioned approaches was compared using average cluster size, coverage, precision, recall and F-measures similar to [5] and others. Average cluster size is the average of all the cluster sizes. Coverage is the percentage of queries for which the similarity measure is able to provide a cluster. Precision is the ratio of the number of similar queries to the total number of queries in a cluster. Recall is the ratio of the number of similar queries to the total number of all similar queries in the dataset (those in the current cluster and others). It is difficult to calculate recall directly because no standard clusters are available in our dataset. Therefore, an alternative normalized recall [5] was used instead (explained 5

6 in the next section). For measuring precision and recall, 00 sample clusters were selected. Finally, F-measure is used to generate a single measure by combining recall and precision. 5. Findings and Analyses Recall that two similarity thresholds 0.5 and 0.9 were used for the basic-link approach to determine which threshold provided the best result in terms of F-measure. It was found that the threshold 0.5 was the consistent best performer (Table ) and hence this value was used in our experiments. Average cluster size indicates the ability of the clustering algorithm to recommend queries for a user s query. Figure (a) shows a plot of the average cluster size against different values of α and β (see Equation 8). Average cluster size exhibits an increasing trend along with the increase in β, the weight for the content-based component. The average cluster size of sim_content at similarity threshold 0.5 ranks highest at 6.93 indicating that users have a higher likelihood of finding more queries from a given query using the content-based approach. This is because different queries share a number of identical terms. Conversely, the average cluster size of sim_result basic-link is the smallest (.8) meaning that users will have the fewest number of recommended queries per issued query. This may be due to the fact that the number of distinct URLs is usually large and thus many similar queries cannot be clustered together for lack of common URLs []. Figure (a) also shows that the basic-term approaches have consistently higher average cluster sizes than that of other approaches. Thus when result terms were used, these similarity measures provided more recommended queries than those without using results terms. Coverage [] indicates the probability that a user can obtain recommended queries from his/her issued query. The content-based approach, sim_content, ranks highest in coverage (8.40%) followed by the hybrid approaches sim_hybrid3 basic-term (78.55%) and sim_hybrid3 extended (76.99%), indicating that the content-based approach is better able to find similar queries for a given query than other approaches. The coverage of sim_result basic-link is the lowest (.5%) which means that users will be less likely to receive recommended queries from an issued query. This reason may be the same for average cluster size in that the number of distinct URLs is usually large and many similar queries cannot be clustered for lack of common URLs []. Figure (b) shows that like average cluster size, coverage exhibits an increasing trend along with the increase in β. In addition, the coverage of the basic-term approaches is consistently higher than that of the other approaches, suggesting the importance of the use of results terms. Hence, users have a higher likelihood of receiving recommendations from a given query by using result terms during clustering than not using them. basic-link (0.5) extended (0.5) basic-term (0.5) 8 basic-link (0.5) extended (0.5) basic-term (0.5) 7 Average Cluster Size α=.0 α=0.75 α=0.5 α=0.5 α=0.0 β=0.0 β=0.5 β=0.5 β=0.75 β=.0 sim_result sim_hybrid sim_hybrid sim_hybrid3 sim_content Coverage α=.0 α=0.75 α=0.5 α=0.5 α=0.0 β=0.0 β=0.5 β=0.5 β=0.75 β=.0 sim_result sim_hybridsim_hybridsim_hybrid3sim_content (a) (b) Fig.. (a) Average Cluster Size and (b) Coverage for different similarity measures In terms of precision, the hybrid approach, sim_hybrid basic-link, performs best (99.73%) followed by sim_hybrid extended (99.50%) and sim_result extended (99.00%). The worst performer was sim_content (65.9%) suggesting that the use of query terms alone are not sufficient for producing good quality clusters. Figure 3(a) shows that precision decreases with the increase in α while it increases with a corresponding increase in β, the weight for the results-based component. In general, sim_hybrid generally performs better and this is attributed to the fact that the search engine used in our experiments (Google) tends to return results URLs that are the 6

7 same for semantically related queries. It is also interesting to observe that the precision of the basic-term approaches are consistently lower than that of other approaches indicating that the use of results terms compromises precision. However, these differences tend to be small. 00% 80% basic-link (0.5) extended (0.5) basic-term (0.5) 00% 80% basic-link (0.9) basic-link (0.5) extended (0.5) basic-term (0.5) Precision 60% 40% Recall 60% 40% 0% 0% 0% α=.0 α=0.75 α=0.5 α=0.5 α=0.0 β=0.0 β=0.5 β=0.5 β=0.75 β=.0 sim_result sim_hybridsim_hybridsim_hybrid3sim_content (a) (b) Fig. 3. (a) Precision and (b) Recall for different similarity measures. As mentioned, the calculation of recall posed a problem as no standard query clusters were available. Therefore, the normalized recall [5] measure was used. It is defined as the ratio of the number of correctly clustered queries within the 00 selected clusters to the maximum number of correctly clustered queries across the dataset (that is, clusters from the 3 experiments). In our work, the maximum number of correctly clustered queries was 54, obtained by sim_hybrid3 basic-term. Thus, the normalized recall of sim_hybrid3 basic-term ranks highest at 00%, followed by sim_hybrid3 extended (96.03%) and sim_content (95%). This indicates that sim_hybrid3 basic-term is better able to uncover clusters of similar queries generated by different similarity functions on the given query set used in this experiment. At the similarity threshold of 0.5, the recall of sim_result basic-link is the lowest (4.03%) meaning that users have the lowest likelihood of finding more query clusters. This could be because the number of distinct URLs is usually large and similar queries cannot be clustered together for lack of common URLs [], thus lowering the ability to uncover clusters of similar queries. 0% α=.0 α=0.75 α=0.5 α=0.5 α=0.0 β=0.0 β=0.5 β=0.5 β=0.75 β=.0 sim_result sim_hybrid sim_hybrid sim_hybrid3 sim_content Figure 3(b) shows that sim_result extended has lowest normalized recall (4.35%). The figure also indicates that the recall measures resulting from the basic-term approaches are consistently higher than other approaches though the extended approaches outperform basic-link approaches. It is evident that the use of result terms alone is better able to uncover clusters of similar queries than using result links alone. This may explain why the use of result terms in all similarity functions improves the ability to uncover more clusters of similar queries. The reason behind this finding could be that as the number of common terms increases due to the contribution of result terms used, the chances of finding related queries increase. Table. F-Measure Comparison Similarity Function basic-link (threshold = 0.9) basic-link (0.5) basic-term (0.5) extended (0.5) sim_result sim_hybrid sim_hybrid sim_hybrid sim_content To measure the global quality of the clusters, the F-measure is used [5]. Table shows that the F-measures returned from basic-link experiments at threshold 0.5 were consistently better than those at threshold 0.9. Hence similarity threshold 0.5 was used in the other approaches to generate query clusters. It was observed that the use of result terms consistently improved the F-measure compared to the approaches that did not use them. The reasons for this are similar to those cited for precision and recall since the F-measure itself is the harmonic mean of precision and recall. 7

8 6. Discussion We have observed that the basic-term approach performs better than the other approaches in terms of all the performance criteria except precision. As there are five similarity functions used in the basic-term approach, the comparison between the qualities of results for each is presented in Table. Only F-measures are reported in place of recall and precision. As can be seen in the table, the average cluster sizes of the results-based approach sim_result basic-term and hybrid approaches sim_hybrid basic-term and sim_hybrid basic-term are low. However sim_hybrid3 basic-term and sim_content basic-term provide 5.7 and 6.93 queries per cluster respectively on average. Since users are already burdened with a number of search results, too many query suggestions to the original queries will not be of much benefit to users. Hence, the average of 5 to 7 suggestions per query is quite reasonable. Table 3 shows sample query clusters that resulted from sim_hybrid3 basic-term. Table. Summary of Evaluation Results Similarity Functions Average Range of (similarity threshold Cluster Size Cluster Size = 0.5) Coverage F-measure sim_result basic-term 3.04 ~7 7.89% 0.76 sim_hybrid basic-term.98 ~ % 0.77 sim_hybrid basic-term 3.60 ~ % 0.8 sim_hybrid3 basic-term 5.7 ~ % 0.95 sim_content basic-term 6.93 ~8 8.40% 0.78 Table 3. Two Query Clusters from sim_hybrid3 basic-term A wavelet tour of signal processing, wavelet processing, wavelet tool signal processing Strains stresses, handbook formulas stress strain, aluminum stress-strain curve, strain material, stress transformation, distribution curve, engineering stress, failure curve Moreover, sim_content basic-term provides the highest coverage at 8.40% while sim_hybrid3 basic-term ranks second highest at 78.55%.. Stated differently, a user is more likely to receive recommendations to his original query using sim_content basic-term and sim_hybrid3 basic-term. In terms of F-measure, sim_hybrid3 basic-term ranks highest at 0.95 which suggests that the recommended queries will be useful to the searcher. Also, sim_hybrid3 extended achieved the second highest score at 0.94, while sim_hybrid3 basic-link ranks third at 0.9 (refer to Table ). It is observed that although sim_content basic-term achieved the largest average cluster size (6.93) among other approaches and the highest coverage (8.40%), it has a lower F-measure (0.78) compared to sim_hybrid3 basic-term (0.95) (from Table ). Table 4. Comparison of Query Clustering Approaches Best Worst Avg Cluster Size sim_hybrid3 basic-term sim_result basic-link Coverage sim_content basic-term sim_result basic-link Precision sim_hybrid basic-link sim_content basic-term Recall sim_hybrid3 basic-term sim_result basic-link F-measure sim_hybrid3 basic-term sim_result basic-link It can thus be deduced that sim_hybrid3 basic-term provides the most balanced results for query clusters followed by sim_hybrid3 extended and sim_hybrid3 basic-link. Table 4 gives an alternative view by comparing the performances of all approaches at similarity threshold 0.5. Although there is no single best performer, the table shows that the combination of the content-based and results-based approach outperforms the use of either approach alone. Further, the hybrid clustering approach, sim_hybrid3 basic-term, yields the best performance in terms of average cluster size, recall and F-measure. 8

9 7. Conclusion Collaborative querying is inspired from research in information seeking behaviour, which shows that collaboration among searchers is important in the process of information seeking and use. In our work, we use query clustering to discover similar queries to assist in query formulation and reformulation. We compared the content-based approach, results-based approach and a hybrid approach of both measures to cluster similar queries from query logs using the extended K-means algorithm. Our experiments also explored whether the use of more features (terms from title and snippets from the search results lists), in addition to the query terms and results URL, produces better clusters. Results showed that the use of either content-based or results-based approaches alone was inadequate. Instead, the hybrid content and results-based approach (sim_hybrid3 basic-term ), considering query terms, title terms and snippet terms, produced better query clustering results than approaches that do not consider terms from titles and snippets. The reason could be due to the increased number of common terms between two queries extracted from titles and snippets. With this increased number, term weights play a more important role in finding related queries. Our proposed hybrid clustering approach is seen to perform better than other clustering algorithms. It is able to overcome the drawbacks of the results-based approach through the addition of additional information from result documents such as titles and snippets. Besides, it also takes into account the important information in the queries submitted by users by combining the content-based approach with the results-based approach. Our clustering algorithm will allow IR systems to identify high-quality clusters to help users reformulate queries to meet their information needs by sharing queries issued by other users. High quality query clusters can also be used in automatic query expansion. We propose to extend our work in the following directions. Firstly, phrases could be used in the calculation of query similarity since they are less sensitive to noise due to the lower probability of finding matching phrases in non-related queries [8]. Hence, this technique could increase precision. Further, lexical knowledge through resources such as Wordnet could be used to improve the recall of the clusters. Similar to [], the use of lexical knowledge can be used to create a query hierarchy representing broader or specialized queries and allow users to modify their queries by either broadening or narrowing their searches. Acknowledgments. This proect is partially supported by TU research grant number RG5/05. References. Chien, S. & Immorlica,. Semantic similarity between search engine queries using temporal correlation. WWW 005, (005) -.. Chuang, S.L. & Chien, L.F. Towards automatic generation of query taxonomy: a hierarchical query clustering approach. Proceedings of IEEE 00 International Conference on Data Mining, (00) Cui, H., Wen, J.R., ie, J.Y. & Ma, W.Y. Probabilistic query expansion using query logs. Proceedings of the eleventh international conference on World Wide Web, (00) Fonseca, B.M., Golgher, P.B., De Moura, E.S. & Ziviani,. Using association rules to discover search engines related queries. First Latin American Web Congress (LA-WEB 03), (003) Fu, L., Goh, D.H., Foo, S. & a, J.C. Collaborative querying through a hybrid query clustering approach. Digital libraries: Technology and management of indigenous knowledge for global access, ICADL 003, (003) Fu, L., Goh, D.H., Foo, S. & Supangat, Y. Collaborative querying for enhanced information retrieval. European conference on research and advanced technology for digital libraries (ECDL 004), (004) Furnas, G.W., Landauer, T.K., Gomez, L.M. & Dumais, S.T. The vocabulary problem in human-system communication. Communications of the ACM, 30(), (987) Huang, C.K., Oyang, Y.J. & Chien, L.F. A contextual term suggestion mechanism for interactive search. Proceedings of 00 Web Intelligence Conference (WI 00), (00) pp Jansen, B. J., Spink, A. & Saracevic, T. Real life, real users and real needs: A study and analysis of users queries on the Web. Information Processing and Management, 36(), (000) Lokman, I.M. & Stephanie, W.H. Information seeking behavior and use of social science faculty studying stateless nations: A case study. Journal of library and Information Science Research, 3(), (00) Matwin, S. & Scott, S. Text classification using wordnet hypernyms. Proceedings of the COLIG/ACL Workshop on Usage of Wordet in atural Language Processing Systems, COLIG-ACL'98, (998) Raghavan, V.V. & Sever, H. On the reuse of past optimal queries. Proceedings of the Eighteenth International ACM SIGIR Conference on Research and Development in Information Retrieval, (995)

10 3. Turney, P.D. Word sense disambiguation by web mining for word co-occurrence probabilities. Proceedings Third International Workshop on the Evaluation of Systems for the Semantic Analysis of Text (SESEVAL-3), (004) Velardi, P. & avigli, R. An analysis of ontology-based query expansion strategies. Workshop on Adaptive Text Extraction and Mining at the 4th European Conference on Machine Learning, ECML 003, (003) Voorhees, E. Using wordnet to disambiguate word senses for text retrieval. Proceedings of the 6th annual international ACM SIGIR conference on Research and development in information retrieval, (993) Wen, J.R., ie, J.Y. & Zhang, H.J. Query clustering using user logs. ACM Transactions on Information Systems, 0 (), (00)

Chapter 27 Introduction to Information Retrieval and Web Search

Chapter 27 Introduction to Information Retrieval and Web Search Chapter 27 Introduction to Information Retrieval and Web Search Copyright 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 27 Outline Information Retrieval (IR) Concepts Retrieval

More information

This document is downloaded from DR-NTU, Nanyang Technological University Library, Singapore.

This document is downloaded from DR-NTU, Nanyang Technological University Library, Singapore. This document is downloaded from DR-NTU, Nanyang Technological University Library, Singapore. Title Query formulation with a search assistant. Author(s) Citation Fu, Lin.; Goh, Dion Hoe-Lian.; Foo, Schubert

More information

A User Study on Features Supporting Subjective Relevance for Information Retrieval Interfaces

A User Study on Features Supporting Subjective Relevance for Information Retrieval Interfaces A user study on features supporting subjective relevance for information retrieval interfaces Lee, S.S., Theng, Y.L, Goh, H.L.D., & Foo, S. (2006). Proc. 9th International Conference of Asian Digital Libraries

More information

Information Retrieval

Information Retrieval Information Retrieval CSC 375, Fall 2016 An information retrieval system will tend not to be used whenever it is more painful and troublesome for a customer to have information than for him not to have

More information

Chapter 6: Information Retrieval and Web Search. An introduction

Chapter 6: Information Retrieval and Web Search. An introduction Chapter 6: Information Retrieval and Web Search An introduction Introduction n Text mining refers to data mining using text documents as data. n Most text mining tasks use Information Retrieval (IR) methods

More information

This document is downloaded from DR-NTU, Nanyang Technological University Library, Singapore.

This document is downloaded from DR-NTU, Nanyang Technological University Library, Singapore. This document is downloaded from DR-NTU, Nanyang Technological University Library, Singapore. Title Collaborative querying using the Query Graph Visualizer( Main article ) Author(s) Goh, Dion Hoe-Lian;

More information

Enabling Users to Visually Evaluate the Effectiveness of Different Search Queries or Engines

Enabling Users to Visually Evaluate the Effectiveness of Different Search Queries or Engines Appears in WWW 04 Workshop: Measuring Web Effectiveness: The User Perspective, New York, NY, May 18, 2004 Enabling Users to Visually Evaluate the Effectiveness of Different Search Queries or Engines Anselm

More information

Collaborative querying using query graph visualizer Goh, H.L..D, Lin, F., & Foo, S (2005). Online Information Review, 29(3),

Collaborative querying using query graph visualizer Goh, H.L..D, Lin, F., & Foo, S (2005). Online Information Review, 29(3), Collaborative querying using query graph visualizer Goh, H.L..D, Lin, F., & Foo, S (2005). Online Information Review, 29(3), 266-282. Collaborative Querying using the Query Graph Visualizer Lin Fu, Dion

More information

Outline. Possible solutions. The basic problem. How? How? Relevance Feedback, Query Expansion, and Inputs to Ranking Beyond Similarity

Outline. Possible solutions. The basic problem. How? How? Relevance Feedback, Query Expansion, and Inputs to Ranking Beyond Similarity Outline Relevance Feedback, Query Expansion, and Inputs to Ranking Beyond Similarity Lecture 10 CS 410/510 Information Retrieval on the Internet Query reformulation Sources of relevance for feedback Using

More information

Web Information Retrieval using WordNet

Web Information Retrieval using WordNet Web Information Retrieval using WordNet Jyotsna Gharat Asst. Professor, Xavier Institute of Engineering, Mumbai, India Jayant Gadge Asst. Professor, Thadomal Shahani Engineering College Mumbai, India ABSTRACT

More information

Effect of log-based Query Term Expansion on Retrieval Effectiveness in Patent Searching

Effect of log-based Query Term Expansion on Retrieval Effectiveness in Patent Searching Effect of log-based Query Term Expansion on Retrieval Effectiveness in Patent Searching Wolfgang Tannebaum, Parvaz Madabi and Andreas Rauber Institute of Software Technology and Interactive Systems, Vienna

More information

Automatic New Topic Identification in Search Engine Transaction Log Using Goal Programming

Automatic New Topic Identification in Search Engine Transaction Log Using Goal Programming Proceedings of the 2012 International Conference on Industrial Engineering and Operations Management Istanbul, Turkey, July 3 6, 2012 Automatic New Topic Identification in Search Engine Transaction Log

More information

Better Contextual Suggestions in ClueWeb12 Using Domain Knowledge Inferred from The Open Web

Better Contextual Suggestions in ClueWeb12 Using Domain Knowledge Inferred from The Open Web Better Contextual Suggestions in ClueWeb12 Using Domain Knowledge Inferred from The Open Web Thaer Samar 1, Alejandro Bellogín 2, and Arjen P. de Vries 1 1 Centrum Wiskunde & Informatica, {samar,arjen}@cwi.nl

More information

IMPROVING INFORMATION RETRIEVAL BASED ON QUERY CLASSIFICATION ALGORITHM

IMPROVING INFORMATION RETRIEVAL BASED ON QUERY CLASSIFICATION ALGORITHM IMPROVING INFORMATION RETRIEVAL BASED ON QUERY CLASSIFICATION ALGORITHM Myomyo Thannaing 1, Ayenandar Hlaing 2 1,2 University of Technology (Yadanarpon Cyber City), near Pyin Oo Lwin, Myanmar ABSTRACT

More information

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University CS473: CS-473 Course Review Luo Si Department of Computer Science Purdue University Basic Concepts of IR: Outline Basic Concepts of Information Retrieval: Task definition of Ad-hoc IR Terminologies and

More information

Patent Classification Using Ontology-Based Patent Network Analysis

Patent Classification Using Ontology-Based Patent Network Analysis Association for Information Systems AIS Electronic Library (AISeL) PACIS 2010 Proceedings Pacific Asia Conference on Information Systems (PACIS) 2010 Patent Classification Using Ontology-Based Patent Network

More information

Collaborative Filtering using Euclidean Distance in Recommendation Engine

Collaborative Filtering using Euclidean Distance in Recommendation Engine Indian Journal of Science and Technology, Vol 9(37), DOI: 10.17485/ijst/2016/v9i37/102074, October 2016 ISSN (Print) : 0974-6846 ISSN (Online) : 0974-5645 Collaborative Filtering using Euclidean Distance

More information

Repositorio Institucional de la Universidad Autónoma de Madrid.

Repositorio Institucional de la Universidad Autónoma de Madrid. Repositorio Institucional de la Universidad Autónoma de Madrid https://repositorio.uam.es Esta es la versión de autor de la comunicación de congreso publicada en: This is an author produced version of

More information

Automated Online News Classification with Personalization

Automated Online News Classification with Personalization Automated Online News Classification with Personalization Chee-Hong Chan Aixin Sun Ee-Peng Lim Center for Advanced Information Systems, Nanyang Technological University Nanyang Avenue, Singapore, 639798

More information

A New Technique to Optimize User s Browsing Session using Data Mining

A New Technique to Optimize User s Browsing Session using Data Mining Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 4, Issue. 3, March 2015,

More information

CS224W: Social and Information Network Analysis Project Report: Edge Detection in Review Networks

CS224W: Social and Information Network Analysis Project Report: Edge Detection in Review Networks CS224W: Social and Information Network Analysis Project Report: Edge Detection in Review Networks Archana Sulebele, Usha Prabhu, William Yang (Group 29) Keywords: Link Prediction, Review Networks, Adamic/Adar,

More information

CHAPTER 5 OPTIMAL CLUSTER-BASED RETRIEVAL

CHAPTER 5 OPTIMAL CLUSTER-BASED RETRIEVAL 85 CHAPTER 5 OPTIMAL CLUSTER-BASED RETRIEVAL 5.1 INTRODUCTION Document clustering can be applied to improve the retrieval process. Fast and high quality document clustering algorithms play an important

More information

CLUSTERING, TIERED INDEXES AND TERM PROXIMITY WEIGHTING IN TEXT-BASED RETRIEVAL

CLUSTERING, TIERED INDEXES AND TERM PROXIMITY WEIGHTING IN TEXT-BASED RETRIEVAL STUDIA UNIV. BABEŞ BOLYAI, INFORMATICA, Volume LVII, Number 4, 2012 CLUSTERING, TIERED INDEXES AND TERM PROXIMITY WEIGHTING IN TEXT-BASED RETRIEVAL IOAN BADARINZA AND ADRIAN STERCA Abstract. In this paper

More information

Relevance Feedback and Query Reformulation. Lecture 10 CS 510 Information Retrieval on the Internet Thanks to Susan Price. Outline

Relevance Feedback and Query Reformulation. Lecture 10 CS 510 Information Retrieval on the Internet Thanks to Susan Price. Outline Relevance Feedback and Query Reformulation Lecture 10 CS 510 Information Retrieval on the Internet Thanks to Susan Price IR on the Internet, Spring 2010 1 Outline Query reformulation Sources of relevance

More information

Extraction of Web Image Information: Semantic or Visual Cues?

Extraction of Web Image Information: Semantic or Visual Cues? Extraction of Web Image Information: Semantic or Visual Cues? Georgina Tryfou and Nicolas Tsapatsoulis Cyprus University of Technology, Department of Communication and Internet Studies, Limassol, Cyprus

More information

CHAPTER 6 PROPOSED HYBRID MEDICAL IMAGE RETRIEVAL SYSTEM USING SEMANTIC AND VISUAL FEATURES

CHAPTER 6 PROPOSED HYBRID MEDICAL IMAGE RETRIEVAL SYSTEM USING SEMANTIC AND VISUAL FEATURES 188 CHAPTER 6 PROPOSED HYBRID MEDICAL IMAGE RETRIEVAL SYSTEM USING SEMANTIC AND VISUAL FEATURES 6.1 INTRODUCTION Image representation schemes designed for image retrieval systems are categorized into two

More information

Query Phrase Expansion using Wikipedia for Patent Class Search

Query Phrase Expansion using Wikipedia for Patent Class Search Query Phrase Expansion using Wikipedia for Patent Class Search 1 Bashar Al-Shboul, Sung-Hyon Myaeng Korea Advanced Institute of Science and Technology (KAIST) December 19 th, 2011 AIRS 11, Dubai, UAE OUTLINE

More information

Mining Web Data. Lijun Zhang

Mining Web Data. Lijun Zhang Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems

More information

NUS-I2R: Learning a Combined System for Entity Linking

NUS-I2R: Learning a Combined System for Entity Linking NUS-I2R: Learning a Combined System for Entity Linking Wei Zhang Yan Chuan Sim Jian Su Chew Lim Tan School of Computing National University of Singapore {z-wei, tancl} @comp.nus.edu.sg Institute for Infocomm

More information

IMAGE RETRIEVAL SYSTEM: BASED ON USER REQUIREMENT AND INFERRING ANALYSIS TROUGH FEEDBACK

IMAGE RETRIEVAL SYSTEM: BASED ON USER REQUIREMENT AND INFERRING ANALYSIS TROUGH FEEDBACK IMAGE RETRIEVAL SYSTEM: BASED ON USER REQUIREMENT AND INFERRING ANALYSIS TROUGH FEEDBACK 1 Mount Steffi Varish.C, 2 Guru Rama SenthilVel Abstract - Image Mining is a recent trended approach enveloped in

More information

Semi-Supervised Clustering with Partial Background Information

Semi-Supervised Clustering with Partial Background Information Semi-Supervised Clustering with Partial Background Information Jing Gao Pang-Ning Tan Haibin Cheng Abstract Incorporating background knowledge into unsupervised clustering algorithms has been the subject

More information

Subjective Relevance: Implications on Interface Design for Information Retrieval Systems

Subjective Relevance: Implications on Interface Design for Information Retrieval Systems Subjective : Implications on interface design for information retrieval systems Lee, S.S., Theng, Y.L, Goh, H.L.D., & Foo, S (2005). Proc. 8th International Conference of Asian Digital Libraries (ICADL2005),

More information

Recommendation System for Location-based Social Network CS224W Project Report

Recommendation System for Location-based Social Network CS224W Project Report Recommendation System for Location-based Social Network CS224W Project Report Group 42, Yiying Cheng, Yangru Fang, Yongqing Yuan 1 Introduction With the rapid development of mobile devices and wireless

More information

Improving Difficult Queries by Leveraging Clusters in Term Graph

Improving Difficult Queries by Leveraging Clusters in Term Graph Improving Difficult Queries by Leveraging Clusters in Term Graph Rajul Anand and Alexander Kotov Department of Computer Science, Wayne State University, Detroit MI 48226, USA {rajulanand,kotov}@wayne.edu

More information

Semantic Website Clustering

Semantic Website Clustering Semantic Website Clustering I-Hsuan Yang, Yu-tsun Huang, Yen-Ling Huang 1. Abstract We propose a new approach to cluster the web pages. Utilizing an iterative reinforced algorithm, the model extracts semantic

More information

Exploiting Internal and External Semantics for the Clustering of Short Texts Using World Knowledge

Exploiting Internal and External Semantics for the Clustering of Short Texts Using World Knowledge Exploiting Internal and External Semantics for the Using World Knowledge, 1,2 Nan Sun, 1 Chao Zhang, 1 Tat-Seng Chua 1 1 School of Computing National University of Singapore 2 School of Computer Science

More information

Information Retrieval using Pattern Deploying and Pattern Evolving Method for Text Mining

Information Retrieval using Pattern Deploying and Pattern Evolving Method for Text Mining Information Retrieval using Pattern Deploying and Pattern Evolving Method for Text Mining 1 Vishakha D. Bhope, 2 Sachin N. Deshmukh 1,2 Department of Computer Science & Information Technology, Dr. BAM

More information

Finding Hubs and authorities using Information scent to improve the Information Retrieval precision

Finding Hubs and authorities using Information scent to improve the Information Retrieval precision Finding Hubs and authorities using Information scent to improve the Information Retrieval precision Suruchi Chawla 1, Dr Punam Bedi 2 1 Department of Computer Science, University of Delhi, Delhi, INDIA

More information

Impact of Term Weighting Schemes on Document Clustering A Review

Impact of Term Weighting Schemes on Document Clustering A Review Volume 118 No. 23 2018, 467-475 ISSN: 1314-3395 (on-line version) url: http://acadpubl.eu/hub ijpam.eu Impact of Term Weighting Schemes on Document Clustering A Review G. Hannah Grace and Kalyani Desikan

More information

A Novel Categorized Search Strategy using Distributional Clustering Neenu Joseph. M 1, Sudheep Elayidom 2

A Novel Categorized Search Strategy using Distributional Clustering Neenu Joseph. M 1, Sudheep Elayidom 2 A Novel Categorized Search Strategy using Distributional Clustering Neenu Joseph. M 1, Sudheep Elayidom 2 1 Student, M.E., (Computer science and Engineering) in M.G University, India, 2 Associate Professor

More information

TERM BASED WEIGHT MEASURE FOR INFORMATION FILTERING IN SEARCH ENGINES

TERM BASED WEIGHT MEASURE FOR INFORMATION FILTERING IN SEARCH ENGINES TERM BASED WEIGHT MEASURE FOR INFORMATION FILTERING IN SEARCH ENGINES Mu. Annalakshmi Research Scholar, Department of Computer Science, Alagappa University, Karaikudi. annalakshmi_mu@yahoo.co.in Dr. A.

More information

Domain Specific Search Engine for Students

Domain Specific Search Engine for Students Domain Specific Search Engine for Students Domain Specific Search Engine for Students Wai Yuen Tang The Department of Computer Science City University of Hong Kong, Hong Kong wytang@cs.cityu.edu.hk Lam

More information

Mining Web Data. Lijun Zhang

Mining Web Data. Lijun Zhang Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems

More information

A BFS-BASED SIMILAR CONFERENCE RETRIEVAL FRAMEWORK

A BFS-BASED SIMILAR CONFERENCE RETRIEVAL FRAMEWORK A BFS-BASED SIMILAR CONFERENCE RETRIEVAL FRAMEWORK Qing Guo 1, 2 1 Nanyang Technological University, Singapore 2 SAP Innovation Center Network,Singapore ABSTRACT Literature review is part of scientific

More information

A novel supervised learning algorithm and its use for Spam Detection in Social Bookmarking Systems

A novel supervised learning algorithm and its use for Spam Detection in Social Bookmarking Systems A novel supervised learning algorithm and its use for Spam Detection in Social Bookmarking Systems Anestis Gkanogiannis and Theodore Kalamboukis Department of Informatics Athens University of Economics

More information

Query Refinement and Search Result Presentation

Query Refinement and Search Result Presentation Query Refinement and Search Result Presentation (Short) Queries & Information Needs A query can be a poor representation of the information need Short queries are often used in search engines due to the

More information

Search Costs vs. User Satisfaction on Mobile

Search Costs vs. User Satisfaction on Mobile Search Costs vs. User Satisfaction on Mobile Manisha Verma, Emine Yilmaz University College London mverma@cs.ucl.ac.uk, emine.yilmaz@ucl.ac.uk Abstract. Information seeking is an interactive process where

More information

Domain-specific Concept-based Information Retrieval System

Domain-specific Concept-based Information Retrieval System Domain-specific Concept-based Information Retrieval System L. Shen 1, Y. K. Lim 1, H. T. Loh 2 1 Design Technology Institute Ltd, National University of Singapore, Singapore 2 Department of Mechanical

More information

A Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2

A Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2 A Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2 1 Department of Electronics & Comp. Sc, RTMNU, Nagpur, India 2 Department of Computer Science, Hislop College, Nagpur,

More information

CLEF-IP 2009: Exploring Standard IR Techniques on Patent Retrieval

CLEF-IP 2009: Exploring Standard IR Techniques on Patent Retrieval DCU @ CLEF-IP 2009: Exploring Standard IR Techniques on Patent Retrieval Walid Magdy, Johannes Leveling, Gareth J.F. Jones Centre for Next Generation Localization School of Computing Dublin City University,

More information

MEASURING SEMANTIC SIMILARITY BETWEEN WORDS AND IMPROVING WORD SIMILARITY BY AUGUMENTING PMI

MEASURING SEMANTIC SIMILARITY BETWEEN WORDS AND IMPROVING WORD SIMILARITY BY AUGUMENTING PMI MEASURING SEMANTIC SIMILARITY BETWEEN WORDS AND IMPROVING WORD SIMILARITY BY AUGUMENTING PMI 1 KAMATCHI.M, 2 SUNDARAM.N 1 M.E, CSE, MahaBarathi Engineering College Chinnasalem-606201, 2 Assistant Professor,

More information

Leveraging Transitive Relations for Crowdsourced Joins*

Leveraging Transitive Relations for Crowdsourced Joins* Leveraging Transitive Relations for Crowdsourced Joins* Jiannan Wang #, Guoliang Li #, Tim Kraska, Michael J. Franklin, Jianhua Feng # # Department of Computer Science, Tsinghua University, Brown University,

More information

Copyright 2011 please consult the authors

Copyright 2011 please consult the authors Alsaleh, Slah, Nayak, Richi, Xu, Yue, & Chen, Lin (2011) Improving matching process in social network using implicit and explicit user information. In: Proceedings of the Asia-Pacific Web Conference (APWeb

More information

An Intelligent Clustering Algorithm for High Dimensional and Highly Overlapped Photo-Thermal Infrared Imaging Data

An Intelligent Clustering Algorithm for High Dimensional and Highly Overlapped Photo-Thermal Infrared Imaging Data An Intelligent Clustering Algorithm for High Dimensional and Highly Overlapped Photo-Thermal Infrared Imaging Data Nian Zhang and Lara Thompson Department of Electrical and Computer Engineering, University

More information

VIDEO SEARCHING AND BROWSING USING VIEWFINDER

VIDEO SEARCHING AND BROWSING USING VIEWFINDER VIDEO SEARCHING AND BROWSING USING VIEWFINDER By Dan E. Albertson Dr. Javed Mostafa John Fieber Ph. D. Student Associate Professor Ph. D. Candidate Information Science Information Science Information Science

More information

Comment Extraction from Blog Posts and Its Applications to Opinion Mining

Comment Extraction from Blog Posts and Its Applications to Opinion Mining Comment Extraction from Blog Posts and Its Applications to Opinion Mining Huan-An Kao, Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University, Taipei, Taiwan

More information

Letter Pair Similarity Classification and URL Ranking Based on Feedback Approach

Letter Pair Similarity Classification and URL Ranking Based on Feedback Approach Letter Pair Similarity Classification and URL Ranking Based on Feedback Approach P.T.Shijili 1 P.G Student, Department of CSE, Dr.Nallini Institute of Engineering & Technology, Dharapuram, Tamilnadu, India

More information

Text clustering based on a divide and merge strategy

Text clustering based on a divide and merge strategy Available online at www.sciencedirect.com ScienceDirect Procedia Computer Science 55 (2015 ) 825 832 Information Technology and Quantitative Management (ITQM 2015) Text clustering based on a divide and

More information

A probabilistic model to resolve diversity-accuracy challenge of recommendation systems

A probabilistic model to resolve diversity-accuracy challenge of recommendation systems A probabilistic model to resolve diversity-accuracy challenge of recommendation systems AMIN JAVARI MAHDI JALILI 1 Received: 17 Mar 2013 / Revised: 19 May 2014 / Accepted: 30 Jun 2014 Recommendation systems

More information

Web-page Indexing based on the Prioritize Ontology Terms

Web-page Indexing based on the Prioritize Ontology Terms Web-page Indexing based on the Prioritize Ontology Terms Sukanta Sinha 1, 4, Rana Dattagupta 2, Debajyoti Mukhopadhyay 3, 4 1 Tata Consultancy Services Ltd., Victoria Park Building, Salt Lake, Kolkata

More information

Minoru SASAKI and Kenji KITA. Department of Information Science & Intelligent Systems. Faculty of Engineering, Tokushima University

Minoru SASAKI and Kenji KITA. Department of Information Science & Intelligent Systems. Faculty of Engineering, Tokushima University Information Retrieval System Using Concept Projection Based on PDDP algorithm Minoru SASAKI and Kenji KITA Department of Information Science & Intelligent Systems Faculty of Engineering, Tokushima University

More information

highest cosine coecient [5] are returned. Notice that a query can hit documents without having common terms because the k indexing dimensions indicate

highest cosine coecient [5] are returned. Notice that a query can hit documents without having common terms because the k indexing dimensions indicate Searching Information Servers Based on Customized Proles Technical Report USC-CS-96-636 Shih-Hao Li and Peter B. Danzig Computer Science Department University of Southern California Los Angeles, California

More information

A Vector Space Equalization Scheme for a Concept-based Collaborative Information Retrieval System

A Vector Space Equalization Scheme for a Concept-based Collaborative Information Retrieval System A Vector Space Equalization Scheme for a Concept-based Collaborative Information Retrieval System Takashi Yukawa Nagaoka University of Technology 1603-1 Kamitomioka-cho, Nagaoka-shi Niigata, 940-2188 JAPAN

More information

Query Likelihood with Negative Query Generation

Query Likelihood with Negative Query Generation Query Likelihood with Negative Query Generation Yuanhua Lv Department of Computer Science University of Illinois at Urbana-Champaign Urbana, IL 61801 ylv2@uiuc.edu ChengXiang Zhai Department of Computer

More information

An Attempt to Identify Weakest and Strongest Queries

An Attempt to Identify Weakest and Strongest Queries An Attempt to Identify Weakest and Strongest Queries K. L. Kwok Queens College, City University of NY 65-30 Kissena Boulevard Flushing, NY 11367, USA kwok@ir.cs.qc.edu ABSTRACT We explore some term statistics

More information

Learning Ontology-Based User Profiles: A Semantic Approach to Personalized Web Search

Learning Ontology-Based User Profiles: A Semantic Approach to Personalized Web Search 1 / 33 Learning Ontology-Based User Profiles: A Semantic Approach to Personalized Web Search Bernd Wittefeld Supervisor Markus Löckelt 20. July 2012 2 / 33 Teaser - Google Web History http://www.google.com/history

More information

Enhancing Cluster Quality by Using User Browsing Time

Enhancing Cluster Quality by Using User Browsing Time Enhancing Cluster Quality by Using User Browsing Time Rehab Duwairi Dept. of Computer Information Systems Jordan Univ. of Sc. and Technology Irbid, Jordan rehab@just.edu.jo Khaleifah Al.jada' Dept. of

More information

Chapter 2. Architecture of a Search Engine

Chapter 2. Architecture of a Search Engine Chapter 2 Architecture of a Search Engine Search Engine Architecture A software architecture consists of software components, the interfaces provided by those components and the relationships between them

More information

DESIGN AND IMPLEMENTATION OF AN INTERACTIVE QUERY EXPANSION METHODOLOGY FOR INFORMATION RETRIEVAL

DESIGN AND IMPLEMENTATION OF AN INTERACTIVE QUERY EXPANSION METHODOLOGY FOR INFORMATION RETRIEVAL DESIGN AND IMPLEMENTATION OF AN INTERACTIVE QUERY EXPANSION METHODOLOGY FOR INFORMATION RETRIEVAL S. Ruban 1, Vanitha T 2. and S. Behin Sam 3 1 Department of Computer Science, Bharathiar University, Coimbatore,

More information

CMPSCI 646, Information Retrieval (Fall 2003)

CMPSCI 646, Information Retrieval (Fall 2003) CMPSCI 646, Information Retrieval (Fall 2003) Midterm exam solutions Problem CO (compression) 1. The problem of text classification can be described as follows. Given a set of classes, C = {C i }, where

More information

Content-based Dimensionality Reduction for Recommender Systems

Content-based Dimensionality Reduction for Recommender Systems Content-based Dimensionality Reduction for Recommender Systems Panagiotis Symeonidis Aristotle University, Department of Informatics, Thessaloniki 54124, Greece symeon@csd.auth.gr Abstract. Recommender

More information

From Passages into Elements in XML Retrieval

From Passages into Elements in XML Retrieval From Passages into Elements in XML Retrieval Kelly Y. Itakura David R. Cheriton School of Computer Science, University of Waterloo 200 Univ. Ave. W. Waterloo, ON, Canada yitakura@cs.uwaterloo.ca Charles

More information

An Empirical Study of Lazy Multilabel Classification Algorithms

An Empirical Study of Lazy Multilabel Classification Algorithms An Empirical Study of Lazy Multilabel Classification Algorithms E. Spyromitros and G. Tsoumakas and I. Vlahavas Department of Informatics, Aristotle University of Thessaloniki, 54124 Thessaloniki, Greece

More information

Leveraging Set Relations in Exact Set Similarity Join

Leveraging Set Relations in Exact Set Similarity Join Leveraging Set Relations in Exact Set Similarity Join Xubo Wang, Lu Qin, Xuemin Lin, Ying Zhang, and Lijun Chang University of New South Wales, Australia University of Technology Sydney, Australia {xwang,lxue,ljchang}@cse.unsw.edu.au,

More information

Better Contextual Suggestions in ClueWeb12 Using Domain Knowledge Inferred from The Open Web

Better Contextual Suggestions in ClueWeb12 Using Domain Knowledge Inferred from The Open Web Better Contextual Suggestions in ClueWeb12 Using Domain Knowledge Inferred from The Open Web Thaer Samar 1, Alejandro Bellogín 2, and Arjen P. de Vries 1 1 Centrum Wiskunde & Informatica, {samar,arjen}@cwi.nl

More information

ITERATIVE SEARCHING IN AN ONLINE DATABASE. Susan T. Dumais and Deborah G. Schmitt Cognitive Science Research Group Bellcore Morristown, NJ

ITERATIVE SEARCHING IN AN ONLINE DATABASE. Susan T. Dumais and Deborah G. Schmitt Cognitive Science Research Group Bellcore Morristown, NJ - 1 - ITERATIVE SEARCHING IN AN ONLINE DATABASE Susan T. Dumais and Deborah G. Schmitt Cognitive Science Research Group Bellcore Morristown, NJ 07962-1910 ABSTRACT An experiment examined how people use

More information

Indexing in Search Engines based on Pipelining Architecture using Single Link HAC

Indexing in Search Engines based on Pipelining Architecture using Single Link HAC Indexing in Search Engines based on Pipelining Architecture using Single Link HAC Anuradha Tyagi S. V. Subharti University Haridwar Bypass Road NH-58, Meerut, India ABSTRACT Search on the web is a daily

More information

Search Engine Architecture II

Search Engine Architecture II Search Engine Architecture II Primary Goals of Search Engines Effectiveness (quality): to retrieve the most relevant set of documents for a query Process text and store text statistics to improve relevance

More information

Outline. Structures for subject browsing. Subject browsing. Research issues. Renardus

Outline. Structures for subject browsing. Subject browsing. Research issues. Renardus Outline Evaluation of browsing behaviour and automated subject classification: examples from KnowLib Subject browsing Automated subject classification Koraljka Golub, Knowledge Discovery and Digital Library

More information

A NEW CLUSTER MERGING ALGORITHM OF SUFFIX TREE CLUSTERING

A NEW CLUSTER MERGING ALGORITHM OF SUFFIX TREE CLUSTERING A NEW CLUSTER MERGING ALGORITHM OF SUFFIX TREE CLUSTERING Jianhua Wang, Ruixu Li Computer Science Department, Yantai University, Yantai, Shandong, China Abstract: Key words: Document clustering methods

More information

ResPubliQA 2010

ResPubliQA 2010 SZTAKI @ ResPubliQA 2010 David Mark Nemeskey Computer and Automation Research Institute, Hungarian Academy of Sciences, Budapest, Hungary (SZTAKI) Abstract. This paper summarizes the results of our first

More information

Query Modifications Patterns During Web Searching

Query Modifications Patterns During Web Searching Bernard J. Jansen The Pennsylvania State University jjansen@ist.psu.edu Query Modifications Patterns During Web Searching Amanda Spink Queensland University of Technology ah.spink@qut.edu.au Bhuva Narayan

More information

WEIGHTING QUERY TERMS USING WORDNET ONTOLOGY

WEIGHTING QUERY TERMS USING WORDNET ONTOLOGY IJCSNS International Journal of Computer Science and Network Security, VOL.9 No.4, April 2009 349 WEIGHTING QUERY TERMS USING WORDNET ONTOLOGY Mohammed M. Sakre Mohammed M. Kouta Ali M. N. Allam Al Shorouk

More information

Multi-Stage Rocchio Classification for Large-scale Multilabeled

Multi-Stage Rocchio Classification for Large-scale Multilabeled Multi-Stage Rocchio Classification for Large-scale Multilabeled Text data Dong-Hyun Lee Nangman Computing, 117D Garden five Tools, Munjeong-dong Songpa-gu, Seoul, Korea dhlee347@gmail.com Abstract. Large-scale

More information

Contextual Search Using Ontology-Based User Profiles Susan Gauch EECS Department University of Kansas Lawrence, KS

Contextual Search Using Ontology-Based User Profiles Susan Gauch EECS Department University of Kansas Lawrence, KS Vishnu Challam Microsoft Corporation One Microsoft Way Redmond, WA 9802 vishnuc@microsoft.com Contextual Search Using Ontology-Based User s Susan Gauch EECS Department University of Kansas Lawrence, KS

More information

A Document-centered Approach to a Natural Language Music Search Engine

A Document-centered Approach to a Natural Language Music Search Engine A Document-centered Approach to a Natural Language Music Search Engine Peter Knees, Tim Pohle, Markus Schedl, Dominik Schnitzer, and Klaus Seyerlehner Dept. of Computational Perception, Johannes Kepler

More information

Personalized Information Retrieval by Using Adaptive User Profiling and Collaborative Filtering

Personalized Information Retrieval by Using Adaptive User Profiling and Collaborative Filtering Personalized Information Retrieval by Using Adaptive User Profiling and Collaborative Filtering Department of Computer Science & Engineering, Hanyang University {hcjeon,kimth}@cse.hanyang.ac.kr, jmchoi@hanyang.ac.kr

More information

Disambiguating Search by Leveraging a Social Context Based on the Stream of User s Activity

Disambiguating Search by Leveraging a Social Context Based on the Stream of User s Activity Disambiguating Search by Leveraging a Social Context Based on the Stream of User s Activity Tomáš Kramár, Michal Barla and Mária Bieliková Faculty of Informatics and Information Technology Slovak University

More information

Inverted List Caching for Topical Index Shards

Inverted List Caching for Topical Index Shards Inverted List Caching for Topical Index Shards Zhuyun Dai and Jamie Callan Language Technologies Institute, Carnegie Mellon University {zhuyund, callan}@cs.cmu.edu Abstract. Selective search is a distributed

More information

Applying Multi-Armed Bandit on top of content similarity recommendation engine

Applying Multi-Armed Bandit on top of content similarity recommendation engine Applying Multi-Armed Bandit on top of content similarity recommendation engine Andraž Hribernik andraz.hribernik@gmail.com Lorand Dali lorand.dali@zemanta.com Dejan Lavbič University of Ljubljana dejan.lavbic@fri.uni-lj.si

More information

Knowledge Retrieval. Franz J. Kurfess. Computer Science Department California Polytechnic State University San Luis Obispo, CA, U.S.A.

Knowledge Retrieval. Franz J. Kurfess. Computer Science Department California Polytechnic State University San Luis Obispo, CA, U.S.A. Knowledge Retrieval Franz J. Kurfess Computer Science Department California Polytechnic State University San Luis Obispo, CA, U.S.A. 1 Acknowledgements This lecture series has been sponsored by the European

More information

arxiv: v3 [cs.dl] 23 Sep 2017

arxiv: v3 [cs.dl] 23 Sep 2017 A Complete Year of User Retrieval Sessions in a Social Sciences Academic Search Engine Philipp Mayr, Ameni Kacem arxiv:1706.00816v3 [cs.dl] 23 Sep 2017 GESIS Leibniz Institute for the Social Sciences,

More information

A Miniature-Based Image Retrieval System

A Miniature-Based Image Retrieval System A Miniature-Based Image Retrieval System Md. Saiful Islam 1 and Md. Haider Ali 2 Institute of Information Technology 1, Dept. of Computer Science and Engineering 2, University of Dhaka 1, 2, Dhaka-1000,

More information

Term Frequency Normalisation Tuning for BM25 and DFR Models

Term Frequency Normalisation Tuning for BM25 and DFR Models Term Frequency Normalisation Tuning for BM25 and DFR Models Ben He and Iadh Ounis Department of Computing Science University of Glasgow United Kingdom Abstract. The term frequency normalisation parameter

More information

Proximity Prestige using Incremental Iteration in Page Rank Algorithm

Proximity Prestige using Incremental Iteration in Page Rank Algorithm Indian Journal of Science and Technology, Vol 9(48), DOI: 10.17485/ijst/2016/v9i48/107962, December 2016 ISSN (Print) : 0974-6846 ISSN (Online) : 0974-5645 Proximity Prestige using Incremental Iteration

More information

IJREAT International Journal of Research in Engineering & Advanced Technology, Volume 1, Issue 5, Oct-Nov, ISSN:

IJREAT International Journal of Research in Engineering & Advanced Technology, Volume 1, Issue 5, Oct-Nov, ISSN: IJREAT International Journal of Research in Engineering & Advanced Technology, Volume 1, Issue 5, Oct-Nov, 20131 Improve Search Engine Relevance with Filter session Addlin Shinney R 1, Saravana Kumar T

More information

IRCE at the NTCIR-12 IMine-2 Task

IRCE at the NTCIR-12 IMine-2 Task IRCE at the NTCIR-12 IMine-2 Task Ximei Song University of Tsukuba songximei@slis.tsukuba.ac.jp Yuka Egusa National Institute for Educational Policy Research yuka@nier.go.jp Masao Takaku University of

More information

Automatically Generating Queries for Prior Art Search

Automatically Generating Queries for Prior Art Search Automatically Generating Queries for Prior Art Search Erik Graf, Leif Azzopardi, Keith van Rijsbergen University of Glasgow {graf,leif,keith}@dcs.gla.ac.uk Abstract This report outlines our participation

More information

Academic Paper Recommendation Based on Heterogeneous Graph

Academic Paper Recommendation Based on Heterogeneous Graph Academic Paper Recommendation Based on Heterogeneous Graph Linlin Pan, Xinyu Dai, Shujian Huang, and Jiajun Chen National Key Laboratory for Novel Software Technology, Nanjing University, Nanjing 210023,

More information

Studying the Impact of Text Summarization on Contextual Advertising

Studying the Impact of Text Summarization on Contextual Advertising Studying the Impact of Text Summarization on Contextual Advertising G. Armano, A. Giuliani, and E. Vargiu Intelligent Agents and Soft-Computing Group Dept. of Electrical and Electronic Engineering University

More information