Evaluating Clustering Techniques for Collaborative Querying
|
|
- Brenda Willis
- 5 years ago
- Views:
Transcription
1 Evaluating clustering techniques for collaborative querying Ray, C.S., Goh, H.L.D., Foo, S., & Win,.C.S. (006). Proc. 006 International Conference on Hybrid Information Technology (ICHIT006), Cheu Island, Korea, ovember 9- Evaluating Clustering Techniques for Collaborative Querying Chandrani Sinha Ray, Dion Hoe-Lian Goh, Schubert Foo, yein Chan Soe Win Division of Information Studies, School of Communication and Information anyang Technological University, Singapore {rayc000, ashlgoh, assfoo, Abstract. Collaborative querying is a technique which makes use of past users search experiences in order to help the current user formulate an appropriate query. In this technique, related queries are extracted from query logs and clustered. Queries from these clusters that are related to the user s query are then recommended. This work compares a range of similarity measures and features (e.g. titles, URLs, snippets) used for clustering, and also proposes an extended K-means clustering algorithm. Experimental results reveal that the best clusters are obtained by using a combination of these sources rather than using only query terms or only result URLs alone. Keywords: Query Clustering, Information Retrieval, Collaborative Querying Introduction Web search engines play a vital role in retrieving information from the Internet, but there are several challenges faced by users of information retrieval (IR) systems in general. Users are often overwhelmed by the number of documents returned by an IR system or fail to express their information needs in terms compatible to those used in the system. While new IR techniques can potentially alleviate these problems, specifying information needs properly in the form of a query is still a fundamental problem with users. [] attributed this to the lack of conceptual knowledge of the information retrieval process. Put differently, users need to know how to translate their information needs into searchable queries. Research in information seeking behavior suggests an alternative approach in helping users meet their information needs. Studies have found that collaboration among searchers is an important step in the process of information seeking and use. For example [3] highlights the importance of the interaction between the inquirer and the librarian, while [4] argues that communication with colleagues is a key component in the initial search for information. We use the term collaborative IR to refer to a family of techniques that facilitate collaboration among users while conducting searches. These techniques can be categorized into: collaborative filtering, collaborative browsing and collaborative querying []. In particular, collaborative querying helps searchers in query formulation by harnessing other users search experience or expert knowledge [7]. Queries, being expressions of information needs, have the potential to provide a wealth of information that could be used to guide other searchers with similar information needs, helping them with query reformulation. If large quantities of queries are collected, they may be mined for emerging patterns and relationships that could be used to define communities of interest (e.g. Lycos 50), and even help users explore not only their current information needs but related ones as well. One can imagine a network of queries, amassed from the collective expertise of numerous searchers representing what people think as helping them meet their information needs. Query formulation and reformulation will then be a matter of exploring the query network, executing relevant queries, and retrieving the resulting documents.
2 There are many techniques for uncovering related queries to support collaborative querying, one of which is query clustering [5]. Here, query logs of IR systems are mined, and similar queries are grouped and used as recommendations. A key issue in query clustering is the measure of the similarity between queries. Traditional term-based similarity measures borrowed from classical IR [] is one example. An alternative is based on the overlap of the top results (such as URLs) returned from an IR system [6, 9]. Further, a hybrid approach was proposed by [5], which is a linear combination of the content-based and results-based approaches. Experiments showed that this approach produces better query clusters in terms of precision, recall and coverage, than using either the content-based or results-based method alone. Although these results show potential, there is still room for improvement by using additional information for measuring similarity. Examples include titles, snippets (document excerpts) or other metadata in the search results listings. Such information is commonly returned by IR systems and is used by searchers to udge whether the results satisfy their information needs. These can therefore be used to complement query terms and result URLs as criteria to measure the similarity between queries. Thus, in this paper, we try to improve on existing work in query clustering by [5] in two ways. First, the type of features used for calculating similarity between queries is expanded to include titles and snippets from search results documents in addition to query terms and results URLs. Secondly, a new algorithm, a variation of K- means, is proposed for clustering queries. The results of this work are then compared with previous work.. Related Work Query logs maintained by IR systems contain information related to users queries including search sessions, query terms submitted, documents selected, and so on. Hence, query logs provide valuable indications for understanding the types of documents users intend to retrieve when formulating a query [3]. One of the common techniques for collaborative querying through the reuse of previously submitted queries obtained from query logs is via query clustering [5]. Queries within the clusters are then used as candidates for query reformulation. For example, [7] developed a collaborative IR system that recommended related queries extracted and clustered from query logs. Both query terms and results returned were utilized in their hybrid query clustering approach. Query clustering approaches can be categorized into content-based, results-based, feedback-based, and a hybrid of these approaches, depending on the type of information used to determine the similarity between queries. The content-based approach borrows from classical term-based techniques of traditional IR. Here, features are constructed from the query keywords. A maor drawback is that on an average queries submitted to a web search engine are very short [], comprising -3 words in contrast to documents which are rich in content. Since the information content in queries is sparse [], it is difficult to deduce the semantics behind queries which in turn makes it difficult to establish similarity between queries [5]. An alternative is to use clickthrough data (feedback approach) which uses documents selected by users returned in response to a query. The idea behind this approach is that two queries are similar if they lead to the same selection of documents by searchers [3]. Work by [5] demonstrated that a combination of the contentbased and feedback-based approaches performed better in terms of the F-measure than the individual approaches. The disadvantage however is that the quality of query clusters may suffer if users select too many irrelevant documents. In results-based approach, documents returned in response to a query are used to measure similarity between queries. Here, document identifiers (e.g. URLs) as well as other metadata such as the document title, snippet or even the entire document are used. The latter was used by [9] to determine the similarity between queries by calculating the overlap in document content but this method is time consuming to execute. [6] therefore determined query similarity by the amount of overlap in their respective top-50 search results listings. Finally, [0] uses the results documents containing the query terms to extract other terms which often occur in context with the original query terms and these are used to establish similarity between queries. Our work differs from previous work in that it uses not only the content of queries and results URLs, but also the titles and snippets from search results documents since they are considered good descriptors of a query [6]. Also, since the entire result document is not used as feature set, it reduces the processing time, which is
3 important in a web search scenario. Moreover, our work uses the extended K-means algorithm [4] to define the similarity between queries rather than a simple measure of overlap, as in [5]. The K-means algorithm is popular for its speed and simplicity and is effective in computing the distances between obects in very large data sets [] which makes it appropriate for our dataset. The need for specifying the initial k centroids in advance which is the greatest disadvantage of K-means is overcome by using the extended K-means algorithm [4] described in section Query Clustering Algorithm Let W denote a collection of queries to be clustered. We represent each query Q i in W with composite unit vectors: Q i = (D i, E i, F i ) () where Q i is the composite unit vector representing i-th query in the query set W, i n (n is the total number of queries in the dataset) ; D i is the component unit vector representing keywords in Q i ; E i is the component unit vector representing the terms from the snippets and the title of the documents returned in response to Q i ; F i is the component unit vector representing the results URLs returned in response to Q i.. 3. The Content-based Similarity Measure A query is considered a document and query terms are vectors representing the query []. All terms are extracted after which stop words are removed and the remainder are stemmed. Terms that appear in less than two queries are eliminated, as they do not contribute in defining the relationship between two or more queries. The extracted terms are used to represent each query (Q i ) in the query set W as the content-based component vector (D i ) in Equation (). Specifically, D i is a weight vector obtained using the standard TFIDF formula: wnq = tfnq * idfn idfn = log dfn () where w nqi is the weight of the n-th term in query Q ; tf nqi is the query term frequency; is the number of queries in the query set W ; df n is the number of queries containing the n-th term ; idf n is the inverted term frequency for the n-th term. The content-based similarity between query Q i and Q is then measured using the cosine similarity function as defined in Equation (3): sim _ content( Q, Q ) = i n= w nqi w wnqi n= n= where w nqi and w nq are weights of the common n-th term in query Q i and Q respectively obtained using Equation (). nq w nq (3) 3. The Results-based Similarity Measure As discussed, limitations exist when using only query terms for clustering. [6] notes that the results lists returned in response to queries are readily accessible sources of rich information for measuring similarity. In our work, the results URLs as well as results terms (titles plus snippets) are used to represent each query as a results-based component vector (E) in Equation (). For the results URLs component of E, let U(Q ) be a set of query results URLs to a query Q, U(Q ) = {u, u,, u n } where u i represents the i-th result URL for Q. We then define R i as the overlap between two results URLs vectors between two queries Qi and Q: R = u : u U ( Q ) U ( Q )} (4) i { i 3
4 The similarity measure based on results URLs (adopted from [6]) is defined as: Ri (5) sim _ resultsurls ( Qi, Q ) = Max( Q, Q ) i where Qi and Q represent the number of results URLs in queries Qi and Q respectively ; R i is the number of common results URLs between Qi and Q. For the results terms component of E, we employ the cosine similarity measure as presented in Equation (3): sim_ results terms ( Q, Q i wnqi wnq n= ) = wnqi wnq n= n= where w nqi and w nq are weights of the common n-th term in query Q i and Q extracted from the results titles and snippets, using the TFIDF formula in Equation (). Finally, the complete results-based similarity measure is a linear combination of both URLs and terms and is defined as: sim _ results ( Qi, Q ) = ω k * sim _ results URLs + ω l * sim _ results terns (7) (6) where ω k is the weight assigned to sim_results URLs ; ω l is the weight assigned to sim_results terms ; ω k and ω l are non negative numbers such that ω k + ω l =. 3.3 The Hybrid Similarity Measure Similar to [5], the hybrid approach adopted in the present work is a linear combination of the content-based and results-based approaches, and is expected to overcome the disadvantages of using either approach alone: sim _ hybrid( Q, Q ) = α * sim _ results + β * sim _ content where α and β are non-negative weights and α + β =. i (8) 3.4 The Extended K-Means Algorithm. Define a similarity threshold, t where 0 t.. Randomly select a query as a cluster centroid. 3. Assign the remaining queries to the existing cluster if the similarity between the new query and the cluster centroid exceeds the defined threshold. 4. The query becomes a cluster by itself if condition in step 3 is not satisfied. 5. The centroids of the clusters are recomputed if its cluster members are changed. 6. Repeat steps 3 to 5 until all queries are assigned and cluster centroids do not change any more. Fig.. K-Means Clustering Algorithm Figure summarizes the extended K-means algorithm [4], an iterative partitioning algorithm which tries to overcome some of the limitations of the standard K-means. It uses similarity thresholds instead of pre-defined k centroids. This is the maor advantage as the number of query clusters are not known in advance. 4
5 4. Methodology Six months worth of query logs obtained from the anyang Technological University (TU, Singapore) digital library was used in this work. The queries belong to various domains including engineering disciplines, business and the arts. Each entry in a query log contains a timestamp indicating when the query was submitted, the submitted query, session ID, and the number of records returned. In the preprocessing stage, only the queries from the logs were extracted as the remainder of the content did not contribute to the query clustering technique used in this work. From the logs, approximately,700 queries were randomly sampled for use. Each query was then submitted to the Google search engine using the Google Web API. From the ensuing results listing, the top 0 URLs, titles and snippets were extracted for use in the results-based component of the hybrid similarity measure. 4. Computing Similarity Recall from Equation (8) that hybrid similarity is a linear combination of the content and results-based similarity. In our experiments, five pairs of values for (α,β) were used to vary the levels of contribution of each of these similarity measures: (0.0,.0), (.0, 0.0), (0.75, 0.5), (0.50, 0.50) and (0.5, 0.75) respectively for the content-based (sim_content), results-based (sim_result), hybrid (sim_hybrid), hybrid (sim_hybrid) and hybrid3 (sim_hybrid3) similarity measures. ote also that the results-based approach, defined in Equation (7), is a linear combination of results terms and URLs. Here, three pairs of values of (ω k, ω l ), namely, (.0, 0.0), (0.5, 0.5) and (0.0,.0) were used to represent the varying levels of contribution of results terms and URLs. In order to manage the number of experiments to be carried out to facilitate easier comparison, the following technique was used. We first set ω k =.0 and ω l = 0.0 to observe the results-based method using results URLs alone. These similarity measures are denoted with a subscript basic-link. Using this we had 5 sets of experiments - sim_content, sim_result basic-link, sim_hybrid basic-link, sim_hybrid basic-link and sim_hybrid3 basic-link through varying (α,β) as mentioned above. These measures are the same ones defined in [5]. From these, query clustering experiments were conducted using the extended K-Means algorithm at two different similarity thresholds (0.5 and 0.9) to determine the threshold that will provide better query clustering results in terms of the F-measure. That threshold would then be used for the remainder of the experiments. Given the selected similarity threshold, we next set ω k = ω l =0.5 so that the results-based method considered both the result URLs and terms equally during clustering. These similarity measures are denoted by the subscript extended. This gave rise to 4 sets of experiements- sim_result extended, sim_hybrid extended, sim_hybrid extended, sim_hybrid3 extended for varying (α,β). Finally, we considered result terms alone by setting ω k =0.0 and ω l =.0. These similarity measures are denoted by the subscript basic-term and we again had 4 similarity measures- sim_result basic-term, sim_hybrid basicterm, sim_hybrid basic-, sim_hybrid basic-term for varying (α,β). ote that the content-based similarity measure remains unchanged in the hybrid equation but the resultsbased component has been changed by the three combinations of results URLs and terms. Overall, a total of 3 separate experiments were run. 4. Comparison Criteria The quality of the query clusters generated from the 3 aforementioned approaches was compared using average cluster size, coverage, precision, recall and F-measures similar to [5] and others. Average cluster size is the average of all the cluster sizes. Coverage is the percentage of queries for which the similarity measure is able to provide a cluster. Precision is the ratio of the number of similar queries to the total number of queries in a cluster. Recall is the ratio of the number of similar queries to the total number of all similar queries in the dataset (those in the current cluster and others). It is difficult to calculate recall directly because no standard clusters are available in our dataset. Therefore, an alternative normalized recall [5] was used instead (explained 5
6 in the next section). For measuring precision and recall, 00 sample clusters were selected. Finally, F-measure is used to generate a single measure by combining recall and precision. 5. Findings and Analyses Recall that two similarity thresholds 0.5 and 0.9 were used for the basic-link approach to determine which threshold provided the best result in terms of F-measure. It was found that the threshold 0.5 was the consistent best performer (Table ) and hence this value was used in our experiments. Average cluster size indicates the ability of the clustering algorithm to recommend queries for a user s query. Figure (a) shows a plot of the average cluster size against different values of α and β (see Equation 8). Average cluster size exhibits an increasing trend along with the increase in β, the weight for the content-based component. The average cluster size of sim_content at similarity threshold 0.5 ranks highest at 6.93 indicating that users have a higher likelihood of finding more queries from a given query using the content-based approach. This is because different queries share a number of identical terms. Conversely, the average cluster size of sim_result basic-link is the smallest (.8) meaning that users will have the fewest number of recommended queries per issued query. This may be due to the fact that the number of distinct URLs is usually large and thus many similar queries cannot be clustered together for lack of common URLs []. Figure (a) also shows that the basic-term approaches have consistently higher average cluster sizes than that of other approaches. Thus when result terms were used, these similarity measures provided more recommended queries than those without using results terms. Coverage [] indicates the probability that a user can obtain recommended queries from his/her issued query. The content-based approach, sim_content, ranks highest in coverage (8.40%) followed by the hybrid approaches sim_hybrid3 basic-term (78.55%) and sim_hybrid3 extended (76.99%), indicating that the content-based approach is better able to find similar queries for a given query than other approaches. The coverage of sim_result basic-link is the lowest (.5%) which means that users will be less likely to receive recommended queries from an issued query. This reason may be the same for average cluster size in that the number of distinct URLs is usually large and many similar queries cannot be clustered for lack of common URLs []. Figure (b) shows that like average cluster size, coverage exhibits an increasing trend along with the increase in β. In addition, the coverage of the basic-term approaches is consistently higher than that of the other approaches, suggesting the importance of the use of results terms. Hence, users have a higher likelihood of receiving recommendations from a given query by using result terms during clustering than not using them. basic-link (0.5) extended (0.5) basic-term (0.5) 8 basic-link (0.5) extended (0.5) basic-term (0.5) 7 Average Cluster Size α=.0 α=0.75 α=0.5 α=0.5 α=0.0 β=0.0 β=0.5 β=0.5 β=0.75 β=.0 sim_result sim_hybrid sim_hybrid sim_hybrid3 sim_content Coverage α=.0 α=0.75 α=0.5 α=0.5 α=0.0 β=0.0 β=0.5 β=0.5 β=0.75 β=.0 sim_result sim_hybridsim_hybridsim_hybrid3sim_content (a) (b) Fig.. (a) Average Cluster Size and (b) Coverage for different similarity measures In terms of precision, the hybrid approach, sim_hybrid basic-link, performs best (99.73%) followed by sim_hybrid extended (99.50%) and sim_result extended (99.00%). The worst performer was sim_content (65.9%) suggesting that the use of query terms alone are not sufficient for producing good quality clusters. Figure 3(a) shows that precision decreases with the increase in α while it increases with a corresponding increase in β, the weight for the results-based component. In general, sim_hybrid generally performs better and this is attributed to the fact that the search engine used in our experiments (Google) tends to return results URLs that are the 6
7 same for semantically related queries. It is also interesting to observe that the precision of the basic-term approaches are consistently lower than that of other approaches indicating that the use of results terms compromises precision. However, these differences tend to be small. 00% 80% basic-link (0.5) extended (0.5) basic-term (0.5) 00% 80% basic-link (0.9) basic-link (0.5) extended (0.5) basic-term (0.5) Precision 60% 40% Recall 60% 40% 0% 0% 0% α=.0 α=0.75 α=0.5 α=0.5 α=0.0 β=0.0 β=0.5 β=0.5 β=0.75 β=.0 sim_result sim_hybridsim_hybridsim_hybrid3sim_content (a) (b) Fig. 3. (a) Precision and (b) Recall for different similarity measures. As mentioned, the calculation of recall posed a problem as no standard query clusters were available. Therefore, the normalized recall [5] measure was used. It is defined as the ratio of the number of correctly clustered queries within the 00 selected clusters to the maximum number of correctly clustered queries across the dataset (that is, clusters from the 3 experiments). In our work, the maximum number of correctly clustered queries was 54, obtained by sim_hybrid3 basic-term. Thus, the normalized recall of sim_hybrid3 basic-term ranks highest at 00%, followed by sim_hybrid3 extended (96.03%) and sim_content (95%). This indicates that sim_hybrid3 basic-term is better able to uncover clusters of similar queries generated by different similarity functions on the given query set used in this experiment. At the similarity threshold of 0.5, the recall of sim_result basic-link is the lowest (4.03%) meaning that users have the lowest likelihood of finding more query clusters. This could be because the number of distinct URLs is usually large and similar queries cannot be clustered together for lack of common URLs [], thus lowering the ability to uncover clusters of similar queries. 0% α=.0 α=0.75 α=0.5 α=0.5 α=0.0 β=0.0 β=0.5 β=0.5 β=0.75 β=.0 sim_result sim_hybrid sim_hybrid sim_hybrid3 sim_content Figure 3(b) shows that sim_result extended has lowest normalized recall (4.35%). The figure also indicates that the recall measures resulting from the basic-term approaches are consistently higher than other approaches though the extended approaches outperform basic-link approaches. It is evident that the use of result terms alone is better able to uncover clusters of similar queries than using result links alone. This may explain why the use of result terms in all similarity functions improves the ability to uncover more clusters of similar queries. The reason behind this finding could be that as the number of common terms increases due to the contribution of result terms used, the chances of finding related queries increase. Table. F-Measure Comparison Similarity Function basic-link (threshold = 0.9) basic-link (0.5) basic-term (0.5) extended (0.5) sim_result sim_hybrid sim_hybrid sim_hybrid sim_content To measure the global quality of the clusters, the F-measure is used [5]. Table shows that the F-measures returned from basic-link experiments at threshold 0.5 were consistently better than those at threshold 0.9. Hence similarity threshold 0.5 was used in the other approaches to generate query clusters. It was observed that the use of result terms consistently improved the F-measure compared to the approaches that did not use them. The reasons for this are similar to those cited for precision and recall since the F-measure itself is the harmonic mean of precision and recall. 7
8 6. Discussion We have observed that the basic-term approach performs better than the other approaches in terms of all the performance criteria except precision. As there are five similarity functions used in the basic-term approach, the comparison between the qualities of results for each is presented in Table. Only F-measures are reported in place of recall and precision. As can be seen in the table, the average cluster sizes of the results-based approach sim_result basic-term and hybrid approaches sim_hybrid basic-term and sim_hybrid basic-term are low. However sim_hybrid3 basic-term and sim_content basic-term provide 5.7 and 6.93 queries per cluster respectively on average. Since users are already burdened with a number of search results, too many query suggestions to the original queries will not be of much benefit to users. Hence, the average of 5 to 7 suggestions per query is quite reasonable. Table 3 shows sample query clusters that resulted from sim_hybrid3 basic-term. Table. Summary of Evaluation Results Similarity Functions Average Range of (similarity threshold Cluster Size Cluster Size = 0.5) Coverage F-measure sim_result basic-term 3.04 ~7 7.89% 0.76 sim_hybrid basic-term.98 ~ % 0.77 sim_hybrid basic-term 3.60 ~ % 0.8 sim_hybrid3 basic-term 5.7 ~ % 0.95 sim_content basic-term 6.93 ~8 8.40% 0.78 Table 3. Two Query Clusters from sim_hybrid3 basic-term A wavelet tour of signal processing, wavelet processing, wavelet tool signal processing Strains stresses, handbook formulas stress strain, aluminum stress-strain curve, strain material, stress transformation, distribution curve, engineering stress, failure curve Moreover, sim_content basic-term provides the highest coverage at 8.40% while sim_hybrid3 basic-term ranks second highest at 78.55%.. Stated differently, a user is more likely to receive recommendations to his original query using sim_content basic-term and sim_hybrid3 basic-term. In terms of F-measure, sim_hybrid3 basic-term ranks highest at 0.95 which suggests that the recommended queries will be useful to the searcher. Also, sim_hybrid3 extended achieved the second highest score at 0.94, while sim_hybrid3 basic-link ranks third at 0.9 (refer to Table ). It is observed that although sim_content basic-term achieved the largest average cluster size (6.93) among other approaches and the highest coverage (8.40%), it has a lower F-measure (0.78) compared to sim_hybrid3 basic-term (0.95) (from Table ). Table 4. Comparison of Query Clustering Approaches Best Worst Avg Cluster Size sim_hybrid3 basic-term sim_result basic-link Coverage sim_content basic-term sim_result basic-link Precision sim_hybrid basic-link sim_content basic-term Recall sim_hybrid3 basic-term sim_result basic-link F-measure sim_hybrid3 basic-term sim_result basic-link It can thus be deduced that sim_hybrid3 basic-term provides the most balanced results for query clusters followed by sim_hybrid3 extended and sim_hybrid3 basic-link. Table 4 gives an alternative view by comparing the performances of all approaches at similarity threshold 0.5. Although there is no single best performer, the table shows that the combination of the content-based and results-based approach outperforms the use of either approach alone. Further, the hybrid clustering approach, sim_hybrid3 basic-term, yields the best performance in terms of average cluster size, recall and F-measure. 8
9 7. Conclusion Collaborative querying is inspired from research in information seeking behaviour, which shows that collaboration among searchers is important in the process of information seeking and use. In our work, we use query clustering to discover similar queries to assist in query formulation and reformulation. We compared the content-based approach, results-based approach and a hybrid approach of both measures to cluster similar queries from query logs using the extended K-means algorithm. Our experiments also explored whether the use of more features (terms from title and snippets from the search results lists), in addition to the query terms and results URL, produces better clusters. Results showed that the use of either content-based or results-based approaches alone was inadequate. Instead, the hybrid content and results-based approach (sim_hybrid3 basic-term ), considering query terms, title terms and snippet terms, produced better query clustering results than approaches that do not consider terms from titles and snippets. The reason could be due to the increased number of common terms between two queries extracted from titles and snippets. With this increased number, term weights play a more important role in finding related queries. Our proposed hybrid clustering approach is seen to perform better than other clustering algorithms. It is able to overcome the drawbacks of the results-based approach through the addition of additional information from result documents such as titles and snippets. Besides, it also takes into account the important information in the queries submitted by users by combining the content-based approach with the results-based approach. Our clustering algorithm will allow IR systems to identify high-quality clusters to help users reformulate queries to meet their information needs by sharing queries issued by other users. High quality query clusters can also be used in automatic query expansion. We propose to extend our work in the following directions. Firstly, phrases could be used in the calculation of query similarity since they are less sensitive to noise due to the lower probability of finding matching phrases in non-related queries [8]. Hence, this technique could increase precision. Further, lexical knowledge through resources such as Wordnet could be used to improve the recall of the clusters. Similar to [], the use of lexical knowledge can be used to create a query hierarchy representing broader or specialized queries and allow users to modify their queries by either broadening or narrowing their searches. Acknowledgments. This proect is partially supported by TU research grant number RG5/05. References. Chien, S. & Immorlica,. Semantic similarity between search engine queries using temporal correlation. WWW 005, (005) -.. Chuang, S.L. & Chien, L.F. Towards automatic generation of query taxonomy: a hierarchical query clustering approach. Proceedings of IEEE 00 International Conference on Data Mining, (00) Cui, H., Wen, J.R., ie, J.Y. & Ma, W.Y. Probabilistic query expansion using query logs. Proceedings of the eleventh international conference on World Wide Web, (00) Fonseca, B.M., Golgher, P.B., De Moura, E.S. & Ziviani,. Using association rules to discover search engines related queries. First Latin American Web Congress (LA-WEB 03), (003) Fu, L., Goh, D.H., Foo, S. & a, J.C. Collaborative querying through a hybrid query clustering approach. Digital libraries: Technology and management of indigenous knowledge for global access, ICADL 003, (003) Fu, L., Goh, D.H., Foo, S. & Supangat, Y. Collaborative querying for enhanced information retrieval. European conference on research and advanced technology for digital libraries (ECDL 004), (004) Furnas, G.W., Landauer, T.K., Gomez, L.M. & Dumais, S.T. The vocabulary problem in human-system communication. Communications of the ACM, 30(), (987) Huang, C.K., Oyang, Y.J. & Chien, L.F. A contextual term suggestion mechanism for interactive search. Proceedings of 00 Web Intelligence Conference (WI 00), (00) pp Jansen, B. J., Spink, A. & Saracevic, T. Real life, real users and real needs: A study and analysis of users queries on the Web. Information Processing and Management, 36(), (000) Lokman, I.M. & Stephanie, W.H. Information seeking behavior and use of social science faculty studying stateless nations: A case study. Journal of library and Information Science Research, 3(), (00) Matwin, S. & Scott, S. Text classification using wordnet hypernyms. Proceedings of the COLIG/ACL Workshop on Usage of Wordet in atural Language Processing Systems, COLIG-ACL'98, (998) Raghavan, V.V. & Sever, H. On the reuse of past optimal queries. Proceedings of the Eighteenth International ACM SIGIR Conference on Research and Development in Information Retrieval, (995)
10 3. Turney, P.D. Word sense disambiguation by web mining for word co-occurrence probabilities. Proceedings Third International Workshop on the Evaluation of Systems for the Semantic Analysis of Text (SESEVAL-3), (004) Velardi, P. & avigli, R. An analysis of ontology-based query expansion strategies. Workshop on Adaptive Text Extraction and Mining at the 4th European Conference on Machine Learning, ECML 003, (003) Voorhees, E. Using wordnet to disambiguate word senses for text retrieval. Proceedings of the 6th annual international ACM SIGIR conference on Research and development in information retrieval, (993) Wen, J.R., ie, J.Y. & Zhang, H.J. Query clustering using user logs. ACM Transactions on Information Systems, 0 (), (00)
Chapter 27 Introduction to Information Retrieval and Web Search
Chapter 27 Introduction to Information Retrieval and Web Search Copyright 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 27 Outline Information Retrieval (IR) Concepts Retrieval
More informationThis document is downloaded from DR-NTU, Nanyang Technological University Library, Singapore.
This document is downloaded from DR-NTU, Nanyang Technological University Library, Singapore. Title Query formulation with a search assistant. Author(s) Citation Fu, Lin.; Goh, Dion Hoe-Lian.; Foo, Schubert
More informationA User Study on Features Supporting Subjective Relevance for Information Retrieval Interfaces
A user study on features supporting subjective relevance for information retrieval interfaces Lee, S.S., Theng, Y.L, Goh, H.L.D., & Foo, S. (2006). Proc. 9th International Conference of Asian Digital Libraries
More informationInformation Retrieval
Information Retrieval CSC 375, Fall 2016 An information retrieval system will tend not to be used whenever it is more painful and troublesome for a customer to have information than for him not to have
More informationChapter 6: Information Retrieval and Web Search. An introduction
Chapter 6: Information Retrieval and Web Search An introduction Introduction n Text mining refers to data mining using text documents as data. n Most text mining tasks use Information Retrieval (IR) methods
More informationThis document is downloaded from DR-NTU, Nanyang Technological University Library, Singapore.
This document is downloaded from DR-NTU, Nanyang Technological University Library, Singapore. Title Collaborative querying using the Query Graph Visualizer( Main article ) Author(s) Goh, Dion Hoe-Lian;
More informationEnabling Users to Visually Evaluate the Effectiveness of Different Search Queries or Engines
Appears in WWW 04 Workshop: Measuring Web Effectiveness: The User Perspective, New York, NY, May 18, 2004 Enabling Users to Visually Evaluate the Effectiveness of Different Search Queries or Engines Anselm
More informationCollaborative querying using query graph visualizer Goh, H.L..D, Lin, F., & Foo, S (2005). Online Information Review, 29(3),
Collaborative querying using query graph visualizer Goh, H.L..D, Lin, F., & Foo, S (2005). Online Information Review, 29(3), 266-282. Collaborative Querying using the Query Graph Visualizer Lin Fu, Dion
More informationOutline. Possible solutions. The basic problem. How? How? Relevance Feedback, Query Expansion, and Inputs to Ranking Beyond Similarity
Outline Relevance Feedback, Query Expansion, and Inputs to Ranking Beyond Similarity Lecture 10 CS 410/510 Information Retrieval on the Internet Query reformulation Sources of relevance for feedback Using
More informationWeb Information Retrieval using WordNet
Web Information Retrieval using WordNet Jyotsna Gharat Asst. Professor, Xavier Institute of Engineering, Mumbai, India Jayant Gadge Asst. Professor, Thadomal Shahani Engineering College Mumbai, India ABSTRACT
More informationEffect of log-based Query Term Expansion on Retrieval Effectiveness in Patent Searching
Effect of log-based Query Term Expansion on Retrieval Effectiveness in Patent Searching Wolfgang Tannebaum, Parvaz Madabi and Andreas Rauber Institute of Software Technology and Interactive Systems, Vienna
More informationAutomatic New Topic Identification in Search Engine Transaction Log Using Goal Programming
Proceedings of the 2012 International Conference on Industrial Engineering and Operations Management Istanbul, Turkey, July 3 6, 2012 Automatic New Topic Identification in Search Engine Transaction Log
More informationBetter Contextual Suggestions in ClueWeb12 Using Domain Knowledge Inferred from The Open Web
Better Contextual Suggestions in ClueWeb12 Using Domain Knowledge Inferred from The Open Web Thaer Samar 1, Alejandro Bellogín 2, and Arjen P. de Vries 1 1 Centrum Wiskunde & Informatica, {samar,arjen}@cwi.nl
More informationIMPROVING INFORMATION RETRIEVAL BASED ON QUERY CLASSIFICATION ALGORITHM
IMPROVING INFORMATION RETRIEVAL BASED ON QUERY CLASSIFICATION ALGORITHM Myomyo Thannaing 1, Ayenandar Hlaing 2 1,2 University of Technology (Yadanarpon Cyber City), near Pyin Oo Lwin, Myanmar ABSTRACT
More informationCS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University
CS473: CS-473 Course Review Luo Si Department of Computer Science Purdue University Basic Concepts of IR: Outline Basic Concepts of Information Retrieval: Task definition of Ad-hoc IR Terminologies and
More informationPatent Classification Using Ontology-Based Patent Network Analysis
Association for Information Systems AIS Electronic Library (AISeL) PACIS 2010 Proceedings Pacific Asia Conference on Information Systems (PACIS) 2010 Patent Classification Using Ontology-Based Patent Network
More informationCollaborative Filtering using Euclidean Distance in Recommendation Engine
Indian Journal of Science and Technology, Vol 9(37), DOI: 10.17485/ijst/2016/v9i37/102074, October 2016 ISSN (Print) : 0974-6846 ISSN (Online) : 0974-5645 Collaborative Filtering using Euclidean Distance
More informationRepositorio Institucional de la Universidad Autónoma de Madrid.
Repositorio Institucional de la Universidad Autónoma de Madrid https://repositorio.uam.es Esta es la versión de autor de la comunicación de congreso publicada en: This is an author produced version of
More informationAutomated Online News Classification with Personalization
Automated Online News Classification with Personalization Chee-Hong Chan Aixin Sun Ee-Peng Lim Center for Advanced Information Systems, Nanyang Technological University Nanyang Avenue, Singapore, 639798
More informationA New Technique to Optimize User s Browsing Session using Data Mining
Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 4, Issue. 3, March 2015,
More informationCS224W: Social and Information Network Analysis Project Report: Edge Detection in Review Networks
CS224W: Social and Information Network Analysis Project Report: Edge Detection in Review Networks Archana Sulebele, Usha Prabhu, William Yang (Group 29) Keywords: Link Prediction, Review Networks, Adamic/Adar,
More informationCHAPTER 5 OPTIMAL CLUSTER-BASED RETRIEVAL
85 CHAPTER 5 OPTIMAL CLUSTER-BASED RETRIEVAL 5.1 INTRODUCTION Document clustering can be applied to improve the retrieval process. Fast and high quality document clustering algorithms play an important
More informationCLUSTERING, TIERED INDEXES AND TERM PROXIMITY WEIGHTING IN TEXT-BASED RETRIEVAL
STUDIA UNIV. BABEŞ BOLYAI, INFORMATICA, Volume LVII, Number 4, 2012 CLUSTERING, TIERED INDEXES AND TERM PROXIMITY WEIGHTING IN TEXT-BASED RETRIEVAL IOAN BADARINZA AND ADRIAN STERCA Abstract. In this paper
More informationRelevance Feedback and Query Reformulation. Lecture 10 CS 510 Information Retrieval on the Internet Thanks to Susan Price. Outline
Relevance Feedback and Query Reformulation Lecture 10 CS 510 Information Retrieval on the Internet Thanks to Susan Price IR on the Internet, Spring 2010 1 Outline Query reformulation Sources of relevance
More informationExtraction of Web Image Information: Semantic or Visual Cues?
Extraction of Web Image Information: Semantic or Visual Cues? Georgina Tryfou and Nicolas Tsapatsoulis Cyprus University of Technology, Department of Communication and Internet Studies, Limassol, Cyprus
More informationCHAPTER 6 PROPOSED HYBRID MEDICAL IMAGE RETRIEVAL SYSTEM USING SEMANTIC AND VISUAL FEATURES
188 CHAPTER 6 PROPOSED HYBRID MEDICAL IMAGE RETRIEVAL SYSTEM USING SEMANTIC AND VISUAL FEATURES 6.1 INTRODUCTION Image representation schemes designed for image retrieval systems are categorized into two
More informationQuery Phrase Expansion using Wikipedia for Patent Class Search
Query Phrase Expansion using Wikipedia for Patent Class Search 1 Bashar Al-Shboul, Sung-Hyon Myaeng Korea Advanced Institute of Science and Technology (KAIST) December 19 th, 2011 AIRS 11, Dubai, UAE OUTLINE
More informationMining Web Data. Lijun Zhang
Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems
More informationNUS-I2R: Learning a Combined System for Entity Linking
NUS-I2R: Learning a Combined System for Entity Linking Wei Zhang Yan Chuan Sim Jian Su Chew Lim Tan School of Computing National University of Singapore {z-wei, tancl} @comp.nus.edu.sg Institute for Infocomm
More informationIMAGE RETRIEVAL SYSTEM: BASED ON USER REQUIREMENT AND INFERRING ANALYSIS TROUGH FEEDBACK
IMAGE RETRIEVAL SYSTEM: BASED ON USER REQUIREMENT AND INFERRING ANALYSIS TROUGH FEEDBACK 1 Mount Steffi Varish.C, 2 Guru Rama SenthilVel Abstract - Image Mining is a recent trended approach enveloped in
More informationSemi-Supervised Clustering with Partial Background Information
Semi-Supervised Clustering with Partial Background Information Jing Gao Pang-Ning Tan Haibin Cheng Abstract Incorporating background knowledge into unsupervised clustering algorithms has been the subject
More informationSubjective Relevance: Implications on Interface Design for Information Retrieval Systems
Subjective : Implications on interface design for information retrieval systems Lee, S.S., Theng, Y.L, Goh, H.L.D., & Foo, S (2005). Proc. 8th International Conference of Asian Digital Libraries (ICADL2005),
More informationRecommendation System for Location-based Social Network CS224W Project Report
Recommendation System for Location-based Social Network CS224W Project Report Group 42, Yiying Cheng, Yangru Fang, Yongqing Yuan 1 Introduction With the rapid development of mobile devices and wireless
More informationImproving Difficult Queries by Leveraging Clusters in Term Graph
Improving Difficult Queries by Leveraging Clusters in Term Graph Rajul Anand and Alexander Kotov Department of Computer Science, Wayne State University, Detroit MI 48226, USA {rajulanand,kotov}@wayne.edu
More informationSemantic Website Clustering
Semantic Website Clustering I-Hsuan Yang, Yu-tsun Huang, Yen-Ling Huang 1. Abstract We propose a new approach to cluster the web pages. Utilizing an iterative reinforced algorithm, the model extracts semantic
More informationExploiting Internal and External Semantics for the Clustering of Short Texts Using World Knowledge
Exploiting Internal and External Semantics for the Using World Knowledge, 1,2 Nan Sun, 1 Chao Zhang, 1 Tat-Seng Chua 1 1 School of Computing National University of Singapore 2 School of Computer Science
More informationInformation Retrieval using Pattern Deploying and Pattern Evolving Method for Text Mining
Information Retrieval using Pattern Deploying and Pattern Evolving Method for Text Mining 1 Vishakha D. Bhope, 2 Sachin N. Deshmukh 1,2 Department of Computer Science & Information Technology, Dr. BAM
More informationFinding Hubs and authorities using Information scent to improve the Information Retrieval precision
Finding Hubs and authorities using Information scent to improve the Information Retrieval precision Suruchi Chawla 1, Dr Punam Bedi 2 1 Department of Computer Science, University of Delhi, Delhi, INDIA
More informationImpact of Term Weighting Schemes on Document Clustering A Review
Volume 118 No. 23 2018, 467-475 ISSN: 1314-3395 (on-line version) url: http://acadpubl.eu/hub ijpam.eu Impact of Term Weighting Schemes on Document Clustering A Review G. Hannah Grace and Kalyani Desikan
More informationA Novel Categorized Search Strategy using Distributional Clustering Neenu Joseph. M 1, Sudheep Elayidom 2
A Novel Categorized Search Strategy using Distributional Clustering Neenu Joseph. M 1, Sudheep Elayidom 2 1 Student, M.E., (Computer science and Engineering) in M.G University, India, 2 Associate Professor
More informationTERM BASED WEIGHT MEASURE FOR INFORMATION FILTERING IN SEARCH ENGINES
TERM BASED WEIGHT MEASURE FOR INFORMATION FILTERING IN SEARCH ENGINES Mu. Annalakshmi Research Scholar, Department of Computer Science, Alagappa University, Karaikudi. annalakshmi_mu@yahoo.co.in Dr. A.
More informationDomain Specific Search Engine for Students
Domain Specific Search Engine for Students Domain Specific Search Engine for Students Wai Yuen Tang The Department of Computer Science City University of Hong Kong, Hong Kong wytang@cs.cityu.edu.hk Lam
More informationMining Web Data. Lijun Zhang
Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems
More informationA BFS-BASED SIMILAR CONFERENCE RETRIEVAL FRAMEWORK
A BFS-BASED SIMILAR CONFERENCE RETRIEVAL FRAMEWORK Qing Guo 1, 2 1 Nanyang Technological University, Singapore 2 SAP Innovation Center Network,Singapore ABSTRACT Literature review is part of scientific
More informationA novel supervised learning algorithm and its use for Spam Detection in Social Bookmarking Systems
A novel supervised learning algorithm and its use for Spam Detection in Social Bookmarking Systems Anestis Gkanogiannis and Theodore Kalamboukis Department of Informatics Athens University of Economics
More informationQuery Refinement and Search Result Presentation
Query Refinement and Search Result Presentation (Short) Queries & Information Needs A query can be a poor representation of the information need Short queries are often used in search engines due to the
More informationSearch Costs vs. User Satisfaction on Mobile
Search Costs vs. User Satisfaction on Mobile Manisha Verma, Emine Yilmaz University College London mverma@cs.ucl.ac.uk, emine.yilmaz@ucl.ac.uk Abstract. Information seeking is an interactive process where
More informationDomain-specific Concept-based Information Retrieval System
Domain-specific Concept-based Information Retrieval System L. Shen 1, Y. K. Lim 1, H. T. Loh 2 1 Design Technology Institute Ltd, National University of Singapore, Singapore 2 Department of Mechanical
More informationA Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2
A Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2 1 Department of Electronics & Comp. Sc, RTMNU, Nagpur, India 2 Department of Computer Science, Hislop College, Nagpur,
More informationCLEF-IP 2009: Exploring Standard IR Techniques on Patent Retrieval
DCU @ CLEF-IP 2009: Exploring Standard IR Techniques on Patent Retrieval Walid Magdy, Johannes Leveling, Gareth J.F. Jones Centre for Next Generation Localization School of Computing Dublin City University,
More informationMEASURING SEMANTIC SIMILARITY BETWEEN WORDS AND IMPROVING WORD SIMILARITY BY AUGUMENTING PMI
MEASURING SEMANTIC SIMILARITY BETWEEN WORDS AND IMPROVING WORD SIMILARITY BY AUGUMENTING PMI 1 KAMATCHI.M, 2 SUNDARAM.N 1 M.E, CSE, MahaBarathi Engineering College Chinnasalem-606201, 2 Assistant Professor,
More informationLeveraging Transitive Relations for Crowdsourced Joins*
Leveraging Transitive Relations for Crowdsourced Joins* Jiannan Wang #, Guoliang Li #, Tim Kraska, Michael J. Franklin, Jianhua Feng # # Department of Computer Science, Tsinghua University, Brown University,
More informationCopyright 2011 please consult the authors
Alsaleh, Slah, Nayak, Richi, Xu, Yue, & Chen, Lin (2011) Improving matching process in social network using implicit and explicit user information. In: Proceedings of the Asia-Pacific Web Conference (APWeb
More informationAn Intelligent Clustering Algorithm for High Dimensional and Highly Overlapped Photo-Thermal Infrared Imaging Data
An Intelligent Clustering Algorithm for High Dimensional and Highly Overlapped Photo-Thermal Infrared Imaging Data Nian Zhang and Lara Thompson Department of Electrical and Computer Engineering, University
More informationVIDEO SEARCHING AND BROWSING USING VIEWFINDER
VIDEO SEARCHING AND BROWSING USING VIEWFINDER By Dan E. Albertson Dr. Javed Mostafa John Fieber Ph. D. Student Associate Professor Ph. D. Candidate Information Science Information Science Information Science
More informationComment Extraction from Blog Posts and Its Applications to Opinion Mining
Comment Extraction from Blog Posts and Its Applications to Opinion Mining Huan-An Kao, Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University, Taipei, Taiwan
More informationLetter Pair Similarity Classification and URL Ranking Based on Feedback Approach
Letter Pair Similarity Classification and URL Ranking Based on Feedback Approach P.T.Shijili 1 P.G Student, Department of CSE, Dr.Nallini Institute of Engineering & Technology, Dharapuram, Tamilnadu, India
More informationText clustering based on a divide and merge strategy
Available online at www.sciencedirect.com ScienceDirect Procedia Computer Science 55 (2015 ) 825 832 Information Technology and Quantitative Management (ITQM 2015) Text clustering based on a divide and
More informationA probabilistic model to resolve diversity-accuracy challenge of recommendation systems
A probabilistic model to resolve diversity-accuracy challenge of recommendation systems AMIN JAVARI MAHDI JALILI 1 Received: 17 Mar 2013 / Revised: 19 May 2014 / Accepted: 30 Jun 2014 Recommendation systems
More informationWeb-page Indexing based on the Prioritize Ontology Terms
Web-page Indexing based on the Prioritize Ontology Terms Sukanta Sinha 1, 4, Rana Dattagupta 2, Debajyoti Mukhopadhyay 3, 4 1 Tata Consultancy Services Ltd., Victoria Park Building, Salt Lake, Kolkata
More informationMinoru SASAKI and Kenji KITA. Department of Information Science & Intelligent Systems. Faculty of Engineering, Tokushima University
Information Retrieval System Using Concept Projection Based on PDDP algorithm Minoru SASAKI and Kenji KITA Department of Information Science & Intelligent Systems Faculty of Engineering, Tokushima University
More informationhighest cosine coecient [5] are returned. Notice that a query can hit documents without having common terms because the k indexing dimensions indicate
Searching Information Servers Based on Customized Proles Technical Report USC-CS-96-636 Shih-Hao Li and Peter B. Danzig Computer Science Department University of Southern California Los Angeles, California
More informationA Vector Space Equalization Scheme for a Concept-based Collaborative Information Retrieval System
A Vector Space Equalization Scheme for a Concept-based Collaborative Information Retrieval System Takashi Yukawa Nagaoka University of Technology 1603-1 Kamitomioka-cho, Nagaoka-shi Niigata, 940-2188 JAPAN
More informationQuery Likelihood with Negative Query Generation
Query Likelihood with Negative Query Generation Yuanhua Lv Department of Computer Science University of Illinois at Urbana-Champaign Urbana, IL 61801 ylv2@uiuc.edu ChengXiang Zhai Department of Computer
More informationAn Attempt to Identify Weakest and Strongest Queries
An Attempt to Identify Weakest and Strongest Queries K. L. Kwok Queens College, City University of NY 65-30 Kissena Boulevard Flushing, NY 11367, USA kwok@ir.cs.qc.edu ABSTRACT We explore some term statistics
More informationLearning Ontology-Based User Profiles: A Semantic Approach to Personalized Web Search
1 / 33 Learning Ontology-Based User Profiles: A Semantic Approach to Personalized Web Search Bernd Wittefeld Supervisor Markus Löckelt 20. July 2012 2 / 33 Teaser - Google Web History http://www.google.com/history
More informationEnhancing Cluster Quality by Using User Browsing Time
Enhancing Cluster Quality by Using User Browsing Time Rehab Duwairi Dept. of Computer Information Systems Jordan Univ. of Sc. and Technology Irbid, Jordan rehab@just.edu.jo Khaleifah Al.jada' Dept. of
More informationChapter 2. Architecture of a Search Engine
Chapter 2 Architecture of a Search Engine Search Engine Architecture A software architecture consists of software components, the interfaces provided by those components and the relationships between them
More informationDESIGN AND IMPLEMENTATION OF AN INTERACTIVE QUERY EXPANSION METHODOLOGY FOR INFORMATION RETRIEVAL
DESIGN AND IMPLEMENTATION OF AN INTERACTIVE QUERY EXPANSION METHODOLOGY FOR INFORMATION RETRIEVAL S. Ruban 1, Vanitha T 2. and S. Behin Sam 3 1 Department of Computer Science, Bharathiar University, Coimbatore,
More informationCMPSCI 646, Information Retrieval (Fall 2003)
CMPSCI 646, Information Retrieval (Fall 2003) Midterm exam solutions Problem CO (compression) 1. The problem of text classification can be described as follows. Given a set of classes, C = {C i }, where
More informationContent-based Dimensionality Reduction for Recommender Systems
Content-based Dimensionality Reduction for Recommender Systems Panagiotis Symeonidis Aristotle University, Department of Informatics, Thessaloniki 54124, Greece symeon@csd.auth.gr Abstract. Recommender
More informationFrom Passages into Elements in XML Retrieval
From Passages into Elements in XML Retrieval Kelly Y. Itakura David R. Cheriton School of Computer Science, University of Waterloo 200 Univ. Ave. W. Waterloo, ON, Canada yitakura@cs.uwaterloo.ca Charles
More informationAn Empirical Study of Lazy Multilabel Classification Algorithms
An Empirical Study of Lazy Multilabel Classification Algorithms E. Spyromitros and G. Tsoumakas and I. Vlahavas Department of Informatics, Aristotle University of Thessaloniki, 54124 Thessaloniki, Greece
More informationLeveraging Set Relations in Exact Set Similarity Join
Leveraging Set Relations in Exact Set Similarity Join Xubo Wang, Lu Qin, Xuemin Lin, Ying Zhang, and Lijun Chang University of New South Wales, Australia University of Technology Sydney, Australia {xwang,lxue,ljchang}@cse.unsw.edu.au,
More informationBetter Contextual Suggestions in ClueWeb12 Using Domain Knowledge Inferred from The Open Web
Better Contextual Suggestions in ClueWeb12 Using Domain Knowledge Inferred from The Open Web Thaer Samar 1, Alejandro Bellogín 2, and Arjen P. de Vries 1 1 Centrum Wiskunde & Informatica, {samar,arjen}@cwi.nl
More informationITERATIVE SEARCHING IN AN ONLINE DATABASE. Susan T. Dumais and Deborah G. Schmitt Cognitive Science Research Group Bellcore Morristown, NJ
- 1 - ITERATIVE SEARCHING IN AN ONLINE DATABASE Susan T. Dumais and Deborah G. Schmitt Cognitive Science Research Group Bellcore Morristown, NJ 07962-1910 ABSTRACT An experiment examined how people use
More informationIndexing in Search Engines based on Pipelining Architecture using Single Link HAC
Indexing in Search Engines based on Pipelining Architecture using Single Link HAC Anuradha Tyagi S. V. Subharti University Haridwar Bypass Road NH-58, Meerut, India ABSTRACT Search on the web is a daily
More informationSearch Engine Architecture II
Search Engine Architecture II Primary Goals of Search Engines Effectiveness (quality): to retrieve the most relevant set of documents for a query Process text and store text statistics to improve relevance
More informationOutline. Structures for subject browsing. Subject browsing. Research issues. Renardus
Outline Evaluation of browsing behaviour and automated subject classification: examples from KnowLib Subject browsing Automated subject classification Koraljka Golub, Knowledge Discovery and Digital Library
More informationA NEW CLUSTER MERGING ALGORITHM OF SUFFIX TREE CLUSTERING
A NEW CLUSTER MERGING ALGORITHM OF SUFFIX TREE CLUSTERING Jianhua Wang, Ruixu Li Computer Science Department, Yantai University, Yantai, Shandong, China Abstract: Key words: Document clustering methods
More informationResPubliQA 2010
SZTAKI @ ResPubliQA 2010 David Mark Nemeskey Computer and Automation Research Institute, Hungarian Academy of Sciences, Budapest, Hungary (SZTAKI) Abstract. This paper summarizes the results of our first
More informationQuery Modifications Patterns During Web Searching
Bernard J. Jansen The Pennsylvania State University jjansen@ist.psu.edu Query Modifications Patterns During Web Searching Amanda Spink Queensland University of Technology ah.spink@qut.edu.au Bhuva Narayan
More informationWEIGHTING QUERY TERMS USING WORDNET ONTOLOGY
IJCSNS International Journal of Computer Science and Network Security, VOL.9 No.4, April 2009 349 WEIGHTING QUERY TERMS USING WORDNET ONTOLOGY Mohammed M. Sakre Mohammed M. Kouta Ali M. N. Allam Al Shorouk
More informationMulti-Stage Rocchio Classification for Large-scale Multilabeled
Multi-Stage Rocchio Classification for Large-scale Multilabeled Text data Dong-Hyun Lee Nangman Computing, 117D Garden five Tools, Munjeong-dong Songpa-gu, Seoul, Korea dhlee347@gmail.com Abstract. Large-scale
More informationContextual Search Using Ontology-Based User Profiles Susan Gauch EECS Department University of Kansas Lawrence, KS
Vishnu Challam Microsoft Corporation One Microsoft Way Redmond, WA 9802 vishnuc@microsoft.com Contextual Search Using Ontology-Based User s Susan Gauch EECS Department University of Kansas Lawrence, KS
More informationA Document-centered Approach to a Natural Language Music Search Engine
A Document-centered Approach to a Natural Language Music Search Engine Peter Knees, Tim Pohle, Markus Schedl, Dominik Schnitzer, and Klaus Seyerlehner Dept. of Computational Perception, Johannes Kepler
More informationPersonalized Information Retrieval by Using Adaptive User Profiling and Collaborative Filtering
Personalized Information Retrieval by Using Adaptive User Profiling and Collaborative Filtering Department of Computer Science & Engineering, Hanyang University {hcjeon,kimth}@cse.hanyang.ac.kr, jmchoi@hanyang.ac.kr
More informationDisambiguating Search by Leveraging a Social Context Based on the Stream of User s Activity
Disambiguating Search by Leveraging a Social Context Based on the Stream of User s Activity Tomáš Kramár, Michal Barla and Mária Bieliková Faculty of Informatics and Information Technology Slovak University
More informationInverted List Caching for Topical Index Shards
Inverted List Caching for Topical Index Shards Zhuyun Dai and Jamie Callan Language Technologies Institute, Carnegie Mellon University {zhuyund, callan}@cs.cmu.edu Abstract. Selective search is a distributed
More informationApplying Multi-Armed Bandit on top of content similarity recommendation engine
Applying Multi-Armed Bandit on top of content similarity recommendation engine Andraž Hribernik andraz.hribernik@gmail.com Lorand Dali lorand.dali@zemanta.com Dejan Lavbič University of Ljubljana dejan.lavbic@fri.uni-lj.si
More informationKnowledge Retrieval. Franz J. Kurfess. Computer Science Department California Polytechnic State University San Luis Obispo, CA, U.S.A.
Knowledge Retrieval Franz J. Kurfess Computer Science Department California Polytechnic State University San Luis Obispo, CA, U.S.A. 1 Acknowledgements This lecture series has been sponsored by the European
More informationarxiv: v3 [cs.dl] 23 Sep 2017
A Complete Year of User Retrieval Sessions in a Social Sciences Academic Search Engine Philipp Mayr, Ameni Kacem arxiv:1706.00816v3 [cs.dl] 23 Sep 2017 GESIS Leibniz Institute for the Social Sciences,
More informationA Miniature-Based Image Retrieval System
A Miniature-Based Image Retrieval System Md. Saiful Islam 1 and Md. Haider Ali 2 Institute of Information Technology 1, Dept. of Computer Science and Engineering 2, University of Dhaka 1, 2, Dhaka-1000,
More informationTerm Frequency Normalisation Tuning for BM25 and DFR Models
Term Frequency Normalisation Tuning for BM25 and DFR Models Ben He and Iadh Ounis Department of Computing Science University of Glasgow United Kingdom Abstract. The term frequency normalisation parameter
More informationProximity Prestige using Incremental Iteration in Page Rank Algorithm
Indian Journal of Science and Technology, Vol 9(48), DOI: 10.17485/ijst/2016/v9i48/107962, December 2016 ISSN (Print) : 0974-6846 ISSN (Online) : 0974-5645 Proximity Prestige using Incremental Iteration
More informationIJREAT International Journal of Research in Engineering & Advanced Technology, Volume 1, Issue 5, Oct-Nov, ISSN:
IJREAT International Journal of Research in Engineering & Advanced Technology, Volume 1, Issue 5, Oct-Nov, 20131 Improve Search Engine Relevance with Filter session Addlin Shinney R 1, Saravana Kumar T
More informationIRCE at the NTCIR-12 IMine-2 Task
IRCE at the NTCIR-12 IMine-2 Task Ximei Song University of Tsukuba songximei@slis.tsukuba.ac.jp Yuka Egusa National Institute for Educational Policy Research yuka@nier.go.jp Masao Takaku University of
More informationAutomatically Generating Queries for Prior Art Search
Automatically Generating Queries for Prior Art Search Erik Graf, Leif Azzopardi, Keith van Rijsbergen University of Glasgow {graf,leif,keith}@dcs.gla.ac.uk Abstract This report outlines our participation
More informationAcademic Paper Recommendation Based on Heterogeneous Graph
Academic Paper Recommendation Based on Heterogeneous Graph Linlin Pan, Xinyu Dai, Shujian Huang, and Jiajun Chen National Key Laboratory for Novel Software Technology, Nanjing University, Nanjing 210023,
More informationStudying the Impact of Text Summarization on Contextual Advertising
Studying the Impact of Text Summarization on Contextual Advertising G. Armano, A. Giuliani, and E. Vargiu Intelligent Agents and Soft-Computing Group Dept. of Electrical and Electronic Engineering University
More information