highest cosine coecient [5] are returned. Notice that a query can hit documents without having common terms because the k indexing dimensions indicate

Size: px

Start display at page:

Download "highest cosine coecient [5] are returned. Notice that a query can hit documents without having common terms because the k indexing dimensions indicate"

Chester Richards
6 years ago
Views:

1 Searching Information Servers Based on Customized Proles Technical Report USC-CS Shih-Hao Li and Peter B. Danzig Computer Science Department University of Southern California Los Angeles, California fshli, Abstract We investigate the eect of using customized proles to help searching relevant servers in Internet. Our experiments demonstrate that the use of customized proles with latent semantic indexing (LSI) technique can improve the performance of Internet searching. 1 Introduction When searching information in a retrieval system, people use dierent terms to describe their information needs. The retrieval system searches through its database and returns documents indexed with matching terms. Since a concept can be represented by a variety of terms, users may fail to obtain the information they require. This is called the vocabulary problem [1]. The vocabulary problem occurs not only in traditional information retrieval, but also in Internet resource discovery, where users seek relevant information servers to submit their queries. Previously, we proposed to use Latent Semantic Indexing (LSI) [2] to ameliorate the vocabulary problem in the Internet search [3]. Here, we expand the idea by integrating a customized prole with LSI to assist the searching. We demonstrate that customized proles can help a retrieval system to understand a user's terminology better, and thus improve the performance. 2 Background Originally LSI [2] was developed to address the vocabulary problem in Salton's Vector Space Model (VSM) [4] where documents and queries are represented as vectors of term frequencies or weights. It assumes some underlying semantic structure exists in the pattern of term usage across documents. To capture this information, LSI applies Singular Value Decomposition (SVD) to a term-document matrix representing a database and generates vectors of k (typically 100 to 300) orthogonal indexing dimensions, where each dimension represents a linearly independent concept. The decomposed vectors are used to represent both documents and terms in queries in the same semantic space, while their values indicate the degrees of association with the k underlying concepts. A query vector in LSI is the weighted sum of its component term vectors. For example, a p-term query is represented as the average sum of the p decomposed term vectors. To determine relevant documents, the query vector is compared with all document vectors, and those with the 1

2 highest cosine coecient [5] are returned. Notice that a query can hit documents without having common terms because the k indexing dimensions indicate the concepts, not the exact terms used. Hence, LSI improves search performance by ameliorating the vocabulary problem. A prole (or user prole) is a collection of data, specied by users to reect their interests. It can be used as a lter to select new documents or information that match users' interests [6, 7, 8], or used to augment the query for improving retrieval eectiveness [9, 10]. Foltz and Dumais used LSI to lter new incoming documents based on user proles [8]. They compared new documents against users' word and document proles, and ranked them based on their similarities to the prole. For the word prole, users indicate words or phrases of interests, each is represented as a separate vector and compared with new documents using the standard vector and LSI vector methods. Similarly, each document in the document prole is expressed as a vector and compared to all new documents using the same matching methods. In their experiment, they found LSI-match with document prole has the best performance. Earlier, we proposed Two-Level LSI in the Internet environment [3], where a \directory of services" records the descriptions of information servers using LSI. A user sends his query to the directory of services which determines and ranks the information servers relevant to the user's request. The user employs the rankings when selecting the most relevant information servers to query directly. Here we investigate the use of customized proles in two-level LSI. In this research, a prole provides background information to the query. It could be a set of documents reecting a user's interests or a discipline's taxonomy representing a specialized knowledge. To distinguish these two types of prole, we call the former \user prole" and the latter \taxonomy" in the paper. Below, we compare the eect of merging taxonomy at the directory of services against expanding queries with user prole at the client site. 3 Experiment In [3], we showed that two-level LSI can outperform VSM in estimating server rankings in the Internet environment. Here we focus on the comparisons of using user prole and taxonomy with LSI. We generate three server rankings using (1) the original LSI, (2) LSI with user prole, and (3) LSI with taxonomy. In this experiment, a user prole is used to expand a user's query before sending to the directory of services, while a taxonomy is merged with server descriptions at the directory of services. Figure 1 shows the three processes. We use the standard CACM and MED document collections, for which queries and relevant judgments are available. We compute the rankings estimated by the three methods and calculate their rank-order coecient and accumulated recall. 3.1 Methodology We combine the documents from both the CACM and MED collections, and divide them into nine sub-collections, each representing a server's database. Notice that these documents may use the same terms for totally dierent meanings because they belong to two dierent disciplines (computer science and medicine). There could exist severe vocabulary problem in such environment. Documents in each database are indexed with terms occurring in the title and abstract but not on a stop list of 429 common words. While queries are written in natural language, terms in a query are used only if they do not appear on the same stop list and if they appear in at least one document. All indexed terms are stored in their original forms without stemming. Table 1 gives 2

3 Figure 1: The three ranking processes - (1) the original LSI, (2) LSI with user prole, and (3) LSI with taxonomy. the additional characteristics of our experiment. Number of documents 4237 Number of queries 64 Number of indexing terms Mean number of terms per document Mean number of terms per query Table 1: The characteristics of the test collection. In LSI ranking, we apply the single link clustering algorithm [11] to construct server descriptions. We cluster documents when their similarity is greater than a predened threshold. Each of the remaining documents forms a cluster of its own. Each cluster is represented by the mean vector of its component document vectors, and the server description is the set of its cluster vectors. The directory of services collects the server descriptions from all the servers, and determines the ranking using SVD for each user query. In this experiment, server descriptions are decomposed into vectors of 100 dimensions as suggested in Deerwester's LSI experiments [2]. The ranking is based on the cosine similarity between server descriptions and user query. 3

4 In LSI with user prole ranking, we select half of the relevant documents of a query to construct the user prole for that query. Since those documents have been judged relevant by the user, they can reect the interests of the user. Before sending a query to the directory of services, we expand it by adding the \prole vector", which is the centroid of all the document vectors in the prole. The directory of services applies typical LSI algorithm to rank servers for the remaining half of relevant documents. In LSI with taxonomy ranking, we generate \pseudo-documents" from the ACM taxonomy which contains a listing of computer science classication schemes [12]. We then merge these pseudodocuments with the server descriptions in the directory of services before applying LSI algorithm. We postulate that adding it as pseudo-documents may reinforce the computer science interpretation of the terms in the CACM documents. Therefore, it can help increase the likelihood that computer science rather than medical documents are returned from the new superset collection. Below, we use two methods to evaluate the rankings estimated by the above three approaches. Our criterion is to give high ranks to the servers that contain the most relevant documents. 3.2 Rank-Order Correlation To verify the estimated rankings, we generate a standard ranking (denoted as STD) by sorting servers based on their number of relevant documents excluding those used in the user prole. We calculate the Spearman rank-order correlation coecient (r s ) [13] to measure the closeness of STD and the estimated ranking. The r s ranges between?1 and 1. If two rankings are identical, r s = 1. If one ranking is the reverse of the other, r s =?1. The larger the r s, the closer the rankings. The r s coecient allows us to determine which of the above methods generates a ranking closest to that of STD. To compare the rankings generated using the original LSI (denoted as LSI), LSI with user prole (denoted as LSI-PRO), and LSI with taxonomy (denoted as LSI-TAX), we calculate their r s against STD for each query. Among the 64 samples, r s (LSI, STD) is larger than, equal to, and less than r s (LSI-TAX, STD) for 16, 19, and 29 times, respectively. This indicates when using indexing dimension 100, LSI with taxonomy generates a ranking closer to STD than without it for 29 out of 64 times, whereas the latter only has closer order for 16 out of 64 times. Similarly, r s (LSI-TAX, STD) is larger than, equal to, and less than r s (LSI-PRO, STD) for 18, 14, and 32 times, respectively. Therefore, LSI with user prole generates more closer rankings than with taxonomy. To measure the condence that LSI with user prole outperforms the other methods, we calculate the condence interval for proportion dened as follows [14]: Sample proportion = p = max[n 1; n2] ; n1 + n2 Condence interval for proportion = p z1? 2 s p(1? p) n1 + n2 ; where n1 is the number of times one method is better than the other, and n2 is the number of times it is worse. The z1? is the (1? 2 2 )-quantile of a unit normal variate. For 95% condence level, z1? = 1:960. If the condence interval does not include 0.5, we can say with 95% condence that 2 one method is superior to the other. For r s (LSI-PRO, STD) and r s (LSI-TAX, STD), their condence interval is (0.507, 0.773). Because it does not include 0.5, we can say with 95% condence that LSI with user prole is superior to LSI with taxonomy. Similarly, the condence interval for r s (LSI-TAX, 4

5 STD) and r s (LSI, STD) is (0.505, 0.784). Therefore, LSI with taxonomy is superior to LSI with 95% condence. From the two ranking comparisons, we conclude LSI with either user prole or taxonomy gives a better ranking than without it. Additionally, LSI with user prole performs better than with taxonomy. The reason could be that the user prole is query-specic which changes from query to query, while the taxonomy acts as a generic prole for all queries. Therefore, LSI with user prole gives more accurate results. 3.3 Accumulated Recall To measure the performance of using estimated server rankings, we calculate the \accumulated recall" for the top n out of total N servers in the ranking. Let rel i be the set of relevant documents and retr i the set of retrieved documents for a given query on server i. We dene: Document Recall, denoted as R d (n), is the ratio of the number of relevant documents retrieved in the top n servers over the number of relevant documents in all servers, R d (n) = P n i=1 jrel i \ retr i j P : N i=1 jrel ij Server Recall, denoted as R s (n), is the ratio of the number of the top n servers having relevant documents over the total number of servers having relevant documents, R s (n) = jfserver ijrel i 6= ;; 1 i ngj jfserver ijrel i 6= ;; 1 i N gj : Because the returned documents are determined by the query processing engine in each server, we assume all relevant documents are returned for simplicity. Table 2 shows the average document and server recalls as a function of number of servers for 64 queries retrieved on the test collection. In Table 2, LSI with user prole has the highest document recall except when the number of servers n is 6 and 7, and the original LSI has the lowest value except when n = 1 and 5. This means when retrieving the top 1, 2, 3, 4, 5, and 8 servers in the ranking estimated by LSI with user prole, we can get more relevant documents than LSI or LSI with taxonomy. If we retrieve the servers ranked by LSI only, we will obtain fewer relevant documents most of the time. This is consistent with LSI's lower rank-order correlation coecient. The average order of the nine document recalls for LSI, LSI-PRO, and LSI-TAX are 2.556, 1.222, and 1.889, respectively. Clearly, LSI with user prole performs best among the three methods. For server recall, both LSI-PRO and LSI-TAX get the rst places 4 out of 9 times. The average order for LSI, LSI-PRO, and LSI-TAX are 2.333, 1.556, and 1.667, respectively. Thus, LSI with user prole performs slightly better than with taxonomy, while both of them are much better than the original LSI. Therefore, users can get more relevant servers using the ranking order estimated by LSI with either user prole or taxonomy than without it. 4 Conclusions We proposed to use Deerwester's latent semantic indexing with customized proles to search and rank information servers in Internet. We conducted experiments on standard document collections 5

6 Recall n LSI LSI-PRO LSI-TAX (2) (1) (3) (3) (1) (2) (3) (1) (2) (3) (1) (2) R d (2) (1) (3) (3) (2) (1) (3) (2) (1) (3) (1) (2) (1) (1) (1) (2) (1) (3) (3) (2) (1) (3) (1) (2) (3) (2) (1) R s (1) (1) (3) (2) (3) (1) (3) (2) (1) (3) (1) (2) (1) (1) (1) Table 2: The average document recall (R d ) and server recall (R s ) as a function of number of servers (n) for 64 queries retrieved on the test collection. The numbers in parentheses indicate the order among the three methods for a given n. and compared the performance of LSI, LSI with user prole, and LSI with taxonomy using rankorder coecient and accumulated recall. The results show that LSI with user prole performs best, LSI with taxonomy second, and the original LSI third in estimating and ranking relevant servers for user queries. A customized prole provides background information to the query. In practice, novice users can use LSI with taxonomy to nd initial documents in a specic eld, then construct their own prole for further searching. Users having a variety of interests can use dierent prole for each query to get higher recall. As the number of Internet servers on Internet grows rapidly, we believe this technique can ameliorate the vocabulary problem and improve user's searching process. References [1] George W. Furnas, Thomas K. Landauer, Louis M. Gomez, and Susan T. Dumais, \The vocabulary problem in human-system communication", Communications of the ACM, vol. 30, no. 11, pp. 964{971, November [2] Scott Deerwester, Susan T. Dumais, George W. Furnas, Thomas K. Landauer, and Richard Harshman, \Indexing by latent semantic analysis", Journal of the American Society for Information Science, vol. 41, no. 6, pp. 391{407, September

7 [3] Shih-Hao Li and Peter B. Danzig, \Vocabulary problem in Internet resource discovery", in Proceedings of the Second International Workshop on Next Generation Information Technologies and Systems, Naharia, Israel, June 1995, pp. 139{145, Available from ftp://catarina.usc.edu/shli/ngits.ps.gz. [4] Gerard Salton and Michael J. McGill, Introduction to Modern Information Retrieval, McGraw- Hill Book Company, [5] Gerard Salton, Automatic Information Organization and Retrieval, McGraw-Hill Book Company, [6] K. H. Packer and D. Soergel, \The importance of SDI for current awareness in elds with severe scatter of information", Journal of the American Society for Information Science, vol. 30, no. 3, pp. 125{135, [7] Shoshnan Loeb, \Architecting personalized delivery of multimedia information", Communications of the ACM, vol. 35, no. 12, pp. 39{48, December [8] Peter W. Foltz and Susan T. Dumais, \Personalized information delivery: An analysis of information ltering methods", Communications of the ACM, vol. 35, no. 12, pp. 51{60, December [9] H. Grzelak and K. Kowalski, \Automatic construction of information queries", Information Processing and Management, vol. 19, pp. 381{389, [10] Robert R. Korfhage, \Query enhancement by user proles", in Proceedings of the Third Jonit BCS and ACM Symposium, 1984, pp. 111{122. [11] Ellen M. Voorhees, \Implementing agglomerative hierarchic clustering algorithms for use in document retrieval", Information Processing and Management, vol. 22, no. 6, pp. 465{476, [12] Jean E. Sammet and Anthony Ralston, \The new (1982) computing review classication system - nal version", Communications of the ACM, vol. 25, no. 1, pp. 13{25, January [13] Maurice Kendall and Jean D. Gibbons, Rank Correlation Methods, Edward Arnold, London, fth edition, [14] Raj Jain, The Art of Computer Systems Performance Analysis, John Wiley & Son, Inc., New York,

Two-Dimensional Visualization for Internet Resource Discovery. Shih-Hao Li and Peter B. Danzig. University of Southern California

Two-Dimensional Visualization for Internet Resource Discovery Shih-Hao Li and Peter B. Danzig Computer Science Department University of Southern California Los Angeles, California 90089-0781 fshli, danzigg@cs.usc.edu