Two-Dimensional Visualization for Internet Resource Discovery. Shih-Hao Li and Peter B. Danzig. University of Southern California

Size: px

Start display at page:

Download "Two-Dimensional Visualization for Internet Resource Discovery. Shih-Hao Li and Peter B. Danzig. University of Southern California"

Tyrone Brooks
5 years ago
Views:

1 Two-Dimensional Visualization for Internet Resource Discovery Shih-Hao Li and Peter B. Danzig Computer Science Department University of Southern California Los Angeles, California fshli, Abstract Traditional information retrieval systems return documents in a list, where documents are sorted according to their publication dates, titles, or similarities to the user query. Users select interested documents by searching them through the returned list. In a distributed environment, documents are returned from more than one information server. A simple document list is not ecient to present a large volume data to the users. In this paper, we propose a two-dimensional visualization scheme for Internet resource discovery. Our method displays data by clusters according to their cross-similarities and still retains the rankings with respect to the user query. In addition, it provides a customized view that arranges data in favor of user's preference terms. Index Terms - information retrieval, resource discovery, similarity measure, visualization. 1 Introduction Traditional information retrieval systems return documents in lists or sets, where documents are sorted according to their publication dates, title alphabets, or computed similarities to the user query. In a distributed environment, documents are returned from more than one information server, and a simple document list is not ecient to present a large volume of data to the users. One solution to this problem is to present data using a two-dimensional graphical display. By arranging documents in a two-dimensional space based on their inter-relationships, users can more easily identify topics and browse topical clusters. In [1], Gower and Digby introduce several methods, such as principal components analysis, biplot, and correspondence analysis, for expressing complex relationships in two dimensions. These methods use Singular Value Decomposition (SVD) to transform documents into one or more matrices that carry the \compressed" relationships of the original data. In Information Agents [2], Cybenko et. al use Self Organizing Maps (SOM) [3] to map multi-dimensional data onto a two-dimensional space. SOM is based on neural networks and needs iterative training and adjustments. In this paper, we propose a two-dimensional visualization scheme based on Latent Semantic Indexing (LSI) [4] for Internet resource discovery. Our method displays data by clusters according to their cross-similarities and ranks documents with respect to the user query. In addition, it provides a customized view that arranges data in favor of user's preference terms. Below, we introduce LSI and describe our method in Internet resource discovery. 1

2 2 Latent Semantic Indexing LSI was originally developed for solving the vocabulary problem [5], which states that people with dierent backgrounds or in dierent contexts describe information dierently. LSI assumes there is some underlying semantic structure in the pattern of term usage across documents and uses SVD to capture this structure. Our goal is to present this structure instead of a simple ranked list to the users. In LSI, documents are represented as vectors of term frequencies following Salton's wellestablished Vector Space Model (VSM) [6]. An entire database is represented as a term-document matrix, where the rows and columns indicate the terms and documents, respectively. To capture the semantic structure among documents, LSI applies SVD to this matrix and generates vectors of k (typically 100 to 300) orthogonal indexing dimensions, where each dimension represents a linearly independent concept. The decomposed vectors are used to represent both documents and terms in the same semantic space, while their values indicate the degrees of association with the k underlying concepts. Because k is chosen much smaller than the number of documents and terms in the database, the decomposed vectors are represented in a compressed dimensional space and are not independent. Therefore, two documents can be relevant without having common terms but with common concepts. Figure 1 shows SVD applied to a term-document matrix and Example 1 illustrates the document representation before and after SVD. Figure 1: SVD applies to an m n term-document matrix, where m and n are the numbers of terms and documents in the database, and k is the dimensionality of the SVD representation. Example 1 Let d i (1 i 5), and t j an information system, where (1 j 8) be a set of documents and their associated terms in d 1 = ft 1 ; t 3 ; t 4 ; t 4 g; d 2 = ft 1 ; t 2 ; t 2 ; t 3 g; d 3 = ft 2 ; t 3 ; t 4 g; d 4 = ft 6 ; t 7 ; t 8 g; d 5 = ft 5 ; t 6 ; t 6 ; t 7 g: 2

3 Table 1 shows their vector representations in VSM (i.e. before SVD) and LSI (i.e. after SVD). Figure 2 shows the two-dimensional plot of the decomposed document and term vectors in LSI. document term VSM LSI description (t 1 t 2 t 3 t 4 t 5 t 6 t 7 t 8 ) (dim1 dim2) d 1 t 1 t 3 t 4 t ?0.508 d 2 t 1 t 2 t 2 t d 3 t 2 t 3 t d 4 t 6 t 7 t ? d 5 t 5 t 6 t 6 t ? Table 1: The vector representations of documents d i (1 i 5) in VSM (i.e. before SVD) and LSI (i.e. after SVD). The \(t 1 : : : t 8 )" and \(dim1 dim2)" columns show the vectors in VSM and LSI, respectively D Plot of Terms and Documents d5 t6 Dimension cluster A d4 t7 t5 t8 t1 t3 cluster B t2 t4 d3 d2 d Dimension 1 Figure 2: Two-dimensional plot of decomposed vectors in Example 1, where documents d i (1 i 5) and terms t j (1 j 8) are represented as \" and \+", respectively. The above example shows that documents having a number of common terms are close to each other as well as their common terms. In Figure 2, we can roughly see two clusters, A and B. In cluster A, documents d 1, d 2, and d 3 are close to each other because each pair of them share t 1, t 2, t 3, or t 4. Similarly, d 4 and d 5 are close to each other because both of them have t 6 and t 7. To present documents and terms in this way, we can easily see their relationships by x?y coordinates. 2 3

4 3 Internet Resource Discovery In a distributed environment, people search information by sending requests to associated information servers using a client-server model (Figure 3(a)). In this approach, the client needs to know the server's name or address before sending a request. In the Internet where thousands of servers provide information, it becomes dicult and inecient to search all the servers manually. (a) (b) Figure 3: (a) Client-Server Model. (b) Client-Directory-Server Model. One step to solve this problem is to keep a \directory of services" that records the description or summary of each information server. A user sends his query to the directory of services which determines and ranks the information servers relevant to the user's request. The user employs the rankings when selecting the information servers to query directly. We call this the client-directoryserver model (Figure 3(b)). Internet resource discovery services [7], such as Archie [8], WAIS [9], WWW [10], Gopher [11], and Indie [12], allow users to retrieve information throughout the Internet. All of the above systems provide services similar to the client-directory-server model. For example, Archie, WAIS, and Indie support a global index like the directory of services in their systems. WWW and Gopher do not have a global index by themselves. Instead, they have an added-on indexing scheme built by other tools, such as Harvest [13], WWWW [14] for WWW and Veronica [15] for Gopher. 4 Visualization in Client-Directory-Server Model In Internet resource discovery, users search data from thousands of information servers. A simple ranked list is inecient to present a large volume data to the users. To solve this problem, we propose a LSI-based two-dimensional visualization scheme. Our method presents documents in a two-dimensional space based on their cross-similarities. Documents having a number of common terms are placed close to each other as well as their common terms. In addition, documents are colored based on their similarities to the user query. Documents with higher similarities are painted with darker colors or gray-levels. The darkest document in a cluster is used as the representative document of this cluster. The documents with similarities below a certain value are colorless. By doing this, relevant documents can be grouped together and still show their individual similarities to the query. Users can browse the common terms or representative documents and decide whether to see the details. In contrast, if returned documents are presented in a ranked list, the rst and second documents may have identical similarity values with the query but cover completely dierent topics. If the user wants to focus on the second topic, he still needs to search the whole list to nd other similar ones. 4

5 To implement our visualization method, we collect all the returned documents at the client site, generate a term-document matrix out of them, and apply SVD to this matrix. The resulting terms and documents are plotted on a two-dimensional graphical display and presented to the users. Similarly, the relevant servers returned by the directory of services are displayed in the same manner. Figure 4 shows the architecture of our proposed method in the client-directory-server environment. (a) (b) Figure 4: Two-dimensional visualization for (a) returned servers from the directory of services, and (b) returned documents from relevant servers in the client-directory-server environment. The returned data are colored based on their similarities to the user query. Because users have their own preference terms or taxonomies, displaying data in a customized view can assist users in searching and browsing documents. To achieve this, we rearrange returned data by merging them with a user prole that consists of a set of documents (called pseudo documents) selected by users to represent their interests. For each user prole, we gener- 5

6 ate a \term-prole" matrix like the term-document matrix. We merge it with the term-document matrix of the returned documents and apply SVD to them. The result is a customized view that contains the decomposed term and document vectors arranged in favor of the terminology used by the merged user prole. To scale to the number of returned documents, we replicate the pseudo documents or increase their weights in the merged matrix. Below, we dene the merge operation and demonstrate the customization in Example 2. Denition 1 Let A and B be two term-document matrices, where A consists of m 1 terms (rows) and n 1 documents (columns), and B consists of m 2 terms (rows) and n 2 documents (columns). Let T A and T B be the sets of terms in A and B, respectively. If A merges B is equal to C, denoted as A B = C, then C is a term-document matrix consisting of m terms (rows) and n documents (columns), where n = n 1 + n 2 and m = jt A [ T B j. Example 2 Let p 1 and p 2 be two pseudo documents in a user prole, where p 1 = ft 5 ; t 8 ; t 8 g; p 2 = ft 1 ; t 1 ; t 3 g: We generate a term-prole matrix from p 1 and p 2 and merge it with the original term-document matrix in Example 1 (see Figure 5). Figure 6 shows the two-dimensional plot of the merged matrix after SVD. d 1 d 2 d 3 d 4 d 5 t t t t t t t t p 1 p 2 t t t t = d 1 d 2 d 3 d 4 d 5 p 1 p 2 t t t t t t t t Figure 5: The term-document matrix in Example 1 merges with the termpseudo-document matrix generated from a user prole. To examine the eect of merging a user prole, we calculate the Euclidean distance between document d i and d j in each cluster before and after the merging. Table 2 and Figure 7 show the distances and two-dimensional coordinates of the decomposed document vectors before and after merging with the user prole. 2 In Figure 6, the documents and terms in each cluster are closer to each other than those in Figure 2. The distance changes between documents before and after the merging can be easily seen 6

7 1 2-D Plot of Terms and Documents Dimension d5 d4 p1 t5 t6 t8 t7 d1 p2 d2 d3 t1 t2 t3 t Dimension 1 Figure 6: Two-dimensional plot of the term-document matrix after merging with a user prole. The pseudo documents p 1 and p 2 are shown as \". The documents d 0 (1 i i 5) and terms t0 j (1 j 8) are represented as \" and \+", respectively. document before after change distance merging merging dierence (d 1 ; d 2 ) ?100.00% (d 1 ; d 3 ) ?69.87% (d 2 ; d 3 ) ?70.20% (d 4 ; d 5 ) ?48.53% Table 2: Document distances before and after merging with a user prole. The \change dierence" column shows the distance increasing (\+") or decreasing (\?") rates. 7

8 1 2-D Plot of Terms and Documents d d4 p1 d5 d4 d2 Dimension d3 d1 d2 d3 p d Dimension 1 Figure 7: The documents in two-dimensional space when merging with the a prole. The pseudo documents p 1 and p 2 are shown as \". The documents before the merging are represented by d i (1 i 5), shown as \", and linked by dashed lines. The documents after the merging are represented by (1 i 5), shown as \", and linked by solid lines. d 0 i 8

9 in Figure 7. In Table 2, the distances between documents d 1, d 2, and d 3 all decrease. These changes are because pseudo document p 2 contains both t 1 and t 3, which increases the correlation between any document containing either t 1 (such as d 1 and d 2 ) or t 3 (such as d 1, d 2, and d 3 ). Similarly, the distance between d 4 and d 5 decreases due to p 1 's containing t 5 and t 8. From the example above, we can see the distance changes reecting the correlations between terms and documents in the semantic space. When merging with a user prole, those documents having common terms with pseudo documents move toward each other. 5 Summary In Internet resource discovery, a simple ranked list is inecient to present a large volume data to the users. In this paper, we have proposed a two-dimensional visualization scheme based on Deerwester's latent semantic indexing. Our method displays data according to their cross-similarities and still retains the rankings with respect to the query. We believe integrating two-dimensional visualization with traditional information retrieval techniques, such as query reformation and relevance feedback, can greatly improve user's searching process. References [1] J. C. Gower and P. G. N. Digby, \Expressing complex relationships in two dimensions," in Interpreting Multivariate Data (V. Barnett, ed.), pp. 83{118, John Wiley & Son, INC., [2] G. Cybenko, Y. Wu, A. Khrabrov, and R. Gray, \Information agents as organizers," in CIKM Workshop on Intelligent Information Agents, December [3] T. Kohonen, \The self organizing map," Proceedings of the IEEE Transactions on Computers, vol. 78, no. 9, pp. 1464{1480, [4] S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. Harshman, \Indexing by latent semantic analysis," Journal of the American Society for Information Science, vol. 41, pp. 391{407, September [5] G. W. Furnas, T. K. Landauer, L. M. Gomez, and S. T. Dumais, \The vocabulary problem in human-system communication," Communications of the ACM, vol. 30, pp. 964{971, November [6] G. Salton and M. J. McGill, Introduction to Modern Information Retrieval. McGraw-Hill Book Company, [7] K. Obraczka, P. B. Danzig, and S.-H. Li, \Internet resource discovery services," Computer, vol. 26, pp. 8{22, September [8] A. Emtage and P. Deutsch, \Archie: An electronic directory service for the internet," in Proceedings of the Winter 1992 USENIX Conference, pp. 93{110, [9] B. Kahle and A. Medlar, \An information system for corporate users: Wide Area Information Servers," ConneXions { The Interoperability Report, vol. 5, no. 11, pp. 2{9,

10 [10] T. Berners-Lee, R. Cailliau, J.-F. Gro, and B. Pollermann, \World-Wide Web: The information universe," Electronic Networking: Research, Applications and Policy, vol. 1, no. 2, pp. 52{58, [11] M. McCahill, \The internet gopher protocol: A distributed server information system," ConneXions { The Interoperability Report, vol. 6, pp. 10{14, July [12] P. B. Danzig, S.-H. Li, and K. Obraczka, \Distributed indexing of autonomous internet services," Computing Systems, vol. 5, no. 4, pp. 433{459, [13] C. M. Bowman, P. B. Danzig, D. R. Hardy, U. Manber, and M. F. Schwartz, \Harvest: A scalable, customizable discovery and access system," Technical Report CU-CS , University of Colorado, [14] O. A. McBryan, \GENVL and WWWW: Tools for taming the Web," in Proceedings of the First International World-Wide Web Conference, May [15] S. Foster, \About the Veronica service," November Electronic bulletin board posting on the comp.infosystems.gopher newsgroup. 10

highest cosine coecient [5] are returned. Notice that a query can hit documents without having common terms because the k indexing dimensions indicate

highest cosine coecient [5] are returned. Notice that a query can hit documents without having common terms because the k indexing dimensions indicate Searching Information Servers Based on Customized Proles Technical Report USC-CS-96-636 Shih-Hao Li and Peter B. Danzig Computer Science Department University of Southern California Los Angeles, California