Two-Dimensional Visualization for Internet Resource Discovery. Shih-Hao Li and Peter B. Danzig. University of Southern California

Similar documents
highest cosine coecient [5] are returned. Notice that a query can hit documents without having common terms because the k indexing dimensions indicate

Minoru SASAKI and Kenji KITA. Department of Information Science & Intelligent Systems. Faculty of Engineering, Tokushima University

Internet Resource Discovery Services. Peter B. Danzig, Katia Obraczka and Shih-Hao Li

LRLW-LSI: An Improved Latent Semantic Indexing (LSI) Text Classifier

Decomposition. November 20, Abstract. With the electronic storage of documents comes the possibility of

ITERATIVE SEARCHING IN AN ONLINE DATABASE. Susan T. Dumais and Deborah G. Schmitt Cognitive Science Research Group Bellcore Morristown, NJ

Information Retrieval. hussein suleman uct cs

June 15, Abstract. 2. Methodology and Considerations. 1. Introduction

Internet Resource Discovery Services. Katia Obraczka, Peter B. Danzig and Shih-Hao Li. phone: (213) fax: (213)

Self-organization of very large document collections

Using Statistical Properties of Text to Create. Metadata. Computer Science and Electrical Engineering Department

Published in A R DIGITECH

Dimension reduction : PCA and Clustering

A Content Vector Model for Text Classification

IMPROVING THE RELEVANCY OF DOCUMENT SEARCH USING THE MULTI-TERM ADJACENCY KEYWORD-ORDER MODEL

Discover: A Resource Discovery System based on Content. Routing

ESANN'2001 proceedings - European Symposium on Artificial Neural Networks Bruges (Belgium), April 2001, D-Facto public., ISBN ,

A Metadatabase System for Semantic Image Search by a Mathematical Model of Meaning

Text Modeling with the Trace Norm

Latent Semantic Indexing

Visualization of Text Document Corpus

Web personalization using Extended Boolean Operations with Latent Semantic Indexing

Combining Textual and Visual Cues for Content-based Image Retrieval on the World Wide Web

Dimension Reduction CS534

CSE 494: Information Retrieval, Mining and Integration on the Internet

Content-based Dimensionality Reduction for Recommender Systems

CHAPTER 3 INFORMATION RETRIEVAL BASED ON QUERY EXPANSION AND LATENT SEMANTIC INDEXING

second_language research_teaching sla vivian_cook language_department idl

Rowena Cole and Luigi Barone. Department of Computer Science, The University of Western Australia, Western Australia, 6907

A Multiclassifier based Approach for Word Sense Disambiguation using Singular Value Decomposition

A Semi-Discrete Matrix Decomposition for Latent. Semantic Indexing in Information Retrieval. December 5, Abstract

The Semantic Conference Organizer

Clustering. Bruno Martins. 1 st Semester 2012/2013

The Harvest Information Discovery and Access System

A Multiclassifier based Approach for Word Sense Disambiguation using Singular Value Decomposition

Web site Image database. Web site Video database. Web server. Meta-server Meta-search Agent. Meta-DB. Video query. Text query. Web client.

Clustered SVD strategies in latent semantic indexing q

Research Problems for. Scalable Internet Resource Discovery. C. Mic Bowman. Peter B. Danzig. Michael F. Schwartz. CU-CS March 1993

DOCUMENT INDEXING USING INDEPENDENT TOPIC EXTRACTION. Yu-Hwan Kim and Byoung-Tak Zhang

Distributed Indexing of the Web Using Migrating Crawlers

the number of states must be set in advance, i.e. the structure of the model is not t to the data, but given a priori the algorithm converges to a loc

Introduction to Information Retrieval

ENHANCEMENT OF METICULOUS IMAGE SEARCH BY MARKOVIAN SEMANTIC INDEXING MODEL

Favorites-Based Search Result Ordering

Cluster Analysis and Visualization. Workshop on Statistics and Machine Learning 2004/2/6

Feature-Guided Automated Collaborative Filtering. Yezdi Lashkari. Abstract. of content analysis of documents to represent a prole of user interests.

Vector Semantics. Dense Vectors

Document Clustering in Reduced Dimension Vector Space

Sprinkled Latent Semantic Indexing for Text Classification with Background Knowledge

Lab # 2 - ACS I Part I - DATA COMPRESSION in IMAGE PROCESSING using SVD

Map of the document collection. Document j. Vector n. document. encoding. Mapping function

Cumulative Percentage of File Requests

Data Distortion for Privacy Protection in a Terrorist Analysis System

Concept Based Search Using LSI and Automatic Keyphrase Extraction

Cluster Analysis for Microarray Data

Information Retrieval: Retrieval Models

523, IEEE Expert, England, Gaithersburg, , 1989, pp in Digital Libraries (ADL'99), Baltimore, 1998.

Retrieval by Content. Part 3: Text Retrieval Latent Semantic Indexing. Srihari: CSE 626 1

Clustering Startups Based on Customer-Value Proposition

The Use of Biplot Analysis and Euclidean Distance with Procrustes Measure for Outliers Detection

Vector Space Models: Theory and Applications

Feature Selection for Clustering. Abstract. Clustering is an important data mining task. Data mining

Collaborative Filtering based on User Trends

The Curated Web: A Recommendation Challenge. Saaya, Zurina; Rafter, Rachael; Schaal, Markus; Smyth, Barry. RecSys 13, Hong Kong, China

Reddit Recommendation System Daniel Poon, Yu Wu, David (Qifan) Zhang CS229, Stanford University December 11 th, 2011

Distance (*40 ft) Depth (*40 ft) Profile A-A from SEG-EAEG salt model

Effective Latent Space Graph-based Re-ranking Model with Global Consistency

Resource and Knowledge Discovery in Global Information Systems: A Preliminary Design and Experiment

SPE Copyright 2010, Society of Petroleum Engineers

Analysis and Latent Semantic Indexing

storage, access, search, and organization. interpreted by remote applications. File systems support storage, providing a repository where

Principal Component Image Interpretation A Logical and Statistical Approach

[8] Peter W. Foltz and Susan T. Dumais. Personalized information delivery: An analysis of information

Document Clustering and Visualization with Latent Dirichlet Allocation and Self-Organizing Maps

Vector Semantics. Dense Vectors

Recommendation System Using Yelp Data CS 229 Machine Learning Jia Le Xu, Yingran Xu

Document Clustering using Concept Space and Cosine Similarity Measurement

/00/$10.00 (C) 2000 IEEE

Calibrating a Structured Light System Dr Alan M. McIvor Robert J. Valkenburg Machine Vision Team, Industrial Research Limited P.O. Box 2225, Auckland

A World Wide Web Resource Discovery System. Budi Yuwono Savio L. Lam Jerry H. Ying Dik L. Lee. Hong Kong University of Science and Technology

A Vector Space Equalization Scheme for a Concept-based Collaborative Information Retrieval System

Video Representation. Video Analysis

Michael F. Schwartz. March 12, (Original Date: August 1994 Revised March 1995) Abstract

Algebraic Techniques for Analysis of Large Discrete-Valued Datasets

Explore Co-clustering on Job Applications. Qingyun Wan SUNet ID:qywan

Using the Self-Organizing Map (SOM) Algorithm, as a Prototype E-Content Retrieval Tool

Indexing by Latent Semantic Analysis

CSE 252B: Computer Vision II

Identifying Value Mappings for Data Integration: An Unsupervised Approach

ESANN'2001 proceedings - European Symposium on Artificial Neural Networks Bruges (Belgium), April 2001, D-Facto public., ISBN ,

Managing Changes to Schema of Data Sources in a Data Warehouse

Request for Comments: Network Research/UCSD September 1997

Classification of Web Resident Sensor Resources using Latent Semantic Indexing and Ontologies

Skill. Robot/ Controller

Robust Pole Placement using Linear Quadratic Regulator Weight Selection Algorithm

Module 9 : Numerical Relaying II : DSP Perspective

Info Agent USER. External Retrieval Agent. Internal Services Agent. Interface Agent

Routing and Ad-hoc Retrieval with the. Nikolaus Walczuch, Norbert Fuhr, Michael Pollmann, Birgit Sievers. University of Dortmund, Germany.

Improving Probabilistic Latent Semantic Analysis with Principal Component Analysis

The Design and Implementation of an Intelligent Online Recommender System

Transcription:

Two-Dimensional Visualization for Internet Resource Discovery Shih-Hao Li and Peter B. Danzig Computer Science Department University of Southern California Los Angeles, California 90089-0781 fshli, danzigg@cs.usc.edu Abstract Traditional information retrieval systems return documents in a list, where documents are sorted according to their publication dates, titles, or similarities to the user query. Users select interested documents by searching them through the returned list. In a distributed environment, documents are returned from more than one information server. A simple document list is not ecient to present a large volume data to the users. In this paper, we propose a two-dimensional visualization scheme for Internet resource discovery. Our method displays data by clusters according to their cross-similarities and still retains the rankings with respect to the user query. In addition, it provides a customized view that arranges data in favor of user's preference terms. Index Terms - information retrieval, resource discovery, similarity measure, visualization. 1 Introduction Traditional information retrieval systems return documents in lists or sets, where documents are sorted according to their publication dates, title alphabets, or computed similarities to the user query. In a distributed environment, documents are returned from more than one information server, and a simple document list is not ecient to present a large volume of data to the users. One solution to this problem is to present data using a two-dimensional graphical display. By arranging documents in a two-dimensional space based on their inter-relationships, users can more easily identify topics and browse topical clusters. In [1], Gower and Digby introduce several methods, such as principal components analysis, biplot, and correspondence analysis, for expressing complex relationships in two dimensions. These methods use Singular Value Decomposition (SVD) to transform documents into one or more matrices that carry the \compressed" relationships of the original data. In Information Agents [2], Cybenko et. al use Self Organizing Maps (SOM) [3] to map multi-dimensional data onto a two-dimensional space. SOM is based on neural networks and needs iterative training and adjustments. In this paper, we propose a two-dimensional visualization scheme based on Latent Semantic Indexing (LSI) [4] for Internet resource discovery. Our method displays data by clusters according to their cross-similarities and ranks documents with respect to the user query. In addition, it provides a customized view that arranges data in favor of user's preference terms. Below, we introduce LSI and describe our method in Internet resource discovery. 1

2 Latent Semantic Indexing LSI was originally developed for solving the vocabulary problem [5], which states that people with dierent backgrounds or in dierent contexts describe information dierently. LSI assumes there is some underlying semantic structure in the pattern of term usage across documents and uses SVD to capture this structure. Our goal is to present this structure instead of a simple ranked list to the users. In LSI, documents are represented as vectors of term frequencies following Salton's wellestablished Vector Space Model (VSM) [6]. An entire database is represented as a term-document matrix, where the rows and columns indicate the terms and documents, respectively. To capture the semantic structure among documents, LSI applies SVD to this matrix and generates vectors of k (typically 100 to 300) orthogonal indexing dimensions, where each dimension represents a linearly independent concept. The decomposed vectors are used to represent both documents and terms in the same semantic space, while their values indicate the degrees of association with the k underlying concepts. Because k is chosen much smaller than the number of documents and terms in the database, the decomposed vectors are represented in a compressed dimensional space and are not independent. Therefore, two documents can be relevant without having common terms but with common concepts. Figure 1 shows SVD applied to a term-document matrix and Example 1 illustrates the document representation before and after SVD. Figure 1: SVD applies to an m n term-document matrix, where m and n are the numbers of terms and documents in the database, and k is the dimensionality of the SVD representation. Example 1 Let d i (1 i 5), and t j an information system, where (1 j 8) be a set of documents and their associated terms in d 1 = ft 1 ; t 3 ; t 4 ; t 4 g; d 2 = ft 1 ; t 2 ; t 2 ; t 3 g; d 3 = ft 2 ; t 3 ; t 4 g; d 4 = ft 6 ; t 7 ; t 8 g; d 5 = ft 5 ; t 6 ; t 6 ; t 7 g: 2

Table 1 shows their vector representations in VSM (i.e. before SVD) and LSI (i.e. after SVD). Figure 2 shows the two-dimensional plot of the decomposed document and term vectors in LSI. document term VSM LSI description (t 1 t 2 t 3 t 4 t 5 t 6 t 7 t 8 ) (dim1 dim2) d 1 t 1 t 3 t 4 t 4 1 0 1 2 0 0 0 0 0.863?0.508 d 2 t 1 t 2 t 2 t 3 1 2 1 0 0 0 0 0 0.873 0.521 d 3 t 2 t 3 t 4 0 1 1 1 0 0 0 0 0.702 0.005 d 4 t 6 t 7 t 8 0 0 0 0 0 1 1 1?0.010 0.603 d 5 t 5 t 6 t 6 t 7 0 0 0 0 1 2 1 0?0.017 0.974 Table 1: The vector representations of documents d i (1 i 5) in VSM (i.e. before SVD) and LSI (i.e. after SVD). The \(t 1 : : : t 8 )" and \(dim1 dim2)" columns show the vectors in VSM and LSI, respectively. 1 0.8 2-D Plot of Terms and Documents d5 t6 Dimension 2 0.6 0.4 0.2 0-0.2-0.4-0.6 cluster A d4 t7 t5 t8 t1 t3 cluster B t2 t4 d3 d2 d1-0.8-1 -1-0.8-0.6-0.4-0.2 0 0.2 0.4 0.6 0.8 1 Dimension 1 Figure 2: Two-dimensional plot of decomposed vectors in Example 1, where documents d i (1 i 5) and terms t j (1 j 8) are represented as \" and \+", respectively. The above example shows that documents having a number of common terms are close to each other as well as their common terms. In Figure 2, we can roughly see two clusters, A and B. In cluster A, documents d 1, d 2, and d 3 are close to each other because each pair of them share t 1, t 2, t 3, or t 4. Similarly, d 4 and d 5 are close to each other because both of them have t 6 and t 7. To present documents and terms in this way, we can easily see their relationships by x?y coordinates. 2 3

3 Internet Resource Discovery In a distributed environment, people search information by sending requests to associated information servers using a client-server model (Figure 3(a)). In this approach, the client needs to know the server's name or address before sending a request. In the Internet where thousands of servers provide information, it becomes dicult and inecient to search all the servers manually. (a) (b) Figure 3: (a) Client-Server Model. (b) Client-Directory-Server Model. One step to solve this problem is to keep a \directory of services" that records the description or summary of each information server. A user sends his query to the directory of services which determines and ranks the information servers relevant to the user's request. The user employs the rankings when selecting the information servers to query directly. We call this the client-directoryserver model (Figure 3(b)). Internet resource discovery services [7], such as Archie [8], WAIS [9], WWW [10], Gopher [11], and Indie [12], allow users to retrieve information throughout the Internet. All of the above systems provide services similar to the client-directory-server model. For example, Archie, WAIS, and Indie support a global index like the directory of services in their systems. WWW and Gopher do not have a global index by themselves. Instead, they have an added-on indexing scheme built by other tools, such as Harvest [13], WWWW [14] for WWW and Veronica [15] for Gopher. 4 Visualization in Client-Directory-Server Model In Internet resource discovery, users search data from thousands of information servers. A simple ranked list is inecient to present a large volume data to the users. To solve this problem, we propose a LSI-based two-dimensional visualization scheme. Our method presents documents in a two-dimensional space based on their cross-similarities. Documents having a number of common terms are placed close to each other as well as their common terms. In addition, documents are colored based on their similarities to the user query. Documents with higher similarities are painted with darker colors or gray-levels. The darkest document in a cluster is used as the representative document of this cluster. The documents with similarities below a certain value are colorless. By doing this, relevant documents can be grouped together and still show their individual similarities to the query. Users can browse the common terms or representative documents and decide whether to see the details. In contrast, if returned documents are presented in a ranked list, the rst and second documents may have identical similarity values with the query but cover completely dierent topics. If the user wants to focus on the second topic, he still needs to search the whole list to nd other similar ones. 4

To implement our visualization method, we collect all the returned documents at the client site, generate a term-document matrix out of them, and apply SVD to this matrix. The resulting terms and documents are plotted on a two-dimensional graphical display and presented to the users. Similarly, the relevant servers returned by the directory of services are displayed in the same manner. Figure 4 shows the architecture of our proposed method in the client-directory-server environment. (a) (b) Figure 4: Two-dimensional visualization for (a) returned servers from the directory of services, and (b) returned documents from relevant servers in the client-directory-server environment. The returned data are colored based on their similarities to the user query. Because users have their own preference terms or taxonomies, displaying data in a customized view can assist users in searching and browsing documents. To achieve this, we rearrange returned data by merging them with a user prole that consists of a set of documents (called pseudo documents) selected by users to represent their interests. For each user prole, we gener- 5

ate a \term-prole" matrix like the term-document matrix. We merge it with the term-document matrix of the returned documents and apply SVD to them. The result is a customized view that contains the decomposed term and document vectors arranged in favor of the terminology used by the merged user prole. To scale to the number of returned documents, we replicate the pseudo documents or increase their weights in the merged matrix. Below, we dene the merge operation and demonstrate the customization in Example 2. Denition 1 Let A and B be two term-document matrices, where A consists of m 1 terms (rows) and n 1 documents (columns), and B consists of m 2 terms (rows) and n 2 documents (columns). Let T A and T B be the sets of terms in A and B, respectively. If A merges B is equal to C, denoted as A B = C, then C is a term-document matrix consisting of m terms (rows) and n documents (columns), where n = n 1 + n 2 and m = jt A [ T B j. Example 2 Let p 1 and p 2 be two pseudo documents in a user prole, where p 1 = ft 5 ; t 8 ; t 8 g; p 2 = ft 1 ; t 1 ; t 3 g: We generate a term-prole matrix from p 1 and p 2 and merge it with the original term-document matrix in Example 1 (see Figure 5). Figure 6 shows the two-dimensional plot of the merged matrix after SVD. d 1 d 2 d 3 d 4 d 5 t 1 1 1 0 0 0 t 2 0 2 1 0 0 t 3 1 1 1 0 0 t 4 2 0 1 0 0 t 5 0 0 0 0 1 t 6 0 0 0 1 2 t 7 0 0 0 1 1 t 8 0 0 0 1 0 p 1 p 2 t 1 0 2 t 3 0 1 t 5 1 0 t 8 2 0 = d 1 d 2 d 3 d 4 d 5 p 1 p 2 t 1 1 1 0 0 0 0 2 t 2 0 2 1 0 0 0 0 t 3 1 1 1 0 0 0 1 t 4 2 0 1 0 0 0 0 t 5 0 0 0 0 1 1 0 t 6 0 0 0 1 2 0 0 t 7 0 0 0 1 1 0 0 t 8 0 0 0 1 0 2 0 Figure 5: The term-document matrix in Example 1 merges with the termpseudo-document matrix generated from a user prole. To examine the eect of merging a user prole, we calculate the Euclidean distance between document d i and d j in each cluster before and after the merging. Table 2 and Figure 7 show the distances and two-dimensional coordinates of the decomposed document vectors before and after merging with the user prole. 2 In Figure 6, the documents and terms in each cluster are closer to each other than those in Figure 2. The distance changes between documents before and after the merging can be easily seen 6

1 2-D Plot of Terms and Documents Dimension 2 0.8 0.6 0.4 0.2 0-0.2 d5 d4 p1 t5 t6 t8 t7 d1 p2 d2 d3 t1 t2 t3 t4-0.4-0.6-0.8-1 -1-0.8-0.6-0.4-0.2 0 0.2 0.4 0.6 0.8 1 Dimension 1 Figure 6: Two-dimensional plot of the term-document matrix after merging with a user prole. The pseudo documents p 1 and p 2 are shown as \". The documents d 0 (1 i i 5) and terms t0 j (1 j 8) are represented as \" and \+", respectively. document before after change distance merging merging dierence (d 1 ; d 2 ) 1.029 0.000?100.00% (d 1 ; d 3 ) 0.538 0.162?69.87% (d 2 ; d 3 ) 0.544 0.162?70.20% (d 4 ; d 5 ) 0.371 0.191?48.53% Table 2: Document distances before and after merging with a user prole. The \change dierence" column shows the distance increasing (\+") or decreasing (\?") rates. 7

1 2-D Plot of Terms and Documents d5 0.8 0.6 0.4 d4 p1 d5 d4 d2 Dimension 2 0.2 0-0.2 d3 d1 d2 d3 p2-0.4-0.6 d1-0.8-1 -1-0.8-0.6-0.4-0.2 0 0.2 0.4 0.6 0.8 1 Dimension 1 Figure 7: The documents in two-dimensional space when merging with the a prole. The pseudo documents p 1 and p 2 are shown as \". The documents before the merging are represented by d i (1 i 5), shown as \", and linked by dashed lines. The documents after the merging are represented by (1 i 5), shown as \", and linked by solid lines. d 0 i 8

in Figure 7. In Table 2, the distances between documents d 1, d 2, and d 3 all decrease. These changes are because pseudo document p 2 contains both t 1 and t 3, which increases the correlation between any document containing either t 1 (such as d 1 and d 2 ) or t 3 (such as d 1, d 2, and d 3 ). Similarly, the distance between d 4 and d 5 decreases due to p 1 's containing t 5 and t 8. From the example above, we can see the distance changes reecting the correlations between terms and documents in the semantic space. When merging with a user prole, those documents having common terms with pseudo documents move toward each other. 5 Summary In Internet resource discovery, a simple ranked list is inecient to present a large volume data to the users. In this paper, we have proposed a two-dimensional visualization scheme based on Deerwester's latent semantic indexing. Our method displays data according to their cross-similarities and still retains the rankings with respect to the query. We believe integrating two-dimensional visualization with traditional information retrieval techniques, such as query reformation and relevance feedback, can greatly improve user's searching process. References [1] J. C. Gower and P. G. N. Digby, \Expressing complex relationships in two dimensions," in Interpreting Multivariate Data (V. Barnett, ed.), pp. 83{118, John Wiley & Son, INC., 1981. [2] G. Cybenko, Y. Wu, A. Khrabrov, and R. Gray, \Information agents as organizers," in CIKM Workshop on Intelligent Information Agents, December 1994. [3] T. Kohonen, \The self organizing map," Proceedings of the IEEE Transactions on Computers, vol. 78, no. 9, pp. 1464{1480, 1990. [4] S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. Harshman, \Indexing by latent semantic analysis," Journal of the American Society for Information Science, vol. 41, pp. 391{407, September 1990. [5] G. W. Furnas, T. K. Landauer, L. M. Gomez, and S. T. Dumais, \The vocabulary problem in human-system communication," Communications of the ACM, vol. 30, pp. 964{971, November 1987. [6] G. Salton and M. J. McGill, Introduction to Modern Information Retrieval. McGraw-Hill Book Company, 1983. [7] K. Obraczka, P. B. Danzig, and S.-H. Li, \Internet resource discovery services," Computer, vol. 26, pp. 8{22, September 1993. [8] A. Emtage and P. Deutsch, \Archie: An electronic directory service for the internet," in Proceedings of the Winter 1992 USENIX Conference, pp. 93{110, 1992. [9] B. Kahle and A. Medlar, \An information system for corporate users: Wide Area Information Servers," ConneXions { The Interoperability Report, vol. 5, no. 11, pp. 2{9, 1991. 9

[10] T. Berners-Lee, R. Cailliau, J.-F. Gro, and B. Pollermann, \World-Wide Web: The information universe," Electronic Networking: Research, Applications and Policy, vol. 1, no. 2, pp. 52{58, 1992. [11] M. McCahill, \The internet gopher protocol: A distributed server information system," ConneXions { The Interoperability Report, vol. 6, pp. 10{14, July 1992. [12] P. B. Danzig, S.-H. Li, and K. Obraczka, \Distributed indexing of autonomous internet services," Computing Systems, vol. 5, no. 4, pp. 433{459, 1992. [13] C. M. Bowman, P. B. Danzig, D. R. Hardy, U. Manber, and M. F. Schwartz, \Harvest: A scalable, customizable discovery and access system," Technical Report CU-CS-732-94, University of Colorado, 1994. [14] O. A. McBryan, \GENVL and WWWW: Tools for taming the Web," in Proceedings of the First International World-Wide Web Conference, May 1994. [15] S. Foster, \About the Veronica service," November 1992. Electronic bulletin board posting on the comp.infosystems.gopher newsgroup. 10