Organizing Internet Bookmarks using Latent Semantic Analysis and Intelligent Icons Note: This file is a homework produced by two students for UCR CS235, Spring 06. In order to fully appreacate it, it may be helpful to read Intelligent Icons by Keogh et. al. first. Eamonn 6/30/2006 Jin Shieh June 15, 2006 Scott Sirowy Abstract An unorganized bookmark list is a common problem for many internet users. This lack of organization makes looking through entries both time consuming and tedious. We present an application which organizes Mozilla bookmark entries based off the contents of their target website. We also incorporate Intelligent Icons into bookmark entries for a clear visualization of similarity. 1. Introduction With the onset of news aggregators and social bookmarks, internet users have a greater means of locating and accessing sites of interest than ever before. Often times, and due to the overwhelming volume, bookmarks are saved in a haphazard manner, with little thought or organization. This makes looking up a specific bookmark at a later time a tedious and time consuming task, likely requiring a sequential scan of nearly the entire bookmark listing. Our solution is the formulation of an application process which would be capable of organizing a users bookmark entries in an automatic as well as intuitive fashion. In order to organize bookmark entries, we must have a means of determining similarity between the contents of different websites (in the remaining text, we will refer to websites generically as documents). Through a technique called Latent Semantic Analysis (LSA) [2] we are able to associate each document with a set of concepts. Using this we can then determine the document to document similarity. Once document processing has been completed, we will generate Intelligent Icons for each document entry to provide users with a convenient visualization aide [1]. Intelligent Icons allow the user to easily identify similar items and to some extent, the depth of similarity. These generated icons will then be encoded into the bookmark file as a page icon. 2. Methodology and Considerations The application process follows a series of intermediate steps. The bookmark file must first be parsed and the text representative of each bookmark entry must be extracted. A termdocument matrix is then constructed and additional preprocessing is done to improve accuracy. LSA then takes this term-document matrix and performs singular vector decomposition (SVD) for rank lowering. Once this is complete we can then use basic matrix operations to compute a document to document similarity matrix. Using this similarity information, we will then cluster similar documents so they are arranged together. Icon generation and bookmark construction will then complete the application process. The following subsections will elaborate on each of the key phases of the application process as well as any considerations we made during the construction of our application prototype. 2.1 Text Extraction Individual bookmark entries are first extracted from the Mozilla bookmarks.html file. Presently, we use regular expressions to obtain the title and URL of the target website, though future extensions should include a formal parser which can prevent lossy extraction by saving the 1
complete set of metadata. Each website specified by an entry is then fetched and the relevant text is extracted 1. During text extraction, there is some concern of the presence of advertisement as well as text in the form of different Unicode mappings. Advertisement text may distort the perceived relationship between documents and Unicode may not be mapped to the correct text. These two issues warrant additional consideration in future development. 2.2 Latent Semantic Analysis To use LSA, we first change the representation of the documents into that of a term-document matrix. This is simply a large frequency matrix consisting of all possible words (rows) in the set of documents and the number of occurrences, if any, for each document (columns). To improve the accuracy of our results, we preprocess the text during construction of the matrix. The first step of preprocessing is the stemming of words, using Porter s algorithm [3]. This maps a large number of word variations to a single root word. For example, connections, connection, connecting, and connected can all be reduced to a single term. Next a list of common English stop words was used as an exclusionary list [4]. These words, such as a, and, and etc. add little or no description and fails to provide help with the formulation of document concepts. Following the construction of the termdocument matrix, a number of weighting schemes may be applied (tf-idf, log, binary, etc) [5]. The effectiveness of each is dependant on the nature of the dataset being used. For our documents we found that taking the log (Term- Document i,j +1) of each entry in the termdocument matrix and then normalizing each document vector (columns), resulted in the most effective weighting scheme. approximating the original term-document matrix [6]. This is done by keeping only the n largest singular values during SVD. The choice of n here is critical in determining the accuracy of the result (too high results in over fitting and too low fails to capture accurate dataset representation). While determining a good size for n is an inherently difficult choice, our empirical results indicate that keeping a relatively low number of singular values (11 for 79 documents) will be sufficient to generate accurate results. Once SVD has been completed, we can use basic matrix operations to generate term to term, term to document, or document to document similarity matrices (For additional details on LSA and SVD see [2]). 2.3 Hierarchical Clustering Once we obtain the document to document similarity matrix, we then use single linkage hierarchical clustering to obtain an ordering where similar items are clustered together. We note that while we do not know the actual number of clusters present in the dataset; it is unnecessary, as we only wish to return the ordering. To do this we first create a singleton cluster for each document, and then proceed to merge the two most similar clusters. This merging step is repeated until a single cluster, containing all documents is formed. The ordering is saved during the clustering process and will be used for icon generation as well as organization of bookmark entries. Once the processing of the termdocument matrix has been completed, we use the SVD process as described by LSA to construct a lower dimensional abstract semantic space 1 At the present time, extraction is done manually. Figure 1. Using color map for icon generation 2
2.4 Icon Generation As clustering returns an ordering where similar items are placed together, we use this information to generate Intelligent Icons where similar documents are also visually similar. A linear color map is first created to provide a range of varying colors. Each document is then equidistantly mapped, according to the cluster order, onto the color map. The intuition is that more similar documents will have a representative coloring which is more visually alike than that of a dissimilar document (See Figure 1). To construct the icon for each of the documents we first find a given document s n most similar neighbors (by performing a lookup in the document to document similiarity matrix). Recall that each document can now be identified by a unique color, as a result of the color mapping process illustrated earlier. We will now use the representative colors of the n most similar documents to fill in the icon in a left to right, top to bottom fashion, beginning with the most similar document. We wish to note that as the choice of n dictates the level of granularity, it should be kept relatively low unless the true cluster number and size is known. This is because in a dataset of many small clusters, if n is exorbitantly high; the representation shown in the icon may be potentially overwhelmed by dissimilar documents. 2.5 Bookmarks.html Construction In the last phase of the application process, the bookmarks.html file is reconstructed and bookmark entries are arranged according to the ordering obtained from hierarchical clustering. We then use a base64 encoding to convert each of the generated icons to a string representation. This string is then embedded into the bookmark entry as its page icon. This visualization will help the user differentiate between similar and dissimilar bookmarks. 3. Experimental Results To test the effectiveness of our methodology, we constructed a contrived but complete dataset of 79 bookmark entries, with each entry falling in one of 9 major categories. We have constructed a screen shot of what an unorganized bookmark listing containing these entries may look like in Figure 2. 2 Looking up individual bookmarks in such a listing is neither straight forward nor obvious. 3 For the experimental dataset we manually extracted the text from each site and placed them into text files. Logarithmic weighting was applied and the resulting termdocument matrix was normalized. Singular vector decomposition was then performed by selecting the 11 largest singular values. Once hierarchical clustering was complete we constructed Intelligent Icons using the 4 most similar documents per icon. To help visualize the result of LSA and Intelligent Icons, we projected the document to document similarity 4 onto a 2D plot using Multi- Dimensional Scaling (Figure 3). We can immediately observe the differentiation between documents of varying topics both in terms of spatial locality as well as icon color. The new bookmark file complete with embedded page icons is shown in Figure 4. The hierarchical clustering we used was able to accurately place bookmark entries with the same genre or topic together. The page icons for bookmark entries also proved to be valuable indicators of document similarity, as the icon colorings across different categories tend to have high contrast. 2 During text processing, no ordering is maintained as a result of Python s dictionary implementation. 3 The category name before each bookmark entry in Figure 2 and 4 are only used to assist visualization of the dataset. Titles are not used during LSA. 4 Dissimilarity matrix used by Multi-Dimensional Scaling derived by taking the square root of 1-each element of the document to document similarity matrix. 3
Figure 2. Sample screenshot of 79 unordered bookmark entries Figure 3. Using MDS for visualization following LSA and Intelligent Icon generation 4
Figure 4. Reorganized bookmark entries with embedded page icons 4. Conclusion We formulated an application which was aimed at improving bookmark usability by automatically organizing bookmark listings in a way where similar entries are grouped together. We first used LSA to perform information retrieval and for determining document to document similarity. Hierarchical clustering was then performed to group similar documents together and Intelligent Icon s were generated to help users visualize the data. Our experimental result, which was conducted with 79 bookmark entries, demonstrates the effectiveness and overall improvement achieved from using our application process. The organized bookmark entries are easily identifiable by topic and provide a marked contrast over the original, unorganized listing. References [1] Eamonn Keogh, Kaushik Chakrabarti, Li Wei, Xiaopeng Xi, Stefano Lonardi. Intelligent Icons: Integrating Lite-Weight Visualization and Data Mining into Microsoft Windows Operating Systems [2] Scott Deerwester, Susan T. Dumais, George W. Furnas, Thomas K. Landauer, Richard Harshman. Indexing by latent semantic analysis, Journal of the American Society for Information Science, Vol 41, page 391-407, 1990. [3] Martin F. Porter. An algorithm for suffix stripping, Program, Vol 14, no. 3, pages 130-137, 1980. [4] The Perseus Digital Library. Stopwords for the Perseus English Index http://www.perseus.tufts.edu/texts/engstop.html [5] Fridolin Wild. The lsa Package http://cran.r-project.org/doc/packages/lsa.pdf [6] InfoVis CyberInfrastructure. Latent Semantic Analysis http://iv.slis.indiana.edu/sw/lsa.html 5