June 15, Abstract. 2. Methodology and Considerations. 1. Introduction

Similar documents
Clustering Startups Based on Customer-Value Proposition

Minoru SASAKI and Kenji KITA. Department of Information Science & Intelligent Systems. Faculty of Engineering, Tokushima University

Text Modeling with the Trace Norm

highest cosine coecient [5] are returned. Notice that a query can hit documents without having common terms because the k indexing dimensions indicate

Information Retrieval. hussein suleman uct cs

Two-Dimensional Visualization for Internet Resource Discovery. Shih-Hao Li and Peter B. Danzig. University of Southern California

Information Retrieval. (M&S Ch 15)

Impact of Term Weighting Schemes on Document Clustering A Review

Chapter 6: Information Retrieval and Web Search. An introduction

SPE Copyright 2010, Society of Petroleum Engineers

CSE 494: Information Retrieval, Mining and Integration on the Internet

A Multiclassifier based Approach for Word Sense Disambiguation using Singular Value Decomposition

A Multiclassifier based Approach for Word Sense Disambiguation using Singular Value Decomposition

Mining Web Data. Lijun Zhang

Effective Latent Space Graph-based Re-ranking Model with Global Consistency

Introduction to Information Retrieval

Chapter 27 Introduction to Information Retrieval and Web Search

In = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most

Information Retrieval

Part I: Data Mining Foundations

Document Clustering using Concept Space and Cosine Similarity Measurement

ENHANCEMENT OF METICULOUS IMAGE SEARCH BY MARKOVIAN SEMANTIC INDEXING MODEL

CS 6320 Natural Language Processing

Text Analytics (Text Mining)

Published in A R DIGITECH

Recommendation System Using Yelp Data CS 229 Machine Learning Jia Le Xu, Yingran Xu

Text Data Pre-processing and Dimensionality Reduction Techniques for Document Clustering

LRLW-LSI: An Improved Latent Semantic Indexing (LSI) Text Classifier

Web Page Similarity Searching Based on Web Content

Vector Space Models: Theory and Applications

Mining Web Data. Lijun Zhang

Vector Semantics. Dense Vectors

Tag-based Social Interest Discovery

Dimension Reduction CS534

Working with Unlabeled Data Clustering Analysis. Hsiao-Lung Chan Dept Electrical Engineering Chang Gung University, Taiwan

Learning to Match. Jun Xu, Zhengdong Lu, Tianqi Chen, Hang Li

Context Based Web Indexing For Semantic Web

What is this Song About?: Identification of Keywords in Bollywood Lyrics

Self-organization of very large document collections

Domain-specific Concept-based Information Retrieval System

The Semantic Conference Organizer

Text Analytics (Text Mining)

Document Clustering in Reduced Dimension Vector Space

Sprinkled Latent Semantic Indexing for Text Classification with Background Knowledge

Keyword Extraction by KNN considering Similarity among Features

General Instructions. Questions

Lecture Topic Projects

Event Detection using Archived Smart House Sensor Data obtained using Symbolic Aggregate Approximation

Homework 3: Wikipedia Clustering Cliff Engle & Antonio Lupher CS 294-1

Semantic Website Clustering

Programming Exercise 7: K-means Clustering and Principal Component Analysis

A New Measure of the Cluster Hypothesis

Decomposition. November 20, Abstract. With the electronic storage of documents comes the possibility of

Content-based Dimensionality Reduction for Recommender Systems

ITERATIVE SEARCHING IN AN ONLINE DATABASE. Susan T. Dumais and Deborah G. Schmitt Cognitive Science Research Group Bellcore Morristown, NJ

Analysis and Latent Semantic Indexing

Clustered SVD strategies in latent semantic indexing q

Document Clustering: Comparison of Similarity Measures

CANDIDATE LINK GENERATION USING SEMANTIC PHEROMONE SWARM

String Vector based KNN for Text Categorization

Visualization of Text Document Corpus

Improving Probabilistic Latent Semantic Analysis with Principal Component Analysis

Cluster Analysis for Microarray Data

CS 224N FINAL PROJECT REPORT REGBASE:

Representation/Indexing (fig 1.2) IR models - overview (fig 2.1) IR models - vector space. Weighting TF*IDF. U s e r. T a s k s

Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p.

Developing Focused Crawlers for Genre Specific Search Engines

Feature selection. LING 572 Fei Xia

Reddit Recommendation System Daniel Poon, Yu Wu, David (Qifan) Zhang CS229, Stanford University December 11 th, 2011

Spotting Words in Latin, Devanagari and Arabic Scripts

Encoding Words into String Vectors for Word Categorization

CS377: Database Systems Text data and information. Li Xiong Department of Mathematics and Computer Science Emory University

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1

The end of affine cameras

Similarity search in multimedia databases

Dimension reduction : PCA and Clustering

Hebei University of Technology A Text-Mining-based Patent Analysis in Product Innovative Process

Search Results Clustering in Polish: Evaluation of Carrot

Unsupervised learning, Clustering CS434

Planar Point Location

Clustering. Bruno Martins. 1 st Semester 2012/2013

A Content Vector Model for Text Classification

Week 7 Picturing Network. Vahe and Bethany

Module 5. Function-Oriented Software Design. Version 2 CSE IIT, Kharagpur

CHAPTER 3 ASSOCIATON RULE BASED CLUSTERING

Introduction to Information Retrieval

A Novel Method for Activity Place Sensing Based on Behavior Pattern Mining Using Crowdsourcing Trajectory Data

The Design and Implementation of an Intelligent Online Recommender System

Data Clustering Hierarchical Clustering, Density based clustering Grid based clustering

Browsing Heterogeneous Document Collections by a Segmentation-free Word Spotting Method

A Metadatabase System for Semantic Image Search by a Mathematical Model of Meaning

Collaborative Filtering Recommender System

A BFS-BASED SIMILAR CONFERENCE RETRIEVAL FRAMEWORK

Chapter 2. Architecture of a Search Engine

A Spectral-based Clustering Algorithm for Categorical Data Using Data Summaries (SCCADDS)

CMSC 476/676 Information Retrieval Midterm Exam Spring 2014

Machine Learning HW4

Bing Liu. Web Data Mining. Exploring Hyperlinks, Contents, and Usage Data. With 177 Figures. Springer

Improving Suffix Tree Clustering Algorithm for Web Documents

Keywords: clustering algorithms, unsupervised learning, cluster validity

Transcription:

Organizing Internet Bookmarks using Latent Semantic Analysis and Intelligent Icons Note: This file is a homework produced by two students for UCR CS235, Spring 06. In order to fully appreacate it, it may be helpful to read Intelligent Icons by Keogh et. al. first. Eamonn 6/30/2006 Jin Shieh June 15, 2006 Scott Sirowy Abstract An unorganized bookmark list is a common problem for many internet users. This lack of organization makes looking through entries both time consuming and tedious. We present an application which organizes Mozilla bookmark entries based off the contents of their target website. We also incorporate Intelligent Icons into bookmark entries for a clear visualization of similarity. 1. Introduction With the onset of news aggregators and social bookmarks, internet users have a greater means of locating and accessing sites of interest than ever before. Often times, and due to the overwhelming volume, bookmarks are saved in a haphazard manner, with little thought or organization. This makes looking up a specific bookmark at a later time a tedious and time consuming task, likely requiring a sequential scan of nearly the entire bookmark listing. Our solution is the formulation of an application process which would be capable of organizing a users bookmark entries in an automatic as well as intuitive fashion. In order to organize bookmark entries, we must have a means of determining similarity between the contents of different websites (in the remaining text, we will refer to websites generically as documents). Through a technique called Latent Semantic Analysis (LSA) [2] we are able to associate each document with a set of concepts. Using this we can then determine the document to document similarity. Once document processing has been completed, we will generate Intelligent Icons for each document entry to provide users with a convenient visualization aide [1]. Intelligent Icons allow the user to easily identify similar items and to some extent, the depth of similarity. These generated icons will then be encoded into the bookmark file as a page icon. 2. Methodology and Considerations The application process follows a series of intermediate steps. The bookmark file must first be parsed and the text representative of each bookmark entry must be extracted. A termdocument matrix is then constructed and additional preprocessing is done to improve accuracy. LSA then takes this term-document matrix and performs singular vector decomposition (SVD) for rank lowering. Once this is complete we can then use basic matrix operations to compute a document to document similarity matrix. Using this similarity information, we will then cluster similar documents so they are arranged together. Icon generation and bookmark construction will then complete the application process. The following subsections will elaborate on each of the key phases of the application process as well as any considerations we made during the construction of our application prototype. 2.1 Text Extraction Individual bookmark entries are first extracted from the Mozilla bookmarks.html file. Presently, we use regular expressions to obtain the title and URL of the target website, though future extensions should include a formal parser which can prevent lossy extraction by saving the 1

complete set of metadata. Each website specified by an entry is then fetched and the relevant text is extracted 1. During text extraction, there is some concern of the presence of advertisement as well as text in the form of different Unicode mappings. Advertisement text may distort the perceived relationship between documents and Unicode may not be mapped to the correct text. These two issues warrant additional consideration in future development. 2.2 Latent Semantic Analysis To use LSA, we first change the representation of the documents into that of a term-document matrix. This is simply a large frequency matrix consisting of all possible words (rows) in the set of documents and the number of occurrences, if any, for each document (columns). To improve the accuracy of our results, we preprocess the text during construction of the matrix. The first step of preprocessing is the stemming of words, using Porter s algorithm [3]. This maps a large number of word variations to a single root word. For example, connections, connection, connecting, and connected can all be reduced to a single term. Next a list of common English stop words was used as an exclusionary list [4]. These words, such as a, and, and etc. add little or no description and fails to provide help with the formulation of document concepts. Following the construction of the termdocument matrix, a number of weighting schemes may be applied (tf-idf, log, binary, etc) [5]. The effectiveness of each is dependant on the nature of the dataset being used. For our documents we found that taking the log (Term- Document i,j +1) of each entry in the termdocument matrix and then normalizing each document vector (columns), resulted in the most effective weighting scheme. approximating the original term-document matrix [6]. This is done by keeping only the n largest singular values during SVD. The choice of n here is critical in determining the accuracy of the result (too high results in over fitting and too low fails to capture accurate dataset representation). While determining a good size for n is an inherently difficult choice, our empirical results indicate that keeping a relatively low number of singular values (11 for 79 documents) will be sufficient to generate accurate results. Once SVD has been completed, we can use basic matrix operations to generate term to term, term to document, or document to document similarity matrices (For additional details on LSA and SVD see [2]). 2.3 Hierarchical Clustering Once we obtain the document to document similarity matrix, we then use single linkage hierarchical clustering to obtain an ordering where similar items are clustered together. We note that while we do not know the actual number of clusters present in the dataset; it is unnecessary, as we only wish to return the ordering. To do this we first create a singleton cluster for each document, and then proceed to merge the two most similar clusters. This merging step is repeated until a single cluster, containing all documents is formed. The ordering is saved during the clustering process and will be used for icon generation as well as organization of bookmark entries. Once the processing of the termdocument matrix has been completed, we use the SVD process as described by LSA to construct a lower dimensional abstract semantic space 1 At the present time, extraction is done manually. Figure 1. Using color map for icon generation 2

2.4 Icon Generation As clustering returns an ordering where similar items are placed together, we use this information to generate Intelligent Icons where similar documents are also visually similar. A linear color map is first created to provide a range of varying colors. Each document is then equidistantly mapped, according to the cluster order, onto the color map. The intuition is that more similar documents will have a representative coloring which is more visually alike than that of a dissimilar document (See Figure 1). To construct the icon for each of the documents we first find a given document s n most similar neighbors (by performing a lookup in the document to document similiarity matrix). Recall that each document can now be identified by a unique color, as a result of the color mapping process illustrated earlier. We will now use the representative colors of the n most similar documents to fill in the icon in a left to right, top to bottom fashion, beginning with the most similar document. We wish to note that as the choice of n dictates the level of granularity, it should be kept relatively low unless the true cluster number and size is known. This is because in a dataset of many small clusters, if n is exorbitantly high; the representation shown in the icon may be potentially overwhelmed by dissimilar documents. 2.5 Bookmarks.html Construction In the last phase of the application process, the bookmarks.html file is reconstructed and bookmark entries are arranged according to the ordering obtained from hierarchical clustering. We then use a base64 encoding to convert each of the generated icons to a string representation. This string is then embedded into the bookmark entry as its page icon. This visualization will help the user differentiate between similar and dissimilar bookmarks. 3. Experimental Results To test the effectiveness of our methodology, we constructed a contrived but complete dataset of 79 bookmark entries, with each entry falling in one of 9 major categories. We have constructed a screen shot of what an unorganized bookmark listing containing these entries may look like in Figure 2. 2 Looking up individual bookmarks in such a listing is neither straight forward nor obvious. 3 For the experimental dataset we manually extracted the text from each site and placed them into text files. Logarithmic weighting was applied and the resulting termdocument matrix was normalized. Singular vector decomposition was then performed by selecting the 11 largest singular values. Once hierarchical clustering was complete we constructed Intelligent Icons using the 4 most similar documents per icon. To help visualize the result of LSA and Intelligent Icons, we projected the document to document similarity 4 onto a 2D plot using Multi- Dimensional Scaling (Figure 3). We can immediately observe the differentiation between documents of varying topics both in terms of spatial locality as well as icon color. The new bookmark file complete with embedded page icons is shown in Figure 4. The hierarchical clustering we used was able to accurately place bookmark entries with the same genre or topic together. The page icons for bookmark entries also proved to be valuable indicators of document similarity, as the icon colorings across different categories tend to have high contrast. 2 During text processing, no ordering is maintained as a result of Python s dictionary implementation. 3 The category name before each bookmark entry in Figure 2 and 4 are only used to assist visualization of the dataset. Titles are not used during LSA. 4 Dissimilarity matrix used by Multi-Dimensional Scaling derived by taking the square root of 1-each element of the document to document similarity matrix. 3

Figure 2. Sample screenshot of 79 unordered bookmark entries Figure 3. Using MDS for visualization following LSA and Intelligent Icon generation 4

Figure 4. Reorganized bookmark entries with embedded page icons 4. Conclusion We formulated an application which was aimed at improving bookmark usability by automatically organizing bookmark listings in a way where similar entries are grouped together. We first used LSA to perform information retrieval and for determining document to document similarity. Hierarchical clustering was then performed to group similar documents together and Intelligent Icon s were generated to help users visualize the data. Our experimental result, which was conducted with 79 bookmark entries, demonstrates the effectiveness and overall improvement achieved from using our application process. The organized bookmark entries are easily identifiable by topic and provide a marked contrast over the original, unorganized listing. References [1] Eamonn Keogh, Kaushik Chakrabarti, Li Wei, Xiaopeng Xi, Stefano Lonardi. Intelligent Icons: Integrating Lite-Weight Visualization and Data Mining into Microsoft Windows Operating Systems [2] Scott Deerwester, Susan T. Dumais, George W. Furnas, Thomas K. Landauer, Richard Harshman. Indexing by latent semantic analysis, Journal of the American Society for Information Science, Vol 41, page 391-407, 1990. [3] Martin F. Porter. An algorithm for suffix stripping, Program, Vol 14, no. 3, pages 130-137, 1980. [4] The Perseus Digital Library. Stopwords for the Perseus English Index http://www.perseus.tufts.edu/texts/engstop.html [5] Fridolin Wild. The lsa Package http://cran.r-project.org/doc/packages/lsa.pdf [6] InfoVis CyberInfrastructure. Latent Semantic Analysis http://iv.slis.indiana.edu/sw/lsa.html 5