June 15, Abstract. 2. Methodology and Considerations. 1. Introduction
|
|
- Regina Chambers
- 6 years ago
- Views:
Transcription
1 Organizing Internet Bookmarks using Latent Semantic Analysis and Intelligent Icons Note: This file is a homework produced by two students for UCR CS235, Spring 06. In order to fully appreacate it, it may be helpful to read Intelligent Icons by Keogh et. al. first. Eamonn 6/30/2006 Jin Shieh June 15, 2006 Scott Sirowy Abstract An unorganized bookmark list is a common problem for many internet users. This lack of organization makes looking through entries both time consuming and tedious. We present an application which organizes Mozilla bookmark entries based off the contents of their target website. We also incorporate Intelligent Icons into bookmark entries for a clear visualization of similarity. 1. Introduction With the onset of news aggregators and social bookmarks, internet users have a greater means of locating and accessing sites of interest than ever before. Often times, and due to the overwhelming volume, bookmarks are saved in a haphazard manner, with little thought or organization. This makes looking up a specific bookmark at a later time a tedious and time consuming task, likely requiring a sequential scan of nearly the entire bookmark listing. Our solution is the formulation of an application process which would be capable of organizing a users bookmark entries in an automatic as well as intuitive fashion. In order to organize bookmark entries, we must have a means of determining similarity between the contents of different websites (in the remaining text, we will refer to websites generically as documents). Through a technique called Latent Semantic Analysis (LSA) [2] we are able to associate each document with a set of concepts. Using this we can then determine the document to document similarity. Once document processing has been completed, we will generate Intelligent Icons for each document entry to provide users with a convenient visualization aide [1]. Intelligent Icons allow the user to easily identify similar items and to some extent, the depth of similarity. These generated icons will then be encoded into the bookmark file as a page icon. 2. Methodology and Considerations The application process follows a series of intermediate steps. The bookmark file must first be parsed and the text representative of each bookmark entry must be extracted. A termdocument matrix is then constructed and additional preprocessing is done to improve accuracy. LSA then takes this term-document matrix and performs singular vector decomposition (SVD) for rank lowering. Once this is complete we can then use basic matrix operations to compute a document to document similarity matrix. Using this similarity information, we will then cluster similar documents so they are arranged together. Icon generation and bookmark construction will then complete the application process. The following subsections will elaborate on each of the key phases of the application process as well as any considerations we made during the construction of our application prototype. 2.1 Text Extraction Individual bookmark entries are first extracted from the Mozilla bookmarks.html file. Presently, we use regular expressions to obtain the title and URL of the target website, though future extensions should include a formal parser which can prevent lossy extraction by saving the 1
2 complete set of metadata. Each website specified by an entry is then fetched and the relevant text is extracted 1. During text extraction, there is some concern of the presence of advertisement as well as text in the form of different Unicode mappings. Advertisement text may distort the perceived relationship between documents and Unicode may not be mapped to the correct text. These two issues warrant additional consideration in future development. 2.2 Latent Semantic Analysis To use LSA, we first change the representation of the documents into that of a term-document matrix. This is simply a large frequency matrix consisting of all possible words (rows) in the set of documents and the number of occurrences, if any, for each document (columns). To improve the accuracy of our results, we preprocess the text during construction of the matrix. The first step of preprocessing is the stemming of words, using Porter s algorithm [3]. This maps a large number of word variations to a single root word. For example, connections, connection, connecting, and connected can all be reduced to a single term. Next a list of common English stop words was used as an exclusionary list [4]. These words, such as a, and, and etc. add little or no description and fails to provide help with the formulation of document concepts. Following the construction of the termdocument matrix, a number of weighting schemes may be applied (tf-idf, log, binary, etc) [5]. The effectiveness of each is dependant on the nature of the dataset being used. For our documents we found that taking the log (Term- Document i,j +1) of each entry in the termdocument matrix and then normalizing each document vector (columns), resulted in the most effective weighting scheme. approximating the original term-document matrix [6]. This is done by keeping only the n largest singular values during SVD. The choice of n here is critical in determining the accuracy of the result (too high results in over fitting and too low fails to capture accurate dataset representation). While determining a good size for n is an inherently difficult choice, our empirical results indicate that keeping a relatively low number of singular values (11 for 79 documents) will be sufficient to generate accurate results. Once SVD has been completed, we can use basic matrix operations to generate term to term, term to document, or document to document similarity matrices (For additional details on LSA and SVD see [2]). 2.3 Hierarchical Clustering Once we obtain the document to document similarity matrix, we then use single linkage hierarchical clustering to obtain an ordering where similar items are clustered together. We note that while we do not know the actual number of clusters present in the dataset; it is unnecessary, as we only wish to return the ordering. To do this we first create a singleton cluster for each document, and then proceed to merge the two most similar clusters. This merging step is repeated until a single cluster, containing all documents is formed. The ordering is saved during the clustering process and will be used for icon generation as well as organization of bookmark entries. Once the processing of the termdocument matrix has been completed, we use the SVD process as described by LSA to construct a lower dimensional abstract semantic space 1 At the present time, extraction is done manually. Figure 1. Using color map for icon generation 2
3 2.4 Icon Generation As clustering returns an ordering where similar items are placed together, we use this information to generate Intelligent Icons where similar documents are also visually similar. A linear color map is first created to provide a range of varying colors. Each document is then equidistantly mapped, according to the cluster order, onto the color map. The intuition is that more similar documents will have a representative coloring which is more visually alike than that of a dissimilar document (See Figure 1). To construct the icon for each of the documents we first find a given document s n most similar neighbors (by performing a lookup in the document to document similiarity matrix). Recall that each document can now be identified by a unique color, as a result of the color mapping process illustrated earlier. We will now use the representative colors of the n most similar documents to fill in the icon in a left to right, top to bottom fashion, beginning with the most similar document. We wish to note that as the choice of n dictates the level of granularity, it should be kept relatively low unless the true cluster number and size is known. This is because in a dataset of many small clusters, if n is exorbitantly high; the representation shown in the icon may be potentially overwhelmed by dissimilar documents. 2.5 Bookmarks.html Construction In the last phase of the application process, the bookmarks.html file is reconstructed and bookmark entries are arranged according to the ordering obtained from hierarchical clustering. We then use a base64 encoding to convert each of the generated icons to a string representation. This string is then embedded into the bookmark entry as its page icon. This visualization will help the user differentiate between similar and dissimilar bookmarks. 3. Experimental Results To test the effectiveness of our methodology, we constructed a contrived but complete dataset of 79 bookmark entries, with each entry falling in one of 9 major categories. We have constructed a screen shot of what an unorganized bookmark listing containing these entries may look like in Figure 2. 2 Looking up individual bookmarks in such a listing is neither straight forward nor obvious. 3 For the experimental dataset we manually extracted the text from each site and placed them into text files. Logarithmic weighting was applied and the resulting termdocument matrix was normalized. Singular vector decomposition was then performed by selecting the 11 largest singular values. Once hierarchical clustering was complete we constructed Intelligent Icons using the 4 most similar documents per icon. To help visualize the result of LSA and Intelligent Icons, we projected the document to document similarity 4 onto a 2D plot using Multi- Dimensional Scaling (Figure 3). We can immediately observe the differentiation between documents of varying topics both in terms of spatial locality as well as icon color. The new bookmark file complete with embedded page icons is shown in Figure 4. The hierarchical clustering we used was able to accurately place bookmark entries with the same genre or topic together. The page icons for bookmark entries also proved to be valuable indicators of document similarity, as the icon colorings across different categories tend to have high contrast. 2 During text processing, no ordering is maintained as a result of Python s dictionary implementation. 3 The category name before each bookmark entry in Figure 2 and 4 are only used to assist visualization of the dataset. Titles are not used during LSA. 4 Dissimilarity matrix used by Multi-Dimensional Scaling derived by taking the square root of 1-each element of the document to document similarity matrix. 3
4 Figure 2. Sample screenshot of 79 unordered bookmark entries Figure 3. Using MDS for visualization following LSA and Intelligent Icon generation 4
5 Figure 4. Reorganized bookmark entries with embedded page icons 4. Conclusion We formulated an application which was aimed at improving bookmark usability by automatically organizing bookmark listings in a way where similar entries are grouped together. We first used LSA to perform information retrieval and for determining document to document similarity. Hierarchical clustering was then performed to group similar documents together and Intelligent Icon s were generated to help users visualize the data. Our experimental result, which was conducted with 79 bookmark entries, demonstrates the effectiveness and overall improvement achieved from using our application process. The organized bookmark entries are easily identifiable by topic and provide a marked contrast over the original, unorganized listing. References [1] Eamonn Keogh, Kaushik Chakrabarti, Li Wei, Xiaopeng Xi, Stefano Lonardi. Intelligent Icons: Integrating Lite-Weight Visualization and Data Mining into Microsoft Windows Operating Systems [2] Scott Deerwester, Susan T. Dumais, George W. Furnas, Thomas K. Landauer, Richard Harshman. Indexing by latent semantic analysis, Journal of the American Society for Information Science, Vol 41, page , [3] Martin F. Porter. An algorithm for suffix stripping, Program, Vol 14, no. 3, pages , [4] The Perseus Digital Library. Stopwords for the Perseus English Index [5] Fridolin Wild. The lsa Package [6] InfoVis CyberInfrastructure. Latent Semantic Analysis 5
Clustering Startups Based on Customer-Value Proposition
Clustering Startups Based on Customer-Value Proposition Daniel Semeniuta Stanford University dsemeniu@stanford.edu Meeran Ismail Stanford University meeran@stanford.edu Abstract K-means clustering is a
More informationMinoru SASAKI and Kenji KITA. Department of Information Science & Intelligent Systems. Faculty of Engineering, Tokushima University
Information Retrieval System Using Concept Projection Based on PDDP algorithm Minoru SASAKI and Kenji KITA Department of Information Science & Intelligent Systems Faculty of Engineering, Tokushima University
More informationText Modeling with the Trace Norm
Text Modeling with the Trace Norm Jason D. M. Rennie jrennie@gmail.com April 14, 2006 1 Introduction We have two goals: (1) to find a low-dimensional representation of text that allows generalization to
More informationhighest cosine coecient [5] are returned. Notice that a query can hit documents without having common terms because the k indexing dimensions indicate
Searching Information Servers Based on Customized Proles Technical Report USC-CS-96-636 Shih-Hao Li and Peter B. Danzig Computer Science Department University of Southern California Los Angeles, California
More informationInformation Retrieval. hussein suleman uct cs
Information Management Information Retrieval hussein suleman uct cs 303 2004 Introduction Information retrieval is the process of locating the most relevant information to satisfy a specific information
More informationTwo-Dimensional Visualization for Internet Resource Discovery. Shih-Hao Li and Peter B. Danzig. University of Southern California
Two-Dimensional Visualization for Internet Resource Discovery Shih-Hao Li and Peter B. Danzig Computer Science Department University of Southern California Los Angeles, California 90089-0781 fshli, danzigg@cs.usc.edu
More informationInformation Retrieval. (M&S Ch 15)
Information Retrieval (M&S Ch 15) 1 Retrieval Models A retrieval model specifies the details of: Document representation Query representation Retrieval function Determines a notion of relevance. Notion
More informationImpact of Term Weighting Schemes on Document Clustering A Review
Volume 118 No. 23 2018, 467-475 ISSN: 1314-3395 (on-line version) url: http://acadpubl.eu/hub ijpam.eu Impact of Term Weighting Schemes on Document Clustering A Review G. Hannah Grace and Kalyani Desikan
More informationChapter 6: Information Retrieval and Web Search. An introduction
Chapter 6: Information Retrieval and Web Search An introduction Introduction n Text mining refers to data mining using text documents as data. n Most text mining tasks use Information Retrieval (IR) methods
More informationSPE Copyright 2010, Society of Petroleum Engineers
SPE-132629 Intelligent model management and Visualization for smart oilfields Charalampos Chelmis 1, Amol Bakshi 3*, Burcu Seren 2, Karthik Gomadam 3, Viktor K. Prasanna 3 1 Department of Computer Science,
More informationCSE 494: Information Retrieval, Mining and Integration on the Internet
CSE 494: Information Retrieval, Mining and Integration on the Internet Midterm. 18 th Oct 2011 (Instructor: Subbarao Kambhampati) In-class Duration: Duration of the class 1hr 15min (75min) Total points:
More informationA Multiclassifier based Approach for Word Sense Disambiguation using Singular Value Decomposition
A Multiclassifier based Approach for Word Sense Disambiguation using Singular Value Decomposition Ana Zelaia, Olatz Arregi and Basilio Sierra Computer Science Faculty University of the Basque Country ana.zelaia@ehu.es
More informationA Multiclassifier based Approach for Word Sense Disambiguation using Singular Value Decomposition
A Multiclassifier based Approach for Word Sense Disambiguation using Singular Value Decomposition Ana Zelaia, Olatz Arregi and Basilio Sierra Computer Science Faculty University of the Basque Country ana.zelaia@ehu.es
More informationMining Web Data. Lijun Zhang
Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems
More informationEffective Latent Space Graph-based Re-ranking Model with Global Consistency
Effective Latent Space Graph-based Re-ranking Model with Global Consistency Feb. 12, 2009 1 Outline Introduction Related work Methodology Graph-based re-ranking model Learning a latent space graph A case
More informationIntroduction to Information Retrieval
Introduction to Information Retrieval Mohsen Kamyar چهارمین کارگاه ساالنه آزمایشگاه فناوری و وب بهمن ماه 1391 Outline Outline in classic categorization Information vs. Data Retrieval IR Models Evaluation
More informationChapter 27 Introduction to Information Retrieval and Web Search
Chapter 27 Introduction to Information Retrieval and Web Search Copyright 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 27 Outline Information Retrieval (IR) Concepts Retrieval
More informationIn = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most
In = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most common word appears 10 times then A = rn*n/n = 1*10/100
More informationInformation Retrieval
Information Retrieval CSC 375, Fall 2016 An information retrieval system will tend not to be used whenever it is more painful and troublesome for a customer to have information than for him not to have
More informationPart I: Data Mining Foundations
Table of Contents 1. Introduction 1 1.1. What is the World Wide Web? 1 1.2. A Brief History of the Web and the Internet 2 1.3. Web Data Mining 4 1.3.1. What is Data Mining? 6 1.3.2. What is Web Mining?
More informationDocument Clustering using Concept Space and Cosine Similarity Measurement
29 International Conference on Computer Technology and Development Document Clustering using Concept Space and Cosine Similarity Measurement Lailil Muflikhah Department of Computer and Information Science
More informationENHANCEMENT OF METICULOUS IMAGE SEARCH BY MARKOVIAN SEMANTIC INDEXING MODEL
ENHANCEMENT OF METICULOUS IMAGE SEARCH BY MARKOVIAN SEMANTIC INDEXING MODEL Shwetha S P 1 and Alok Ranjan 2 Visvesvaraya Technological University, Belgaum, Dept. of Computer Science and Engineering, Canara
More informationCS 6320 Natural Language Processing
CS 6320 Natural Language Processing Information Retrieval Yang Liu Slides modified from Ray Mooney s (http://www.cs.utexas.edu/users/mooney/ir-course/slides/) 1 Introduction of IR System components, basic
More informationText Analytics (Text Mining)
CSE 6242 / CX 4242 Apr 1, 2014 Text Analytics (Text Mining) Concepts and Algorithms Duen Horng (Polo) Chau Georgia Tech Some lectures are partly based on materials by Professors Guy Lebanon, Jeffrey Heer,
More informationPublished in A R DIGITECH
IMAGE RETRIEVAL USING LATENT SEMANTIC INDEXING Rachana C Patil*1, Imran R. Shaikh*2 *1 (M.E Student S.N.D.C.O.E.R.C, Yeola) *2(Professor, S.N.D.C.O.E.R.C, Yeola) rachanap4@gmail.com*1, imran.shaikh22@gmail.com*2
More informationRecommendation System Using Yelp Data CS 229 Machine Learning Jia Le Xu, Yingran Xu
Recommendation System Using Yelp Data CS 229 Machine Learning Jia Le Xu, Yingran Xu 1 Introduction Yelp Dataset Challenge provides a large number of user, business and review data which can be used for
More informationText Data Pre-processing and Dimensionality Reduction Techniques for Document Clustering
Text Data Pre-processing and Dimensionality Reduction Techniques for Document Clustering A. Anil Kumar Dept of CSE Sri Sivani College of Engineering Srikakulam, India S.Chandrasekhar Dept of CSE Sri Sivani
More informationLRLW-LSI: An Improved Latent Semantic Indexing (LSI) Text Classifier
LRLW-LSI: An Improved Latent Semantic Indexing (LSI) Text Classifier Wang Ding, Songnian Yu, Shanqing Yu, Wei Wei, and Qianfeng Wang School of Computer Engineering and Science, Shanghai University, 200072
More informationWeb Page Similarity Searching Based on Web Content
Web Page Similarity Searching Based on Web Content Gregorius Satia Budhi Informatics Department Petra Chistian University Siwalankerto 121-131 Surabaya 60236, Indonesia (62-31) 2983455 greg@petra.ac.id
More informationVector Space Models: Theory and Applications
Vector Space Models: Theory and Applications Alexander Panchenko Centre de traitement automatique du langage (CENTAL) Université catholique de Louvain FLTR 2620 Introduction au traitement automatique du
More informationMining Web Data. Lijun Zhang
Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems
More informationVector Semantics. Dense Vectors
Vector Semantics Dense Vectors Sparse versus dense vectors PPMI vectors are long (length V = 20,000 to 50,000) sparse (most elements are zero) Alternative: learn vectors which are short (length 200-1000)
More informationTag-based Social Interest Discovery
Tag-based Social Interest Discovery Xin Li / Lei Guo / Yihong (Eric) Zhao Yahoo!Inc 2008 Presented by: Tuan Anh Le (aletuan@vub.ac.be) 1 Outline Introduction Data set collection & Pre-processing Architecture
More informationDimension Reduction CS534
Dimension Reduction CS534 Why dimension reduction? High dimensionality large number of features E.g., documents represented by thousands of words, millions of bigrams Images represented by thousands of
More informationWorking with Unlabeled Data Clustering Analysis. Hsiao-Lung Chan Dept Electrical Engineering Chang Gung University, Taiwan
Working with Unlabeled Data Clustering Analysis Hsiao-Lung Chan Dept Electrical Engineering Chang Gung University, Taiwan chanhl@mail.cgu.edu.tw Unsupervised learning Finding centers of similarity using
More informationLearning to Match. Jun Xu, Zhengdong Lu, Tianqi Chen, Hang Li
Learning to Match Jun Xu, Zhengdong Lu, Tianqi Chen, Hang Li 1. Introduction The main tasks in many applications can be formalized as matching between heterogeneous objects, including search, recommendation,
More informationContext Based Web Indexing For Semantic Web
IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661, p- ISSN: 2278-8727Volume 12, Issue 4 (Jul. - Aug. 2013), PP 89-93 Anchal Jain 1 Nidhi Tyagi 2 Lecturer(JPIEAS) Asst. Professor(SHOBHIT
More informationWhat is this Song About?: Identification of Keywords in Bollywood Lyrics
What is this Song About?: Identification of Keywords in Bollywood Lyrics by Drushti Apoorva G, Kritik Mathur, Priyansh Agrawal, Radhika Mamidi in 19th International Conference on Computational Linguistics
More informationSelf-organization of very large document collections
Chapter 10 Self-organization of very large document collections Teuvo Kohonen, Samuel Kaski, Krista Lagus, Jarkko Salojärvi, Jukka Honkela, Vesa Paatero, Antti Saarela Text mining systems are developed
More informationDomain-specific Concept-based Information Retrieval System
Domain-specific Concept-based Information Retrieval System L. Shen 1, Y. K. Lim 1, H. T. Loh 2 1 Design Technology Institute Ltd, National University of Singapore, Singapore 2 Department of Mechanical
More informationThe Semantic Conference Organizer
34 The Semantic Conference Organizer Kevin Heinrich, Michael W. Berry, Jack J. Dongarra, Sathish Vadhiyar University of Tennessee, Knoxville, USA CONTENTS 34.1 Background... 571 34.2 Latent Semantic Indexing...
More informationText Analytics (Text Mining)
CSE 6242 / CX 4242 Text Analytics (Text Mining) Concepts and Algorithms Duen Horng (Polo) Chau Georgia Tech Some lectures are partly based on materials by Professors Guy Lebanon, Jeffrey Heer, John Stasko,
More informationDocument Clustering in Reduced Dimension Vector Space
Document Clustering in Reduced Dimension Vector Space Kristina Lerman USC Information Sciences Institute 4676 Admiralty Way Marina del Rey, CA 90292 Email: lerman@isi.edu Abstract Document clustering is
More informationSprinkled Latent Semantic Indexing for Text Classification with Background Knowledge
Sprinkled Latent Semantic Indexing for Text Classification with Background Knowledge Haiqin Yang and Irwin King Department of Computer Science and Engineering The Chinese University of Hong Kong Shatin,
More informationKeyword Extraction by KNN considering Similarity among Features
64 Int'l Conf. on Advances in Big Data Analytics ABDA'15 Keyword Extraction by KNN considering Similarity among Features Taeho Jo Department of Computer and Information Engineering, Inha University, Incheon,
More informationGeneral Instructions. Questions
CS246: Mining Massive Data Sets Winter 2018 Problem Set 2 Due 11:59pm February 8, 2018 Only one late period is allowed for this homework (11:59pm 2/13). General Instructions Submission instructions: These
More informationLecture Topic Projects
Lecture Topic Projects 1 Intro, schedule, and logistics 2 Applications of visual analytics, basic tasks, data types 3 Introduction to D3, basic vis techniques for non-spatial data Project #1 out 4 Data
More informationEvent Detection using Archived Smart House Sensor Data obtained using Symbolic Aggregate Approximation
Event Detection using Archived Smart House Sensor Data obtained using Symbolic Aggregate Approximation Ayaka ONISHI 1, and Chiemi WATANABE 2 1,2 Graduate School of Humanities and Sciences, Ochanomizu University,
More informationHomework 3: Wikipedia Clustering Cliff Engle & Antonio Lupher CS 294-1
Introduction: Homework 3: Wikipedia Clustering Cliff Engle & Antonio Lupher CS 294-1 Clustering is an important machine learning task that tackles the problem of classifying data into distinct groups based
More informationSemantic Website Clustering
Semantic Website Clustering I-Hsuan Yang, Yu-tsun Huang, Yen-Ling Huang 1. Abstract We propose a new approach to cluster the web pages. Utilizing an iterative reinforced algorithm, the model extracts semantic
More informationProgramming Exercise 7: K-means Clustering and Principal Component Analysis
Programming Exercise 7: K-means Clustering and Principal Component Analysis Machine Learning May 13, 2012 Introduction In this exercise, you will implement the K-means clustering algorithm and apply it
More informationA New Measure of the Cluster Hypothesis
A New Measure of the Cluster Hypothesis Mark D. Smucker 1 and James Allan 2 1 Department of Management Sciences University of Waterloo 2 Center for Intelligent Information Retrieval Department of Computer
More informationDecomposition. November 20, Abstract. With the electronic storage of documents comes the possibility of
Latent Semantic Indexing via a Semi-Discrete Matrix Decomposition Tamara G. Kolda and Dianne P. O'Leary y November, 1996 Abstract With the electronic storage of documents comes the possibility of building
More informationContent-based Dimensionality Reduction for Recommender Systems
Content-based Dimensionality Reduction for Recommender Systems Panagiotis Symeonidis Aristotle University, Department of Informatics, Thessaloniki 54124, Greece symeon@csd.auth.gr Abstract. Recommender
More informationITERATIVE SEARCHING IN AN ONLINE DATABASE. Susan T. Dumais and Deborah G. Schmitt Cognitive Science Research Group Bellcore Morristown, NJ
- 1 - ITERATIVE SEARCHING IN AN ONLINE DATABASE Susan T. Dumais and Deborah G. Schmitt Cognitive Science Research Group Bellcore Morristown, NJ 07962-1910 ABSTRACT An experiment examined how people use
More informationAnalysis and Latent Semantic Indexing
18 Principal Component Analysis and Latent Semantic Indexing Understand the basics of principal component analysis and latent semantic index- Lab Objective: ing. Principal Component Analysis Understanding
More informationClustered SVD strategies in latent semantic indexing q
Information Processing and Management 41 (5) 151 163 www.elsevier.com/locate/infoproman Clustered SVD strategies in latent semantic indexing q Jing Gao, Jun Zhang * Laboratory for High Performance Scientific
More informationDocument Clustering: Comparison of Similarity Measures
Document Clustering: Comparison of Similarity Measures Shouvik Sachdeva Bhupendra Kastore Indian Institute of Technology, Kanpur CS365 Project, 2014 Outline 1 Introduction The Problem and the Motivation
More informationCANDIDATE LINK GENERATION USING SEMANTIC PHEROMONE SWARM
CANDIDATE LINK GENERATION USING SEMANTIC PHEROMONE SWARM Ms.Susan Geethu.D.K 1, Ms. R.Subha 2, Dr.S.Palaniswami 3 1, 2 Assistant Professor 1,2 Department of Computer Science and Engineering, Sri Krishna
More informationString Vector based KNN for Text Categorization
458 String Vector based KNN for Text Categorization Taeho Jo Department of Computer and Information Communication Engineering Hongik University Sejong, South Korea tjo018@hongik.ac.kr Abstract This research
More informationVisualization of Text Document Corpus
Informatica 29 (2005) 497 502 497 Visualization of Text Document Corpus Blaž Fortuna, Marko Grobelnik and Dunja Mladenić Jozef Stefan Institute Jamova 39, 1000 Ljubljana, Slovenia E-mail: {blaz.fortuna,
More informationImproving Probabilistic Latent Semantic Analysis with Principal Component Analysis
Improving Probabilistic Latent Semantic Analysis with Principal Component Analysis Ayman Farahat Palo Alto Research Center 3333 Coyote Hill Road Palo Alto, CA 94304 ayman.farahat@gmail.com Francine Chen
More informationCluster Analysis for Microarray Data
Cluster Analysis for Microarray Data Seventh International Long Oligonucleotide Microarray Workshop Tucson, Arizona January 7-12, 2007 Dan Nettleton IOWA STATE UNIVERSITY 1 Clustering Group objects that
More informationCS 224N FINAL PROJECT REPORT REGBASE:
CS 224N FINAL PROJECT REPORT REGBASE: An Information Infrastructure for Regulation Management and Analysis Gloria T. Lau (glau@stanford.edu) Jun 8 th, 2002 1. Introduction and Motivation to the Problem
More informationRepresentation/Indexing (fig 1.2) IR models - overview (fig 2.1) IR models - vector space. Weighting TF*IDF. U s e r. T a s k s
Summary agenda Summary: EITN01 Web Intelligence and Information Retrieval Anders Ardö EIT Electrical and Information Technology, Lund University March 13, 2013 A Ardö, EIT Summary: EITN01 Web Intelligence
More informationIntroduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p.
Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p. 6 What is Web Mining? p. 6 Summary of Chapters p. 8 How
More informationDeveloping Focused Crawlers for Genre Specific Search Engines
Developing Focused Crawlers for Genre Specific Search Engines Nikhil Priyatam Thesis Advisor: Prof. Vasudeva Varma IIIT Hyderabad July 7, 2014 Examples of Genre Specific Search Engines MedlinePlus Naukri.com
More informationFeature selection. LING 572 Fei Xia
Feature selection LING 572 Fei Xia 1 Creating attribute-value table x 1 x 2 f 1 f 2 f K y Choose features: Define feature templates Instantiate the feature templates Dimensionality reduction: feature selection
More informationReddit Recommendation System Daniel Poon, Yu Wu, David (Qifan) Zhang CS229, Stanford University December 11 th, 2011
Reddit Recommendation System Daniel Poon, Yu Wu, David (Qifan) Zhang CS229, Stanford University December 11 th, 2011 1. Introduction Reddit is one of the most popular online social news websites with millions
More informationSpotting Words in Latin, Devanagari and Arabic Scripts
Spotting Words in Latin, Devanagari and Arabic Scripts Sargur N. Srihari, Harish Srinivasan, Chen Huang and Shravya Shetty {srihari,hs32,chuang5,sshetty}@cedar.buffalo.edu Center of Excellence for Document
More informationEncoding Words into String Vectors for Word Categorization
Int'l Conf. Artificial Intelligence ICAI'16 271 Encoding Words into String Vectors for Word Categorization Taeho Jo Department of Computer and Information Communication Engineering, Hongik University,
More informationCS377: Database Systems Text data and information. Li Xiong Department of Mathematics and Computer Science Emory University
CS377: Database Systems Text data and information retrieval Li Xiong Department of Mathematics and Computer Science Emory University Outline Information Retrieval (IR) Concepts Text Preprocessing Inverted
More informationCluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1
Cluster Analysis Mu-Chun Su Department of Computer Science and Information Engineering National Central University 2003/3/11 1 Introduction Cluster analysis is the formal study of algorithms and methods
More informationThe end of affine cameras
The end of affine cameras Affine SFM revisited Epipolar geometry Two-view structure from motion Multi-view structure from motion Planches : http://www.di.ens.fr/~ponce/geomvis/lect3.pptx http://www.di.ens.fr/~ponce/geomvis/lect3.pdf
More informationSimilarity search in multimedia databases
Similarity search in multimedia databases Performance evaluation for similarity calculations in multimedia databases JO TRYTI AND JOHAN CARLSSON Bachelor s Thesis at CSC Supervisor: Michael Minock Examiner:
More informationDimension reduction : PCA and Clustering
Dimension reduction : PCA and Clustering By Hanne Jarmer Slides by Christopher Workman Center for Biological Sequence Analysis DTU The DNA Array Analysis Pipeline Array design Probe design Question Experimental
More informationHebei University of Technology A Text-Mining-based Patent Analysis in Product Innovative Process
A Text-Mining-based Patent Analysis in Product Innovative Process Liang Yanhong, Tan Runhua Abstract Hebei University of Technology Patent documents contain important technical knowledge and research results.
More informationSearch Results Clustering in Polish: Evaluation of Carrot
Search Results Clustering in Polish: Evaluation of Carrot DAWID WEISS JERZY STEFANOWSKI Institute of Computing Science Poznań University of Technology Introduction search engines tools of everyday use
More informationUnsupervised learning, Clustering CS434
Unsupervised learning, Clustering CS434 Unsupervised learning and pattern discovery So far, our data has been in this form: We will be looking at unlabeled data: x 11,x 21, x 31,, x 1 m x 12,x 22, x 32,,
More informationPlanar Point Location
C.S. 252 Prof. Roberto Tamassia Computational Geometry Sem. II, 1992 1993 Lecture 04 Date: February 15, 1993 Scribe: John Bazik Planar Point Location 1 Introduction In range searching, a set of values,
More informationClustering. Bruno Martins. 1 st Semester 2012/2013
Departamento de Engenharia Informática Instituto Superior Técnico 1 st Semester 2012/2013 Slides baseados nos slides oficiais do livro Mining the Web c Soumen Chakrabarti. Outline 1 Motivation Basic Concepts
More informationA Content Vector Model for Text Classification
A Content Vector Model for Text Classification Eric Jiang Abstract As a popular rank-reduced vector space approach, Latent Semantic Indexing (LSI) has been used in information retrieval and other applications.
More informationWeek 7 Picturing Network. Vahe and Bethany
Week 7 Picturing Network Vahe and Bethany Freeman (2005) - Graphic Techniques for Exploring Social Network Data The two main goals of analyzing social network data are identification of cohesive groups
More informationModule 5. Function-Oriented Software Design. Version 2 CSE IIT, Kharagpur
Module 5 Function-Oriented Software Design Lesson 12 Structured Design Specific Instructional Objectives At the end of this lesson the student will be able to: Identify the aim of structured design. Explain
More informationCHAPTER 3 ASSOCIATON RULE BASED CLUSTERING
41 CHAPTER 3 ASSOCIATON RULE BASED CLUSTERING 3.1 INTRODUCTION This chapter describes the clustering process based on association rule mining. As discussed in the introduction, clustering algorithms have
More informationIntroduction to Information Retrieval
Introduction to Information Retrieval (Supplementary Material) Zhou Shuigeng March 23, 2007 Advanced Distributed Computing 1 Text Databases and IR Text databases (document databases) Large collections
More informationA Novel Method for Activity Place Sensing Based on Behavior Pattern Mining Using Crowdsourcing Trajectory Data
A Novel Method for Activity Place Sensing Based on Behavior Pattern Mining Using Crowdsourcing Trajectory Data Wei Yang 1, Tinghua Ai 1, Wei Lu 1, Tong Zhang 2 1 School of Resource and Environment Sciences,
More informationThe Design and Implementation of an Intelligent Online Recommender System
The Design and Implementation of an Intelligent Online Recommender System Rosario Sotomayor, Joe Carthy and John Dunnion Intelligent Information Retrieval Group Department of Computer Science University
More informationData Clustering Hierarchical Clustering, Density based clustering Grid based clustering
Data Clustering Hierarchical Clustering, Density based clustering Grid based clustering Team 2 Prof. Anita Wasilewska CSE 634 Data Mining All Sources Used for the Presentation Olson CF. Parallel algorithms
More informationBrowsing Heterogeneous Document Collections by a Segmentation-free Word Spotting Method
Browsing Heterogeneous Document Collections by a Segmentation-free Word Spotting Method Marçal Rusiñol, David Aldavert, Ricardo Toledo and Josep Lladós Computer Vision Center, Dept. Ciències de la Computació
More informationA Metadatabase System for Semantic Image Search by a Mathematical Model of Meaning
A Metadatabase System for Semantic Image Search by a Mathematical Model of Meaning Yasushi Kiyoki, Takashi Kitagawa and Takanari Hayama Institute of Information Sciences and Electronics University of Tsukuba
More informationCollaborative Filtering Recommender System
International Journal of Emerging Trends in Science and Technology Collaborative Filtering Recommender System Authors Anvitha Hegde 1, Savitha K Shetty 2 1 M.Tech, Dept. of ISE, MSRIT, Bangalore 2 Assistant
More informationA BFS-BASED SIMILAR CONFERENCE RETRIEVAL FRAMEWORK
A BFS-BASED SIMILAR CONFERENCE RETRIEVAL FRAMEWORK Qing Guo 1, 2 1 Nanyang Technological University, Singapore 2 SAP Innovation Center Network,Singapore ABSTRACT Literature review is part of scientific
More informationChapter 2. Architecture of a Search Engine
Chapter 2 Architecture of a Search Engine Search Engine Architecture A software architecture consists of software components, the interfaces provided by those components and the relationships between them
More informationA Spectral-based Clustering Algorithm for Categorical Data Using Data Summaries (SCCADDS)
A Spectral-based Clustering Algorithm for Categorical Data Using Data Summaries (SCCADDS) Eman Abdu eha90@aol.com Graduate Center The City University of New York Douglas Salane dsalane@jjay.cuny.edu Center
More informationCMSC 476/676 Information Retrieval Midterm Exam Spring 2014
CMSC 476/676 Information Retrieval Midterm Exam Spring 2014 Name: You may consult your notes and/or your textbook. This is a 75 minute, in class exam. If there is information missing in any of the question
More informationMachine Learning HW4
Machine Learning HW4 Unsupervised Clustering & Dimensionality Reduction TAs: ml2016ta@gmailcom National Taiwan University November 17, 2016 ML2016 TAs Machine Learning HW4 November 17, 2016 1 / 19 1 Task
More informationBing Liu. Web Data Mining. Exploring Hyperlinks, Contents, and Usage Data. With 177 Figures. Springer
Bing Liu Web Data Mining Exploring Hyperlinks, Contents, and Usage Data With 177 Figures Springer Table of Contents 1. Introduction 1 1.1. What is the World Wide Web? 1 1.2. A Brief History of the Web
More informationImproving Suffix Tree Clustering Algorithm for Web Documents
International Conference on Logistics Engineering, Management and Computer Science (LEMCS 2015) Improving Suffix Tree Clustering Algorithm for Web Documents Yan Zhuang Computer Center East China Normal
More informationKeywords: clustering algorithms, unsupervised learning, cluster validity
Volume 6, Issue 1, January 2016 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Clustering Based
More information