Latent Semantic Indexing

Similar documents
Feature selection. LING 572 Fei Xia

Information Retrieval: Retrieval Models

Introduction to Information Retrieval

Concept Based Search Using LSI and Automatic Keyphrase Extraction

Minoru SASAKI and Kenji KITA. Department of Information Science & Intelligent Systems. Faculty of Engineering, Tokushima University

Vector Semantics. Dense Vectors

Semantic Search in s

Text Analytics (Text Mining)

LRLW-LSI: An Improved Latent Semantic Indexing (LSI) Text Classifier

On Scaling Latent Semantic Indexing for Large Peer-to-Peer Systems

CSE 494: Information Retrieval, Mining and Integration on the Internet

Vector Space Models: Theory and Applications

Principal Component Analysis for Distributed Data

CHAPTER 3 INFORMATION RETRIEVAL BASED ON QUERY EXPANSION AND LATENT SEMANTIC INDEXING

Reminder: Lecture 20: The Eight-Point Algorithm. Essential/Fundamental Matrix. E/F Matrix Summary. Computing F. Computing F from Point Matches

Towards Understanding Latent Semantic Indexing. Second Reader: Dr. Mario Nascimento

Retrieval by Content. Part 3: Text Retrieval Latent Semantic Indexing. Srihari: CSE 626 1

Distributed Information Retrieval using LSI. Markus Watzl and Rade Kutil

Eagle Eye. Sommersemester 2017 Big Data Science Praktikum. Zhenyu Chen - Wentao Hua - Guoliang Xue - Bernhard Fabry - Daly

Multiple View Geometry in Computer Vision

Text Analytics (Text Mining)

Information Retrieval. hussein suleman uct cs

Clustering. Bruno Martins. 1 st Semester 2012/2013

vector space retrieval many slides courtesy James Amherst

Dimension Reduction CS534

Two-Dimensional Visualization for Internet Resource Discovery. Shih-Hao Li and Peter B. Danzig. University of Southern California

Interlude: Solving systems of Equations

Mining Web Data. Lijun Zhang

Content-based Dimensionality Reduction for Recommender Systems

Instructor: Stefan Savev

Fundamental Matrix & Structure from Motion

ESANN'2001 proceedings - European Symposium on Artificial Neural Networks Bruges (Belgium), April 2001, D-Facto public., ISBN ,

A Patent Retrieval Method Using a Hierarchy of Clusters at TUT

A Content Vector Model for Text Classification

Testing Dimension Reduction Methods for Text Retrieval

Structure from Motion

DOCUMENT INDEXING USING INDEPENDENT TOPIC EXTRACTION. Yu-Hwan Kim and Byoung-Tak Zhang

Effective Latent Space Graph-based Re-ranking Model with Global Consistency

A Spectral-based Clustering Algorithm for Categorical Data Using Data Summaries (SCCADDS)

Performance for Web document mining using NLP and Latent Semantic Indexing with Singular Value Decomposition

Modern GPUs (Graphics Processing Units)

Lab # 2 - ACS I Part I - DATA COMPRESSION in IMAGE PROCESSING using SVD

Document Clustering using Concept Space and Cosine Similarity Measurement

Peer-to-Peer Information Retrieval Using Self-Organizing Semantic Overlay Networks

Text Analytics (Text Mining)

Jan Pedersen 22 July 2010

A Multiclassifier based Approach for Word Sense Disambiguation using Singular Value Decomposition

Introduction to Machine Learning Prof. Anirban Santara Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

A Multiclassifier based Approach for Word Sense Disambiguation using Singular Value Decomposition

Feature extraction techniques to use in cereal classification

Text Data Pre-processing and Dimensionality Reduction Techniques for Document Clustering

ENHANCEMENT OF METICULOUS IMAGE SEARCH BY MARKOVIAN SEMANTIC INDEXING MODEL

Representation/Indexing (fig 1.2) IR models - overview (fig 2.1) IR models - vector space. Weighting TF*IDF. U s e r. T a s k s

Dimension reduction : PCA and Clustering

Reviewer Profiling Using Sparse Matrix Regression

Introduction to Information Retrieval

Lecture 7: Relevance Feedback and Query Expansion

Implementation of LSI Method on Information Retrieval for Text Document in Bahasa Indonesia

APPM 2360 Lab #2: Facial Recognition

Spam Filtering using Contextual Network Graphs

Randomization Algorithm to Compute Low-Rank Approximation

The real voyage of discovery consists not in seeking new landscapes, but in having new eyes.

Chapter 6: Information Retrieval and Web Search. An introduction

Analysis and Latent Semantic Indexing

Published in A R DIGITECH

Three-Dimensional Sensors Lecture 6: Point-Cloud Registration

Topics in Information Retrieval

CSE 6242 A / CS 4803 DVA. Feb 12, Dimension Reduction. Guest Lecturer: Jaegul Choo

Compression, Clustering and Pattern Discovery in Very High Dimensional Discrete-Attribute Datasets

A Visualization Tool for Concept Based Retrieval Schemes Theoretical Foundations and Models. Bachelor thesis. Benedikt Grundmann 10th March 2006

Two Modifications of Weight Calculation of the Non-Local Means Denoising Method

Recommender System. What is it? How to build it? Challenges. R package: recommenderlab

Dimensionality Reduction, Clustering and Trends in NLP. Kan Min-Yen Day 3 / Afternoon

such a manner that we are able to understand, grasp and grapple with the problem at hand in a more organized fashion.

James Mayfield! The Johns Hopkins University Applied Physics Laboratory The Human Language Technology Center of Excellence!

Overview. Lecture 3: Index Representation and Tolerant Retrieval. Type/token distinction. IR System components

Structure from Motion. Lecture-15

Text Analytics (Text Mining)

(Refer Slide Time 04:53)

Shrey Patel B.E. Computer Engineering, Gujarat Technological University, Ahmedabad, Gujarat, India

K-Means and Gaussian Mixture Models

Two-view geometry Computer Vision Spring 2018, Lecture 10

Classification Based on NMF

APPLICATIONS OF THE SINGULAR VALUE DECOMPOSITION

A GPU-based Approximate SVD Algorithm Blake Foster, Sridhar Mahadevan, Rui Wang

Autar Kaw Benjamin Rigsby. Transforming Numerical Methods Education for STEM Undergraduates

Favorites-Based Search Result Ordering

Singular Value Decomposition, and Application to Recommender Systems

A Modular Approach to Document Indexing and Semantic Search

Self-organization of very large document collections

Image Analysis & Retrieval. CS/EE 5590 Special Topics (Class Ids: 44873, 44874) Fall 2016, M/W Lec 13

Recognition, SVD, and PCA

Dense Matrix Algorithms

Lecture 3: Camera Calibration, DLT, SVD

Clustering K-means. Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, Carlos Guestrin

Mining Features from the Object-Oriented Source Code of a Collection of Software Variants Using Formal Concept Analysis and Latent Semantic Indexing

Unsupervised learning in Vision

Feature Selection for fmri Classification

Computational Methods CMSC/AMSC/MAPL 460. Vectors, Matrices, Linear Systems, LU Decomposition, Ramani Duraiswami, Dept. of Computer Science

CSC 411 Lecture 18: Matrix Factorizations

Transcription:

Latent Semantic Indexing Thanks to Ian Soboroff Information Retrieval 1

Issues: Vector Space Model Assumes terms are independent Some terms are likely to appear together synonyms, related words spelling mistakes? Terms can have different meanings depending on context Term-document matrix has very high dimensionality are there really that many important features for each document and term? Lecture 12 Information Retrieval 2

Latent Semantic Indexing w td = T Σ r r D T r d t d t r Compute singular value decomposition of a term-document matrix D, a representation of M in r dimensions T, a matrix for transforming new documents diagonal matrix Σ gives relative importance of dimensions Lecture 12 Information Retrieval 3

LSI Term matrix T T matrix gives a vector for each term combo in LSI space for a new document c, c *T gives a new row in D That is, fold in the new document into the LSI space LSI is a rotation of the term-space original matrix: terms are d-dimensional new space has lower dimensionality dimensions are groups of terms that tend to cooccur in the same documents synonyms, contextually-related words, variant endings Lecture 12 Information Retrieval 4

bound numb flow lay effect pressur mach body Lecture 12 Information Retrieval 5

bound ratio wing lay solut pressur mach jet body Lecture 12 Information Retrieval 6

bound numb wing flow lay method solut pressur mach jet body Lecture 12 Information Retrieval 7

Singular Values Σ gives an ordering to the dimensions values tend to drop off very quickly singular values at the tail represent "noise" cutting off low-value dimensions reduces noise and can improve performance Lecture 12 Information Retrieval 8

Truncating Dimensions in LSI Lecture 12 Information Retrieval 9

Document matrix D D matrix coordinates of documents in LSI space same dimensionality as T vectors can compute the similarity between a term and a document http://lsi.research.telcordia.com/ Lecture 12 Information Retrieval 10

Improved Retrieval with LSI New documents and queries are "folded in" multiply vector by TΣ -1 Compute similarity for ranking as in VSM compare queries and documents by dot-product Improvements come from reduction of noise no need to stem terms (variants will co-occur) no need for stop list stop words are used uniformly throughout collection, so they tend to appear in the first dimension No speed or space gains, though Lecture 12 Information Retrieval 11

LSI in TREC-3 LSI space computed from a sample of the document collection Documents and queries folded into LSI space for comparison Improvement in AP with LSI: 5% Improvements up to 20% seen in smaller collections Lecture 12 Information Retrieval 12

Other LSI Applications Text classification by topic dimension reduction -> good for clustering by language languages have their own stop words by writing style Information Filtering Cross-language retrieval Lecture 12 Information Retrieval 13

N-gram indexing recap Index all n character sequences language-independent resistant to noisy text no stemming easy to do Document array of n-gram frequencies n = 5 Hello World Hello World Hello World Hello World Lecture 12 Information Retrieval 14

Why N-grams? N-grams capture pairs of words Brings out phraseology and word choice LSI using n-grams might cluster documents by writing style and/or author a lot of what makes style is word choices and stop word usage Small experiment Three biblical Hebrew texts: Ecclesiastes, Song of Songs, Book of Daniel used 3-grams in original Hebrew Lecture 12 Information Retrieval 15

Lecture 12 Information Retrieval 16

Lecture 12 Information Retrieval 17

Lecture 12 Information Retrieval 18

Conclusion LSI can be a useful technique for reducing the dimensionality of an IR problem reduction can improve effectiveness reduction can find surprising relationships! SVD can be expensive to compute on large matrices Available tools for working with LSI MATLAB or Octave (small data sets only) SMART (an IR system) with SVDPACK Lecture 12 Information Retrieval 19