Basic Tokenizing, Indexing, and Implementation of Vector-Space Retrieval

Similar documents
Information Retrieval CS Lecture 06. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Information Retrieval. (M&S Ch 15)

In = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most

CS 6320 Natural Language Processing

Chapter III.2: Basic ranking & evaluation measures

Chapter 6: Information Retrieval and Web Search. An introduction

Instructor: Stefan Savev

Representation of Documents and Infomation Retrieval

A Simple Search Engine

Information Retrieval. CS630 Representing and Accessing Digital Information. What is a Retrieval Model? Basic IR Processes

Chrome based Keyword Visualizer (under sparse text constraint) SANGHO SUH MOONSHIK KANG HOONHEE CHO

Introduction to Information Retrieval

Indexing and Query Processing. What will we cover?

Models for Document & Query Representation. Ziawasch Abedjan

Introduction to Information Retrieval

Chapter 2. Architecture of a Search Engine

Query Answering Using Inverted Indexes

Mining Web Data. Lijun Zhang

Information Retrieval (IR) Introduction to Information Retrieval. Lecture Overview. Why do we need IR? Basics of an IR system.

Recommender Systems 6CCS3WSN-7CCSMWAL

Manning Chapter: Text Retrieval (Selections) Text Retrieval Tasks. Vorhees & Harman (Bulkpack) Evaluation The Vector Space Model Advanced Techniques

Q: Given a set of keywords how can we return relevant documents quickly?

INF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering

Lecture 5: Information Retrieval using the Vector Space Model

Multimedia Information Extraction and Retrieval Term Frequency Inverse Document Frequency

Chapter 27 Introduction to Information Retrieval and Web Search

Vector Space Scoring Introduction to Information Retrieval INF 141/ CS 121 Donald J. Patterson

Birkbeck (University of London)

Digital Libraries: Language Technologies

ΕΠΛ660. Ανάκτηση µε το µοντέλο διανυσµατικού χώρου

Exam IST 441 Spring 2011

Query Evaluation Strategies

Introduction to Information Retrieval

Keyword Search in Databases

Information Retrieval

Introduction to Information Retrieval

Recommender Systems (RSs)

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University

Text Retrieval an introduction

VK Multimedia Information Systems

Exam in course TDT4215 Web Intelligence - Solutions and guidelines - Wednesday June 4, 2008 Time:

CMSC 476/676 Information Retrieval Midterm Exam Spring 2014

Multimedia Information Systems

Lecture 24: Image Retrieval: Part II. Visual Computing Systems CMU , Fall 2013

This lecture: IIR Sections Ranked retrieval Scoring documents Term frequency Collection statistics Weighting schemes Vector space scoring

Information Retrieval

Natural Language Processing Basics. Yingyu Liang University of Wisconsin-Madison

Information Retrieval

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

Efficient query processing

CS377: Database Systems Text data and information. Li Xiong Department of Mathematics and Computer Science Emory University

XML RETRIEVAL. Introduction to Information Retrieval CS 150 Donald J. Patterson

Where Should the Bugs Be Fixed?

Large Scale Chinese News Categorization. Peng Wang. Joint work with H. Zhang, B. Xu, H.W. Hao

Suresha M Department of Computer Science, Kuvempu University, India

Information Retrieval: Retrieval Models

Text Analytics (Text Mining)

Automatic Cluster Number Selection using a Split and Merge K-Means Approach

Materials Text Algorithms (4AP) Information Retrieval. Jaak Vilo 2008 fall

Mining Web Data. Lijun Zhang

Text Analytics (Text Mining)

CS105 Introduction to Information Retrieval

CS47300: Web Information Search and Management

Search Engine Architecture II

Outline of the course

Text Algorithms (4AP) Information Retrieval. Jaak Vilo 2008 fall. MTAT Text Algorithms. Materials

Designing and Building an Automatic Information Retrieval System for Handling the Arabic Data

A short introduction to the development and evaluation of Indexing systems

Ranked Retrieval. Evaluation in IR. One option is to average the precision scores at discrete. points on the ROC curve But which points?

CMPSCI 646, Information Retrieval (Fall 2003)

Exam IST 441 Spring 2014

Department of Computer Science and Engineering B.E/B.Tech/M.E/M.Tech : B.E. Regulation: 2013 PG Specialisation : _

Analyzing the performance of top-k retrieval algorithms. Marcus Fontoura Google, Inc

International Journal of Advanced Research in Computer Science and Software Engineering

CS54701: Information Retrieval

CSE 494: Information Retrieval, Mining and Integration on the Internet

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

Chapter 4. Distributed Algorithms based on MapReduce. - Applications

Part 2: Boolean Retrieval Francesco Ricci

CANDIDATE LINK GENERATION USING SEMANTIC PHEROMONE SWARM

Relevance of a Document to a Query

Query Evaluation Strategies

Information Retrieval. Information Retrieval and Web Search

Indexing: Part IV. Announcements (February 17) Keyword search. CPS 216 Advanced Database Systems

Text Documents clustering using K Means Algorithm

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Winter 2009 Lecture 12 Google Bigtable

Basic techniques. Text processing; term weighting; vector space model; inverted index; Web Search

Similarity search in multimedia databases

Midterm Exam Search Engines ( / ) October 20, 2015

Index and Search for Hierarchical Key-value Data Store

Impact of Term Weighting Schemes on Document Clustering A Review

Level of analysis Finding Out About Chapter 3: 25 Sept 01 R. K. Belew

String Vector based KNN for Text Categorization

Search Engines and Learning to Rank

Approaches to Mining the Web

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

A Closeup View. Class Overview CSE 454. Relevance. Retrieval Model Overview. 10/19 IR & Indexing 10/21 Google & Alta.

Computer Science 572 Exam Prof. Horowitz Wednesday, February 22, 2017, 8:00am 8:50am

Information Retrieval

Classic IR Models 5/6/2012 1

Transcription:

Basic Tokenizing, Indexing, and Implementation of Vector-Space Retrieval 1

Naïve Implementation Convert all documents in collection D to tf-idf weighted vectors, d j, for keyword vocabulary V. Convert query to a tf-idf-weighted vector q. For each d j in D do Compute score s j = cossim(d j, q) Sort documents by decreasing score. Present top ranked documents to the user. Time complexity: O( V D ) Bad for large V & D! V = 10,000; D = 100,000; V D = 1,000,000,000 2

Practical Implementation Based on the observation that documents containing none of the query keywords do not affect the final ranking Try to identify only those documents that contain at least one query keyword Actual implementation of an inverted index 3

Step 1: Preprocessing Implement the preprocessing functions: For tokenization For stop word removal For stemming Input: Documents that are read one by one from the collection Output: Tokens to be added to the index No punctuation, no stop-words, stemmed 4

Step 2: Indexing Build an inverted index, with an entry for each word in the vocabulary Input: Tokens obtained from the preprocessing module Output: An inverted index for fast access 5

Step 2 (cont d) Many data structures are appropriate for fast access B-trees, sparse lists, hashtables We need: One entry for each word in the vocabulary For each such entry: Keep a list of all the documents where it appears together with the corresponding frequency TF For each such entry, keep the total number of documents where the word occurred: IDF 6

Step 2 (cont d) Index terms df D j, tf j computer database 3 2 D 7, 4 D 1, 3 science 4 D 2, 4 system 1 D 5, 2 Index file lists 7

Step 2 (cont d) Term frequencies and DF for each token can be computed in one pass Cosine similarity also requires the lengths of the document vectors. Might need a second pass (through document collection or the inverted index) to compute document vector lengths. 8

Step 2 (cont d) Remember the weight of a token is: TF * IDF Therefore, must wait until IDF s are known (and therefore until all documents are indexed) before document lengths can be determined. Remember that the length of a document vector is the square-root of sum of the squares of the weights of its tokens. Do a second pass over all documents: keep a list or hashtable with all document id-s, and for each document determine the length of its vector. 9

Time Complexity of Indexing Complexity of creating vector and indexing a document of n tokens is O(n). So indexing m such documents is O(m n). Computing token IDFs can be done during the same first pass Computing vector lengths is also O(m n). Complete process is O(m n), which is also the complexity of just reading in the corpus. 10

Step 3: Retrieval Use inverted index (from step 2) to find the limited set of documents that contain at least one of the query words. Incrementally compute cosine similarity of each indexed document as query words are processed one by one. To accumulate a total score for each retrieved document, store retrieved documents in a hashtable, where the document id is the key, and the partial accumulated score is the value. 11

Step 3 (cont d) Input: Query and Inverted Index (from Step2) Output: Similarity values between query and documents 12

Step 4: Ranking Sort the hashtable including the retrieved documents based on the value of cosine similarity Return the documents in descending order of their relevance Input: Similarity values between query and documents Output: Ranked list of documented in reversed order of their relevance 13

What weighting methods? Weights applied to both document terms and query terms Direct impact on the final ranking Direct impact on the results Direct impact on the quality of IR system 14

Standard Evaluation Measures Starts with a CONTINGENCY table for each query retrieved not retrieved relevant TP FN n 1 = TP + FN not relevant FP TN n 2 = TP + FP N 15

Precision and Recall From all the documents that are relevant out there, how many did the IR system retrieve? Recall: Precision: TP TP + FN From all the documents that are retrieved by the IR system, how many are relevant? TP TP+FP 16