A Simple Search Engine

Similar documents
Basic Tokenizing, Indexing, and Implementation of Vector-Space Retrieval

Q: Given a set of keywords how can we return relevant documents quickly?

Lecture 5: Information Retrieval using the Vector Space Model

Representing Documents; Unit Testing II

Information Retrieval. (M&S Ch 15)

Information Retrieval CS Lecture 06. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Multimedia Information Extraction and Retrieval Term Frequency Inverse Document Frequency

Chapter 2. Architecture of a Search Engine

In = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most

Information Retrieval

Information Retrieval (IR) Introduction to Information Retrieval. Lecture Overview. Why do we need IR? Basics of an IR system.

ΕΠΛ660. Ανάκτηση µε το µοντέλο διανυσµατικού χώρου

A Security Model for Multi-User File System Search. in Multi-User Environments

Boolean Model. Hongning Wang

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Winter 2009 Lecture 12 Google Bigtable

CFCS1. Vectors in MATLAB. Miles Osborne. February 22, School of Informatics University of Edinburgh

Text Analytics. Index-Structures for Information Retrieval. Ulf Leser

Full-Text Indexing For Heritrix

Instructor: Stefan Savev

Lecture 24: Image Retrieval: Part II. Visual Computing Systems CMU , Fall 2013

Document indexing, similarities and retrieval in large scale text collections

Text Analytics. Index-Structures for Information Retrieval. Ulf Leser

Chapter 4. Distributed Algorithms based on MapReduce. - Applications

Models for Document & Query Representation. Ziawasch Abedjan

Midterm Examination CSE 455 / CIS 555 Internet and Web Systems Spring 2009 Zachary Ives

Introduction to Information Retrieval

This lecture: IIR Sections Ranked retrieval Scoring documents Term frequency Collection statistics Weighting schemes Vector space scoring

CS 6320 Natural Language Processing

Relevance of a Document to a Query

Chapter 6: Information Retrieval and Web Search. An introduction

Outline of the course

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

Classic IR Models 5/6/2012 1

Introduction to Information Retrieval

Basic techniques. Text processing; term weighting; vector space model; inverted index; Web Search

Stanford University Computer Science Department Solved CS347 Spring 2001 Mid-term.

CS54701: Information Retrieval

Information Retrieval

FRONT CODING. Front-coding: 8automat*a1 e2 ic3 ion. Extra length beyond automat. Encodes automat. Begins to resemble general string compression.

Architecture and Implementation of Database Systems (Summer 2018)

Information Retrieval

COL106: Data Structures, I Semester Assignment 3-4 A small search engine

Query Evaluation Strategies

Information Retrieval and Organisation

Information Retrieval

CS105 Introduction to Information Retrieval

Knowledge Discovery and Data Mining 1 (VO) ( )

1 Document Classification [60 points]

Query Evaluation Strategies

Information Retrieval

modern database systems lecture 4 : information retrieval

Vector Space Scoring Introduction to Information Retrieval INF 141/ CS 121 Donald J. Patterson

Applied Databases. Sebastian Maneth. Lecture 11 TFIDF Scoring, Lucene. University of Edinburgh - February 26th, 2017

Indexing and Query Processing. What will we cover?

Indexing: Part IV. Announcements (February 17) Keyword search. CPS 216 Advanced Database Systems

Efficiency. Efficiency: Indexing. Indexing. Efficiency Techniques. Inverted Index. Inverted Index (COSC 488)

Vannevar Bush. Information Retrieval. Prophetic: Hypertext. Historic Vision 2/8/17

Relational Approach. Problem Definition

COMP6237 Data Mining Searching and Ranking

Outline. Possible solutions. The basic problem. How? How? Relevance Feedback, Query Expansion, and Inputs to Ranking Beyond Similarity

Text Retrieval an introduction

Information Retrieval. Danushka Bollegala

Apache Lucene - Scoring

Query Answering Using Inverted Indexes

Task Description: Finding Similar Documents. Document Retrieval. Case Study 2: Document Retrieval

Problem 1: Complexity of Update Rules for Logistic Regression

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University

Midterm Exam Search Engines ( / ) October 20, 2015

Information Retrieval

Efficient query processing

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

vector space retrieval many slides courtesy James Amherst

Midterm 2, CS186 Prof. Hellerstein, Spring 2012

Python: Short Overview and Recap

Social Media Computing

Information Retrieval. CS630 Representing and Accessing Digital Information. What is a Retrieval Model? Basic IR Processes

Information Retrieval

Data Modelling and Multimedia Databases M

Web Information Retrieval Exercises Boolean query answering. Prof. Luca Becchetti

Introduction to Information Retrieval (Manning, Raghavan, Schutze)

CS61A Lecture 16. Amir Kamil UC Berkeley February 27, 2013

Information Retrieval. Lecture 5 - The vector space model. Introduction. Overview. Term weighting. Wintersemester 2007

Boolean retrieval & basics of indexing CE-324: Modern Information Retrieval Sharif University of Technology

Text Analytics (Text Mining)

Informa(on Retrieval

CS61A Lecture 16. Amir Kamil UC Berkeley February 27, 2013

Boolean retrieval & basics of indexing CE-324: Modern Information Retrieval Sharif University of Technology

10/10/13. Traditional database system. Information Retrieval. Information Retrieval. Information retrieval system? Information Retrieval Issues

Data-Intensive Distributed Computing

CC PROCESAMIENTO MASIVO DE DATOS OTOÑO Lecture 7: Information Retrieval II. Aidan Hogan

Introduction to Information Retrieval

pygtrie Release Jul 03, 2017

CS 1110 Prelim 2 November 10th, 2016

Information Retrieval Tutorial 1: Boolean Retrieval

Information Retrieval

Index Compression. David Kauchak cs160 Fall 2009 adapted from:

MG4J: Managing Gigabytes for Java. Exercise. Adriano Fazzone DIAG Department Sapienza University of Rome. MG4J - intro 1

Building an Inverted Index

Text Analytics (Text Mining)

Exam in course TDT4215 Web Intelligence - Solutions and guidelines - Wednesday June 4, 2008 Time:

Transcription:

A Simple Search Engine Benjamin Roth CIS LMU Benjamin Roth (CIS LMU) A Simple Search Engine 1 / 13

Document Collection for Search Engine Now that we have a documents, let s represent a collection of documents for search. What does a such a class for representing a document collection need? Information to store? Functionality? Benjamin Roth (CIS LMU) A Simple Search Engine 2 / 13

Document Collection for Search Engine What does a class need for representing a document collection for search? Information to store: Store the documents, and access them via an id. An inverted index: A map from each term to all documents containing that term. (For efficiently finding all potentially relevant documents) The document frequency for each terms (number of documents in which it occurs), to be used in similarity computation. Functionality: Read documents (from directory) Return (all) documents that contain (all) terms of a query. Reweight token frequencies by tf-idf weighting. Compute cosine-similarity for two documents. Benjamin Roth (CIS LMU) A Simple Search Engine 3 / 13

Document Collection (Code Skeleton) class DocumentCollection: def init (self, term_to_df, term_to_docids, \ docid_to_doc): @classmethod def from_dir(cls, root_dir, file_suffix): @classmethod def from_document_list(cls, docs): def docs_with_all_tokens(self, tokens): def tfidf(self, counts): def cosine_similarity(self, doca, docb): Benjamin Roth (CIS LMU) A Simple Search Engine 4 / 13

Detail: Constructor Set all the required data fields def init (self, term_to_df, term_to_docids, docid_to_doc): # string to int self.term_to_df = term_to_df # string to set of string self.term_to_docids = term_to_docids # string to TextDocument self.docid_to_doc = docid_to_doc Benjamin Roth (CIS LMU) A Simple Search Engine 5 / 13

Detail: Get all documents containing all search terms def docs_with_all_tokens(self, tokens): docids_for_each_token = [self.term_to_docids[token] \ for token in tokens] docids = set.intersection(*docids_for_each_token) return [self.docid_to_doc[id] for id in docids] What does docids for each token contain? What is contained in docids? How can we get all documents that contain any of the search terms? Bonus: What could be (roughly) the time complexity of set.intersection(...)? Benjamin Roth (CIS LMU) A Simple Search Engine 6 / 13

Detail: Get all documents containing all search terms What does docids for each token contain? List of set of document ids. (For each search term one set) What is contained in docids? The intersection of the above sets. The ids of those documents that contain all terms. How can we get all documents that contain any of the search terms? Use set union instead of intersection. Bonus: What could be (roughly) the time complexity of set.intersection(...)? A simple algorithm would be: For each document id in any of the sets check wether it is contained in all of the other sets. If yes, add to result set. You can assume that checking set inclusion, and adding to a set takes constant time. Complexity: O(nm), where n is number of search terms, m is number of document ids in all sets. A more efficient algorithm would use sorted lists of document ids (posting lists). Benjamin Roth (CIS LMU) A Simple Search Engine 7 / 13

Detail: Tf.Idf Weighting def tfidf(self, counts): N = len(self.docid_to_doc) return {tok: tf * math.log(n/self.term_to_df[tok]) for \ tok,tf in counts.items() if tok in self.term_to_df} Input (dictionary): term counts of term in document Output (dictionary): term weighted counts Remember formulas: Term frequency is just the number of occurrences of the term (we use the simple, unnormalized version). Inverse document frequency: log N df t where N is the size of the document collection and df t is the number of documents term t occurrs in. Benjamin Roth (CIS LMU) A Simple Search Engine 8 / 13

Detail: Cosine Similarity def cosine_similarity(self, doca, docb): weighteda = self.tfidf(doca.token_counts) weightedb = self.tfidf(docb.token_counts) dotab = dot(weighteda, weightedb) norma = math.sqrt(dot(weighteda, weighteda)) normb = math.sqrt(dot(weightedb, weightedb)) if norma == 0 or normb == 0: return 0. else: return dotab / (norma * normb) Input (dictionaries): term frequencies of two documents. Output: Cosine similarity of tf.idf weighted document vectors. How would dot helper function look like? What is the meaning of norma and normb? When can norma or normb be zero? Benjamin Roth (CIS LMU) A Simple Search Engine 9 / 13

Detail: Cosine Similarity How would dot helper function look like? def dot(dicta, dictb): return sum([dicta.get(tok) * dictb.get(tok,0) for \ tok in dicta]) What is the meaning of norma and normb? Vector norm (l2). It is defined as the square root of the dot product of a vector with itself: v 2 = Intuitively it measures the length of a document, and is high if a document contains many terms. When can norma or normb be zero? When a query only contains out-of-vocabulary words (tfidf(...) filters those words out). i v 2 i Benjamin Roth (CIS LMU) A Simple Search Engine 10 / 13

Putting it all together: Search Engine Most of the functionality is already contained in the DocumentCollection class. The search engine only has to Preprocess (tokenize) the query. Call the respective methods (e.g. docs with all tokens, cosine similarity) Sort the results to put most similar results first. Select some text snippets for displaying to the user. Benjamin Roth (CIS LMU) A Simple Search Engine 11 / 13

Search Engine: Code Skeleton class SearchEngine: def init (self, doc_collection): def ranked_documents(self, query): def snippets(self, query, document, window=50): See full implementation in the lecture repository. Benjamin Roth (CIS LMU) A Simple Search Engine 12 / 13

Summary Representing Text documents Document collections Factory method constructors Retrieving documents Computing similarity... Questions? Benjamin Roth (CIS LMU) A Simple Search Engine 13 / 13