ΕΠΛ660. Ανάκτηση µε το µοντέλο διανυσµατικού χώρου

Similar documents
Informa(on Retrieval

Query Answering Using Inverted Indexes

Information Retrieval. Lecture 5 - The vector space model. Introduction. Overview. Term weighting. Wintersemester 2007

Efficient query processing

Multimedia Information Extraction and Retrieval Term Frequency Inverse Document Frequency

Lecture 5: Information Retrieval using the Vector Space Model

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

Query Evaluation Strategies

Query Evaluation Strategies

Information Retrieval. (M&S Ch 15)

Vector Space Scoring Introduction to Information Retrieval INF 141/ CS 121 Donald J. Patterson

Information Retrieval. CS630 Representing and Accessing Digital Information. What is a Retrieval Model? Basic IR Processes

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University

Query Processing and Alternative Search Structures. Indexing common words

Q: Given a set of keywords how can we return relevant documents quickly?

Today s topic CS347. Results list clustering example. Why cluster documents. Clustering documents. Lecture 8 May 7, 2001 Prabhakar Raghavan

Basic Tokenizing, Indexing, and Implementation of Vector-Space Retrieval

Part 2: Boolean Retrieval Francesco Ricci

Information Retrieval

Information Retrieval

modern database systems lecture 4 : information retrieval

This lecture: IIR Sections Ranked retrieval Scoring documents Term frequency Collection statistics Weighting schemes Vector space scoring

Lecture 24: Image Retrieval: Part II. Visual Computing Systems CMU , Fall 2013

Information Retrieval and Organisation

Near Neighbor Search in High Dimensional Data (1) Dr. Anwar Alhenshiri

Models for Document & Query Representation. Ziawasch Abedjan

Information Retrieval

In = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most

CS54701: Information Retrieval

Information Retrieval (IR) Introduction to Information Retrieval. Lecture Overview. Why do we need IR? Basics of an IR system.

Midterm Exam Search Engines ( / ) October 20, 2015

Classic IR Models 5/6/2012 1

More on indexing CE-324: Modern Information Retrieval Sharif University of Technology

James Mayfield! The Johns Hopkins University Applied Physics Laboratory The Human Language Technology Center of Excellence!

Information Retrieval

Lecture 8 May 7, Prabhakar Raghavan

Introduction to Information Retrieval (Manning, Raghavan, Schutze)

Task Description: Finding Similar Documents. Document Retrieval. Case Study 2: Document Retrieval

Information Retrieval

Digital Libraries: Language Technologies

Query Languages. Berlin Chen Reference: 1. Modern Information Retrieval, chapter 4

Analyzing the performance of top-k retrieval algorithms. Marcus Fontoura Google, Inc

Introduction to Information Retrieval

Efficiency. Efficiency: Indexing. Indexing. Efficiency Techniques. Inverted Index. Inverted Index (COSC 488)

Information Retrieval CS Lecture 06. Razvan C. Bunescu School of Electrical Engineering and Computer Science

CLUSTERING, TIERED INDEXES AND TERM PROXIMITY WEIGHTING IN TEXT-BASED RETRIEVAL

Informa(on Retrieval

Data Modelling and Multimedia Databases M

THE WEB SEARCH ENGINE

Relevance of a Document to a Query

CS6200 Information Retrieval. David Smith College of Computer and Information Science Northeastern University

Introducing Information Retrieval and Web Search. borrowing from: Pandu Nayak

Information Retrieval

Feature selection. LING 572 Fei Xia

Birkbeck (University of London)

vector space retrieval many slides courtesy James Amherst

Indexing and Query Processing. What will we cover?

COMP6237 Data Mining Searching and Ranking

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

Basic techniques. Text processing; term weighting; vector space model; inverted index; Web Search

Boolean Model. Hongning Wang

Indexing. UCSB 290N. Mainly based on slides from the text books of Croft/Metzler/Strohman and Manning/Raghavan/Schutze

Instructor: Stefan Savev

Mining Web Data. Lijun Zhang

Document Representation : Quiz

A Security Model for Multi-User File System Search. in Multi-User Environments

Locality- Sensitive Hashing Random Projections for NN Search

Information Retrieval and Web Search

Mining Web Data. Lijun Zhang

CS 6320 Natural Language Processing

Chapter 2. Architecture of a Search Engine

Information Retrieval

Text Analytics. Index-Structures for Information Retrieval. Ulf Leser

Boolean retrieval & basics of indexing CE-324: Modern Information Retrieval Sharif University of Technology

Information Retrieval. Information Retrieval and Web Search

Unstructured Data. CS102 Winter 2019

Lecture 3: Phrasal queries and wildcards

Chapter 6: Information Retrieval and Web Search. An introduction

Methods for High Degrees of Similarity. Index-Based Methods Exploiting Prefixes and Suffixes Exploiting Length

Inverted Index for Fast Nearest Neighbour

Unstructured Data Management. Advanced Topics in Database Management (INFSCI 2711)

kd-trees Idea: Each level of the tree compares against 1 dimension. Let s us have only two children at each node (instead of 2 d )

Data-Intensive Distributed Computing

Boolean retrieval & basics of indexing CE-324: Modern Information Retrieval Sharif University of Technology

CSCI 5417 Information Retrieval Systems. Jim Martin!

Efficient Implementation of Postings Lists

Information Retrieval

Vannevar Bush. Information Retrieval. Prophetic: Hypertext. Historic Vision 2/8/17

FRONT CODING. Front-coding: 8automat*a1 e2 ic3 ion. Extra length beyond automat. Encodes automat. Begins to resemble general string compression.

Information Retrieval CSCI

Online Expansion of Rare Queries for Sponsored Search

Introduction to Information Retrieval

CS377: Database Systems Text data and information. Li Xiong Department of Mathematics and Computer Science Emory University

Elementary IR: Scalable Boolean Text Search. (Compare with R & G )

Data-analysis and Retrieval Boolean retrieval, posting lists and dictionaries

Midterm Examination CS540-2: Introduction to Artificial Intelligence

Plan for today. CS276B Text Retrieval and Mining Winter Vector spaces and XML. Text-centric XML retrieval. Vector spaces and XML

Spatially-Aware Information Retrieval on the Internet

Boolean retrieval & basics of indexing CE-324: Modern Information Retrieval Sharif University of Technology

A Systems View of Large- Scale 3D Reconstruction

Transcription:

Ανάκτηση µε το µοντέλο διανυσµατικού χώρου

Σηµερινό ερώτηµα Typically we want to retrieve the top K docs (in the cosine ranking for the query) not totally order all docs in the corpus can we pick off docs with K highest cosines?

Efficient cosine ranking Find the k docs in the corpus nearest to the query k largest query-doc cosines. Efficient ranking: Computing a single cosine efficiently. Choosing the k largest cosine values efficiently. Can we do this without computing all N cosines? N = number of documents in collection

Basic cosine score computation

A faster algorithm for vector space scores

Use heap for selecting top k Takes 2N operations (comparisons) to construct Θ(Ν), then each of k log N winners read off in 2log N steps. For N=1M, K=100, this is about 10% of the cost of sorting. 1.9.3.3.8.1.1

Binary Heap: Definition Binary heap. Almost complete binary tree. filled on all levels, except last, where filled from left to right Min-heap ordered. every child greater than (or equal to) parent 06 14 45 78 18 47 53 83 91 81 77 84 99 64

Efficient cosine ranking What we re doing in effect: solving the K-nearest neighbors problem for a query vector In general, do not know how to do this efficiently for high-dimensional spaces Short of computing all N cosines Yes, but may occasionally get an answer wrong a doc not in the top K may creep into the answer (and docs in the top K get omitted)

Generic approach Find a set A of documents that are contenders, where K << A << N. A does not necessarily contain the K top-scoring documents for the query, but is likely to have many with scores near those of the top K. Return the K top-scoring documents in A. Will see many ideas following this approach.

Index elimination Traverse only postings of high-idf terms The rest don t contribute much to scores Added benefit: low idf terms have long postings Now ignored Only compute scores for docs containing all (or most) query terms Danger: may end up with fewer than K candidates

Champion lists (Fancy lists ) Precompute, for each term t, the set of the r docs with the highest weights for t Call this set of r docs the champion list for term t For tf-idf weighting, these would be the r docs with the highest tf values for t Given a query q take the union of the champion lists for the terms in q to create set A Restrict cosine computation to only the docs in A Note: r is fixed at index construction, whereas K is application dependent

Static quality scores In many search engines, we have available a quality measure g(d) for each document d g(d) [0,1] is query-independent and thus static E.g., for news stories on the web: g(d) may be derived from the number of favorable reviews Order postings by g(d) Still can perform all postings merge operations as before! Don t need docid ordering. Suppose

Postings ordering by g(d) Can compute complete cosine scores Then get exact top-k Can perform inexact top top-k in one of several ways: Stop postings traversal for term t when g() drops below a threshold For each t, maintain global champion list of the r documents with the highest values of g(d) + (tf-idf) t,d (Now treat as champion lists before.)

High and low champions For each term maintain two disjoint postings: High, with the champions Low, with the rest In query processing, try to get K documents from the High champions If we fail to get K, then fall back to the Low postings

Impact ordering Thus far, all postings always contain docs in same consistent order We ve used docids or g(d) for this ordering Now will consider a scheme where the postings are not all in the same ordering But will still obtain a benefit for fast but inexact cosine computation

Recall basic scoring algorithm Did not require common ordering

Frequency-ordered postings Order docs d in the postings of term t by decreasing order of tf t,d Thus varying orders for different postings Must compute scores one term at a time When traversing postings for query term t, stop after considering a prefix of the postings Consider query terms in decreasing order of idf Use various criteria for early termination

Cluster pruning: preprocessing Pick n docs at random: call these leaders For each other doc, pre-compute nearest leader Docs attached to a leader: its followers; Likely: each leader has ~ n followers.

Cluster pruning: query processing Process a query as follows: Given query Q, find its nearest leader L. Seek k nearest docs from among L s followers.

Visualization Query Leader Follower

Why use random sampling Fast Leaders reflect data distribution

General variants Have each follower attached to a=3 (say) nearest leaders. From query, find b=4 (say) nearest leaders and their followers. Can recur on leader/follower construction.

Exercises To find the nearest leader in step 1, how many cosine computations do we do? Why did we have n in the first place? What is the effect of the constants a,b on the previous slide? Devise an example where this is likely to fail i.e., we miss one of the k nearest docs. Likely under random sampling.

Putting together a system We have most of the pieces for a fairly modern search system; two further notions Tiered Indexes General machine-learned scoring

Tiered indexes Generalization of high and low champions Try answering query using a fraction of the postings for each term If we fail to get enough matching docs Fall back to the next fraction of the postings Then the next, and so on

auto Doc2 Tier 1 best car Doc1 Doc3 insurance Doc2 Doc3 Tier 2 auto best Doc1 Doc3 car insurance auto Doc1 Tier 3 best car Doc2 insurance

Query-term proximity Given a query with two or more query terms, t 1, t 2,..., t k Let w = width of the smallest window in doc d containing all t 1, t 2,..., t k E.g., if the doc is The quality of mercy is not strained, for the query strained mercy w=4 Intuition: docs for which w is small should score higher How much smaller?

General issue We have many features on which to base our scoring Cosine score, zone scores, proximity ω, Length(d), g(d), How do we combine all these into a scoring function? Machine-learned scoring

Putting it all together Documents parsing linguistics user query freetext query parser results page indexers spell correction scoring and ranking heap zone and field indexes inexact tiered inverted top K positional index retrieval Indexes k-gram scoring parameters MLR training set

Vectors and Boolean queries A vector-space index can be used to answer Boolean queries The reverse is not true. Why? Vectors and Boolean queries really don t work together very well Vector space queries are a form of evidence accumulation Boolean retrieval requires a formula for selecting documents through the presence (or absence) of specific keywords

Vectors and wild cards How about the query rom* restaurant? Thought: interpret the wildcard component of the query as spawing multiple terms in the vector space. Internet not a good idea.

Interaction: vectors and phrases Scoring phrases doesn t fit naturally into the vector space world: tangerine trees marmalade skies Positional indexes don t calculate or store tf.idf information for tangerine trees Biword indexes treat certain phrases as terms For these, we can pre-compute tf.idf. Theoretical problem of correlated dimensions Problem: we cannot expect end-user formulating queries to know what phrases are indexed We can use a positional index to boost or ensure phrase occurrence