COMP6237 Data Mining Searching and Ranking

Similar documents
Chapter 6: Information Retrieval and Web Search. An introduction

Boolean Model. Hongning Wang

Lecture 5: Information Retrieval using the Vector Space Model

Document Representation : Quiz

Information Retrieval. (M&S Ch 15)

Indexing. UCSB 290N. Mainly based on slides from the text books of Croft/Metzler/Strohman and Manning/Raghavan/Schutze

James Mayfield! The Johns Hopkins University Applied Physics Laboratory The Human Language Technology Center of Excellence!

In = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most

CS6200 Information Retrieval. David Smith College of Computer and Information Science Northeastern University

Query Processing and Alternative Search Structures. Indexing common words

Efficiency. Efficiency: Indexing. Indexing. Efficiency Techniques. Inverted Index. Inverted Index (COSC 488)

Multimedia Information Extraction and Retrieval Term Frequency Inverse Document Frequency

Instructor: Stefan Savev

Mining Web Data. Lijun Zhang

Information Retrieval

Information Retrieval

Information Retrieval CS Lecture 06. Razvan C. Bunescu School of Electrical Engineering and Computer Science

10/10/13. Traditional database system. Information Retrieval. Information Retrieval. Information retrieval system? Information Retrieval Issues

Web Information Retrieval. Lecture 4 Dictionaries, Index Compression

Multimedia Information Systems

Knowledge Discovery and Data Mining 1 (VO) ( )

Text Analytics. Index-Structures for Information Retrieval. Ulf Leser

Information Retrieval (IR) Introduction to Information Retrieval. Lecture Overview. Why do we need IR? Basics of an IR system.

CMPSCI 646, Information Retrieval (Fall 2003)

Static Pruning of Terms In Inverted Files

Recap: lecture 2 CS276A Information Retrieval

Text Analytics. Index-Structures for Information Retrieval. Ulf Leser

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

Introduction to Information Retrieval

Chapter 27 Introduction to Information Retrieval and Web Search

Chapter 2. Architecture of a Search Engine

Introduction to Information Retrieval (Manning, Raghavan, Schutze)

Information Retrieval. CS630 Representing and Accessing Digital Information. What is a Retrieval Model? Basic IR Processes

Building an Inverted Index

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

Query Evaluation Strategies

Information Retrieval

Information Retrieval

University of Virginia Department of Computer Science. CS 4501: Information Retrieval Fall 2015

Mining Web Data. Lijun Zhang

CS 6320 Natural Language Processing

Query Evaluation Strategies

Session 10: Information Retrieval

Data-Intensive Distributed Computing

ΕΠΛ660. Ανάκτηση µε το µοντέλο διανυσµατικού χώρου

Document indexing, similarities and retrieval in large scale text collections

vector space retrieval many slides courtesy James Amherst

Text Retrieval an introduction

modern database systems lecture 4 : information retrieval

Administrative. Distributed indexing. Index Compression! What I did last summer lunch talks today. Master. Tasks

Information Retrieval and Web Search

Indexing. CS6200: Information Retrieval. Index Construction. Slides by: Jesse Anderton

CC PROCESAMIENTO MASIVO DE DATOS OTOÑO Lecture 7: Information Retrieval II. Aidan Hogan

CC PROCESAMIENTO MASIVO DE DATOS OTOÑO Lecture 6: Information Retrieval I. Aidan Hogan

Information Retrieval. hussein suleman uct cs

Information Retrieval. Information Retrieval and Web Search

FRONT CODING. Front-coding: 8automat*a1 e2 ic3 ion. Extra length beyond automat. Encodes automat. Begins to resemble general string compression.

Efficiency vs. Effectiveness in Terabyte-Scale IR

Vannevar Bush. Information Retrieval. Prophetic: Hypertext. Historic Vision 2/8/17

CS6200 Informa.on Retrieval. David Smith College of Computer and Informa.on Science Northeastern University

Unstructured Data. CS102 Winter 2019

Search Engines. Informa1on Retrieval in Prac1ce. Annotations by Michael L. Nelson

Information Retrieval

Fall CS646: Information Retrieval. Lecture 2 - Introduction to Search Result Ranking. Jiepu Jiang University of Massachusetts Amherst 2016/09/12

Representation/Indexing (fig 1.2) IR models - overview (fig 2.1) IR models - vector space. Weighting TF*IDF. U s e r. T a s k s

Course work. Today. Last lecture index construc)on. Why compression (in general)? Why compression for inverted indexes?

Outline. Possible solutions. The basic problem. How? How? Relevance Feedback, Query Expansion, and Inputs to Ranking Beyond Similarity

Index Compression. David Kauchak cs160 Fall 2009 adapted from:

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University

Chrome based Keyword Visualizer (under sparse text constraint) SANGHO SUH MOONSHIK KANG HOONHEE CHO

THE WEB SEARCH ENGINE

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Winter 2009 Lecture 12 Google Bigtable

Models for Document & Query Representation. Ziawasch Abedjan

Informa(on Retrieval

Information Retrieval and Organisation

V.2 Index Compression

CS 533 Information Retrieval Systems

This lecture: IIR Sections Ranked retrieval Scoring documents Term frequency Collection statistics Weighting schemes Vector space scoring

Basic Tokenizing, Indexing, and Implementation of Vector-Space Retrieval

Introduction to Information Retrieval

Relevance of a Document to a Query

CS347. Lecture 2 April 9, Prabhakar Raghavan

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)

CS60092: Informa0on Retrieval

Data Modelling and Multimedia Databases M

Today s topics CS347. Inverted index storage. Inverted index storage. Processing Boolean queries. Lecture 2 April 9, 2001 Prabhakar Raghavan

Index Construction. Dictionary, postings, scalable indexing, dynamic indexing. Web Search

Relevance Feedback & Other Query Expansion Techniques

Information Retrieval: Retrieval Models

Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p.

Information Retrieval CSCI

A Security Model for Multi-User File System Search. in Multi-User Environments

Introduction to Information Retrieval

Digital Libraries: Language Technologies

Outline of the course

Architecture and Implementation of Database Systems (Summer 2018)

Information Retrieval. Lecture 5 - The vector space model. Introduction. Overview. Term weighting. Wintersemester 2007

Introduction to Information Retrieval

Midterm Exam Search Engines ( / ) October 20, 2015

Information Retrieval

Transcription:

COMP6237 Data Mining Searching and Ranking Jonathon Hare jsh2@ecs.soton.ac.uk Note: portions of these slides are from those by ChengXiang Cheng Zhai at UIUC https://class.coursera.org/textretrieval-001

Introduction Information retrieval (IR) is the activity of obtaining information resources relevant to an information need from a collection of information resources. Searches can be based on metadata or on contentbased indexing. Content need not be textual! Automated information retrieval systems are used to reduce what has been called "information overload".

Problem statement: It s all about the user User has an information need A query Could be specific: What is the population of Southampton? Or vague: I am going to a conference in Arizona in July. What do I need to know? IR system must find documents that are relevant to the users query Clearly a classic data mining problem!

Historical Context 4000 years ago Table of contents and the index data-structure developed 1876 Dewey decimal system Classification Scheme Commonly used in libraries

Historical Context 1960 s basic advances in automatic indexing using computers 1970 s Probabilistic and vector-space models Clustering, relevance feedback Large, on-line Boolean IR systems 1980 s Natural language processing and IR Expert Systems Commercial-Off-the-Shelf IR systems

Historical Context 1990 s onwards Dominance of ranking The Internet! Web-based IR Distributed IR Multimedia IR Statistical language modelling User feedback Last ~10 years Really Big data Better language modelling Even better ranking

Text Retrieval

What is text retrieval Collection of text documents exists User gives a query to express the information need Search engine system returns relevant documents to users Known as search technology in industry

Text retrieval versus database retrieval Information Unstructured/free text vs. structured data Ambiguous vs. well-defined semantics Query Ambiguous vs. well-defined semantics Incomplete vs. complete specification Answers Relevant documents vs. matched records TR is an empirically defined problem Can t mathematically prove one method is better than another Must rely on empirical evaluation involving users!

Classical Models of Retrieval Boolean Model Does the document satisfy the Boolean expression? winter AND Olympics AND ( ski OR snowboard ) Vector Space Model How similar is the document to the query? [(Olympics 2) (winter 2) (ski 1) (snowboard 1)] Probabilistic Model What is the probability that the document is generated by the query?

Classical Models of Retrieval Boolean Model Result Selection Does the document satisfy the Boolean expression? winter AND Olympics AND ( ski OR snowboard ) Vector Space Model Result Ranking How similar is the document to the query? [(Olympics 2) (winter 2) (ski 1) (snowboard 1)] Probabilistic Model What is the probability that the document is generated by the query?

Problems of Result Selection The classifier is unlikely accurate Over-constrained query -> no relevant documents to return Under-constrained query -> over delivery Hard to find the right position between these two extremes Even if it is accurate, all relevant documents are not equally relevant (relevance isn t a binary attribute!) Prioritisation is needed Thus, ranking is generally preferred

Ranking versus selecting results Probability Ranking Principle [Robertson 77]: Returning a ranked list of documents in descending order of probability that a document is relevant to the query is the optimal strategy under the following two assumptions: The utility of a document (to a user) is independent of the utility of any other document A user would browse the results sequentially Do these two assumptions hold?

The Vector-Space Model of Retrieval

The Vector-Space Model (VSM) Conceptually simple: Model each document by a vector Model each query by a vector Assumption: documents that are close together in space are similar in meaning. Develop similarity measures to rank each document to a query in terms of decreasing similarity Star Doc about astronomy Doc about movie stars Doc about mammal behavior Diet

VSM Is a Framework Represent a doc/query by a term vector Term: basic concept, e.g., word or phrase Each term defines one dimension N terms define an N-dimensional space Query vector: q=(x 1,,x N ), x i ϵr is query term weight Doc vector: d=(y 1,,y N ), y j ϵr is doc term weight relevance(d q) similarity(q,d) = f(q,d)

What VSM Doesn t Say How to define/select the basic concept Concepts are assumed to be orthogonal How to place docs and query in the space (= how to assign term weights) Term weight in query indicates importance of term Term weight in doc indicates how well the term characterises the doc How to define the similarity measure, f(q,d)

Text processing (feature extraction) the quick brown fox jumped over the lazy dog Tokenisation cleaning Stop-word Removal Bag of Words Stemming/Lemmatisation optional brown dog fox jump lazi over quick count 1 1 1 1 1 1 1

Bag of Words Vectors The lexicon or vocabulary is the set of all (processed) words across all documents known to the system. We can create vectors for each document with as many dimensions as there are words in the lexicon. Each word in the document s bag of words contributes a count to the corresponding element of the vector for that word. In essence, each vector is a histogram of the word occurrences in the respective document. Vectors will have very high number of dimensions, but will be very sparse.

Aside: language statistics and Zipf s Law stop words TF rare words term rank (ordered by tf)

Searching the VSM Example: D 1 = 2T 1 + 3T 2 + 5T 3 D 2 = 3T 1 + 7T 2 + T 3 Q = 0T 1 + 0T 2 + 2T 3 5 T 3 D 1 = 2T 1 + 3T 2 + 5T 3 Q = 0T 1 + 0T 2 + 2T 3 2 3 T 1 D 2 = 3T 1 + 7T 2 + T 3 T 2 7 Is D 1 or D 2 more similar to Q? How to measure the degree of similarity? Distance? Angle? Projection? cosine similarity

Recap: Cosine Similarity If p and q are both high dimensional and sparse, then you re going spend a lot of time multiplying 0 by 0 and adding 0 to the accumulator These can be pre-computed and stored!

Inverted Indexes Aardvark Astronomy Diet [doc3:4] [doc1:2] [doc2:9; doc3:8] Movie Star Telescope [doc2:10] [doc1:13; doc2:4] [doc1:15] A map of words to lists of postings

Inverted Indexes Aardvark Astronomy Diet [doc3:4] [doc1:2] [doc2:9; doc3:8] Movie Star Telescope [doc2:10] [doc1:13; doc2:4] [doc1:15] A posting is a pair formed by a document ID and the number of times the specific word appeared in that document

Computing the Cosine Similarity For each word in the query, lookup the relevant postings list and accumulate similarities for only the documents seen in those postings lists much more efficient than fully comparing vectors

Query: Movie Star Aardvark Astronomy Diet Movie Star Telescope [doc3:4] [doc1:2] [doc2:9; doc3:8] [doc2:10] [doc1:13; doc2:4] [doc1:15]

Query: Movie Star Aardvark [doc3:4] Astronomy [doc1:2] Diet [doc2:9; doc3:8] Accumulation table: doc2 10x1 Movie [doc2:10] Star [doc1:13; doc2:4] Telescope [doc1:15]

Query: Movie Star Aardvark [doc3:4] Astronomy [doc1:2] Diet [doc2:9; doc3:8] Accumulation table: doc2 10 1 + 4 1 Movie [doc2:10] doc1 13 1 Star [doc1:13; doc2:4] Telescope [doc1:15]

Query: Movie Star Accumulation table: doc2 (10 1 + 4 1) / 14.04 = 0.997 Aardvark Astronomy Diet Movie [doc3:4] [doc1:2] [doc2:9; doc3:8] [doc2:10] doc1 13 1 / 19.95 = 0.652 Star [doc1:13; doc2:4] doc3 0 Telescope [doc1:15]

Weighting the vectors The number of times a term occurs in a document reflects the importance of that term in the document. Intuitions: A term that appears in many documents is not important: e.g., the, going, come, If a term is frequent in a document and rare across other documents, it is probably important in that document.

Possible weighting schemes Binary weights Only presence (1) or absence (0) of a term recorded in vector. Raw frequency Frequency of occurrence of term in document included in vector. TF-IDF Term frequency is the frequency count of a term in a document. Inverse document frequency (idf) provides high values for rare words and low values for common words. (Note there are many forms of TF-IDF and that it isn t one specific scheme)

Actual scoring functions: baseline Key bit of the cosine similarity is the dot-product between the query and document Use this as a basis for building a scoring function that incorporates TF and IDF Total #docs in Number of times w appears in d collection NX X f(q, d) = q i y i = i=1 w2q\d c(w, q)c(w, d)log M +1 df (w) document frequency Number of times w appears in q (number of docs containing w)

Better scoring functions The baseline TF-IDF scoring function has a few problems skewed scores if a document contains a word many occurrences what if the document is long? Lots of other scoring functions out there BM25 is one of the most popular families of functions Developed as part of the Okapi retrieval system at City University in the 80s

BM25 f(q, d) = X w2q\d c(w, q) c(w, d) (k 1 + 1) 1 b + b c(w, d)+k 1 d avgdl log M df (w)+0.5 df (w)+0.5 k1, b are constants d is the length of doc d avgdl is the average doc length in the collection

Inside a retrieval system

offline online online & offline Documents Query judgements Tokeniser & cleaning Feedback document representation query representation user results Indexer Index Scorer

Index Document Index Inverted Index DocID Length TermID Postings Meta Index DocID metadata Lexicon TermID Term df total tf

Building an inverted index Difficult to build a huge index with limited memory Memory-based methods: not usable for large collections Sort-based methods: Step 1: Collect local <termid, docid, freq> tuples in a run Step 2: Sort tuples within the run and write to disk Step 3: Pair-wise merge runs on disk Step 4: Output inverted file

Sort-based Inversion Sort by DocID Sort by TermID Doc #1 Doc #2 Doc #30 <1,1,3> <2,1,2> <3,1,1> <1,2,2> <5,2,3> <3,2,2> <8,30,3> <1,30,3> run1 run2 run3 <1,1,3> <1,2,2> <2,1,2> <3,1,1> <3,2,2> <5,2,3> <1,30,3> <8,30,3> <1,1,3> <1,2,2> <1,30,3> <2,1,2> <3,1,1> <3,2,2> <5,2,3> <8,30,3> 1 2 3 4 5 6 7 8 9.... 1:3;2:2;.. 1:2; 1:1;2:2;.. 2:3; 30:3; Parse and count sort in run merge sort build index

Inverted Index Compression Leverage skewed distribution of values and use variable-length integer encoding TF compression Small numbers tend to occur far more frequently than large numbers (c.f. Zipf) Fewer bits for small (high frequency) integers at the cost of more bits for large integers DocID compression delta-gap or ( d-gap ) (store difference): d1, d2-d1, d3-d2,... Feasible due to sequential access Methods: Oblivious: Binary code, unary code, γ-code, δ-code, varint... List-adaptive: Golomb, Rice, Frame of Reference (FOR), Patched FOR (PFOR),

Improving search engine ranking

Location-weighting; Phrase and Proximity search Can we use the position of the query terms in the document to improve ranking? Document more likely to be relevant if the query terms occur nearer beginning? Allow for exact phrase matching or proximity search (query terms appearing close to each other) For any of these to be tractable, the term position of every term in every document needs to be indexed

Index Payloads Positional inverted index augments the postings with a list of term offsets from the beginning of the document term1 [doc3:4: 21,28,100,311, ] Obviously increases size of index dramatically, so compression is very important - typically use d-gap+γ or d-gap+for encoding

Retrieving hypertext: Using inbound links Assume we are indexing text documents with hyperlinks (i.e. webpages) Can we use the hyperlinks to improve search? perhaps increase the score of documents that have many other documents linking to them unfortunately prone to manipulation - easy to set up new pages with lots of links Solution: PageRank Compute the importance of a page from the importance of all the other pages that link to it and the number of links each other page has.

Making use of feedback: Click Ranking Can we make use of user feedback? When a user performs a search, which results do they look at? Could improve ranking by: increasing the weighting of documents that are clicked (irrespective of the query) learning associations between queries and documents and using this to re-rank documents in response to a specific query

Current challenges in IR: Diverse ranking

Implicit Search Result Diversification Most current search systems always assume that a ranked list of results will be generated Is this optimal? Diversity in search result rankings is needed when users' queries are poorly specified or ambiguous. e.g. what if I perform a web search for jaguar By presenting a diverse range of results covering all possible representations of a query the probability of finding relevant documents is increased very much a data-mining problem!

Search for images related to euro (using textual metadata): Original Ranking Diversified Ranking

Summary Search engines are a key tool for data mining They can help a user understand what is in a data set by effectively providing target access The techniques required to build a search engine are important: Content analysis/feature extraction is of key importance Low-level techniques for scalability and efficiency used by search engines have applications in other areas Search engines can provide a base for performing other types of data mining efficiently and effectively