Informa/on Retrieval. Text Search. CISC437/637, Lecture #23 Ben CartereAe. Consider a database consis/ng of long textual informa/on fields

Informa/on Retrieval CISC437/637, Lecture #23 Ben CartereAe Copyright Ben CartereAe 1 Text Search Consider a database consis/ng of long textual informa/on fields News ar/cles, patents, web pages, books, What is the best way to search within this data? grep for keywords in it? Store it in a DBMS and use SQL queries with LIKE %keyword1% AND LIKE %keyword2%? If there s enough data, these approaches are very slow and very inefficient Hash and B+- tree indexes don t help Copyright Ben CartereAe 2 1

Informa/on Retrieval Informa(on retrieval (IR) studies systems for indexing and querying large full- text corpora Google is the most widely- known modern example Work in IR and DB has mostly been separate IR has roots in library science and informa/on science going back to the 1950s, today allied with AI DB is more firmly rooted in algorithms and systems In recent years they have begun to intersect via XML, text mining, data mining Copyright Ben CartereAe 3 IR systems vs DBMS Both involve queries that are matched to records (possibly using an index) to retrieve results Afer that, many differences: IR Relevance seman/cs Keyword search Unstructured data Read- only (mostly) Rank best- matching results DBMS Rela/onal seman/cs Full SQL query language Structured data Read/write Return full result set Copyright Ben CartereAe 4 2

Common Topics, Different Focus IR and DB have many things in common, but focus differs between the two: DBMS Users Query language Query/record matching Record ranking Building indexes Indexing strategies Query op/miza/on Concurrency control IR Users Query language Query/record matching Record ranking Building indexes Indexing strategies Query op/miza/on Concurrency control Copyright Ben CartereAe 5 Relevance and Ranking In a DBMS, records either match a SQL query or they don t there is no middle ground In IR systems, some documents can be beaer matches than others Matching documents may not be relevant Documents that don t match may be relevant Relevance describes the usefulness of a document to a par/cular user Copyright Ben CartereAe 6 3

query relevant results (for one user) top- ranked results relevant results (for a different user) Copyright Ben CartereAe 7 What Determines Relevance? Many factors can contribute to a document being relevant to a user: Frequency of search terms within the document Frequency of search terms in the corpus Proximity of search terms in document Popularity of a document among other users Popularity of a document among content developers User prior knowledge User task The system can only make guesses about relevance; it can t read the user s mind Guesses are based on computa/ons of above factors Copyright Ben CartereAe 8 4

The Bag of Words Model Bag of words refers to a simple representa/on scheme for text: Documents and queries are simply unordered sets of words Syntac/c informa/on (grammar) stripped out Two details: Very common words ( stopwords ) are removed E.g. the, a, of, in, that, Words are converted to their stems E.g. surfs, surfing, surfed surf Copyright Ben CartereAe 9 Text Indexes The bag of words model allows a /me- and space- efficient indexing model: the inverted index Instead of storing a rela/on with documents as records and terms as aaributes, store each term with the list of documents it appears in An inverted list or pos(ng list To answer a one- term query, simply retrieve pos/ng list for that term Copyright Ben CartereAe 10 5

Longer Queries Boolean queries use AND, OR, NOT syntax term1 AND (term2 OR term3) To answer an AND query: Get pos/ng lists for all terms and take the intersec/on To answer an OR query: Take union of all pos/ng lists To answer an AND NOT query: Set subtrac/on To answer an OR NOT query: Union of term1 and NOT term2 which will be a very large set OR NOT usually not allowed Copyright Ben CartereAe 11 Ranking Querying the inverted index only provides the set of documents that match the query The next step is to rank them in order of likelihood of relevance to the user Ranking algorithms are the subject of much study in the field Most are based on the probability ranking principle, which says that the op/mal ordering is in decreasing order of probability of relevance Copyright Ben CartereAe 12 6

Text Sta/s/cs and Ranking Another advantage of inverted files: They can store a lot of informa/on about terms within documents and in the corpus For each term, also store the following: Document frequency df, the total number of documents the term appears in Collec/on term frequency c=, the total number of /mes the term appears in all documents For each term/document pair, store: Term frequency =, the number of /mes the term appears in the document For each document, store: Document length N, the total number of terms in the document Copyright Ben CartereAe 13 Ranking Func/ons Use v, N, df to calculate a score for each document The score indicates the likelihood of relevance One common approach treats a document D as a vector in V- dimensional space The magnitude along each dimension is set to the weight of the corresponding term Weights are func/ons of v, N, and df The query Q is represented as a vector in the same space The score S(Q, D) is defined to be the cosine of the angle between the two vectors Copyright Ben CartereAe 14 7

Evalua/ng Ranking Func/ons Different func/ons produce different rankings of documents How do we choose among ranking func/ons? Performance evalua(on Judge the relevance of each document in the ranking to a user s informa/on need Calculate a summary measure of performance over the relevance judgments Common measures: Precision, recall, average precision, DCG Copyright Ben CartereAe 15 How Google Works (Kind Of) Google will never say how they actually work, but they have provided some details in the past Basic approach: Crawl the web constantly Within documents, store some formaxng info Store info about links between documents Use links between documents to gauge popularity Store indexes across many cheap servers Process queries in parallel Copyright Ben CartereAe 16 8

From Anatomy of a Large- Scale Hypertextual Web Search Engine, Brin & Page, 2001 Copyright Ben CartereAe 17 Google s Index Google uses an inverted index Each term s pos/ng list is made up of: An internal page ID The number of hits on that page For each hit, a fixed- size data structure Hit data structure: Stores font info, posi/on in document, and some markup informa/on Total size = 2 bytes (16 bits) Forward barrels store a sequence of hits appearing in a page Inverted barrels store hits in all pages for each term The lexicon supports fast access to the inverted barrels Copyright Ben CartereAe 18 9

PageRank PageRank is a cita<on analysis algorithm Basic idea: Pages with many links from high- quality pages are likely to be high- quality High- quality pages are more likely to link to other high- quality pages The recursive flavor comes through in the PageRank formula Copyright Ben CartereAe 19 PageRank User Model PageRank is the result of modeling users as random surfers : A user starts randomly on some page in the web They randomly click links, never going back At some point they jump to a new page selected uniformly at random PR(A) = the probability that a random surfer is looking at page A Google uses PageRank to modify an IR score- based ranking Copyright Ben CartereAe 20 10