Informa/on Retrieval. Text Search. CISC437/637, Lecture #23 Ben CartereAe. Consider a database consis/ng of long textual informa/on fields

Similar documents
Introduc3on to Data Management

Chapter 6: Information Retrieval and Web Search. An introduction

Ges$one Avanzata dell Informazione Part A Full- Text Informa$on Management. Full- Text Indexing

Informa(on Retrieval

Information Retrieval. (M&S Ch 15)

The Anatomy of a Large-Scale Hypertextual Web Search Engine

Search Engines. Informa1on Retrieval in Prac1ce. Annotations by Michael L. Nelson

CS6200 Informa.on Retrieval. David Smith College of Computer and Informa.on Science Northeastern University

60-538: Information Retrieval

Search Engines. Informa1on Retrieval in Prac1ce. Annota1ons by Michael L. Nelson

Information Retrieval (IR) Introduction to Information Retrieval. Lecture Overview. Why do we need IR? Basics of an IR system.

CS377: Database Systems Text data and information. Li Xiong Department of Mathematics and Computer Science Emory University

CC PROCESAMIENTO MASIVO DE DATOS OTOÑO Lecture 7: Information Retrieval II. Aidan Hogan

Informa(on Retrieval

Trustworthy Keyword Search for Regulatory Compliant Records Reten;on

Rela+onal Algebra. Rela+onal Query Languages. CISC437/637, Lecture #6 Ben Cartere?e

CS54701: Information Retrieval

CS 6320 Natural Language Processing

Relevance of a Document to a Query

Web Search Ranking. (COSC 488) Nazli Goharian Evaluation of Web Search Engines: High Precision Search

Elementary IR: Scalable Boolean Text Search. (Compare with R & G )

XML RETRIEVAL. Introduction to Information Retrieval CS 150 Donald J. Patterson

Introduction to Information Retrieval

Searching the Web for Information

Information Retrieval CS Lecture 01. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Basic Tokenizing, Indexing, and Implementation of Vector-Space Retrieval

CIS 4930/6930 Spring 2014 Introduction to Data Science /Data Intensive Computing. University of Florida, CISE Department Prof.

Session 10: Information Retrieval

Informa(on Retrieval. Administra*ve. Sta*s*cal MT Overview. Problems for Sta*s*cal MT

Introduction to Information Retrieval. Lecture Outline

CS105 Introduction to Information Retrieval

In = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most

Crawler. Crawler. Crawler. Crawler. Anchors. URL Resolver Indexer. Barrels. Doc Index Sorter. Sorter. URL Server

Multimedia Information Systems

Representation/Indexing (fig 1.2) IR models - overview (fig 2.1) IR models - vector space. Weighting TF*IDF. U s e r. T a s k s

Outline. Lecture 3: EITN01 Web Intelligence and Information Retrieval. Query languages - aspects. Previous lecture. Anders Ardö.

CS6200 Informa.on Retrieval. David Smith College of Computer and Informa.on Science Northeastern University

Architecture and Implementation of Database Systems (Summer 2018)

An Adaptive Approach in Web Search Algorithm

Information Retrieval

Behrang Mohit : txt proc! Review. Bag of word view. Document Named

CS47300: Web Information Search and Management

CS60092: Informa0on Retrieval. Sourangshu Bha<acharya

Information Retrieval

Lecture 5: Information Retrieval using the Vector Space Model

Indexing and Query Processing. What will we cover?

CHAPTER THREE INFORMATION RETRIEVAL SYSTEM

Mining Web Data. Lijun Zhang

Search Engine Architecture. Hongning Wang

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University

Informa(on Retrieval

Chapter 27 Introduction to Information Retrieval and Web Search

Definitions. Lecture Objectives. Text Technologies for Data Science INFR Learn about main concepts in IR 9/19/2017. Instructor: Walid Magdy

Introduc)on to. CS60092: Informa0on Retrieval

Information Retrieval

Link Analysis and Web Search

Natural Language Processing

Information Retrieval

CS6200 Informa.on Retrieval. David Smith College of Computer and Informa.on Science Northeastern University

Text Analytics (Text Mining)

Information Retrieval CSCI

Information Retrieval

Anatomy of a search engine. Design criteria of a search engine Architecture Data structures

Overview of DB & IR. ICS 624 Spring Asst. Prof. Lipyeow Lim Information & Computer Science Department University of Hawaii at Manoa

Roadmap. Roadmap. Ranking Web Pages. PageRank. Roadmap. Random Walks in Ranking Query Results in Semistructured Databases

CS 347 Parallel and Distributed Data Processing

CS 347 Parallel and Distributed Data Processing

Outline. Lecture 2: EITN01 Web Intelligence and Information Retrieval. Previous lecture. Representation/Indexing (fig 1.

Introduction to Information Retrieval

Search Engines. Informa1on Retrieval in Prac1ce. Annotations by Michael L. Nelson

Ranked Retrieval. Evaluation in IR. One option is to average the precision scores at discrete. points on the ROC curve But which points?

Logistics. CSE Case Studies. Indexing & Retrieval in Google. Review: AltaVista. BigTable. Index Stream Readers (ISRs) Advanced Search

Exam IST 441 Spring 2011

Indexing: Part IV. Announcements (February 17) Keyword search. CPS 216 Advanced Database Systems

University of Virginia Department of Computer Science. CS 4501: Information Retrieval Fall 2015

Boolean Model. Hongning Wang

Exam IST 441 Spring 2014

Web search before Google. (Taken from Page et al. (1999), The PageRank Citation Ranking: Bringing Order to the Web.)

Introduction to Information Retrieval and Anatomy of Google. Information Retrieval Introduction

CSE 494: Information Retrieval, Mining and Integration on the Internet

Basic techniques. Text processing; term weighting; vector space model; inverted index; Web Search

CS60092: Informa0on Retrieval

Home Page. Title Page. Page 1 of 14. Go Back. Full Screen. Close. Quit

Chapter 2. Architecture of a Search Engine

Introduction to IR Systems: Supporting Boolean Text Search

Information Retrieval Lecture 4: Web Search. Challenges of Web Search 2. Natural Language and Information Processing (NLIP) Group

Query Phrase Expansion using Wikipedia for Patent Class Search

Models for Document & Query Representation. Ziawasch Abedjan

Informa(on Retrieval

Information Retrieval and Web Search

INTRODUCTION TO DATA SCIENCE. Link Analysis (MMDS5)

Web Search. Lecture Objectives. Text Technologies for Data Science INFR Learn about: 11/14/2017. Instructor: Walid Magdy

Lecture 8: Linkage algorithms and web search

Information Retrieval. Information Retrieval and Web Search

Information Retrieval CS Lecture 06. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Introduction to Information Retrieval

Running Head: HOW A SEARCH ENGINE WORKS 1. How a Search Engine Works. Sara Davis INFO Spring Erika Gutierrez.

Search Engine Architecture. Search Engine Architecture

Midterm Exam Search Engines ( / ) October 20, 2015

SEMINAR: GRAPH-BASED METHODS FOR NLP

Transcription:

Informa/on Retrieval CISC437/637, Lecture #23 Ben CartereAe Copyright Ben CartereAe 1 Text Search Consider a database consis/ng of long textual informa/on fields News ar/cles, patents, web pages, books, What is the best way to search within this data? grep for keywords in it? Store it in a DBMS and use SQL queries with LIKE %keyword1% AND LIKE %keyword2%? If there s enough data, these approaches are very slow and very inefficient Hash and B+- tree indexes don t help Copyright Ben CartereAe 2 1

Informa/on Retrieval Informa(on retrieval (IR) studies systems for indexing and querying large full- text corpora Google is the most widely- known modern example Work in IR and DB has mostly been separate IR has roots in library science and informa/on science going back to the 1950s, today allied with AI DB is more firmly rooted in algorithms and systems In recent years they have begun to intersect via XML, text mining, data mining Copyright Ben CartereAe 3 IR systems vs DBMS Both involve queries that are matched to records (possibly using an index) to retrieve results Afer that, many differences: IR Relevance seman/cs Keyword search Unstructured data Read- only (mostly) Rank best- matching results DBMS Rela/onal seman/cs Full SQL query language Structured data Read/write Return full result set Copyright Ben CartereAe 4 2

Common Topics, Different Focus IR and DB have many things in common, but focus differs between the two: DBMS Users Query language Query/record matching Record ranking Building indexes Indexing strategies Query op/miza/on Concurrency control IR Users Query language Query/record matching Record ranking Building indexes Indexing strategies Query op/miza/on Concurrency control Copyright Ben CartereAe 5 Relevance and Ranking In a DBMS, records either match a SQL query or they don t there is no middle ground In IR systems, some documents can be beaer matches than others Matching documents may not be relevant Documents that don t match may be relevant Relevance describes the usefulness of a document to a par/cular user Copyright Ben CartereAe 6 3

query relevant results (for one user) top- ranked results relevant results (for a different user) Copyright Ben CartereAe 7 What Determines Relevance? Many factors can contribute to a document being relevant to a user: Frequency of search terms within the document Frequency of search terms in the corpus Proximity of search terms in document Popularity of a document among other users Popularity of a document among content developers User prior knowledge User task The system can only make guesses about relevance; it can t read the user s mind Guesses are based on computa/ons of above factors Copyright Ben CartereAe 8 4

The Bag of Words Model Bag of words refers to a simple representa/on scheme for text: Documents and queries are simply unordered sets of words Syntac/c informa/on (grammar) stripped out Two details: Very common words ( stopwords ) are removed E.g. the, a, of, in, that, Words are converted to their stems E.g. surfs, surfing, surfed surf Copyright Ben CartereAe 9 Text Indexes The bag of words model allows a /me- and space- efficient indexing model: the inverted index Instead of storing a rela/on with documents as records and terms as aaributes, store each term with the list of documents it appears in An inverted list or pos(ng list To answer a one- term query, simply retrieve pos/ng list for that term Copyright Ben CartereAe 10 5

Longer Queries Boolean queries use AND, OR, NOT syntax term1 AND (term2 OR term3) To answer an AND query: Get pos/ng lists for all terms and take the intersec/on To answer an OR query: Take union of all pos/ng lists To answer an AND NOT query: Set subtrac/on To answer an OR NOT query: Union of term1 and NOT term2 which will be a very large set OR NOT usually not allowed Copyright Ben CartereAe 11 Ranking Querying the inverted index only provides the set of documents that match the query The next step is to rank them in order of likelihood of relevance to the user Ranking algorithms are the subject of much study in the field Most are based on the probability ranking principle, which says that the op/mal ordering is in decreasing order of probability of relevance Copyright Ben CartereAe 12 6

Text Sta/s/cs and Ranking Another advantage of inverted files: They can store a lot of informa/on about terms within documents and in the corpus For each term, also store the following: Document frequency df, the total number of documents the term appears in Collec/on term frequency c=, the total number of /mes the term appears in all documents For each term/document pair, store: Term frequency =, the number of /mes the term appears in the document For each document, store: Document length N, the total number of terms in the document Copyright Ben CartereAe 13 Ranking Func/ons Use v, N, df to calculate a score for each document The score indicates the likelihood of relevance One common approach treats a document D as a vector in V- dimensional space The magnitude along each dimension is set to the weight of the corresponding term Weights are func/ons of v, N, and df The query Q is represented as a vector in the same space The score S(Q, D) is defined to be the cosine of the angle between the two vectors Copyright Ben CartereAe 14 7

Evalua/ng Ranking Func/ons Different func/ons produce different rankings of documents How do we choose among ranking func/ons? Performance evalua(on Judge the relevance of each document in the ranking to a user s informa/on need Calculate a summary measure of performance over the relevance judgments Common measures: Precision, recall, average precision, DCG Copyright Ben CartereAe 15 How Google Works (Kind Of) Google will never say how they actually work, but they have provided some details in the past Basic approach: Crawl the web constantly Within documents, store some formaxng info Store info about links between documents Use links between documents to gauge popularity Store indexes across many cheap servers Process queries in parallel Copyright Ben CartereAe 16 8

From Anatomy of a Large- Scale Hypertextual Web Search Engine, Brin & Page, 2001 Copyright Ben CartereAe 17 Google s Index Google uses an inverted index Each term s pos/ng list is made up of: An internal page ID The number of hits on that page For each hit, a fixed- size data structure Hit data structure: Stores font info, posi/on in document, and some markup informa/on Total size = 2 bytes (16 bits) Forward barrels store a sequence of hits appearing in a page Inverted barrels store hits in all pages for each term The lexicon supports fast access to the inverted barrels Copyright Ben CartereAe 18 9

PageRank PageRank is a cita<on analysis algorithm Basic idea: Pages with many links from high- quality pages are likely to be high- quality High- quality pages are more likely to link to other high- quality pages The recursive flavor comes through in the PageRank formula Copyright Ben CartereAe 19 PageRank User Model PageRank is the result of modeling users as random surfers : A user starts randomly on some page in the web They randomly click links, never going back At some point they jump to a new page selected uniformly at random PR(A) = the probability that a random surfer is looking at page A Google uses PageRank to modify an IR score- based ranking Copyright Ben CartereAe 20 10