Information Retrieval (IR) Introduction to Information Retrieval. Lecture Overview. Why do we need IR? Basics of an IR system.

Similar documents
10/10/13. Traditional database system. Information Retrieval. Information Retrieval. Information retrieval system? Information Retrieval Issues

Recent Researches on Web Page Ranking

Information Retrieval. (M&S Ch 15)

Exam IST 441 Spring 2014

CC PROCESAMIENTO MASIVO DE DATOS OTOÑO Lecture 7: Information Retrieval II. Aidan Hogan

A Survey on Web Information Retrieval Technologies

Mining Web Data. Lijun Zhang

Mining Web Data. Lijun Zhang

Home Page. Title Page. Page 1 of 14. Go Back. Full Screen. Close. Quit

CS 6320 Natural Language Processing

Lec 8: Adaptive Information Retrieval 2

This lecture: IIR Sections Ranked retrieval Scoring documents Term frequency Collection statistics Weighting schemes Vector space scoring

Basic Tokenizing, Indexing, and Implementation of Vector-Space Retrieval

Exam IST 441 Spring 2013

Chapter 6: Information Retrieval and Web Search. An introduction

Problem 1: Complexity of Update Rules for Logistic Regression

Basic techniques. Text processing; term weighting; vector space model; inverted index; Web Search

CS54701: Information Retrieval

Information Retrieval. CS630 Representing and Accessing Digital Information. What is a Retrieval Model? Basic IR Processes

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Winter 2009 Lecture 12 Google Bigtable

Indexing: Part IV. Announcements (February 17) Keyword search. CPS 216 Advanced Database Systems

Lecture 5: Information Retrieval using the Vector Space Model

Exam IST 441 Spring 2011

Digital Libraries: Language Technologies

How to organize the Web?

Multimedia Information Extraction and Retrieval Term Frequency Inverse Document Frequency

Bruno Martins. 1 st Semester 2012/2013

COMP5331: Knowledge Discovery and Data Mining

An Overview of Search Engine. Hai-Yang Xu Dev Lead of Search Technology Center Microsoft Research Asia

Tag-based Social Interest Discovery

Link Analysis and Web Search

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

Information Retrieval Lecture 4: Web Search. Challenges of Web Search 2. Natural Language and Information Processing (NLIP) Group

Text Analytics. Index-Structures for Information Retrieval. Ulf Leser

In = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most

Informa/on Retrieval. Text Search. CISC437/637, Lecture #23 Ben CartereAe. Consider a database consis/ng of long textual informa/on fields

Web Search Ranking. (COSC 488) Nazli Goharian Evaluation of Web Search Engines: High Precision Search

How Does a Search Engine Work? Part 1

Administrivia. Crawlers: Nutch. Course Overview. Issues. Crawling Issues. Groups Formed Architecture Documents under Review Group Meetings CSE 454

Information Retrieval

Task Description: Finding Similar Documents. Document Retrieval. Case Study 2: Document Retrieval

Lecture #3: PageRank Algorithm The Mathematics of Google Search

Information Retrieval CS Lecture 06. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Information Retrieval

CHAPTER THREE INFORMATION RETRIEVAL SYSTEM

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

Introduction to Information Retrieval

ΕΠΛ660. Ανάκτηση µε το µοντέλο διανυσµατικού χώρου

CSE 494: Information Retrieval, Mining and Integration on the Internet

Unit VIII. Chapter 9. Link Analysis

Administrative. Web crawlers. Web Crawlers and Link Analysis!

Instructor: Stefan Savev

Information Retrieval. hussein suleman uct cs

Information Retrieval

DATA MINING - 1DL105, 1DL111

CS/INFO 1305 Summer 2009

Relevance of a Document to a Query

Information Retrieval Spring Web retrieval

Information Retrieval May 15. Web retrieval

CS 347 Parallel and Distributed Data Processing

CS 347 Parallel and Distributed Data Processing

Big Data Analytics CSCI 4030

Information Retrieval

COMP6237 Data Mining Searching and Ranking

COMP 4601 Hubs and Authorities

Information Retrieval

Searching the Deep Web

Models for Document & Query Representation. Ziawasch Abedjan

Searching the Web [Arasu 01]

Introduction to Information Retrieval

Lecture 8: Linkage algorithms and web search

Searching the Web for Information

Representation/Indexing (fig 1.2) IR models - overview (fig 2.1) IR models - vector space. Weighting TF*IDF. U s e r. T a s k s

modern database systems lecture 4 : information retrieval

Link Analysis. CSE 454 Advanced Internet Systems University of Washington. 1/26/12 16:36 1 Copyright D.S.Weld

CS371R: Final Exam Dec. 18, 2017

SEARCH ENGINE INSIDE OUT

Motivation. Motivation

Boolean Model. Hongning Wang

Text Analytics. Index-Structures for Information Retrieval. Ulf Leser

INTRODUCTION TO DATA SCIENCE. Link Analysis (MMDS5)

Introduction to Information Retrieval (Manning, Raghavan, Schutze)

Information Retrieval and Web Search

Bag of Words Models. CS4670 / 5670: Computer Vision Noah Snavely. Bag-of-words models 11/26/2013

Q: Given a set of keywords how can we return relevant documents quickly?

Web Structure Mining using Link Analysis Algorithms

DATA MINING II - 1DL460. Spring 2014"

Efficiency. Efficiency: Indexing. Indexing. Efficiency Techniques. Inverted Index. Inverted Index (COSC 488)

vector space retrieval many slides courtesy James Amherst

Information Retrieval. Information Retrieval and Web Search

Informa(on Retrieval

ROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015

UNIT-V WEB MINING. 3/18/2012 Prof. Asha Ambhaikar, RCET Bhilai.

Near Neighbor Search in High Dimensional Data (1) Dr. Anwar Alhenshiri

Relevant?!? Algoritmi per IR. Goal of a Search Engine. Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Web Search

Part 1: Link Analysis & Page Rank

CS101 Lecture 30: How Search Works and searching algorithms.

CS 6604: Data Mining Large Networks and Time-Series

~ Ian Hunneybell: WWWT Revision Notes (15/06/2006) ~

Transcription:

Introduction to Information Retrieval Ethan Phelps-Goodman Some slides taken from http://www.cs.utexas.edu/users/mooney/ir-course/ Information Retrieval (IR) The indexing and retrieval of textual documents. Searching for pages on the World Wide Web is the most recent killer app. Concerned firstly with retrieving relevant documents to a query. Concerned secondly with retrieving from large sets of documents efficiently. Why do we need IR? 3 billion documents indexed in Google 10-20 Terabytes of text on web ~1000 Terabytes of information (digital & non-digital) produced every year. (http://www.sims.berkeley.edu/research/projects/how-much-info/) Personal information: 20 emails/day. 200 emails/day? Lecture Overview Theoretical Models of IR Data structures for efficient implementation IR for web Basics of an IR system What is a document? Bag of words model. Same as project 3. Problems: IR System Document corpus Query String IR System Much more complicated systems imaginable Ranked Documents 1. Doc1 2. Doc2 3. Doc3.. 1

Boolean Model Query terms and boolean operators AND, OR, NOT (cat OR dog) AND (collar OR leash) Pros Cons Vector space model Think of a document as a vector in a space. One dimension for each word, so huge dimension space. The similarity of two documents (or a document and a query) can be thought of as the distance between the vectors in the space. example Vector-based representation Doc1 = See Spot run. Doc2 = Run spot, run. 3 dimensions: run, spot, and see Doc1 {1, 1, 1} Doc2 {1, 1, 0} For now assume each entry is 1 if word appears in document and 0 otherwise Term 2 Term 1 Doc 1 Doc 3 Doc 2 Term 3 Similartiy How could we define similarity? Euclidean distance, Manhattan distance, word overlap, plenty more. Inner product measure is common: Problems w/ inner product This similarity metric just counts number of words in common with query. What does this leave out? What went wrong with your document correlator? x = y x i yi 2

Weighting continued How often does a term occur? Weight each term by its frequency in document. How common is term in collection. Weight by inverse document frequency. How big is document? Divide inner product by document size. Cosine Measure tf*idf weighting is standard: w tf = # of idf = log( Similarity metric is: i = tf idf times word occurs in document # documents in collection # documents containing word xi yi x y = x y ) Implementation Model says: given query, go out and compute similarity on all document vectors. Problems? Sparse Vectors Vocabulary and therefore dimensionality of vectors can be very large, ~10 4. However, most documents and queries do not contain most words, so vectors are sparse (i.e. most entries are 0). Need efficient methods for storing and computing with sparse vectors. Ideas? Inverted Files An inverted file is just a dictionary ADT that maps from words to documents containing that word. 3

Inverted Index Index terms df D j, tf j computer database 3 2 D 7, 4 D 1, 3 science 4 D 2, 4 system 1 D 5, 2 Index file Postings lists Retrieval with an Inverted Index Tokens that are not in both the query and the document do not effect cosine similarity. Product of token weights is zero and does not contribute to the dot product. Usually the query is fairly short, and therefore its vector is extremely sparse. Use inverted index to find the limited set of documents that contain at least one of the query words. Inverted Query Retrieval Efficiency Assume that, on average, a query word appears in B documents: Q = q 1 q 2 q n D 11 D 1B D 21 D 2B D n1 D nb Then retrieval time is O( Q B), which is typically, much better than naïve retrieval that examines all N documents, O( V N), because Q << V and B << N. IR for World Wide Web Lots of challenges: Heterogeneous data many formats, media types, languages Very little structure known a priori Constantly changing Demanding users: average query is 2.4 words long, and users expect desired page to be at top of list Huge amount of data: 320 million pages in 98 800 million pages in 99 3 billion indexed by Google in 2003 Appears to be growing exponentially! Handling that much data A mixture of: Duplicate data over many machines to balance load Split inverted file into sections and assign machines to sections Assign popular sections to larger cluster of machines Spiders How do we collect pages to index? Web is just a graph Each URL is a vertex Each hyperlink is an outgoing edge So start with some set of known sites, and expand out from there. 4

Basic crawling algorithm Start with a set of known sites While there are pages left in the queue: Retrieve a page. If page hasn t been seen yet, Index page Extract links from page and add to queue Updating index Index must be continually updated. Which pages to update? Keep track of popularity of pages. Refresh poplar pages more often. Update pages that change often: Periodically check pages for changes. Keep a history of how often pages change. Refresh more dynamic pages more often. Standard Web Search Engine Architecture user query DocIds Search engine servers Inverted index What do people search for on the web? l l 4660 sex 3129 yahoo 2191 internal site admin check from kho 1520 chat 1498 porn 1315 horoscopes 1284 pokemon 1283 SiteScope test 1223 hotmail 1163 games 1151 mp3 1140 weather 1127 www.yahoo.com 1110 maps 1036 yahoo.com 983 ebay 980 recipes PageRank Practically any query will return thousands or millions of documents on the web. Users typically don t have a particular page they re looking for. The just want the best page on the topic. Best doesn t just mean similar terms: Users want sites that are reliable and infomative sites that are authorities on the subject Web Popularity Contest Some pages are authorities Some pages are hubs Authorities are pointed to by lots of good pages. Hubs point to lots of good pages. Many exact definitions and algorithms. We ll look at a simple version of Google s PageRank algorithm. 5

Page Rank Using the in-degree of a pages gives a measure of popularity. But not all links are equal. A link from a highly ranked page counts for more than a link from a lowly ranked page. The rank of a page is based on the sum of the ranks of the pages that point to it: rank( p) = c rank( b) out degree( b) b: ( b p) Initial PageRank Idea Using the in-degree of a pages gives a measure of popularity. But not all links are equal. A link from a highly ranked page counts for more than a link from a lowly ranked page. Initial page rank equation for page p: p) = c q: q p q) Nq N q is the total number of out-links from page q. A page, q, gives an equal fraction of its authority to all the pages it points to (e.g. p). c is a normalizing constant set so that the rank of all pages always sums to 1. Can view it as a process of PageRank flowing from pages to the pages they cite..1.08.09.03.05.05.03.03.08.03 Initial Algorithm Iterate rank-flowing process until convergence: Let S be the total set of pages. Initialize p S: p) = 1/ S Until ranks do not change (much) (convergence) For each p S: R ( p) = q: q p p S q) Nq c = 1/ R ( p) For each p S: p) = cr (p) (normalize) Problem with Initial Idea A group of pages that only point to themselves but are pointed to by other pages act as a rank sink and absorb all the rank in the system. Rank flows into cycle and can t get out Rank Source Introduce a rank source E that continually replenishes the rank of each page, p, by a fixed amount E(p). p) = c q: q p q) + E( p) N q 6

Random Surfer Model PageRank can be seen as modeling a random surfer that starts on a random page and then at each point: With probability E(p) randomly jumps to page p. Otherwise, randomly follows a link on the current page. p) models the probability that this random surfer will be on page p at any given time. E jumps are needed to prevent the random surfer from getting trapped in web sinks with no outgoing links. Speed of Convergence Early experiments on Google used 322 million links. PageRank algorithm converged (within small tolerance) in about 52 iterations. Number of iterations required for convergence is empirically O(log n) (where n is the number of links). Therefore calculation is quite efficient. Simple Title Search with PageRank Use simple Boolean search to search web-page titles and rank the retrieved pages by their PageRank. Sample search for university : Altavista returned a random set of pages with university in the title (seemed to prefer short URLs). Primitive Google returned the home pages of top universities. Google Ranking Complete Google ranking includes (based on university publications prior to commercialization). Vector-space similarity component. Keyword proximity component. HTML-tag weight component (e.g. title preference). PageRank component. Details of current commercial ranking functions are trade secrets. Google PageRank-Biased Spidering Use PageRank to direct (focus) a spider on important pages. Compute page-rank using the current set of crawled pages. Order the spider s search queue based on current estimated PageRank. Link Analysis Conclusions Link analysis uses information about the structure of the web graph to aid search. It is one of the major innovations in web search. It is the primary reason for Google s success. 7

IR is getting more and more important Lots of interesting theoretical questions. Lots of interesting engineering questions. Also lots of interesting human related questions. 8