Lecture 7 April 30, Prabhakar Raghavan

Similar documents
Information retrieval. Lecture 9

CS276A Text Information Retrieval, Mining, and Exploitation. Lecture November, 2002

CS347. Lecture 2 April 9, Prabhakar Raghavan

Today s topics CS347. Inverted index storage. Inverted index storage. Processing Boolean queries. Lecture 2 April 9, 2001 Prabhakar Raghavan

Einführung in Web und Data Science Community Analysis. Prof. Dr. Ralf Möller Universität zu Lübeck Institut für Informationssysteme

Link Analysis SEEM5680. Taken from Introduction to Information Retrieval by C. Manning, P. Raghavan, and H. Schutze, Cambridge University Press.

F. Aiolli - Sistemi Informativi 2007/2008. Web Search before Google

This lecture. CS276B Text Retrieval and Mining Winter Pagerank: Issues and Variants. Influencing PageRank ( Personalization )

Efficiency. Efficiency: Indexing. Indexing. Efficiency Techniques. Inverted Index. Inverted Index (COSC 488)

Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 21 Link analysis

Recap: lecture 2 CS276A Information Retrieval

Brief (non-technical) history

TODAY S LECTURE HYPERTEXT AND

Indexing. UCSB 290N. Mainly based on slides from the text books of Croft/Metzler/Strohman and Manning/Raghavan/Schutze

CS60092: Informa0on Retrieval

Information Retrieval

Administrative. Distributed indexing. Index Compression! What I did last summer lunch talks today. Master. Tasks

Today s topic CS347. Results list clustering example. Why cluster documents. Clustering documents. Lecture 8 May 7, 2001 Prabhakar Raghavan

Text Analytics. Index-Structures for Information Retrieval. Ulf Leser

Web Search Basics. Berlin Chen Department of Computer Science & Information Engineering National Taiwan Normal University

Mining Web Data. Lijun Zhang

Information Retrieval. (M&S Ch 15)

CS6200 Information Retrieval. David Smith College of Computer and Information Science Northeastern University

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University

Link analysis. Query-independent ordering. Query processing. Spamming simple popularity

Text Analytics. Index-Structures for Information Retrieval. Ulf Leser

Information Retrieval

CSCI 5417 Information Retrieval Systems Jim Martin!

Introduction to Information Retrieval

modern database systems lecture 4 : information retrieval

Introduction to Information Retrieval (Manning, Raghavan, Schutze)

Searching the Web What is this Page Known for? Luis De Alba

Information Retrieval Lecture 4: Web Search. Challenges of Web Search 2. Natural Language and Information Processing (NLIP) Group

Lecture 8 May 7, Prabhakar Raghavan

Multimedia Information Extraction and Retrieval Term Frequency Inverse Document Frequency

In = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most

Web Information Retrieval. Lecture 4 Dictionaries, Index Compression

Information Retrieval

Mining Web Data. Lijun Zhang

Index Compression. David Kauchak cs160 Fall 2009 adapted from:

Basic techniques. Text processing; term weighting; vector space model; inverted index; Web Search

Course structure & admin. CS276A Text Information Retrieval, Mining, and Exploitation. Dictionary and postings files: a fast, compact inverted index

Chapter 2. Architecture of a Search Engine

Information Retrieval

Chapter 6: Information Retrieval and Web Search. An introduction

Chapter 27 Introduction to Information Retrieval and Web Search

Web consists of web pages and hyperlinks between pages. A page receiving many links from other pages may be a hint of the authority of the page

Information Retrieval

Instructor: Stefan Savev

Index Construction. Dictionary, postings, scalable indexing, dynamic indexing. Web Search

Lecture 8: Linkage algorithms and web search

Introduction to Information Retrieval

Web Search Basics. Berlin Chen Department t of Computer Science & Information Engineering National Taiwan Normal University

ΕΠΛ660. Ανάκτηση µε το µοντέλο διανυσµατικού χώρου

Information retrieval

Outline of the course

Information Retrieval Spring Web retrieval

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

Multimedia Content Management: Link Analysis. Ralf Moeller Hamburg Univ. of Technology

Information Retrieval. hussein suleman uct cs

3-2. Index construction. Most slides were adapted from Stanford CS 276 course and University of Munich IR course.

EECS 395/495 Lecture 3 Scalable Indexing, Searching, and Crawling

Information Retrieval

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Winter 2009 Lecture 12 Google Bigtable

SEARCH ENGINE INSIDE OUT

Information Retrieval (IR) Introduction to Information Retrieval. Lecture Overview. Why do we need IR? Basics of an IR system.

A Survey on Web Information Retrieval Technologies

Web Search. Lecture Objectives. Text Technologies for Data Science INFR Learn about: 11/14/2017. Instructor: Walid Magdy

Link analysis in web IR CE-324: Modern Information Retrieval Sharif University of Technology

Distributed computing: index building and use

Index construction CE-324: Modern Information Retrieval Sharif University of Technology

Link Analysis and Web Search

Query Answering Using Inverted Indexes

Introduction to Information Retrieval

MG4J: Managing Gigabytes for Java. MG4J - intro 1

Introduction to. CS276: Information Retrieval and Web Search Christopher Manning and Prabhakar Raghavan. Lecture 4: Index Construction

CS105 Introduction to Information Retrieval

Lecture 24: Image Retrieval: Part II. Visual Computing Systems CMU , Fall 2013

Introduction to Information Retrieval

Information Retrieval

Digital Libraries: Language Technologies

COMP6237 Data Mining Searching and Ranking

Relevance of a Document to a Query

Lecture 5: Information Retrieval using the Vector Space Model

More on indexing CE-324: Modern Information Retrieval Sharif University of Technology

Information Retrieval CSCI

index construct Overview Overview Recap How to construct index? Introduction Index construction Introduction to Recap

Information Retrieval CS Lecture 06. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Index construction CE-324: Modern Information Retrieval Sharif University of Technology

Information Retrieval. Lecture 3 - Index compression. Introduction. Overview. Characterization of an index. Wintersemester 2007

60-538: Information Retrieval

Web search before Google. (Taken from Page et al. (1999), The PageRank Citation Ranking: Bringing Order to the Web.)

Information Retrieval

Stanford University Computer Science Department Solved CS347 Spring 2001 Mid-term.

Introduction to Information Retrieval

Informa/on Retrieval. Text Search. CISC437/637, Lecture #23 Ben CartereAe. Consider a database consis/ng of long textual informa/on fields

Query Evaluation Strategies

Elementary IR: Scalable Boolean Text Search. (Compare with R & G )

Information Retrieval

Index construction CE-324: Modern Information Retrieval Sharif University of Technology

Transcription:

Lecture 7 April 30, 2001 Prabhakar Raghavan

Finish up web ranking Peer-to-peer search Search deployment models Service vs. software External vs. internal-facing search software Review of search topics

Increase weights of terms in titles Increase weights of terms in <h > tags Increase weights of terms near the beginning of the doc, its chapters and sections - key phrases

Tiger image Here is a great picture of a tiger Cool tiger webpage The text in the vicinity of a hyperlink is descriptive of the page it points to.

When indexing a page, also index the anchor text of links pointing to it. To weight links in the hubs/authorities algorithm from the last lecture. Anchor text usually taken to be a window of 6-8 words around a link anchor.

When indexing a document D, include anchor text from links pointing to D. Armonk, NY-based computer giant IBM announced today Joe s computer hardware links Compaq HP IBM www.ibm.com Big Blue today announced record profits for the quarter

Can sometimes have unexpected side effects - e.g., evil empire. Can index anchor text with less weight.

In hub/authority link analysis, can match anchor text to query, then weight link. h ( x) a( y) x y h x = x y w x y a y a x = w x y h y ( a x) h( y) y x y x

What is w(x,y)? Should increase with the number of query terms in anchor text. Say 1+ number of query terms. x Armonk, NY-based computer giant IBM announced today www.ibm.com y Weight of this link for query Computer is 2.

Recall basic algorithm: Iteratively update all h(x), a(x); After iteration, output pages with highest h() scores as top hubs; highest a() scores as top authorities. Now use weights in iteration. Raises scores of pages with heavy links. Do we still have convergence of scores? To what?

Lots of pages in a site give varying aspects of information on the same topic. Treat portions of web-sites as a single entity for score computations.

Links on a page tend to point to the same topics as neighboring links. Break pages down into pagelets (say separate by tags) and compute a hub/authority score for each pagelet.

Ron Fagin s links Logic links Moshe Vardi s logic page International logic symposium Paper on modal logic. My favorite football team The 49ers Why the Raiders suck Steve s homepage The NFL homepage

The WWW is full of free-spirited opinion, annotation, authority conferral Most other forms of hypertext are far more structured enterprise intranets are regimented and templated very little free-form community formation web-derived link ranking doesn t quite work

Powerful new ideas derived from sociology of web content creation Supplemented by other heuristics Less useful in intranets Challenges from dynamic html Application servers and web content management systems

For each query Q, keep track of which docs in the results are clicked on On subsequent requests for Q, re-order docs in results based on click-throughs First due to DirectHit AskJeeves

j Docs q Queries B qj = number of times doc j clicked-through on query q When query q issued again, order docs by B qj values.

Weighing/combining text- and click-based scores. What identifies a query? Ferrari Mondial Ferrari Mondial Ferrari mondial ferrari mondial Ferrari Mondial Can use heuristics, but search parsing slowed

Maintain a term-doc popularity matrix C as opposed to query-doc popularity initialized to all zeros Each column represents a doc j If doc j clicked on query q, update C j C j +ε q (here q is viewed as a vector). On a query q, compute its cosine proximity to C j for all j. Combine this with the regular text score.

Normalization of C j after updating. Boolean operators Why did the user click on the doc? Updating - live or batch? All votes count the same. More on this in recommendation systems.

Time spent viewing page Difficult session management Inconclusive modeling so far. Does user back out of page? Does user stop searching? Does user transact?

No central index Each node in a network builds and maintains own index Each node has servent software On booting, servent pings ~4 other hosts Connects to those that respond Initiates, propagates and serves requests

The ones you connected to last time Random hosts you know of Request suggestions from central (or hierarchical) nameservers All govern system s shape and efficiency

Send your request to your neighbors They send it to their neighbors decrement time to live for query query dies when ttl = 0 Send search matches back along requesting path

Fresh content no waiting for the next weekly indexing Dynamic content results could be assembled from a database or other repository live pricing/inventory information

Internet: Query interpretation up to servent spamming potential No co-ordination in network fragmentation Enterprises: security and access control administration distributed replication and caching

Intranet vs. extranet

As a service public, e.g., web search access-protected, e.g., proprietary newsfeeds and content As software Outward-facing (Walmart, CDNow ) Inward-facing within an enterprise

+ Ease of maintenance + software as well as indices + Can tune to platform - To date, not much proprietary content owners of valuable content don t hand over custody

Inward vs. outward-facing very different characteristics corpus sizes query rates languages and localization security content management

Relatively small corpora typically under 1GB Sporadic query rates, high peak loads Fairly dynamic corpus item prices in a catalog

Product database (RDBMS) w/product info prices, descriptions Search engine - spiders DB, indexes structured+unstructured product info. Application server - content assembly, personalization + Web server Back-end inventory RDBMS to complete the transaction.

Broker Broker Broker... Search servers

By documents Each server has a subset of the docs Each has its own dictionary Query sent out to all servers Broker ensures load-balancing, failover

By terms Each server has a subset of the lexicon Query sent to server(s) with the query term(s) Partition alphabetically easy query dispatch Partition by hashing uniform spread Query optimization is hard Works best when query terms are uniformly spread across servers

Search within an intranet Enterprise portals Enterprise doesn t have to be a (for profit) company - government, academe, any collaborative group with proprietary information.

Scale - lots of docs, geographically distributed over non-uniform WAN Multiple languages and character sets Locale modules for stemming, thesauri Multiple document repositories Lotus, Exchange, Documentum, Filenet Materialized views of compound documents Multiple formats - pdf, MS office, multiple MIME-type attachments

Each doc has access permissions for groups User authenticated for membership in certain groups; can change with time Results of a search should only contain docs the user can view Not sufficient to show a doc in results, then deny user attempting to access it Compound docs made up of pieces each piece has own ACL s

Enterprise search - inside and outside - are quite different Each different from public web search service Inside enterprise search the most fragile tremendous diversity flexible, hard-to-administer software vs. expensive customization

Dictionary of terms Each term points to a series of postings entries Postings for a term point to docs containing that term

Store pointers to every kth on term string. Need to store term lengths (1 extra byte).7systile9syzygetic8syzygial6syzygy11szaibelyite8szczecin9szomo. Freq. Postings ptr. Term ptr. 33 29 44 126 7 Save 9 bytes on 3 pointers. Lose 4 bytes on term lengths.

Store list of docs containing a term in increasing order of doc id. Brutus: 33,47,154,159,202 Suffices to store gaps. 33,14,107,5,43 Gaps encoded with far fewer than 20 bits, using γ codes.

Estimate using crude Zipf law analysis Most frequent term occurs in n docs Second most frequent term in n/2 docs kth most frequent term in n/k docs, etc. n/k gaps of k each - use ~2log 2 k bits for each gap using γ codes.

Stemming - Porter s. Case folding. Thesauri and soundex Spell correction

Consider a query that is an AND of t terms. The idea: for each of the t terms, get its term-doc incidence from the postings, then AND together. Process in order of increasing freq: start with smallest set, then keep cutting further. This is why we kept freq in dictionary.

2,4,6,8,10,12,14,16,18,20,22,24,... At query time: As we walk the current candidate list, concurrently walk inverted file entry - can skip ahead (e.g., 8,21). Skip size: recommend about (list length)

mon*: find all docs containing any word beginning mon. Solution: index all k-grams occurring in any doc (any sequence of k chars). Querymon* can now be run as $m AND mo AND on But we d get a match on moon. Must post-filter these results against query.

Search for to be or not to be No longer suffices to store only <term:docs> entries. Instead store, for each term, entries <number of docs containing term; doc1: position1, position2 ; doc2: position1, position2 ; etc.>

Precision: fraction of retrieved docs that are relevant Recall: fraction of relevant docs that are retrieved Both can be measured as functions of the number of docs retrieved

Parse and build postings entries one doc at a time To now turn this into a term-wise view, must sort postings entries by term (then by doc within each term) Blockof postings records; can easily fit a couple into memory. Sort within blocks first, then merge.

Inserting a (variable-length) record a typical postings entry Maintain a pool of (say) 64KB chunks Chunk header maintains metadata on records in chunk, and its free space Header Free space Record Record Record Record

Each doc j can now be viewed as a vector of tf idf values, one component for each term. So we have a vector space terms are axes docs live in this space even with stemming, may have 10000+ dimensions

w = tf ij ij n n i tf n n i ij = idf = frequency of = i total number of the number of = log n ni = term i in document documents documents that contain term inverse document frequency of j i term i

Distance between vectors D1,D2 captured by the cosine of the angle x between them. Note - this is similarity, not distance. t 3 d 2 d 1 x t 1 t 2

Key: A user s query can be viewed as a (very) short document. Query becomes a vector in the same space as the docs. Can measure each doc s proximity to it. Natural measure of scores/ranking - no longer Boolean.

Computing individual cosines Speeding up computations Avoiding computing cosines to all docs Dimensionality reduction Random projection LSI

d 1 Documents d 2 r 1 r 2 r 3 Terms Document Network c 1 c 2 c 3 q 1 q 2 i Concepts Query operators (AND/OR/NOT) Information need Query Network

Structured search - search by restricting on attribute values, as in databases. Unstructured search - search in unstructured files, as in text. Semi-structured search: combine both.

Two basic approaches Universal, query-independent ordering on all web pages (based on link analysis) Of two pages meeting a (text) query, one will always win over the other, regardless of the query Query-specific ordering on web pages Of two pages meeting a query, the relative ordering may vary from query to query

For any ergodic Markov chain, there is a unique long-term visit rate for each state. Steady-state distribution. Over a long time-period, we visit each state in proportion to this rate. It doesn t matter where we start.

MIR 9.