A Survey on Web Information Retrieval Technologies

Size: px

Start display at page:

Download "A Survey on Web Information Retrieval Technologies"

Jodie Austin
6 years ago
Views:

1 A Survey on Web Information Retrieval Technologies Lan Huang Computer Science Department State University of New York, Stony Brook Presented by Kajal Miyan Michigan State University

2 Overview Web Information Retrieval Challenges Search Engine Overview and Architecture (Google as Case Study) Various Algorithms Directories Overview and some Algos. Estimating Size of Web Main Challenges Google File System Sponsored Search Discussion

3 Web Information Retrieval Challenges Bulk Dynamic Internet Heterogeneity Variety of Languages Duplication High Linkage Ill-formed queries Wide Variance in Users Specific Behavior

4 The Goal

5 Evaluation Precision - fraction of the documents retrieved that are relevant to the user's information need. Precision = (relevant documents AND retrieved documents)/retrieved documents Recall - fraction of the documents that are relevant to the query that are successfully retrieved. Recall = (relevant documents AND retrieved documents)/relevant documents

6 Current Goal Precision at Top 0 results & Precision at Top 0 result pages

7 Various SEs SEs Directories News SEs Meta Search Engines Social Search Engines Opinions, Forums, Usenets SEs

8 Architecture (Sherman 003) Crawler URL Indexer The Web URL3 Query Server URL Eggs? URL4 Eggs. All About Eggs - 90% Eggo - 8% Your Eggs EgoBrowser by40% Huh? 0% S. I. -Am

9 Crawling the web Robust should avoid overloading the websites and must deal with huge amounts of data. Decides in what order to crawl the pages. Frequency of revisiting pages Rule of Thumb Important pages first

10 Crawler

11 Cho etc.99 (spread the workload) Allocation that URL s in 500 Queues Allocation based on the Hash of the server name Read one URL from each queue at a time

12 Inverted Indexes the IR Way How Inverted Index Are Created???

13 Periodically rebuilt, static otherwise. Documents are parsed to extract tokens. These are saved with the Document ID. Doc Doc Now is the time It was a dark and for all good men stormy night in the country to come to the aid of their country manor. The time was past midnight Term now is the time for all good men to come to the aid of their country it was a dark and stormy night in the country manor the time was past midnight Doc #

14 After all documents have been parsed the inverted file is sorted alphabetically. Term now is the time for all good men to come to the aid of their country it was a dark and stormy night in the country manor the time was past midnight Doc # Term a aid all and come country country dark for good in is it manor men midnight night now of past stormy the the the the their time time to to was was Doc #

15 Multiple term entries for a single document are merged. Withindocument term frequency information is compiled. Term a aid all and come country country dark for good in is it manor men midnight night now of past stormy the the the the their time time to to was was Doc # Term a aid all and come country country dark for good in is it manor men midnight night now of past stormy the the their time time to was Doc # Freq

16 How Inverted Files are Created Finally, the file can be split into Dictionary or Lexicon file Postings file

17 Term a aid all and come country country dark for good in is it manor men midnight night now of past stormy the the their time time to was Doc # Freq Dictionary/Lexicon Postings Term a aid all and come country dark for good in is it manor men midnight night now of past stormy the their time to was N docs Doc # Tot Freq 4 Freq

18 Inverted indexes Permit fast search for individual terms For each term, you get a list consisting of: document ID frequency of term in doc (optional) position of term in doc (optional) These lists can be used to solve Boolean queries: country -> d, d manor -> d country AND manor -> d

19 Inverted Indexes for Web Search Engines Inverted indexes are still used, even though the web is so huge. Some systems partition the indexes across different machines. Each machine handles different parts of the data. Other systems duplicate the data across many machines; queries are distributed among the machines. Most do a combination of these.

20 Google Architecture

21 Repository Contains the full HTML text Compressed using zlib (RFC950) Prefixed by docid, length, and URL Document Index Each entry contain The current doc status (crawled?) A pointer into the repository (if crawled) A document checksum (using binary search to find the docid ) Various statistics Lexicon Keep in memory on a 56M Current contains 4 million words

22 Google s Indexing The Indexer converts each doc into a collection of hit lists and puts these into barrels, sorted by docid. It also creates a database of links. Hit: <wordid, position in doc, font info, hit type> Hit type: Plain or fancy. Fancy hit: Occurs in URL, title, anchor text, metatag. Optimized representation of hits ( bytes each). Sorter sorts each barrel by wordid to create the inverted index. It also creates a lexicon file. Lexicon: <wordid, offset into inverted index> Lexicon is mostly cached in-memory

23 Google s Inverted Index Each barrel contains postings for a range of wordids. Lexicon (in-memory) Postings ( Inverted barrels, on disk) wordid #docs wordid #docs wordid #docs Sorted by wordid Sorted by Docid Docid Docid Docid Docid Docid #hits #hits #hits #hits #hits Hit, hit, hit, hit, hit Hit Hit, hit Hit Hit, hit, hit Barrel i Barrel i+

24 Google crawler Maintain its own DNS cache Asynchronous I/O to manage events 4 crawler Both URLserver & crawler are implement in Python Each crawler keeps 300 connections open at once >00 pages / s, roughly 600K/s

25 How is Relevance of the page decided???

26 Content Relevance Phrase matching. Synonyms. URL analysis. Date last updated. Spell checking. Home page detection.

27 HTML Weighting Class Name HTML tags ) Plain Text None of the above ) Strong STRONG, B, EM, I, U 3) List DL, OL, UL 4) Header H, H, H3, H4, H5, H6 5) Anchor A 6) Title TITLE Meta tag text is mostly ignored by search engines

28 Link-Based Metrics A link from A to B can be viewed as a recommendation, a vote or a citation. Links can be referential, or informational Links effect the ranking of web pages and thus have commercial value.

29 Citation and Linking

30 PageRank - Motivation The number incoming links to a page is a measure of importance and authority of the page. Also take into account the quality of recommendation, so a page is more important if the sources of its incomoing links are important.

31 The Random Surfer Assume the web is a Markov chain. Surfers randomly click on links, where the probability of an outlink from page A is /m, where m is the number of outlinks from A. The surfer occasionally gets bored and is teleported to another web page, say B, where B is equally likely to be any page. Using the theory of Markov chains it can be shown that if the surfer follows links for long enough, the PageRank of a web page is the probability that the surfer will visit that page.

32 PageRank (PR) - Definition PR(Wn ) T PR(W ) PR(W ) PR(W ) = + ( T )( ) N O(W ) O(W ) O(Wn ) W is a web page Wi are the web pages that have a link to W O(Wi) is the number of outlinks from Wi T is the teleportation probability N is the size of the web

33 HITS Hubs and Authorities Hyperlink-Induced Topic Search A on the left is an authority A on the right is a hub

34 HITS Algorithm Iterate until Convergence A( p ) = H (q ) A(q ) q B q p H ( p) = q B p q B is the base set q and p are web pages in B A(p) is the authority score for p H(p) is the hub score for p

35 HITS Algorithm Overview Input: Q a query string SE a text-based search engine t size of the root set d max number of in links Top t pages (highest-ranked pages) from the text-based search engine form the root set Output: focused subset

36 ( Q = java, SE = AltaVista, t = 3, d = 3)

37 ( Q = java, SE = AltaVista, t = 3, d = 3)

38 Algorithm An Iterative Algorithm [authority] weights vector x0 = (,,,, ) [hub] weights vector y0 = (,,,, ) for i =,,, k xi = update_authorityw(yi-) yi = update_hubw(xi) normalize(xi, yi) return (xk, yk)

50 Applications of HITS Search engine querying (speed is an issue). Finding web communities. Finding related pages. Populating categories in web directories. Citation analysis.

51 HITS Did not Work Well!!! Mutually Reinforcing Relationship Between Hosts Automatically Generated Links Non-Relevant Node Topic drift approach K edges - /k authority weight L edges - /l hub weight

52 Duplicate Elimination Challenges Defining the notation of a replicated collection precisely Slight differences between copies Efficient algorithm to identify such collection and exploiting this knowledge of replication Hundreds of millions of pages Subgraph isomorphism: NP

53 Reasons for inability to Detect Duplicates Update Frequency Different Formats Partial Crawls

54 Some Solutions IR for Textual similarity Data Mining for clustering so on... Formal definition of similar collections follows: Similar Pages Similar Links

55 Growing Similar Clusters

56 Directories vs. Search Engines Directories Hand-selected sites Search over the contents of the descriptions of the pages Organized in advance into categories Search Engines All pages in all sites Search over the contents of the pages themselves Organized in response to a query by relevance rankings or other scores

57 Directories & Categorization Automatic Categorization -Taper A taxonomy-and-path-enhanced-retrieval system Given Hypertext document corpus A small set of classified documents Goal Construct a classifier Apply to new documents Manual Categorization -OpenGrid and ODP

60 Good discriminating power: large interclass distance, small intraclass distance

61 Size of the Web Typical Questions Which search engine has the largest coverage? How many pages are out there and how many are indexed? Approach - Measure search engine coverage and overlap through random queries - Allows a third party to measure relative sizes and overlaps of search engines - Take two search engines, E and E, we can: Compute their relative sizes Compute the fraction of E s database indexed by E

62 Size Wars August 005 : We index 0 billion documents. September 005 : We index 8 billion documents, but our index is 3 times larger than our competition s. So, who s right?

63 We knew the web was big... As of 0/30/008 0:33:00 7/5/008 0::00 AM PM 7/5/008 0::00 AM Claimed to have found: trillion (as in,000,000,000,000) unique URLs on the web at once!

66 Challenges Spam - Text Spam - Link Spam Cloaking Content Quality Quality Evaluation Web Conventions (META tag) Duplicate Hosts Vaguely Structured Data

67 Google File System Performance, Scalability, Reliability, Avalability Component failures are a norm Files are huge Most files are mutated by writing in the end rather than overwriting data Co-designing APIs with File System to increase flexibility

68 GFS

69 Sponsored Search More than 50% users visit a SE every few days Over 3% of traffic to commercial sites was generated by Ses Over 40% of product searches on web were initiated via Ses Pioneered by Goto, later Overture Google

70 Measurement and Pricing Cost per mille Cost per action Cost per click Click through rate = (000*CPM) + CPC Yahoo! - Uses CPC Google Uses CTR * CPC

71 Discussion Large field with great challenges Google really lived upto its name 0^00!!! Highly profitable Internet today is a Big Brain... Can we contribute???

Logistics. CSE Case Studies. Indexing & Retrieval in Google. Review: AltaVista. BigTable. Index Stream Readers (ISRs) Advanced Search

Logistics. CSE Case Studies. Indexing & Retrieval in Google. Review: AltaVista. BigTable. Index Stream Readers (ISRs) Advanced Search CSE 454 - Case Studies Indexing & Retrieval in Google Some slides from http://www.cs.huji.ac.il/~sdbi/2000/google/index.htm Logistics For next class Read: How to implement PageRank Efficiently Projects