Information Retrieval

Similar documents
Information Retrieval

Information Retrieval

Link Analysis and Web Search

Information Retrieval Spring Web retrieval

Lecture 9: I: Web Retrieval II: Webology. Johan Bollen Old Dominion University Department of Computer Science

Lecture Notes: Social Networks: Models, Algorithms, and Applications Lecture 28: Apr 26, 2012 Scribes: Mauricio Monsalve and Yamini Mule

Information Retrieval May 15. Web retrieval

Searching the Web [Arasu 01]

The Anatomy of a Large-Scale Hypertextual Web Search Engine

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

Information Retrieval Lecture 4: Web Search. Challenges of Web Search 2. Natural Language and Information Processing (NLIP) Group

= a hypertext system which is accessible via internet

Recent Researches on Web Page Ranking

Web Structure Mining using Link Analysis Algorithms

Bruno Martins. 1 st Semester 2012/2013

CS-C Data Science Chapter 9: Searching for relevant pages on the Web: Random walks on the Web. Jaakko Hollmén, Department of Computer Science

CS6200 Information Retreival. The WebGraph. July 13, 2015

Lecture #3: PageRank Algorithm The Mathematics of Google Search

Information Retrieval and Web Search Engines

Anatomy of a search engine. Design criteria of a search engine Architecture Data structures

ROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015

Einführung in Web und Data Science Community Analysis. Prof. Dr. Ralf Möller Universität zu Lübeck Institut für Informationssysteme

Web Mining. Data Mining and Text Mining (UIC Politecnico di Milano) Daniele Loiacono

COMP5331: Knowledge Discovery and Data Mining

An Improved Computation of the PageRank Algorithm 1

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

Lecture 8: Linkage algorithms and web search

10/10/13. Traditional database system. Information Retrieval. Information Retrieval. Information retrieval system? Information Retrieval Issues

Information Retrieval. Lecture 4: Search engines and linkage algorithms

Information Retrieval. Lecture 11 - Link analysis

A Survey on Web Information Retrieval Technologies

Link Analysis from Bing Liu. Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data, Springer and other material.

Social Network Analysis

COMP 4601 Hubs and Authorities

An Adaptive Approach in Web Search Algorithm

Proximity Prestige using Incremental Iteration in Page Rank Algorithm

Home Page. Title Page. Page 1 of 14. Go Back. Full Screen. Close. Quit

MAE 298, Lecture 9 April 30, Web search and decentralized search on small-worlds

Information Retrieval

World Wide Web has specific challenges and opportunities

Motivation. Motivation

PageRank and related algorithms

Web Mining. Data Mining and Text Mining (UIC Politecnico di Milano) Daniele Loiacono

1 Starting around 1996, researchers began to work on. 2 In Feb, 1997, Yanhong Li (Scotch Plains, NJ) filed a

Web Mining. Data Mining and Text Mining (UIC Politecnico di Milano) Daniele Loiacono

Mining Web Data. Lijun Zhang

How to organize the Web?

Big Data Analytics CSCI 4030

Web Search Ranking. (COSC 488) Nazli Goharian Evaluation of Web Search Engines: High Precision Search

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

International Journal of Scientific & Engineering Research Volume 2, Issue 12, December ISSN Web Search Engine

Link Structure Analysis

Administrivia. Crawlers: Nutch. Course Overview. Issues. Crawling Issues. Groups Formed Architecture Documents under Review Group Meetings CSE 454

Web Search Basics. Berlin Chen Department of Computer Science & Information Engineering National Taiwan Normal University

CHAPTER THREE INFORMATION RETRIEVAL SYSTEM

Web consists of web pages and hyperlinks between pages. A page receiving many links from other pages may be a hint of the authority of the page

UNIT-V WEB MINING. 3/18/2012 Prof. Asha Ambhaikar, RCET Bhilai.

Part 1: Link Analysis & Page Rank

Web search before Google. (Taken from Page et al. (1999), The PageRank Citation Ranking: Bringing Order to the Web.)

Generalized Social Networks. Social Networks and Ranking. How use links to improve information search? Hypertext

Authoritative Sources in a Hyperlinked Environment

Information Retrieval and Web Search

Searching the Web What is this Page Known for? Luis De Alba

THE WEB SEARCH ENGINE

WEB STRUCTURE MINING USING PAGERANK, IMPROVED PAGERANK AN OVERVIEW

A STUDY OF RANKING ALGORITHM USED BY VARIOUS SEARCH ENGINE

TODAY S LECTURE HYPERTEXT AND

COMP Page Rank

INTRODUCTION TO DATA SCIENCE. Link Analysis (MMDS5)

Introduction to Data Mining

Information Retrieval (IR) Introduction to Information Retrieval. Lecture Overview. Why do we need IR? Basics of an IR system.

CS/MAS 115: COMPUTING FOR THE SOCIO-TECHNO WEB HISTORY OF THE WEB

Searching the Web for Information

DATA MINING II - 1DL460. Spring 2014"

Web Crawling. Jitali Patel 1, Hardik Jethva 2 Dept. of Computer Science and Engineering, Nirma University, Ahmedabad, Gujarat, India

Weighted Page Rank Algorithm Based on Number of Visits of Links of Web Page

A brief history of Google

CRAWLING THE WEB: DISCOVERY AND MAINTENANCE OF LARGE-SCALE WEB DATA

Slides based on those in:

Page rank computation HPC course project a.y Compute efficient and scalable Pagerank

Link Analysis. Hongning Wang

Graph and Web Mining - Motivation, Applications and Algorithms PROF. EHUD GUDES DEPARTMENT OF COMPUTER SCIENCE BEN-GURION UNIVERSITY, ISRAEL

On Finding Power Method in Spreading Activation Search

Administrative. Web crawlers. Web Crawlers and Link Analysis!

Roadmap. Roadmap. Ranking Web Pages. PageRank. Roadmap. Random Walks in Ranking Query Results in Semistructured Databases

Lec 8: Adaptive Information Retrieval 2

Web Search. Lecture Objectives. Text Technologies for Data Science INFR Learn about: 11/14/2017. Instructor: Walid Magdy

Big Data Analytics CSCI 4030

Information Networks: PageRank

Agenda. Math Google PageRank algorithm. 2 Developing a formula for ranking web pages. 3 Interpretation. 4 Computing the score of each page

Impact. Course Content. Objectives of Lecture 2 Internet and WWW. CMPUT 499: Internet and WWW Dr. Osmar R. Zaïane. University of Alberta 4

Brief (non-technical) history

Internet Client-Server Systems 4020 A

Lecture 11: Graph algorithms! Claudia Hauff (Web Information Systems)!

SEARCH ENGINE INSIDE OUT

Learning to Rank Networked Entities

Graphs / Networks. CSE 6242/ CX 4242 Feb 18, Centrality measures, algorithms, interactive applications. Duen Horng (Polo) Chau Georgia Tech

AN EFFICIENT COLLECTION METHOD OF OFFICIAL WEBSITES BY ROBOT PROGRAM

Web Mining Data Mining im Internet

COMPARATIVE ANALYSIS OF POWER METHOD AND GAUSS-SEIDEL METHOD IN PAGERANK COMPUTATION

Transcription:

Information Retrieval Information Retrieval on the Web Ulf Leser

Content of this Lecture The Web Web Crawling Exploiting Web Structure for IR A Different Flavor: WebSQL Much of today s material is from: Chakrabarti, S. (2003). Mining the Web: Discovering Knowledge from Hypertext Data: Morgan Kaufmann Publishers. Ulf Leser: Information Retrieval, Winter Semester 2014/2015 2

The World Wide Web 1965: Hypertext: A File Structure for the Complex, the Changing, and the Indeterminate (Ted Nelson) 1969: ARPANET 1971: First email 1978: TCP/IP 1989: "Information Management: A Proposal" (Tim Berners-Lee, CERN) 1990: First Web Browser 1991: WWW Poster 1993: Browsers (Mosaic->Netscape->Mozilla) 1994: W3C creation 1994: Crawler: World Wide Web Wanderer 1995-: Search engines such as Excite, Infoseek, AltaVista, Yahoo, 1997: HTML 3.2 released (W3C) 1999: HTTP 1.1 released (W3C) 2000: Google, Amazon, Ebay, See http://www.w3.org/2004/talks/w3c10-howitallstarted Ulf Leser: Information Retrieval, Winter Semester 2014/2015 3

HTTP: Hypertext Transfer Protocol Stateless, very simple protocol Many clients (e.g. browsers, telnet, ) talk to one server GET: Request a file (e.g., a web page) POST: Request file and transfer data block PUT: Send file to server (deprecated, see WebDAV) HEAD: Request file metadata (e.g. to check currentness) HTTP 1.1: Send many requests over one TCP connection Transferring parameters: URL rewriting or POST method Keeping state: URL rewriting or cookies Example GET /wiki/spezial:search?search=katzen&go=artikel HTTP/1.1 Host: de.wikipedia.org Ulf Leser: Information Retrieval, Winter Semester 2014/2015 4

HTML: Hypertext Markup Language Web pages usually are ASCII files with markup Things change(d): SVG, JavaScript, Web2.0/AJAX, HTML: strongly influenced by SGML, but much simpler Focus on layout; no semantic information (If they only had done it differently ) <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN /strict.dtd"> <html> <head> <title> Titel of web page </title> <! more metadata --> </head> <body> Content of web page </body> </html> Ulf Leser: Information Retrieval, Winter Semester 2014/2015 5

Hypertext Most interesting feature of HTML: Links between pages The concept is old: Hypertext Generally attributed to Bush, V. (1945). As We May Think. The Atlantic Monthly Suggests Memex: A system of storing information linked by pointers in a graph-like structure Associative browsing In HTML, links have an anchor <a href= 05_ir_models.pdf >IR Models</a>: Probabilistic and vector space model http://www.w3.org Ulf Leser: Information Retrieval, Winter Semester 2014/2015 6

Deep Web Most of the data on the web is not stored in HTML Surface web: Static web pages = files on a web server Deep web: Accessible only through forms, logins, Most content of databases (many are periodically dumped) Accessible through CGI scripts, servlets, web services, Crawls only reach the surface web Plus special contracts for specific information: product catalogues, news, Distinction not always clear Content management systems create web pages only when they are accessed Ulf Leser: Information Retrieval, Winter Semester 2014/2015 7

It s Huge Jan 2007: Number of hosts estimated 100-500 Million 2005: App. 12M web pages (Guli, Signorini, WWW 2005) 2013: App. 13 Trillion web pages (www.factshunt.com) Source: http://www.worldwidewebsize.com/ Ulf Leser: Information Retrieval, Winter Semester 2014/2015 8

Today Accesses per Month (as of 2012) Google: 88 billion per month Means: ~3 billion per day 12-fold increase over 7 years Twitter: 19 billion per month Yahoo: 9.4 billion per month Bing: 4.1 billion per month Source: www.searchengineland.com Ulf Leser: Information Retrieval, Winter Semester 2014/2015 9

Search Engines World Wide Ulf Leser: Information Retrieval, Winter Semester 2014/2015 10

Historical Data (world-wide, 2003) Service Google Overture Inktomi LookSmart FindWhat Ask Jeeves AltaVista FAST Searches Per Day 250 million 167 million 80 million 45 million 33 million 20 million 18 million 12 million As Of/Notes February 2003 (as reported to me by Google, for queries at both Google sites and its partners) February 2003 (from Piper Jaffray's The Silk Road for Feb. 25, 2003 and covers searches at Overture's site and those coming from its partners.) February 2003 (Inktomi no longer releases public statements on searches but advised me the figure cited is "not inaccurate") February 2003 (as reported in interview and includes searches with all LookSmart partners, such as MSN) January 2003 (from interview and covers searches at FindWhat and its partners) February 2003 (as reported to me by Ask Jeeves, for queries on Ask, Ask UK, Teoma and via partners) February 2003 (from Overture press conference on purchase of FAST's web search division) February 2003 (from Overture press conference on purchase of FAST's web search division) Ulf Leser: Information Retrieval, Winter Semester 2014/2015 11

It s not Random Degree distribution of web pages follows a power law k: Degree of web pages In the web, p ~ 2 3 Such graphs are called scale-free Pr( k) A similar distribution is found in many other graph-type data: social networks, biological networks, electrical wiring, 1 k p Source: Flickr Characteristics of the Web of Spain, 2005 Barabasi et al. 2003 Ulf Leser: Information Retrieval, Winter Semester 2014/2015 12

Searching the Web In some sense, the Web is a single, large corpus But searching the web is different from traditional IR Recall is nothing Most queries are too short to be discriminative for a corpus of that size Usual queries generate very many hits: Information overload We never know the web: A moving target Ranking is more important than high precision Users rarely go to results page 2 Intentional cheating: Precision of search badly degraded Mirrors: Concept of unique document is not adequate Much of the content is non-textual Documents are linked Ulf Leser: Information Retrieval, Winter Semester 2014/2015 14

Content of this Lecture The Web Web Crawling Exploiting Web Structure for IR A Different Flavor: WebSQL Ulf Leser: Information Retrieval, Winter Semester 2014/2015 15

Searching the web? Searching first means gathering to build an index Architecture of a search engine Crawler: Chose and download web pages, writes pages to repository Indexer: Extract full text and links to be put in crawler queue and to be used for page rank (later) Barrels: Partial inverted files Sorter: Creates overall distributed inverted index; updates lexicon Searcher: Answers queries Source: The Anatomy of a Large-Scale Hypertextual Web Search Engine, S. Brin, L. Page, Stanford Labs, 1998 Ulf Leser: Information Retrieval, Winter Semester 2014/2015 16

Web Crawling We want to search a constantly changing set of documents www.archive.org: The Wayback Machine: Browse through 85 150 billion web pages archived from 1996 to a few months ago. There is no list of all web pages Solution Start from a given set of URLs Iteratively fetch and scan web pages for outlinking URLs Put links in fetch queue sorted by some magic Take care of not fetching the same page again and again Relative links, URL-rewriting, multiple server names, Repeat forever Ulf Leser: Information Retrieval, Winter Semester 2014/2015 17

Architecture of a Web Crawler Ulf Leser: Information Retrieval, Winter Semester 2014/2015 18

Issues Key trick: Parallelize everything Use multiple DNS servers (and cache resolutions) Use many, many download threads Use HTTP 1.1: Multiple fetches over one TCP connection Take care of your bandwidth and of load on remote servers Do not overload server (DoS attack) Robot-exclusion protocol Usually, bandwidth and IO-throughput are more severe bottlenecks than CPU consumption Ulf Leser: Information Retrieval, Winter Semester 2014/2015 19

More Issues Before analyzing a page, check if redundant (checksum) Re-fetching a page is not always bad Pages may have changed Revisit after certain period, use HTTP HEAD command Individual periods can be adjusted automatically Sites / pages usually have a rather stable update frequency Crawler traps, google bombs Pages which are CGI scripts generating an infinite series of different URLs all leading to the same script Difficult to avoid Overly long URLs, special characters, too many directories, Keep black list of servers Ulf Leser: Information Retrieval, Winter Semester 2014/2015 20

Content of this Lecture The Web Web Crawlers Exploiting Web Structure for IR Prestige in networks Page Rank HITS A Different Flavor: WebSQL Ulf Leser: Information Retrieval, Winter Semester 2014/2015 21

Ranking and Prestige Classical IR ranks docs according to content and query On the web, many queries generate too many good matches Cancer, daimler, car rental, newspaper, Why not use other features? Rank documents higher whose author is more famous Rank documents higher whose publisher is more famous Rank documents higher that have more references Rank documents higher that are cited by documents which would be ranked high in searchers Rank docs higher which have a higher prestige Prestige in social networks: The prestige of a person depends on the prestige of its friends Ulf Leser: Information Retrieval, Winter Semester 2014/2015 22

Prestige in a Network Consider a network of people, where a directed edge (u,v) indicates that person u knows person v Modeling prestige: A person inherits the prestige from all persons who known him/her Your prestige is high if you are known by many other famous people, not the other way round Formally: Your prestige is the sum of the prestige values of people that know you Lieschen Jagger Ulf Leser: Information Retrieval, Winter Semester 2014/2015 23

Computing Prestige Let E by the adjacency matrix, i.e., E[u,v]=1 if u knows v Let p be the vector storing the prestige of all nodes Initialized with some small constants If we compute p =E T *p, p is a new prestige vector which considers the prestige of all incoming nodes 5 1 6 4 2 3 7 8 12 11 9 10 E T 1 2 3 4 5 6 7 8 9 0 1 2 1 1 2 1 3 1 4 5 1 6 1 1 1 7 8 9 0 1 1 2 1 1 1 1 1 Ulf Leser: Information Retrieval, Winter Semester 2014/2015 24

Iterative Multiplications Computing p =E T *p =E T *E T *p also considers indirect influences Computing p =E T *p =E T *E T *E T *p also The correct prestige vector fulfills p=e T *p Under some circumstances, iteratively multiplying E T will make this formula converge Math comes later Ulf Leser: Information Retrieval, Winter Semester 2014/2015 25

Example Start with p 1 =(1,1,1, ) Iterate: p i+1 =E T *p i Example p 1 =(1,1,1,0,1,3,0,0,0,1,0,5) 6 and 12 are cool p 2 =(3,3,3,0,3,2,0,0,0,5,0,3) To be known by 12 is cool To be known be 4,7,8,.. doesn t help much Hmm we punish social sinks quite hard Nodes who are not known by anybody 1 2 3 4 5 6 7 8 9 0 1 2 1 1 2 1 3 1 4 5 1 6 1 1 1 7 8 9 0 1 1 2 1 1 1 1 1 Ulf Leser: Information Retrieval, Winter Semester 2014/2015 26

Example 2 Other graph: Every node has at least one incoming link Start with p=(1,1,1, ) Iterate p 1 =(1,1,1,0,1,3,0,0,0,1,0,5) p 2 =(3,3,3,0,3,2,1,0,0,5,1,3) p 3 =(2,3,2,1,2,8,3, As such, numbers grow to infinity Must be repaired 5 1 6 4 2 3 7 12 11 9 10 1 2 3 4 5 6 7 8 9 0 1 2 1 1 2 1 1 3 1 4 1 5 1 6 1 1 1 7 1 8 1 9 1 0 1 1 1 2 1 1 1 1 1 8 Ulf Leser: Information Retrieval, Winter Semester 2014/2015 27

Prestige in Hypertext IR (= Web Search) PageRank uses the number of incoming links Scores are query independent and can be pre-computed Page, L., Brin, S., Motwani, R., & Winograd, T. (1998). The PageRank Citation Ranking: Bringing Order to the Web: Unpublished manuscript, Stanford University. HITS distinguishes authorities and hubs wrt. a query Thus, scores cannot be pre-computed Kleinberg, J. M. (1998). Authoritative Sources in a Hyperlinked Environment. ACM-SIAM Symposium on Discrete Mathematics. Many more suggestions Bharat and Henzinger model ranks down connected pages which are very dissimilar to the query Clever weights links wrt. the local neighborhood of the link in a page (anchor + context) Objectrank and PopRank rank objects (on pages), including different types of relationships Ulf Leser: Information Retrieval, Winter Semester 2014/2015 28

Content of this Lecture Searching the Web Search engines on the Web Exploiting the web structure Prestige in networks Page Rank HITS A different flavor: WebSQL Ulf Leser: Information Retrieval, Winter Semester 2014/2015 29

PageRank Algorithm Major breakthrough: Ranking of Google was much better than that of other search engines Before: Ranking only with page content and length of URL The longer, the more specialized The current ranking of Google is a combination of a prestige value and a classical IR score and Computing PageRank for billions of pages requires more tricks than we present here Especially approximation Ulf Leser: Information Retrieval, Winter Semester 2014/2015 30

Random Surfer Model Another view on prestige Random Surfer Assume a random surfer S taking all decision by chance S starts from a random page picks and clicks a link from that page at random and repeats this process forever At any point in time: What is the probability p(v) for S being on a page v? After arbitrary many clicks? Starting from an arbitrary web page? Ulf Leser: Information Retrieval, Winter Semester 2014/2015 31

Random Surfer Model Math After one click, S is in v with probability With u = # of links outgoing from u and E [u,v]=e[u,v]/ u The probability of being in v after one step depends on the probability to start in a page u with a link to v and the probability of following this link For all v p v = p ( u) = 0 1 ( ) '[, ]* 0( ) ( u, v) V u u p = E' p T 1 * 0 E u v p u Ulf Leser: Information Retrieval, Winter Semester 2014/2015 32

Eigenvectors and PageRank Iteration: p i+1 =E T *p i We search the fixpoint: p=e T *p Recall: If Mx-λx=0 for x 0, then λ is called an Eigenvalue of M and x is his associated Eigenvector Transformation yields λx=mx We are almost there Eigenvectors for Eigenvalue λ=1 solve our problem But these do not always exist Ulf Leser: Information Retrieval, Winter Semester 2014/2015 33

Perron-Frobenius Theorem When do Eigenvectors λ=1 exist? Let M be a stochastic quadratic irreducible aperiodic matrix Quadratic: m=n Stochastic: M[i,j] 0, all column sums are 1 Irreducible: If we interpret M as a graph G, then every node in G can be reached by any other node in G Aperiodic: n N such that for every u,v there is a path of length n between u and v For such M, the largest Eigenvalue is λ=1 Its corresponding Eigenvector x satisfies x = Mx Can be computed using our iterative approach PowerIteration Method Ulf Leser: Information Retrieval, Winter Semester 2014/2015 34

Real Links versus Mathematical Assumptions 1. The sum of the weights in each column equals 1 Not yet achieved web pages may have no outgoing edge Rank sinks 2. The matrix E is irreducible Not yet achieved the web graph is not at all strongly connected For instance, no path between 3 and 4 1 6 4 2 3 7 8 12 11 9 10 5 Ulf Leser: Information Retrieval, Winter Semester 2014/2015 35

Repair Simple repair: We give every possible link a fixed, very small probability No more 0 in E If E [u,v]=0, set E [u,v]=1/n, with n~ total number of pages This also makes the matric aperiodic (with n=1) Normalize such that the sum in each column is 1 Intuitive explanation: Random restarts We allow our surfer S at each step, with a small probability, to jump to any other page instead of following a link Jump probability is the higher, the less outgoing links a page has Now, every page is reachable from every page Ulf Leser: Information Retrieval, Winter Semester 2014/2015 36

PageRank We are done Iteration is performed until changes become small We stop before fixpoint is reached Faster at the cost of accuracy The original paper reports that ~50 iterations sufficed for a crawl of 300 Million links p(v): The probability of S being in v, or the prestige of v Ulf Leser: Information Retrieval, Winter Semester 2014/2015 37

Example 1 [Nuer07] C is very popular To be known by C (like A) brings more prestige than to be known by A (like B) Ulf Leser: Information Retrieval, Winter Semester 2014/2015 38

Example 2 Average PageRank dropped Sinks consume PageRank mass Ulf Leser: Information Retrieval, Winter Semester 2014/2015 39

Example 3 Repair: Every node reachable from every node Average PageRank again at 1 Ulf Leser: Information Retrieval, Winter Semester 2014/2015 40

Example 4 Symmetric link-relationships bear identical ranks Ulf Leser: Information Retrieval, Winter Semester 2014/2015 41

Example 5 Home page outperforms children External links add strong weights Ulf Leser: Information Retrieval, Winter Semester 2014/2015 42

Example 6 Link spamming increases weights (A, B) Ulf Leser: Information Retrieval, Winter Semester 2014/2015 43

Content of this Lecture Searching the Web Search engines on the Web Exploiting the web structure Prestige in networks Page Rank HITS A different flavor: WebSQL Ulf Leser: Information Retrieval, Winter Semester 2014/2015 44

HITS: Hyperlink Induced Topic Search Two main ideas Classify web pages into authorities and hubs Use a query-dependent subset of the web for ranking Approach: Given a query q Compute the root set R: All pages matching (conventional IR) Expand R by all pages which are connected to any page in R with an outgoing or an incoming link Heuristic could as well be 2,3, steps Remove from R all links to pages on the same host Tries to prevent nepotistic and purely navigational links At the end, we rank sites rather than pages Assign to each page an authority score and a hub score Rank pages using a weighted combination of both scores Ulf Leser: Information Retrieval, Winter Semester 2014/2015 45

Hubs and Authorities Authorities Web pages that contain high-quality, definite information Many other web pages link to authorities Break-through articles Hubs Pages that try to cover an entire domain Hubs link to many other pages Survey articles Hubs Authorities Assumption: hubs preferentially link to authorities (to cover the new stuff), and authorities preferentially link to hubs (to explain the old stuff) Ulf Leser: Information Retrieval, Winter Semester 2014/2015 46

But Surveys are the most cited papers Most hubs are also authorities This may be the reason that Google gets away without hubs Ulf Leser: Information Retrieval, Winter Semester 2014/2015 47

Computation A slightly more complicated model Let a be the vector of authority scores of all pages Let h be the vector of hub scores of all pages Define T a = E * h h = E * a Solution can be computed in a similar iterative process as for PageRank Ulf Leser: Information Retrieval, Winter Semester 2014/2015 48

Pros and Cons Contra Pro Distinguishing hubs from authorities is somewhat arbitrary and not necessarily a good model for the Web (today) How should we weight the scores? HITS scores cannot be pre-computed; set R and status of pages changes from query to query The HITS score embodies IR match scores and links, while PageRank requires a separate IR module and has no rational way to combine the scores Ulf Leser: Information Retrieval, Winter Semester 2014/2015 49

Content of this Lecture Searching the Web Search engines on the Web Exploiting the web structure A different flavor of Web search: WebSQL Ulf Leser: Information Retrieval, Winter Semester 2014/2015 50

Side Note: Web Query Languages Deficits of search engines No way of specifying structural properties of results All web pages linking to X (my homepage) All web pages reachable from X in at most k steps No way of extracting specific parts of a web page No SELECT title FROM webpage WHERE Idea: Structured queries over the web Model the web as two relations: node (page) and edge (link) Allow SQL-like queries on these relations Evaluation is done on the web Various research prototypes: WebLog, WebSQL, Araneus, W3QL, Ulf Leser: Information Retrieval, Winter Semester 2014/2015 51

WebSQL Mendelzon, A. O., Mihaila, G. A., & Milo, T. (1997). Querying the World Wide Web. Journal on Digital Libraries, 1, 54-67. Simple model: The web in 2 relations page( url, title, text, type, length, modification_date, ) anchor( url, href, anchortext) Could be combined with DOM (XPath) for fine-grained access Operations Projection: Post-processing of search results Selections: Pushed to search engine where possible Following links: Call a crawler (or look-up a local crawl) Ulf Leser: Information Retrieval, Winter Semester 2014/2015 52

Example Find all web pages which contain the word JAVA and have an outgoing link in whose anchor text the word applet appears; report the target and the anchor text SELECT y.anchortext, y.href FROM Document x SUCH THAT x MENTIONS JAVA, Anchor y SUCH THAT y.url = x.url WHERE y.anchortext CONTAINS applet ; Can be evaluated using a search engine Local processing of pages Ulf Leser: Information Retrieval, Winter Semester 2014/2015 53

More Examples SELECT d.url, d.title FROM Document d SUCH THAT $HOME d WHERE d.title CONTAINS Database ; Report url and title of pages containing Database in the title and that are reachable from $HOME in one or two steps SELECT d.title FROM Document d SUCH THAT $HOME ( )*( )* d; Find the titles of all web pages that are reachable (by first local, than non-local links) from $HOME (calls a crawler) Ulf Leser: Information Retrieval, Winter Semester 2014/2015 54

Self Assessment How does a Web Crawler work? What are important bottlenecks? Name some properties of the IR problem in the web What is the complexity of PageRank? For which matrices does the Power Iteration method converge to the Eigenvector for Eigenwert 1? Explain each property What is the difference between HITS and PageRank? Waht are other models of importance in graphs? Could WebSQL be computed on a local copy of the web? What subsystems would be necessary? Ulf Leser: Information Retrieval, Winter Semester 2014/2015 55