Searching the Web [Arasu 01]

Size: px

Start display at page:

Download "Searching the Web [Arasu 01]"

Clinton Holland
6 years ago
Views:

1 Searching the Web [Arasu 01] Most user simply browse the web Google, Yahoo, Lycos, Ask Others do more specialized searches web search engines submit queries by specifying lists of keywords receive web pages containing the keywords E.G.M. Petrakis Searching the Web 1

2 Size of the Web Several studies estimated the size of the web over a billion pages assuming average size of web page 5-10 KB Size of web is tens of terabytes the size of the web has doubled in less than two years this growth rate is expected to continue E.G.M. Petrakis Searching the Web 2

3 Structure of the Web The link structure of the web is like a bow tie Broder et.al connected core: 28% IN loop: 22% pages which can reach the core OUT loop: 22% pages that can be reached from the core remaining nodes are isolated E.G.M. Petrakis Searching the Web 3

4 IR on the Web IR: best suited for small coherent collections digital libraries, books, journal articles, newspapers Web: massive, less coherent, distributed, continuously updating information, not all pages are important IR alone is not sufficient exploit linkage among web pages E.G.M. Petrakis Searching the Web 4

5 Web Search Engine Arasu 01 E.G.M. Petrakis Searching the Web 5

6 Web Crawler Small programs that retrieve pages from the web for later analysis by the indexing module follow links between pages pass URLs to crawler control module to determine which pages to visit next pass retrieved pages into page repository continue visiting pages until local resources are exhausted E.G.M. Petrakis Searching the Web 6

7 Algorithm Starts with an initial set of URLs Repeat until it decides to stop place URLs in a queue prioritize URLs in the queue get next URL and downloads the page Larbin: a crawler for your PC! E.G.M. Petrakis Searching the Web 7

8 Arising Issues What pages to download cannot download all pages visit important pages by prioritizing URLs in queue Refresh pages revisit downloaded pages and detect changes some pages change more frequently than others Minimize load on visited sites consumes resources belonging to others Parallelize processing crawlers run on multiple machines download pages in parallel E.G.M. Petrakis Searching the Web 8

9 Page Refreshing Maintain crawled pages up-to-date uniform refresh policy: revisit all pages at frequency f proportional refresh policy: revisit pages that change more often Difficult to keep track of pages that change often Do not visit such pages very frequently Uniform refresh policy is better E.G.M. Petrakis Searching the Web 9

10 Page Selection Download important pages first Importance measures can be defined in many ways interest driven: textual similarity popularity driven: back-link count of page location driven: endings (.com), fewer slashes... combination of the above or defined based on link analysis (e.g., PageRank) E.G.M. Petrakis Searching the Web 10

11 Crawler Models [Aggrawal 01] Crawl and stop: stop after visiting K pages only M pages have rank less than r (num. of pages that an ideal crawler would produce) performance: M/K Crawl and stop with threshold: visit K pages but also apply an acceptance criterion qualifying or hot pages H performance: H/K E.G.M. Petrakis Searching the Web 11

12 Performance of Crawler Models Arasu 01 E.G.M. Petrakis Searching the Web 12

13 Storage Large database, similar to data base systems but, different functionality less transaction management, logging... The focus is on scalability: large distributed data repository dual access modes: random (specific page), streaming access (set of pages) updates: high rate of modifications (delete or update obsolete pages) E.G.M. Petrakis Searching the Web 13

14 Indexing Indexer: extracts URLs and terms from pages, creates a lookup table of URLs Text index: inverted file, points to pages where a given term occurs Structure index based on links between pages, e.g., backward, forward links per page Utility index created by the collection analysis module, more specialized indexes based on importance, image content etc. E.G.M. Petrakis Searching the Web 14

15 Indexing Index partitioning: distributed index Local inverted file: nodes storing disjoint subsets of pages Each query addresses a separate index Global inverted file: each node indexes part of the terms e.g., [e-f],[f-h] Each query addresses multiple indices in parallel E.G.M. Petrakis Searching the Web 15

16 Web Indexing Architecture Arasu 01 break query or page in multiple runs E.G.M. Petrakis Searching the Web 16

17 Ranking Search Results Traditional IR is not very effective in ranking query results web is huge great variation of content pages that contain the query terms may be irrelevant or unimportant small queries (1-2 terms) return huge answer sets Solution: link analysis E.G.M. Petrakis Searching the Web 17

18 Link Analysis The link structure of the web is important for filtering irrelevant or unimportant pages Main idea: if p q then p, q are related p recommends q Two main approaches: HITS & PageRank for each page compute a value that captures the notion of importance of the page E.G.M. Petrakis Searching the Web 18

19 PageRank [Page 98] Measure of page importance global ranking scheme, computes a single rank to every page Google s page is more important than my page the difference is reflected in the number and quality of pages pointing to each page a page is important if other important pages are pointing to it recursive definition E.G.M. Petrakis Searching the Web 19

20 Simple PageRank Let the pages of the web be 1,2,...m N j : number of forward links B j : number of backward links r i : PageRank of page i The pages that point to i distribute their rank to the pages they point to j B i =3 N j =2 r i j i B i r N j j N i =2 E.G.M. Petrakis Searching the Web 20

21 Example of PageRank r 2 = which is distributed to r 1, r 3 r 3 = 0.286/2 since it has no other incoming links r 1 = the rank it receives from r 5, r 2 and r 3 Σr i = 1 Arasu 01 E.G.M. Petrakis Searching the Web 21

22 Computing PageRank The PageRank of page i is the eigenvector of A T corresponding to eigenvalue 1 a r i r ij j [ r r... 2 r 1 B where 1/ N 0 i i r N m j j ] if r A T r i points to j, i points to otherwise N i pages E.G.M. Petrakis Searching the Web 22

23 Computing PageRank (cont,d) Power iteration method for computing the principal eigenvector of a matrix An arbitrary initial vector is multiplied repeatedly with A T until it converges to the principal eigenvector 1. s = random vector 2. r = A T s 3. if ( r s < e) stop : r is the PageRank vector else {s = r; goto step 2} E.G.M. Petrakis Searching the Web 23

24 Random Surfer Model Definition of PageRank leads itself to an interpretation based on random walks [Motwani,Raghavan 1995] A person who surfs the web by randomly clicking links does a random walk on the web The PageRank of a page is proportional to the frequency with which a random surfer would visit it E.G.M. Petrakis Searching the Web 24

25 Practical PageRank Well defined for strongly connected graphs sink: connected pages with no links pointing outwards nodes not in the sink receive 0 PageRank leak: a page with no outlinks e.g., 5 becomes a leak if 5 4 is removed sinks and leaks cause rank 0 for nodes not in sinks or leaks sink E.G.M. Petrakis Searching the Web 25

26 Leaks and Sinks A random surfer would eventually get stuck in nodes 4 and 5 nodes 1, 2, and 3 would have rank 0 and nodes 4 and 5 would have rank 0.5 By reaching a rank leak he is lost forever a leak causes all the ranks to eventually converge to 0 Our random surfer would eventually reach E.G.M. Petrakis node 4 and will Searching never the be Web seen again! 26

27 Computing PageRank Leaks: Eliminate all nodes with no out degree or assume they have links to the pages that point to them Sinks: Assume arbitrary links to the outside connected graph Compute page rank using Random Surfer model [Page, 1998] Whenever a surfer gets trapped in a sink will jump out to a random page after a number of loops E.G.M. Petrakis Searching the Web 27

28 Practical PageRank Assume that only a fraction d (0 < d < 1) of the rank of a page is distributed among the pages it points to remaining rank is distributed equally among all the m pages on the web r i d j N j r N parameter d dictates how often the surfer gets bored (random surfer model) E.G.M. Petrakis Searching the Web 28 j j 1- d m

29 Google [Brin 98] First prototype at Stanford University IR score is combined with PageRank to determine the rank of an answer given a query, Google computes the IR score (e.g., vector model) of pages containing the query terms Google also uses more text-based techniques to enhance results (e.g., anchor text on links) E.G.M. Petrakis Searching the Web 29

30 Comparison Nodes 4 and 5 now have higher ranks than the other nodes, indicating that surfers will tend to gravitate to 4 and 5. However, the other nodes have nonzero ranks E.G.M. Petrakis Searching the Web 30

31 HITS [Kleinberg 98] HITS is a query dependent ranking technique authority and hub score to each page Authorities: pages which are likely to be relevant to the query Hubs: pages pointing to authorities some hubs may be authorities too used to compute the authorities Good authorities are pointed to by many hubs Good hubs point to many authorities E.G.M. Petrakis Searching the Web 31

32 Hubs The hub pages are not necessarily authorities themselves but point to several authority pages There are two reasons why one might be interested in hubs pages They are used to compute the authority pages They can also be returned to the user in response to a query E.G.M. Petrakis Searching the Web 32

33 Focused Subgraph F HITS does link analysis on F: 1) compute set R of pages relevant to the query (containing the query terms) 2) initially F = R (accept maximum t pages) 3) for each page in p in R include in F all pages that p points to include pages pointing to p (maximum d pages) t, d are parameters, defined by designer E.G.M. Petrakis Searching the Web 33

34 hubs authorities E.G.M. Petrakis Searching the Web 34

35 HITS Link Analysis F = {1, 2... N} pages F i number of pages that i points to B i number of pages pointing to i A adjacency matrix: a ij =1 if i points to j, a ij =0 otherwise B i =3 i F i =2 E.G.M. Petrakis Searching the Web 35

36 Computing HITS For each i compute authority score a i : sum of hub scores of pages pointing to it hub score h i: sum of authority scores of all pages it points to h, a : principal eigenvectors of AA T, A T A a h i j B i h a j a h T i j j Fi E.G.M. Petrakis Searching the Web 36 A T Aa h a h A AA T Aa h

37 Bibliographic Matrix The two matrices A T A and AA T are well known in the field of bibliometrics: A T A is the co-citation matrix of the collection [A T A] i,j : number of pages which jointly point at (cite) pages i and j AA T is the bibliographic coupling matrix of the collection [AA T ] i,j : number of pages jointly referred to (pointed at) by pages i and j E.G.M. Petrakis Searching the Web 37

38 Computing HITS Power iteration (similarly to PageRank) Initialize a i, h i to arbitrary values Repeat until convergence 1. a i = Σ j in Bi h j 2. h i = Σ j in Fi a j 3. At each step normalize each α i, h i by {Σ i a i2 } 1/2, {Σ i h i2 } 1/2 E.G.M. Petrakis Searching the Web 38

39 Example E.G.M. Petrakis Searching the Web 39

40 Multimedia and Link Analysis PageRank and HITS have been investigated in conjunction with text retrieval techniques do the same for images and video PicASHOW extends HITS for images Diogenis is a web system for person images ImageRover, PicToSeek, WebSeek: web systems for images E.G.M. Petrakis Searching the Web 40

41 Data Mining on Web Link analysis in conjunction with text, image data to perform data mining on the web (e.g., Chackrabarti 99) identify communities on the web (e.g., Gibson 98) Define and produce ontologies similar to Yahoo s topic taxonomies find related pages E.G.M. Petrakis Searching the Web 41

42 Image Retrieval on the Web using Link Analysis Queries: keywords Images are described by text surrounding them in Web pages image filename, Alternate text, Page title, Caption Boolean, VSM.. to retrieval Web pages PageRank, HITs to assign higher ranking to images in important Web pages Important Web page does not mean relevant images! 12/7/ rch 42

43 Image Importance Answers: images in Web pages Relevant but not always important From corporate web sites, organizations From individuals and small companies Link analysis: assign higher ranking to answers from important web sites 12/7/ rch 43

44 PicASHOW [Lempel 2002] PageRank and HITS for retrieval using keywords PicASHOW for Web pages with images Main idea: co-cited and co-contained images are likely to be related W: page to page relationships M: page to image relationships E.G.M. Petrakis Searching the Web 44

45 Example 12/7/ rch 45

46 Image Link Analysis Queries are matched against text descriptions Create the focused sub-graph F of the initial answer set Authorities: principal eigenvector of [(W+I)M T ](W+I)M W: page to page relationships in F M: page to image relationships in F Rank answers by authority value 12/7/ rch 46

47 Further Reading A. Arasu et.al. Searching the Web, ACM Trans. on Internet Technology, Vol. 1, No. 1, Aug. 2001, pp M.R.Henzinger, Hyperlink Analysis on the Web, IEEE Internet Computing, Jan/Feb 2001, pp J.Cho, H.-G.Molina, L. Page, Efficient Crawling Through URL Ordering, Proc. 7 th Int. WWW Conf S. Brin, L. Page, The Anatomy of a Large-Scale Hypertextual Web Search Engine, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pp , 1998 M. Kobayashi, K. Takeda, Information Retrieval on the Web, ACM Comp. Surveys, Vol. 32, No. 2, pp , 2000 C. Aggarwal, F. Al-Garawi, P.S.Yu, Intelligent Crawling on the World Wide Web with Arbitrary Predicates, Proc. of 10 th, Int. WWW Conf J. Kleinberg, Authoritative sources in a hyperlinked environment, Journal of the ACM 46, E.G.M. Petrakis Searching the Web 47

48 L. Page and S. Brin and R. Motwani, T. Winograd, The PageRank Citation Ranking: Bringing Order to the Web, 1998 J.R. Smith, S.-F. Chang, Searching for Images and Videos on the WWW, TR , Center for Image Technology for New Media, Columbia Univ R. Lempel, A. Soffer, PicASHOW: Pictorial Authority Search by Hyperlinks on the Web, ACM Trans. on Info. Systems, Vol. 20, No Y.A. Aslandogan, C.T. Yu, Evaluating Strategies and Systems for Content-Based Indexing of Person Images on the Web, ACM Multimedia, pp , 2000 T. Gevers, A.W.M. Smeulders, The PicToSeek WWW Image Search Engine, IEEE Int. Conf. Mult.Comp. and Syst. pp , 1999 L. Taycher, M.La Cascia, S. Sclaroff, Image Digestion and Relevance Feedback in the ImageRover WWW Search Engine, Proc. Visual 97, 1997 E.G.M. Petrakis Searching the Web 48

49 S. Chackrabarti et.al. Mining the Web s Link Structure, IEEE Computer, Vol. 32, No. 88, 1999, pp S. Gibson, J. Kleinberg, P. Raghavan, Inferring Web Communities from Link Topology, Proc. 9th ACM Conf. on Hypertext and Hypermedia, 1998, pp E.G.M. Petrakis Searching the Web 49

Searching the Web What is this Page Known for? Luis De Alba

Searching the Web What is this Page Known for? Luis De Alba ldealbar@cc.hut.fi Searching the Web Arasu, Cho, Garcia-Molina, Paepcke, Raghavan August, 2001. Stanford University Introduction People browse