Definition. Spider = robot = crawler. Crawlers are computer programs that roam the Web with the goal of automating specific tasks related to the Web.

Size: px

Start display at page:

Download "Definition. Spider = robot = crawler. Crawlers are computer programs that roam the Web with the goal of automating specific tasks related to the Web."

Abigail Carter
5 years ago
Views:

1 Web Crawlers

2 Definition Spider = robot = crawler Crawlers are computer programs that roam the Web with the goal of automating specific tasks related to the Web.

3 What is the Web? (another view) pages containing (fairly unstructured) text images, audio, etc. embedded in pages structure defined using HTML (Hypertext Markup Language) hyperlinks between pages! over 2.9 billion pages over 16 billion hyperlinks a giant graph!

4 How is the Web organized? Web Server (Host) Web Server (Host) pages reside in servers related pages in sites local versus global links logical vs. physical structure Web Server (Host)

5 How the Web Works Fetching give me the file /world/index.html Desktop (with browser) Web Server here is the file:...

6 How do we find pages on the Web? more than 2.9 billion pages more than 16 billion hyperlinks plus images, movies,.., database content we need specialized tools for finding pages and information

7 Overview of web search tools Major search engines (google, alltheweb, altavista, northernlight, hotbot, excite, go) Web directories: Specialized search engines Local search engines Meta search engines Personal search assistants (yahoo, open directory project) (cora, csindex, achoo, findlaw) (for one site) (beaucoup, allsearchengines, about) (alexa, zapper) Comparison shopping agents (mysimon, dealtime, price) Image search Natural language questions Database search (ditto, visoo) (askjeeves?, northernlight?) (completeplanet, direct, invisibleweb)

8 Major search engines

9 Basic structure of a search engine: indexing Crawler Index disks Query: computer Search.com look up

10 Ranking: return best pages first term- vs. link-based approaches

11 Example #1: Link-based ranking techniques Ragerank (Brin&Page/Google) significance of a page depends on significance of those referencing it HITS (Kleinberg/IBM) Hubs and Authorities

12 Challenges for search engines: coverage (need to cover large part of the web) need to crawl and store massive data sets good ranking (in the case of broad queries) freshness user load manipulation smart informational retrieval techniques (need to update content) frequent recrawling of content (up to 3000 queries/sec - Google) many queries on massive data (sites want to be listed first) naïve techniques will be exploited quickly

13 Web directories:

14 Topic hierarchy: everything sports politics business health baseball foreign hockey domestic soccer.... Challenges: designing topic hierarchy automatic classification: what is this page about? Yahoo and Open Directory mostly human-based

15 Specialized search engines: be the best on one particular topic use domain-specific knowledge limited resources do not crawl the entire web! focused crawling techniques Meta search engines: uses other search engines to answer questions ask the right specialized search engine combine results from several large engines needs to be familiar with thousands of engines

16 Personal Search Assistants: (alexa, zapper) embedded into browser can suggest related pages search by highlighting text can use context may exploit individual browsing behavior may collect and aggregate browsing information privacy issues crawl the web (alexa), or use existing search engines (zapper)

17 Web Search Information System

18 Web Search Information Query and Feedback User Interface System Knowledge Base Crawling Learning Ad Hoc Information Query Processing Inference Engine Indexing Search Engine Learning Document Repository Large Text (Multimedia) Database Tech. Data(Text) Mining Tech.

19 Perspective information systems User Interface information retrieval AI algorithms machine learning data mining databases

20 Search Engine Architecture: indexing Crawler Index disks Query: computer Search.com look up

21 Web Crawlers

22 Crawler Crawler disks starts at set of seed pages fetches pages from the web parses fetched pages for hyperlinks then follows those links (e.g., BFS) variations: - random walks - focused crawling

23 Typical Crawler Architecture Internet Seed List Crawler URL DB Pagefiles Discovery Grab Alias DB Pagefiles Index Build Filtered Pagefiles Anchor Text DB Connectivity DB Index Duplicates DB

24 Web Crawler Retrieving Module Processing Module Formatting Module Word Wide Web URL Listing Module Retrieving Module URL Listing Module The order of traversing Breadth-first Database Depth-first Better pages first Processing Module Formatting Module How frequently the index is updated Mining the World Wide Web (pages )

25 What is a Crawler? init initial urls web get next url get page to visit urls visited urls extract urls web pages 2

26 Simple Crawler Algorithm Simple-Crawler ( S 0, D, E ) 1 Q S 0 2 While Q 3 do u DEQUEUE (Q) 4 d(u) FETCH (u) 5 STORE (D, (d(u), u)) 6 L PARSE (d(u)) 7 For each v in L 8 Do STORE (E, (u, v)) 9 If (v D v Q) 10 Then ENQUEUE (Q, v). S 0 is the seed URL. L is the set of children URLs of u. Q is the to visit URLs queue. D is the visited URLs queue.

How Web Search Engines Work: Indexing Place seed URLs into a priority queue Repeatedly Select next URL from queue Fetch page Characterize page Store

edu/ http://familysearch.com/ http://www.semmel.com/ The Web Ralph s Web Page My favorite color is lavender! I collect Beanie Babies!

27 How Web Search Engines Work: Indexing Place seed URLs into a priority queue Repeatedly Select next URL from queue Fetch page Characterize page Store characterization in index Extract links from page Assign priority to each link Add links to queue Queue The Web Ralph s Web Page My favorite color is lavender! I collect Beanie Babies! See pictures of my moss garden! doc42 baby beanie collect color Index avocado doc3 doc177 baby doc3 doc42 doc117 beanie doc42 doc77 doc

28 How Web Search Engines Work: Retrieval Retrieve query from user Characterize query Use index to find documents that contain query terms Measure similarity between query and each potentially relevant document Sort documents by similarity score Return documents with highest scores to user Search Results 1. Ralph s Web Page 2. Ty Homepage 3. Toys R Expensive 4. Caps for Freshmen 5. Bohnanza 6. Ralph s Lavender Page Not Found 8. Hot Men in Tight Shorts lavender Beanie Babies baby beanie lavender Index avocado doc3 doc177 baby doc3 doc42 doc117 beanie doc42 doc77 doc doc42 doc doc117 doc doc doc3 doc doc193...

29 Crawling Issues How to crawl? Quality: Best pages first Efficiency: Avoid duplication (or near duplication) How much to crawl? How much to index? Coverage: How big is the Web? How much do we cover? Relative Coverage: How much do competitors have? How often to crawl? Freshness: How much has changed? How much has really changed? Visit order and the hidden web

30 Visit Order Breadth-first: FIFO queue Depth-first: LIFO queue Best-first: Priority queue Random Refresh rate

31 Breadth First Crawlers

32 Breadth First Crawlers Use breadth-first search (BFS) algorithm Get all links from the starting page, and add them to a queue Pick the 1 st link from the queue, get all links on the page and add to the queue Repeat above step till queue is empty

33 Simple Breadth-First Search Crawler insert set of initial URLs into a queue Q while Q is not empty currenturl = dequeue(q) download page from currenturl for any hyperlink found in the page if hyperlink is to a new page enqueue hyperlink URL into Q this will eventually download all pages reachable from the start set

34 Depth First Crawlers

35 Depth First Crawlers Use depth first search (DFS) algorithm Get the 1 st link not visited from the start page Visit link and get 1 st non-visited link Repeat above step till no no-visited links Go to next non-visited link in the previous level and repeat 2 nd step

36 Traversal strategies: (why BFS?) crawl will quickly spread all over the web load-balancing between servers in reality, more refined strategies (but still BFSish) Tools/languages for implementation: Scripting languages (Python, Perl) Java C/C++ with sockets available crawling tools (performance tuning tricky) (low-level) (usually not scalable)

37 Focused Crawling Focused Crawler: selectively seeks out pages that are relevant to a pre-defined set of topics. - Topics specified by using exemplary documents (not keywords) - Crawl most relevant links - Ignore irrelevant parts. - Leads to significant savings in hardware and network resources.

38 Web Indexer

39 Index Issues How to structure the index How to create the index (storage, time) How to store the index (storage, compression) How to process the index (storage, time) How to update the index (storage, time)

40 Inverted File Indexing Inverted file index contains a list of terms that appear in the document collection (called a lexicon or vocabulary) and for each term in the lexicon, stores a list of pointers to all occurrences of that term in the document collection. This list is called an inverted list.

41 Inverted File Indexing Postings file Inverted file contains Postings: for each term in the lexicon, a list of pointers to all occurrences of that term in the main text; stored in increasing document ID Lexicon: mapping from terms to pointer list

42 Lexicon and Postings File Salmon 5 PTR <5,23> <12,95> <16,22> <21,12> <25,42> Document 5:.The extinction of Atlantic salmon is predicted if actions to preserve stocks are not taken

43 Inverted files Index information, whether manual or automatic, is stored in an inverted file Doc. 1: The cat is on the mat Doc. 2: The mat is on the floor. Cat 1 1 no. of occurrences Floor Mat ,2 postings

44 Structure of Inverted Index Document-level indexing No. Term Documents 1 cold <2; 1,4> 2 days <2; 3,6> word-level indexing Document ID Document ID 1 cold <2;(1:6),(4:8)> position ID

45 Structure of Inverted Index May be a hierarchical set of addresses, e.g. word number within sentence number within paragraph number within chapter number within volume number within document number Consider as a vector (d,v,c,p,s,w)

46 Compression of Inverted Indexes Uncompressed, maybe % of size of text Compression: store differences rather than document numbers E.g. (8:3,5,20,21,23,76,77,78) (8:3,2,15,1,2,53,1,1) Then code differences using global (for all lists) or local (for each list) methods

47 Indexing: (Simplified Approach) doc1: Bob reads a book doc2: Alice likes Bob doc3: book (1) scan through all documents (2) for every work encountered generate entry (word, doc#, pos) (3) sort entries by (word, doc#, pos) (4) now transform into final form (bob, 1, 1), (reads, 1, 2), (a, 1, 3) (book,1, 4), (alice, 2, 1), (likes, 2, 2) (bob, 2, 3), (book, 3, 1) (a, 1, 3), (alice, 2, 1), (bob, 1, 1), (Bob, 2, 3), (book, 1, 4), (book, 3, 1), (likes, 2, 2), (reads, 1, 2) 1-level a: (1,3) Alice: (2, 1) Bob: (1, 1), (2, 3) book: (1, 4), (3, 1) likes: (2, 2) reads: (1, 2)

48 Improvements. arm 4, 19, 29, 98, 143,... armada 145, 457, 789,... armadillo 678, 2134, 3970,... armani 90, 256, 372, 511,..... arm 4, 15, 10, 69, 45,... armada 145, 312, 332,... armadillo 678, 1456, 1836,... armani 90, 166, 116, 139,.... encode sorted runs by their gaps significant compression for frequent words! less effective if we also store position (adds incompressible lower order bits) many highly optimized schemes have been studied (see Witten/Moffat/Bell)

49 Additional issues: keep data compressed during index construction try to keep index in main memory? keep important parts in memory? (altavista) (fancy hits in google) use database to store lists? (e.g., Berkeley DB) Alternative to inverted index: signature files (Bloom filters): false positives bitmaps better to stick with inverted files (Witten/Moffat/Bell)

50 Standard Web Search Engine Architecture crawl the web Check for duplicates, store the documents DocIds user query create an inverted index Show results To user Search engine servers Inverted index

51 How Inverted Files Are Created Periodically rebuilt, static otherwise. Documents are parsed to extract tokens. These are saved with the Document ID. Doc 1 Now is the time for all good men to come to the aid of their country Doc 2 It was a dark and stormy night in the country manor. The time was past midnight Term Doc # now 1 is 1 the 1 time 1 for 1 all 1 good 1 men 1 to 1 come 1 to 1 the 1 aid 1 of 1 their 1 country 1 it 2 was 2 a 2 dark 2 and 2 stormy 2 night 2 in 2 the 2 country 2 manor 2 the 2 time 2 was 2 past 2 midnight 2

52 How Inverted Files are Created After all documents have been parsed, the inverted file is sorted alphabetically. Term Doc # now 1 is 1 the 1 time 1 for 1 all 1 good 1 men 1 to 1 come 1 to 1 the 1 aid 1 of 1 their 1 country 1 it 2 was 2 a 2 dark 2 and 2 stormy 2 night 2 in 2 the 2 country 2 manor 2 the 2 time 2 was 2 past 2 midnight 2 Term Doc # a 2 aid 1 all 1 and 2 come 1 country 1 country 2 dark 2 for 1 good 1 in 2 is 1 it 2 manor 2 men 1 midnight 2 night 2 now 1 of 1 past 2 stormy 2 the 1 the 1 the 2 the 2 their 1 time 1 time 2 to 1 to 1 was 2 was 2

53 How Inverted Files are Created Multiple term entries for a single document are merged. Within-document term frequency information is compiled. Term Doc # a 2 aid 1 all 1 and 2 come 1 country 1 country 2 dark 2 for 1 good 1 in 2 is 1 it 2 manor 2 men 1 midnight 2 night 2 now 1 of 1 past 2 stormy 2 the 1 the 1 the 2 the 2 their 1 time 1 time 2 to 1 to 1 was 2 was 2 Term Doc # Freq a 2 1 aid 1 1 all 1 1 and 2 1 come 1 1 country 1 1 country 2 1 dark 2 1 for 1 1 good 1 1 in 2 1 is 1 1 it 2 1 manor 2 1 men 1 1 midnight 2 1 night 2 1 now 1 1 of 1 1 past 2 1 stormy 2 1 the 1 2 the 2 2 their 1 1 time 1 1 time 2 1 to 1 2 was 2 2

54 How Inverted Files are Created Finally, the file can be split into A Dictionary or Lexicon file A Postings file

55 How Inverted Files are Created Dictionary/Lexicon Term Doc # Freq a 2 1 aid 1 1 all 1 1 and 2 1 come 1 1 country 1 1 country 2 1 dark 2 1 for 1 1 good 1 1 in 2 1 is 1 1 it 2 1 manor 2 1 men 1 1 midnight 2 1 night 2 1 now 1 1 of 1 1 past 2 1 stormy 2 1 the 1 2 the 2 2 their 1 1 time 1 1 time 2 1 to 1 2 was 2 2 Term N docs Tot Freq a 1 1 aid 1 1 all 1 1 and 1 1 come 1 1 country 2 2 dark 1 1 for 1 1 good 1 1 in 1 1 is 1 1 it 1 1 manor 1 1 men 1 1 midnight 1 1 night 1 1 now 1 1 of 1 1 past 1 1 stormy 1 1 the 2 4 their 1 1 time 2 2 to 1 2 was 1 2 Postings Doc # Freq

56 Implementation Based on Inverted Files Index terms df D j, tf j computer database 3 2 D 7, 4 D 1, 3 science 4 D 2, 4 system 1 D 5, 2 Index file Postings lists

57 Inverted Indexes Permit fast search for individual terms For each term, you get a list consisting of: document ID frequency of term in doc (optional) position of term in doc (optional) These lists can be used to solve Boolean queries: country -> d1, d2 manor -> d2 country AND manor -> d2 Also used for statistical ranking algorithms

58 Inverted Indexes for Web Search Engines Inverted indexes are still used, even though the web is so huge. Some systems partition the indexes across different machines. Each machine handles different parts of the data. Other systems duplicate the data across many machines; queries are distributed among the machines. Most do a combination of these.

59 Summary

60 Search Engines Search engines are the most popular way to locate information online About 33 million U.S. Internet users query on search engines on a typical day. More than 80% have used search engines Search Engines are measured by coverage and recency.

61 Search Engine Architecture WWW W W W Generic Crawler BFS- Crawler Admin Interface User Interface User Tools Focused Crawler Data Acquisition User Interfaces Storage Server Index Server Graph Server Scalable Server Components

62 Working of a Local Search Engine Stores Words Index Search Engine Looks in Index Sends Query Indexer Gets words User Selects required page Gets Matches Results Page Sends Formatted Results Search Form User views Retrieved Page Web Site Documents Retrieved Page

63 Indexing disks indexing aardvark 3452, 11437,... arm 4, 19, 29, 98, 143,... armada 145, 457, 789,... armadillo 678, 2134, 3970,... armani 90, 256, 372, 511,.... zebra 602, 1189, 3209,... inverted index parse & build lexicon & build index index very large I/O-efficient techniques needed

64 Indexing disks how to build an index - in I/O-efficient manner - in parallel - later -... indexing how to compress an index aardvark 3452, 11437,... arm 4, 19, 29, 98, 143,... armada 145, 457, 789,... armadillo 678, 2134, 3970,... armani 90, 256, 372, 511,.... zebra 602, 1189, 3209,... inverted index (while building it in situ) goal: intermediate size not much larger than final size

65 Basic concepts and choices: lexicon: set of all words encountered millions in the case of the web postings: for each word occurrence - store index of document where it occurs also store position in document? (probably yes) - increases space for index significantly! - allows efficient search for phrases - relative positions of words may be important for ranking stop words: common words such as is, a, the ignore stop words? (maybe better not) - saves space in index - cannot search for to be or not to be stemming: runs = run = running (depends on language)

66 Stop Lists Lists of words which are dropped from processing Few words to hundreds; may include single letters E.g. Dialog: AN, AND, BY, FOR, FROM, OF, THE, TO, WITH Improve storage efficiency (may be 10 to 50% of text) Improve processing efficiency May cause problems: to be or not to be, AT&T man of war, birds of prey

67 Stemming Deals with word variation (morphological variants) E.g.: Computer, computers, computing, compute, computed, computational, computationally comput Use a stemming algorithm for conflation Set of rules applied to each word as it is processed Simplest: combine singular and plural form Examples: Porter, Lovins, Paice

68 Simple Stemmer If a word ends in ies but not eies or aies Then ies y If a word ends in es but not aes, ees, or oes Then es e If a word ends in s, but not us or ss Then s NULL (apply only first applicable rule) e.g. spiders, flies, throes, bees

69 Impact of Stemmers May decrease index file size up to 50% Should increase recall at cost of precision Studies are equivocal; some improvements found but not marked May depend on nature of vocabulary

70 Availability of Stemmers Many on Web, e.g. see mmer/index.html for many encodings of Porter Stemmer Or for encodings of the Lovins Stemmer

71 Querying Boolean queries: (zebra AND armadillo) OR armani unions/intersections of lists aardvark 3452, 11437,... arm 4, 19, 29, 98, 143,... armada 145, 457, 789,... armadillo 678, 2134, 3970,... armani 90, 256, 372, 511,.... zebra 602, 1189, 3209,... look up

72 Querying and term-based ranking: Recall Boolean queries: (zebra AND armadillo) OR armani unions/intersections of lists aardvark 3452, 11437,... arm 4, 19, 29, 98, 143,... armada 145, 457, 789,... armadillo 678, 2134, 3970,... armani 90, 256, 372, 511,.... zebra 602, 1189, 3209,... look up

73 Information Retrieval

74 History of IR Systems Role of documentalists Role of database industry Role of researchers in information retrieval systems

75 How exact is the representation of the document? How exact is the representation of the query? Document Representation Query representation How well is query matched to data? How relevant is the result to the query? Query TYPICAL IR PROBLEM Query Answer Document collection

76 Boolean Information Retrieval

77 Boolean Model first online systems in 60s and 70s most widely used in commercial IR AND, OR, NOT operators usually supplemented with proximity operators requires an exact match based on inverted file

78 Boolean Model Based on set theory and Boolean algebra Queries are specified as Boolean expressions Widely used in commercial IR systems (Dialog, Lexis/Nexis)

79 Boolean Operators AND OR NOT

80 Boolean AND Information AND Retrieval Information Retrieval

81 Boolean OR Cats OR Felines Cats Felines

A Survey on Web Information Retrieval Technologies

A Survey on Web Information Retrieval Technologies Lan Huang Computer Science Department State University of New York, Stony Brook Presented by Kajal Miyan Michigan State University Overview Web Information