EPL660: Information Retrieval and Search Engines Lab 8

Size: px

Start display at page:

Download "EPL660: Information Retrieval and Search Engines Lab 8"

Jasper Hart
5 years ago
Views:

1 EPL660: Information Retrieval and Search Engines Lab 8 Παύλος Αντωνίου Γραφείο: B109, ΘΕΕ01 University of Cyprus Department of Computer Science

2 What is Apache Nutch? Production ready Web Crawler Operates at one of three scales: local filesystem (reliable, no network errors, caching is unnecessary) Intranet (local/corporate network) whole web (whole Web crawling is difficult) Nutch can run on a single machine (local mode), but gains a lot of its strength from running οn a Hadoop cluster (deploy mode) Relies on Apache Hadoop data structures, which are great for batch processing Open source Implemented in Java

3 Nutch vs Lucene Nutch is using Lucene (through Solr) for indexing Common question "Should I use Lucene or Nutch?" Simple answer: You should use Lucene if you don't need a web crawler i.e. for fetching the docs to be indexed Nutch is a better fit for sites where you don't have direct access to the underlying data data comes from disparate sources multiple domains different doc format: json, xml, text, html,...

4 Nutch vs Solr Nutch is a web crawler collect web pages uses Solr for indexing Solr is a search platform No crawling: doesn't fetch the data, you have to feed it Perfect if you have data to be indexed already (in XML, json, database, etc.)

5 Nutch building blocks

6 Nutch Data Nutch data is composed of: crawl/crawldb contains information about all pages (URLs) known to the crawler and their status, such as the last time it visited the page, its fetching status, refresh interval, content checksum, page importance, etc. crawl/linkdb for each URL known to Nutch, it contains a list of other URLs pointing to it (incoming links) and their associated anchor text (from HTML <a href= >anchor text</a> elements)

7 Nutch Data crawl/segments Segments are directories with the following subdirectories: a crawl_generate names a set of URLs to be fetched a crawl_fetch contains the status of fetching each URL a content contains the raw content retrieved from each URL (for indexing) a parse_text contains the parsed text of each URL a parse_data contains outlinks and metadata parsed from each URL (such as anchor text) a crawl_parse contains the outlink URLs, used to update the crawldb

8 Crawling frontier challenge No authoritative catalog of web pages Where to start crawling from? Crawlers need to discover their view of web universe Start from seed list & follow (walk) some (useful? interesting?) outlinks Many dangers of simply wandering around explosion or collapse of the frontier; collecting unwanted content (spam, junk, offensive)

9 repeat Main Nutch workflow Inject: initial creation of CrawlDB Insert seed URLs to CrawlDB Initial LinkDB is empty Generate new shard's fetchlist Fetch raw content Parse content (discovers outlinks) Update CrawlDB from shards Update LinkDB from shards Index shards (from crawldb to crawl/segments/crawl_generate) Every step is implemented as one (or more) MapReduce job(s) Command-line: bin/nutch inject generate fetch parse updatedb invertlinks index / solrindex

10 Injecting new URLs 1) Specify a list of URLs you want to crawl 3) Use the injector to add URLs to the crawldb 2) Use a URL filter Note: filters, normalizers and plugins allow Nutch to be highly modular, flexible and very customizable throughout the whole process.

11 Generate-ing fetchlists 4) Generate a fetch list from the crawldb 5) Create segment directory for the generated fetch list

12 Fetching content 6) Fetch segment

13 Content processing 7) Parse the results and update CrawldB

14 Link inversion 8) Before indexing, invert all links, so that incoming anchor text can be indexed with pages

15 Link Inversion Pages (urls) have outgoing links (outlinks) I know where I am pointing to Question: Who points to me? I don t know, there is no catalog of pages NOBODY knows for sure either! In-degree may indicate importance of the page Anchor text provides important semantic info Answer: invert the outlinks that I know about

16 Link Inversion as MR job Goal: Compute inlinks for all downloaded and parsed pages Input: each page as a pair of <srcurl, ParseData> ParseData contain page outlinks (desturls) Map <srcurl, ParseData> <desturl, Inlinks> where Inlinks: <srcurl, anchortext> Reduce: Map output pairs <desturl, Inlinks> grouped by desturl, append Inlinks in a dedicated java writeable class Output: <desturl, list of Inlinks>

17 Page importance - scoring 9) Page importance metadata based on inverted links are stored in CrawlDB

18 Indexing 11) Users can search for information regarding the crawled web pages via Solr. SOLR Lucene 10) Using data from all possible sources (crawldb, linkdb, segments) the indexer creates an index and saves it within the Solr directory. For indexing, the Lucene library is used.

19 Nutch from binary distribution Download Apache Nutch 1.14 binary package from here Unzip your binary Nutch package cd apache-nutch-1.14/ Confirm correct installation run "bin/nutch" If you are seeing "Permission denied" run "chmod +x bin/nutch"

20 Crawl your first website Nutch requires two configuration changes before a website can be crawled: 1. Customize your crawl properties, where at a minimum, you provide a name for your crawler for external servers to recognize 2. Set a seed list of URLs to crawl

21 Customize your crawl properties Default crawl properties: conf/nutch-default.xml Mainly remains unchanged conf/nutch-site.xml serves as a place to add your own custom crawl properties that overwrite conf/nutch-default.xml. Add your agent name in the value field of the http.agent.name property in conf/nutch-site.xml within <configuration> <property> <name>http.agent.name</name> <value>my Nutch Spider</value> </property>

22 Crawl your first website: Seed list A URL seed list includes a list of websites, oneper-line, which nutch will look to crawl Create a URL seed list mkdir -p urls cd urls nano seed.txt to create a text file seed.txt under urls/ with the following content (one URL per line for each site you want Nutch to crawl). (one URL per line for each site you want Nutch to crawl)

23 Configure Reg. Expression Filters conf/regex-urlfilter.txt will provide regular expressions that allow nutch to filter and narrow the types of web resources to crawl and download Edit the file conf/regex-urlfilter.txt and REPLACE # accept anything else +. WITH +^ if, for example, you wished to limit the crawl to the nutch.apache.org domain NOTE: Not specifying any domains to include within regex-urlfilter.txt will lead to all domains linking to your seed URLs file being crawled as well.

24 Seeding crawldb with list of URLs The injector adds URLs to the crawldb bin/nutch inject crawl/crawldb urls STEP 1: FETCHING, PARSING PAGES Generate fetch list for all pages due to be fetched. The fetch list is placed in a newly created segment directory bin/nutch generate crawl/crawldb crawl/segments The segment directory is named by the time it's created s1=`ls -d crawl/segments/2* tail -1` echo $s1 Run the fetcher on this segment bin/nutch fetch $s1

25 Seeding crawldb with list of URLs Parse the entries bin/nutch parse $s1 When this is complete, we update the crawldb database with the results of the fetch: bin/nutch updatedb crawl/crawldb $s1 First fetching: Now crawldb database contains both updated entries for all initial pages + new entries that correspond to newly discovered pages linked from the initial set.

26 Seeding crawldb with list of URLs Now we generate and fetch a new segment containing the top-scoring 1,000 pages: bin/nutch generate crawl/crawldb crawl/segments - topn 1000 s2=`ls -d crawl/segments/2* tail -1` bin/nutch fetch $s2 bin/nutch parse $s2 bin/nutch updatedb crawl/crawldb $s2 Let s fetch one more round: bin/nutch generate crawl/crawldb crawl/segments - topn 1000 s3=`ls -d crawl/segments/2* tail -1` bin/nutch fetch $s3 bin/nutch parse $s3 bin/nutch updatedb crawl/crawldb $s3

27 Seeding crawldb with list of URLs STEP 2: INVERTLINKS Before indexing we first invert all links, so that we may index incoming anchor text with the pages. bin/nutch invertlinks crawl/linkdb -dir crawl/segments STEP 3: INDEXING INTO APACHE SOLR [Nutch-Solr integration needed] Usage: bin/nutch solrindex <solr url> <crawldb> [- linkdb <linkdb>][-params k1=v1&k2=v2...] (<segment>... -dir <segments>) [-nocommit] [-deletegone] [- filter] [-normalize] Example: bin/nutch solrindex crawl/crawldb/ -linkdb crawl/linkdb/ crawl/segments/ / -filter -normalize

28 Seeding crawldb with list of URLs STEP 4: DELETING DUPLICATES Ensure urls are unique in index Usage: bin/nutch solrdedup <solr url> Example: /bin/nutch solrdedup STEP 5: CLEANING SOLR Scan crawldb directory looking for entries with status DB_GONE (404) and sends delete requests to Solr for those documents Usage: bin/nutch solrclean <crawldb> <solrurl> Example: /bin/nutch solrclean crawl/crawldb/

29 All In One: Using the Crawl Command bin/crawl [-i --index] [-D "key=value"] <Seed Dir> <Crawl Dir> <Num Rounds> -i --index Indexes crawl results into a configured indexer -D A Java property to pass to Nutch calls Seed Dir Directory in which to look for a seeds file Crawl Dir Directory where the crawl/link/segments dirs are saved Num Rounds The number of rounds to run this crawl for Example: bin/crawl -i -D solr.server.url= urls/ TestCrawl/ 2

30 Nutch Command Line Options Below are some of the command line options bin/nutch readdb crawldir/crawldb -stats bin/nutch readdb crawldir/crawldb -dump outdump bin/nutch readdb crawldir/crawldb -topn 2 outreaddbtop bin/nutch readdb crawldir/linkdb -dump outputlinkdb For more options:

31 Integrate Solr with Nutch ate_solr_with_nutch Replace Solr schema.xml with Nutch-specific schema.xml Run the Solr Index command: bin/nutch solrindex crawl/crawldb -linkdb crawl/linkdb crawl/segments/*

32 Checking Your Index

33 Useful Links s duction-to-nutch-1.html

LAB 7: Search engine: Apache Nutch + Solr + Lucene

LAB 7: Search engine: Apache Nutch + Solr + Lucene Apache Nutch Apache Lucene Apache Solr Crawler + indexer (mainly crawler) indexer + searcher indexer + searcher Lucene vs. Solr? Lucene = library, more