Focused Crawling with

Similar documents
Focused Crawling with

EPL660: Information Retrieval and Search Engines Lab 8

Nutch as a Web mining platform the present and the future Andrzej Białecki

Storm Crawler. Low latency scalable web crawling on Apache Storm. Julien Nioche digitalpebble. Berlin Buzzwords 01/06/2015

LAB 7: Search engine: Apache Nutch + Solr + Lucene

Creating a Classifier for a Focused Web Crawler

Information Retrieval Spring Web retrieval

Scalable Search Engine Solution

CS6200 Information Retreival. Crawling. June 10, 2015

Optimizing Apache Nutch For Domain Specific Crawling at Large Scale

Web Crawling. Introduction to Information Retrieval CS 150 Donald J. Patterson

memex-explorer Documentation

Information Retrieval May 15. Web retrieval

Automatically Constructing a Directory of Molecular Biology Databases

Clustering the output of Apache Nutch using Apache Spark. May 12, Vancouver, Canada

CS47300: Web Information Search and Management

Supervised Web Forum Crawling

Collection Building on the Web. Basic Algorithm

Collective Intelligence in Action

StormCrawler. Low Latency Web Crawling on Apache Storm.

Tracking Down The Bad Guys. Tom Barber - NASA JPL Big Data Conference - Vilnius Nov 2017

Search Engines. Information Retrieval in Practice

CSCI572 Hw2 Report Team17

Mining Web Data. Lijun Zhang

Crawling the Web for. Sebastian Nagel. Apache Big Data Europe

Clustering Web Pages Based on Structure and Style Similarity

Web Crawling. Jitali Patel 1, Hardik Jethva 2 Dept. of Computer Science and Engineering, Nirma University, Ahmedabad, Gujarat, India

Developing Focused Crawlers for Genre Specific Search Engines

Empowering People with Knowledge the Next Frontier for Web Search. Wei-Ying Ma Assistant Managing Director Microsoft Research Asia

Focused crawling: a new approach to topic-specific Web resource discovery. Authors

Application of rough ensemble classifier to web services categorization and focused crawling

Information Retrieval. Lecture 10 - Web crawling

Mining Web Data. Lijun Zhang

Tambako the Bixo - a webcrawler toolkit Ken Krugler, Stefan Groschupf

Crawling the Web. Web Crawling. Main Issues I. Type of crawl

BUbiNG. Massive Crawling for the Masses. Paolo Boldi, Andrea Marino, Massimo Santini, Sebastiano Vigna

THE MODIFIED CONCEPT BASED FOCUSED CRAWLING USING ONTOLOGY

Chapter 2: Literature Review

Web Mining Strata 2012

DATA MINING II - 1DL460. Spring 2014"

A Software Architecture for Progressive Scanning of On-line Communities

CC PROCESAMIENTO MASIVO DE DATOS OTOÑO 2018

SOURCERER: MINING AND SEARCHING INTERNET- SCALE SOFTWARE REPOSITORIES

Administrivia. Crawlers: Nutch. Course Overview. Issues. Crawling Issues. Groups Formed Architecture Documents under Review Group Meetings CSE 454

Building a Scalable Recommender System with Apache Spark, Apache Kafka and Elasticsearch

Mahout in Action MANNING ROBIN ANIL SEAN OWEN TED DUNNING ELLEN FRIEDMAN. Shelter Island

Evaluating the Usefulness of Sentiment Information for Focused Crawlers

A Framework for adaptive focused web crawling and information retrieval using genetic algorithms

Homework: Building an Apache-Solr based Search Engine for DARPA XDATA Employment Data Due: November 10 th, 12pm PT

Design and Implementation of Agricultural Information Resources Vertical Search Engine Based on Nutch

Information Retrieval

VALLIAMMAI ENGINEERING COLLEGE SRM Nagar, Kattankulathur DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING QUESTION BANK VII SEMESTER

Using Internet as a Data Source for Official Statistics: a Comparative Analysis of Web Scraping Technologies

Today s lecture. Information Retrieval. Basic crawler operation. Crawling picture. What any crawler must do. Simple picture complications

Information Retrieval

CHAPTER 4 PROPOSED ARCHITECTURE FOR INCREMENTAL PARALLEL WEBCRAWLER

Taming Text. How to Find, Organize, and Manipulate It MANNING GRANT S. INGERSOLL THOMAS S. MORTON ANDREW L. KARRIS. Shelter Island

Minghai Liu, Rui Cai, Ming Zhang, and Lei Zhang. Microsoft Research, Asia School of EECS, Peking University

Prof. Ahmet Süerdem Istanbul Bilgi University London School of Economics

Question Answering Systems

Large scale corporate Web Analysis for Business Intelligence

Advanced Crawling Techniques. Outline. Web Crawler. Chapter 6. Selective Crawling Focused Crawling Distributed Crawling Web Dynamics

Crawling. CS6200: Information Retrieval. Slides by: Jesse Anderton

Crawling CE-324: Modern Information Retrieval Sharif University of Technology

A Scalable Architecture for Extracting, Aligning, Linking, and Visualizing Multi-Int Data

Administrative. Web crawlers. Web Crawlers and Link Analysis!

A crawler is a program that visits Web sites and reads their pages and other information in order to create entries for a search engine index.

Semantic Web Company. PoolParty - Server. PoolParty - Technical White Paper.

Part I: Data Mining Foundations

Big Data Computing for GIS Data Discovery

YIOOP FULL HISTORICAL INDEXING IN CACHE NAVIGATION

Introduction to Information Retrieval

Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p.

I. INTRODUCTION. Fig Taxonomy of approaches to build specialized search engines, as shown in [80].

CS47300: Web Information Search and Management

A SURVEY ON WEB FOCUSED INFORMATION EXTRACTION ALGORITHMS

Deliverable D Multilingual corpus acquisition software

Patent-Crawler. A real-time recursive focused web crawler to gather information on patent usage. HPC-AI Advisory Council, Lugano, April 2018

Searching the Deep Web

Simulation Study of Language Specific Web Crawling

Improving Relevance Prediction for Focused Web Crawlers

Homework: MIME Diversity in the Text Retrieval Conference (TREC) Polar Dynamic Domain Dataset Due: Friday, March 4, pm PT

Chapter 6: Information Retrieval and Web Search. An introduction

CC PROCESAMIENTO MASIVO DE DATOS OTOÑO Lecture 6: Information Retrieval I. Aidan Hogan

Session Questions and Responses

data analysis - basic steps Arend Hintze

Today s lecture. Basic crawler operation. Crawling picture. What any crawler must do. Simple picture complications

Data-Intensive Computing with MapReduce

Tika in Action JUKKA MANNING CHRIS A. MATTMANN L. ZITTING. Shelter Island

CS 347 Parallel and Distributed Data Processing

CS 347 Parallel and Distributed Data Processing

APPLICATION OF HADOOP MAPREDUCE TECHNIQUE TOVIRTUAL DATABASE SYSTEM DESIGN. Neha Tiwari Rahul Pandita Nisha Chhatwani Divyakalpa Patil Prof. N.B.

Extending Yioop! Abilities to Search the Invisible Web

Using Internet as a Data Source for Official Statistics: a Comparative Analysis of Web Scraping Technologies

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

Representation/Indexing (fig 1.2) IR models - overview (fig 2.1) IR models - vector space. Weighting TF*IDF. U s e r. T a s k s

Building Software to Translate

Bing Liu. Web Data Mining. Exploring Hyperlinks, Contents, and Usage Data. With 177 Figures. Springer

HYBRID QUERY PROCESSING IN RELIABLE DATA EXTRACTION FROM DEEP WEB INTERFACES

FILTERING OF URLS USING WEBCRAWLER

Transcription:

Focused Crawling with ApacheCon North America Vancouver, 2016

Hello! I am Sujen Shah Computer Science @ University of Southern California Research Intern @ NASA Jet Propulsion Laboratory Member of The ASF and Nutch PMC since 2015 sujen@apache.org /in/sujenshah

Outline The Apache Nutch Project Architectural Overview Focused Crawling Domain Discovery Evaluation Future Additions Acknowledgements

Apache Nutch Highly extensible and scalable open source web crawler software project. Hadoop based ecosystem, provides scalability. Highly modular architecture, to allow development of custom plugins. Supports full-text indexing and searching. Multi-threaded robust distributed crawling with configurable politeness. Project website : http://nutch.apache.org/

Nutch History 2003 Started by Doug Cutting and Mike Caffarella 2005 2006 MapReduce implementation and Hadoop spin off from Nutch 2007 Use MimeType Detection from Tika 2010 Top Level Project at Apache 2012 Nutch 2.x released offering storage abstraction via Apache Gora 2014 2015 REST API, Publisher/Subscriber, JavaScript interaction and content-based Focused Crawling capabilities Friends of Nutch

Architecture [Diagram courtesy Florian Hartl : http://florianhartl.com/nutch-how-it-works.html]

Architecture Stores info for URLs: URL Fetch Status Signature Protocols [Diagram courtesy Florian Hartl : http://florianhartl.com/nutch-how-it-works.html]

Architecture Stores incoming links to each URL and its associated anchor text. [Diagram courtesy Florian Hartl : http://florianhartl.com/nutch-how-it-works.html]

Architecture Stores: Raw page content Parsed content, outlinks and metadata Fetch-list [Diagram courtesy Florian Hartl : http://florianhartl.com/nutch-how-it-works.html]

Architecture [Diagram courtesy Florian Hartl : http://florianhartl.com/nutch-how-it-works.html]

Nutch Workflow Typical workflow is a sequence of batch operations Inject : Populate crawldb from seed list Generate : Selects URLs to fetch Fetch : Fetched URLs from fetchlist Parse : Parse content from fetched URLs UpdateDB : Update the crawldb InvertLinks : Builds the linkdb Index : Optional step to index in SOLR, Elasticsearch, etc

Architecture Few more tools at a glance Fetcher : Multi-threaded, high throughput Limit load on servers Partitioning by host, IP or domain Plugins : On demand activation Customizable by the developer Example: URL filters, protocols, parsers, indexers, scoring etc WebGraph : Stores outlinks, inlinks and node scores Iterative link analysis by LinkRank

Nutch Document Some of the fields that nutch stores for a URL are : URL Fetch-time Fetch status Protocol used Page content Score Arbitrary metadata Inlinks

Crawl Frontier The crawl frontier is a system that governs the order in which URLs should be followed by the crawler. Two important considerations [1] : Refresh rate : High quality pages that change frequently should be prioritized Politeness : Avoid repeated fetch requests to a host within a short time span Open Web URL Frontier (refresh rate, politeness, relevance, etc) URLs already fetched [1] http://nlp.stanford.edu/ir-book/html/htmledition/the-url-frontier-1.html

Frontier Expansion Manual Expansion: Seeding new URLs from Reference websites (Wikipedia, Alexa, etc) Search engines From prior knowledge Automatic discovery: Following contextually relevant outlinks Cosine similarity, Naive Bayes plugins Controlling by URL filers, regular expressions Using scoring OPIC scoring

Broad vs. Focused Crawling Broad Crawling : Unlimited crawl frontier Limited by bandwidth and politeness factors Useful for creating an index of the open web High recall low precision Focused Crawling : Limit crawl frontier to relevant sites Low resource consumption Useful for domain discovery as it prioritizes based on content relevance

Domain Discovery A Domain, here, is defined as an area of interest for a user. Domain Discovery is the act of exploring a domain of which a user has limited prior knowledge.

Why Domain Discovery...? DARPA MEMEX Program Current limitations : Today's web searches use a centralized, one-size-fits-all approach Miss information in the deep web Results not organized or aggregated beyond list of links Successful for commercial purposes but not for government use cases Goals: Develop next-generation technology for discovery, organization and presentation of domain specific content Extend current search technologies to deep web Improve search interfaces for military, government and commercial enterprises for publically available content Program website : http://www.darpa.mil/program/memex

The Web Surface web ~5% (searchable via commercial search engines) Deep web (Academic records, Medical information, Legal Documents, various databases) Dark web (TOR network, Illicit activities)

Domain Discovery with Nutch Previously available tools : URL filter plugins Filter based on regular expressions Whitelist/blacklist hosts Filter based on content mimetype Scoring links (OPIC scoring) Breadth first or Depth first crawl Limitations : Follows the link structure Does not capture content relevance to a domain

Domain Discovery with Nutch To capture content relevance to a domain, two new tools have been introduced. Cosine Similarity scoring filter Naive Bayes parse filter Nutch JIRA issues : https://issues.apache.org/jira/browse/nutch-2039 https://issues.apache.org/jira/browse/nutch-2038

Cosine Similarity Cosine similarity is a measure of similarity between two vectors of an inner product space that measures the cosine of the angle between them [1]. A and B are the vectors. Lesser the angle => higher the similarity [1] https://en.wikipedia.org/wiki/cosine_similarity

Cosine Similarity Scoring in Nutch Implemented as a Scoring filter Computed by measuring the angle between two Document Vectors. Document Vector : A term frequency vector containing all the terms occurring on a fetched page. DV = { robots :51, autonomous : 12, artificial : 23,. }

Cosine Similarity Scoring - Architecture

Cosine Similarity Scoring - Working Features of the similarity scoring plugin : Scores a page based on content relevance Leverages a simplistic bag-of-words approach Outlinks from relevant parent pages are considered relevant Seed

Iteration 1 Start with an initial seed Seed is considered to be relevant User provides keyword list for cosine similarity Seed All children given same priority as parent in the crawl frontier Policy : Fetch top 4 urls in frontier Unfetched (in the crawl frontier) Fetched Decreasing order of relevance

Iteration 2 Children are fetched by the crawler Similarity against the goldstandard is computed and scores are assigned. Policy : Fetch top 4 urls in frontier Unfetched (in the crawl frontier) Fetched Decreasing order of relevance Seed

Iteration 3 Policy : Fetch top 4 urls in frontier Unfetched (in the crawl frontier) Fetched Decreasing order of relevance Seed

Iteration 4 Policy : Fetch top 4 urls in frontier Unfetched (in the crawl frontier) Fetched Decreasing order of relevance Seed

Iteration 5 Policy : Fetch top 4 urls in frontier Unfetched (in the crawl frontier) Fetched Decreasing order of relevance Seed

Naive Bayes Classifier Naive Bayes classifiers are a family of simple probabilistic classifiers based on applying Bayes' theorem with strong (naive) independence assumptions between the features [1]. Naive Bayes in Nutch Implemented as a parse filter Classifies a fetched page relevant or irrelevant based on a user provided training dataset [1] https://en.wikipedia.org/wiki/naive_bayes_classifier

Naive Bayes Classifier Working User provides a set of labeled examples as training data Create a model based on given training data Classify each page as relevant (positive) or irrelevant(negative)

Naive Bayes Classifier Working Features: All outlinks from an irrelevant (negative) page are discarded All outlinks from a relevant (positive) page are followed Seed Crawl Scenario

Evaluation The following process was followed to perform domain discovery using the tools discussed earlier: Deploy 3 different Nutch configurations a. Custom Regex-filters and default scoring b. Cosine similarity scoring activated with keyword list c. Naive Bayes filter activated with labeled training data Provide the same seeds to all 3 configurations Crawl was run for 7 iterations [Thanks to Xu Wu for the evaluations]

Evaluation Iteration Regex-filters and seed Cosine similarity scoring Naive Bayes parse list filter filter Domain Total Rate Domain related related Total Rate Domain Total Rate related 1 17 47 36% 16 47 34% 16 45 2 476 1286 37% 503 1365 37% 519 1293 40% 3 169 1334 13% 140 1265 11% 268 1410 19% 4 354 1351 26% 388 656 59% 528 1628 32% 5 704 1553 45% 1569 1971 80% 445 1466 30% 6 267 1587 17% 1531 1949 79% 173 1567 11% 7 354 1715 21% 1325 1962 68% 433 1795 24% Total 2341 8873 26% 5427 9215 59% 2388 9204 26% 36% [Thanks to Xu Wu for the evaluations]

Evaluation [Thanks to Xu Wu for the evaluations]

Analysis Page Relevance* for the first 3 rounds is almost the same for all the methods Relevancy sharply rises for the Cosine similarity scoring for further rounds Naive Bayes and custom regex-filters perform almost the same * Page Relevance True Relevance of a fetched page was calculated using MeaningCloud s [1] text classification API. [1] https://www.meaningcloud.com/

Limitations A few things to consider : The performance of these new focused crawling tools depends on how well the user provides the initial domain relevant data. Keyword/Text for Cosine Similarity Labeled text for Naive Bayes Filter Currently, these tools perform well with textual data, there is no provision for multimedia These techniques are good at providing topically relevant content, but may not provide factually relevant content

Future Improvements Potential additions to focused crawling in Nutch : Use the html DOM structure to assess relevance Iteratively augment the model Use Tika s NER Parser and GeoParser to extract entities Use Part-of-Speech to capture grammar(context) in a domain

Other cool tools... Nutch REST API Http - GET, PUT, POST... JIRA issues NUTCH-1931: REST API NUTCH-2132: Pub/Sub Nutch Pub/Sub interface JMS Implementation Real-time Viz

Acknowledgements Thanks to : Andrzej Białecki, Chris Mattmann, Doug Cutting, Julien Nioche, Mike Caffarella, Lewis John McGibbney Sebastian Nagel for ideas and material from their previous presentations all Nutch contributors for their amazing work! Florian Hartl for the architecture diagram and blogpost Xu Wu for the evaluations SlidesCarnival for the presentation template

Acknowledgements A special thanks to : My advisor Dr. Chris Mattmann for his guidance The awesome team at NASA Jet Propulsion Laboratory And the DARPA MEMEX Program

Thanks! Any questions? You can find me at: @sujen sujen@apache.org