Optimizing Apache Nutch For Domain Specific Crawling at Large Scale

Similar documents
Storm Crawler. Low latency scalable web crawling on Apache Storm. Julien Nioche digitalpebble. Berlin Buzzwords 01/06/2015

Focused Crawling with

EPL660: Information Retrieval and Search Engines Lab 8

Collective Intelligence in Action

Focused Crawling with

LAB 7: Search engine: Apache Nutch + Solr + Lucene

Information Retrieval Spring Web retrieval

Tambako the Bixo - a webcrawler toolkit Ken Krugler, Stefan Groschupf

Nutch as a Web mining platform the present and the future Andrzej Białecki

Information Retrieval May 15. Web retrieval

BUbiNG. Massive Crawling for the Masses. Paolo Boldi, Andrea Marino, Massimo Santini, Sebastiano Vigna

Scalable Search Engine Solution

Information Retrieval. Lecture 10 - Web crawling

Building Search Applications

Using Internet as a Data Source for Official Statistics: a Comparative Analysis of Web Scraping Technologies

CS 347 Parallel and Distributed Data Processing

CS 347 Parallel and Distributed Data Processing

CS6200 Information Retreival. Crawling. June 10, 2015

This tutorial is designed for all Java enthusiasts who want to learn document type detection and content extraction using Apache Tika.

Pangeo. A community-driven effort for Big Data geoscience

Administrative. Web crawlers. Web Crawlers and Link Analysis!

Relevant?!? Algoritmi per IR. Goal of a Search Engine. Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Web Search

Around the Web in Six Weeks: Documenting a Large-Scale Crawl

Bixo - Web Mining Toolkit 23 Sep Ken Krugler TransPac Software, Inc.

Rok: Data Management for Data Science

StormCrawler. Low Latency Web Crawling on Apache Storm.

Distributed computing: index building and use

DiskSavvy Disk Space Analyzer. DiskSavvy DISK SPACE ANALYZER. User Manual. Version Dec Flexense Ltd.

Homework: Building an Apache-Solr based Search Engine for DARPA XDATA Employment Data Due: November 10 th, 12pm PT

Crawling the Web for. Sebastian Nagel. Apache Big Data Europe

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

Architectural challenges for building a low latency, scalable multi-tenant data warehouse

Istat s Pilot Use Case 1

A Software Architecture for Progressive Scanning of On-line Communities

Information Retrieval

UNIT-V WEB MINING. 3/18/2012 Prof. Asha Ambhaikar, RCET Bhilai.

Stream Processing Platforms Storm, Spark,.. Batch Processing Platforms MapReduce, SparkSQL, BigQuery, Hive, Cypher,...

Stream Processing Platforms Storm, Spark,.. Batch Processing Platforms MapReduce, SparkSQL, BigQuery, Hive, Cypher,...

Distributed Systems. 05r. Case study: Google Cluster Architecture. Paul Krzyzanowski. Rutgers University. Fall 2016

InfiniBand Networked Flash Storage

MOHA: Many-Task Computing Framework on Hadoop

Hadoop/MapReduce Computing Paradigm

Patent-Crawler. A real-time recursive focused web crawler to gather information on patent usage. HPC-AI Advisory Council, Lugano, April 2018

OnCrawl Metrics. What SEO indicators do we analyze for you? Dig into our board of metrics to find the one you are looking for.

CC PROCESAMIENTO MASIVO DE DATOS OTOÑO 2018

memex-explorer Documentation

International Journal of Advance Engineering and Research Development. A Study: Hadoop Framework

Scaling Internet TV Content Delivery ALEX GUTARIN DIRECTOR OF ENGINEERING, NETFLIX

Information Retrieval

Building A Billion Spatio-Temporal Object Search and Visualization Platform

The Billion Object Platform (BOP): a system to lower barriers to support big, streaming, spatio-temporal data sources

Tracking Down The Bad Guys. Tom Barber - NASA JPL Big Data Conference - Vilnius Nov 2017

I'm Charlie Hull, co-founder and Managing Director of Flax. We've been building open source search applications since 2001

Service Oriented Performance Analysis

Design and Implementation of Agricultural Information Resources Vertical Search Engine Based on Nutch

A Cloud-based Web Crawler Architecture

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

Accelerating Enterprise Search with Fusion iomemory PCIe Application Accelerators

An Introduction to Big Data Formats

An Overview of Search Engine. Hai-Yang Xu Dev Lead of Search Technology Center Microsoft Research Asia

Information Retrieval

Web Crawling. Introduction to Information Retrieval CS 150 Donald J. Patterson

CS47300 Web Information Search and Management

GFS: The Google File System. Dr. Yingwu Zhu

Scaling Slack. Bing Wei

At Course Completion Prepares you as per certification requirements for AWS Developer Associate.

CSE Lecture 11: Map/Reduce 7 October Nate Nystrom UTA

The Anatomy of a Large-Scale Hypertextual Web Search Engine

Research Article Mobile Storage and Search Engine of Information Oriented to Food Cloud

Next-Generation Cloud Platform

CC PROCESAMIENTO MASIVO DE DATOS OTOÑO Lecture 6: Information Retrieval I. Aidan Hogan

Using Internet as a Data Source for Official Statistics: a Comparative Analysis of Web Scraping Technologies

CISC 7610 Lecture 2b The beginnings of NoSQL

Administrivia. Crawlers: Nutch. Course Overview. Issues. Crawling Issues. Groups Formed Architecture Documents under Review Group Meetings CSE 454

BIG DATA TESTING: A UNIFIED VIEW

More on Testing and Large Scale Web Apps

GFS Overview. Design goals/priorities Design for big-data workloads Huge files, mostly appends, concurrency, huge bandwidth Design for failures

Web scraping. Donato Summa. 3 WP1 face to face meeting September 2017 Thessaloniki (EL)

Distributed computing: index building and use

Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development::

EECS 395/495 Lecture 5: Web Crawlers. Doug Downey

CS47300: Web Information Search and Management

Bringing code to the data: from MySQL to RocksDB for high volume searches

Lecture 9: MIMD Architecture

Dremel: Interactive Analysis of Web-Scale Database

MOBIUS ONE BILLION DOCUMENT BENCHMARK TEST ON AMAZON WEB SERVICES CONTENT SERVICES PLATFORM

CS November 2017

Lecture 9: MIMD Architectures

Zumobi Brand Integration(Zbi) Platform Architecture Whitepaper Table of Contents

DATA MINING - 1DL105, 1DL111

Cloud Computing with FPGA-based NVMe SSDs

Embedded Technosolutions

White Paper. How the Meltdown and Spectre bugs work and what you can do to prevent a performance plummet. Contents

BigDataBench-MT: Multi-tenancy version of BigDataBench

JULEA: A Flexible Storage Framework for HPC

Empowering People with Knowledge the Next Frontier for Web Search. Wei-Ying Ma Assistant Managing Director Microsoft Research Asia

Nowcasting. D B M G Data Base and Data Mining Group of Politecnico di Torino. Big Data: Hype or Hallelujah? Big data hype?

Performance and Scalability with Griddable.io

Building a High IOPS Flash Array: A Software-Defined Approach

Using Apache Spark for generating ElasticSearch indices offline

Transcription:

Optimizing Apache Nutch For Domain Specific Crawling at Large Scale Luis A. Lopez, Ruth Duerr, Siri Jodha Singh Khalsa luis.lopez@nsidc.org http://github.com/b-cube IEEE Big Data 2015, Santa Clara CA.

Overview BCube is a building block of NSF EarthCube, for the past 6 months we ve been crawling the Internet trying to gather all possible data that s relevant to the geosciences. Our focus is to discover scientific datasets and web services that may contain geolocated data. Mainly structured information (xml, json, csv) Greenland Aerosol Datasets Mass Spectrum Analysis in Geothermal Areas

Understanding the problem [Focused Crawling] Big Search Space With Very Sparse Data Distribution Billions of web pages Most content is not scientific data Scientific data is not well advertised Acceptable Performance With Limited Resources Scalable stack Content Duplication Handles TB of data Semantics Robots.txt Solution A good(enough) scoring algorithm Hard Problems Distributed processing Fault tolerant Uses commodity hardware Apache Nutch! Remote Servers Performance Malformed Metadata Bad Web Standards Implementation Cost

Previous Work and BCube Previous Work Finding the scoring algorithm that performs 10% better Implementing an in house(not open sourced) crawler Focusing on an specific type of data Measuring performance on thousands of pages(sometimes just hundreds) Our Work To understand where the bottlenecks are To use an open source project To improve fetching times To modify the crawler for focused crawls To use the cloud to lower operational costs To verify what happens at large scale

Scoring PageRank Like Scoring Focused Crawl Scoring NOAA NASA

BCube Customizations

BCube Plugins Parse-rawcontent index-links Indexes the unparsed content of a document Indexes inlinks and outlinks of a document parse-bayes index-bcube-extras Scores pages using an online naivebayes classifier Indexes HTTP responses index-bcube-filter index-xmlnamespaces Indexes all the namespaces used in xml documents Discards documents using mime types or substring matching parse-tika fixed critical bug that blocked us from parsing valid XML files

BCube Filtering url regex, suffix bayes scoring filters: mime type, url regex. b.edu/unknown.htm amazon.com site.org/data.xml Lucene noaa.gov/data.htm a.edu/big.is o Fetcher Nutch DB

Problems...

Performance Degradation Crawl-Delay Sparse data distribution Duplicated content Slow servers The tar pits Variable cluster performance in the cloud Idle CPU time https://wiki.apache.org/nutch/optimizingcrawls

Good Scoring, Let s Celebrate! not so soon. The Tar Pit Keep indexing relevant documents from the same sites prevented us from discover new ones Well scored documents are still relevant and we should index them How often these sites are updated should be taken into account.

Content Duplication at large scale. Domain * The Science of Crawl: Deduplication of Web Content http://bit.ly/1gg32hh Documents fetched fr.climate-data.org 212142 bn.climate-data.org 209257 de.climate-data.org 203279 en.climate-data.org 197716

Improving Performance Robots.txt Crawl-Delay Large files:.iso.hdf etc. SSD vs HDD Fetching Strategy Crawling at different speeds. Filtering out file extensions using Nutch s suffix-regex filter SSD instances on AWS Generate a limited number of links per host and distribute the fetch on as many nodes as possible.

Nutch + BCube Property Default Value BCube Value crawldb.url.filters True False db.update.max.inlinks 1000 100 db.injector.overwrite False True generate.max.count -1 1000 fetcher.server.delay 10 2 fetcher.threads.fetch 10 128 fetcher.threads.per.que ue 1 2 fetcher.timelimit.mins -1 45

Conclusions and Future Work There are major issues in focused crawls that can only be reproduced at large scale Some issues cannot be addressed by improving the focused crawl alone We can implement mitigation techniques effectively to alleviate the problems under our control Apache Nutch can scale and be used for focused crawls Optimize scoring algorithm using the link graph and content context Develop a computationally efficient mechanism for dynamic relevance adjustment Automate cost effective cluster deployments Use the latest selenium plugins in Nutch for specific use cases More

References BCube at Github https://github.com/b-cube Apache Nutch https://nutch.apache.org/ Common Crawl Project https://commoncrawl.org/ The Science of Crawl: Deduplication of Web Content http://bit.ly/1gg32hh