Large Crawls of the Web for Linguistic Purposes

Similar documents
Building Large Corpora from the Web

Search Engines. Information Retrieval in Practice

Administrivia. Crawlers: Nutch. Course Overview. Issues. Crawling Issues. Groups Formed Architecture Documents under Review Group Meetings CSE 454

Crawling. CS6200: Information Retrieval. Slides by: Jesse Anderton

Ngram Search Engine with Patterns Combining Token, POS, Chunk and NE Information

Introduction to Information Retrieval

Collaborative work. Building Large Linguistic Corpora by Web Crawling. Outline. Corpora in language research

CS47300: Web Information Search and Management

Web Crawling. Introduction to Information Retrieval CS 150 Donald J. Patterson

Efficient Web Crawling for Large Text Corpora

How Does a Search Engine Work? Part 1

Information Retrieval. Lecture 10 - Web crawling

Crawling the Web. Web Crawling. Main Issues I. Type of crawl

A WaCky Introduction. 1 The corpus and the Web. Silvia Bernardini, Marco Baroni and Stefan Evert

Today s lecture. Information Retrieval. Basic crawler operation. Crawling picture. What any crawler must do. Simple picture complications

Basic techniques. Text processing; term weighting; vector space model; inverted index; Web Search

CS47300: Web Information Search and Management

Information Retrieval

Information Retrieval Spring Web retrieval

Crawling CE-324: Modern Information Retrieval Sharif University of Technology

RIDIRE-CPI: an Open Source Crawling and Processing Infrastructure for Web Corpora Building

Design and implementation of an incremental crawler for large scale web. archives

Information Retrieval and Web Search

BUbiNG. Massive Crawling for the Masses. Paolo Boldi, Andrea Marino, Massimo Santini, Sebastiano Vigna

Administrative. Web crawlers. Web Crawlers and Link Analysis!

Today s lecture. Basic crawler operation. Crawling picture. What any crawler must do. Simple picture complications

SEARCH ENGINE INSIDE OUT

New Issues in Near-duplicate Detection

Information Retrieval (IR) Introduction to Information Retrieval. Lecture Overview. Why do we need IR? Basics of an IR system.

US Patent 6,658,423. William Pugh

Web Archiving Workshop

Around the Web in Six Weeks: Documenting a Large-Scale Crawl

Information Retrieval May 15. Web retrieval

EECS 395/495 Lecture 5: Web Crawlers. Doug Downey

Bixo - Web Mining Toolkit 23 Sep Ken Krugler TransPac Software, Inc.

arxiv:cs/ v2 [cs.dl] 15 Mar 2007

Information Retrieval

Extensible Web Pages Crawler Towards Multimedia Material Analysis

Relevant?!? Algoritmi per IR. Goal of a Search Engine. Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Web Search

Characterization of Search Engine Caches

Crawling Assignment. Introduction to Information Retrieval CS 150 Donald J. Patterson

Néonaute: mining web archives for linguistic analysis

CS6200 Information Retreival. Crawling. June 10, 2015

Collection Building on the Web. Basic Algorithm

Crawling the Web for. Sebastian Nagel. Apache Big Data Europe

CS 347 Parallel and Distributed Data Processing

CS 347 Parallel and Distributed Data Processing

Preserving Legal Blogs

CSC 5930/9010: Text Mining GATE Developer Overview

Web corpus construction USING DATA FROM THE COMMON CRAWL

A LITERATURE SURVEY ON WEB CRAWLERS

The MapReduce Abstraction

Building Large Corpora from the Web Using a New Efficient Tool Chain

Conclusions. Chapter Summary of our contributions

From Web Page Storage to Living Web Archives Thomas Risse

Full-Text Indexing For Heritrix

CC PROCESAMIENTO MASIVO DE DATOS OTOÑO 2018

DATA MINING II - 1DL460. Spring 2014"

Mining Web Data. Lijun Zhang

Efficiency vs. Effectiveness in Terabyte-Scale IR

YIOOP FULL HISTORICAL INDEXING IN CACHE NAVIGATION

Search Engines. Charles Severance

Searching the Deep Web

Last Words. Googleology is Bad Science. Adam Kilgarriff Lexical Computing Ltd. and University of Sussex

Plan for today. CS276B Text Retrieval and Mining Winter Evolution of search engines. Connectivity analysis

Text Technologies for Data Science INFR11145 Web Search Walid Magdy Lecture Objectives

Desktop Crawls. Document Feeds. Document Feeds. Information Retrieval

Introduction to MapReduce Algorithms and Analysis

Data Collection & Data Preprocessing

Developing Focused Crawlers for Genre Specific Search Engines

System Unit Components Chapter2

Introduction. Table of Contents

System recommendations for version 17.1

Searching the Deep Web

Yahoo! Webscope Datasets Catalog January 2009, 19 Datasets Available

Information Retrieval II

Overview of the Netarkivet web archiving system

GNU EPrints 2 Overview

Cache introduction. April 16, Howard Huang 1

HG2052 Language, Technology and the Internet. The Web as Corpus

Chapter 2. Prepared By: Humeyra Saracoglu

Information Retrieval. Lecture 9 - Web search basics

CC PROCESAMIENTO MASIVO DE DATOS OTOÑO Lecture 6: Information Retrieval I. Aidan Hogan

INLS : Introduction to Information Retrieval System Design and Implementation. Fall 2008.

From Several Hundred Million to Some Billion Words: Scaling up a Corpus Indexer and a Search Engine with MapReduce

Tag-based Social Interest Discovery

Corpus collection and analysis for the linguistic layman: The Gromoteur

Lecture 16. Today: Start looking into memory hierarchy Cache$! Yay!

A B2B Search Engine. Abstract. Motivation. Challenges. Technical Report

Corpus Acquisition from the Internet

Data-Intensive Computing with MapReduce

AN OVERVIEW OF SEARCHING AND DISCOVERING WEB BASED INFORMATION RESOURCES

CDL s Web Archiving System

Finding Plagiarism by Evaluating Document Similarities

Google: A Computer Scientist s Playground

Google: A Computer Scientist s Playground

The Anatomy of a Large-Scale Hypertextual Web Search Engine

CS 572: Information Retrieval

Deliverable D Multilingual corpus acquisition software

Indexing and Search with

Transcription:

Large Crawls of the Web for Linguistic Purposes SSLMIT, University of Bologna Birmingham, July 2005

Outline Introduction 1 Introduction 2 3 Basics Heritrix My ongoing crawl 4 Filtering and cleaning 5 Annotation Indexing, etc. Summing up and open issues

The WaCky approach http://wacky.sslmit.unibo.it Current target: 1-billion token English, German, Italian Web-corpora by 2006. Use existing open tools, make developed tools publicly available. Please join us (for other languages as well!)

The basic steps Introduction Select seed urls. Crawl.. Linguistic annotation. Indexing, etc.

Outline Introduction 1 Introduction 2 3 Basics Heritrix My ongoing crawl 4 Filtering and cleaning 5 Annotation Indexing, etc. Summing up and open issues

Use queries for random word combinations to Google search engine.

Use queries for random word combinations to Google search engine. Start crawl from urls discovered in this way.

Use queries for random word combinations to Google search engine. Start crawl from urls discovered in this way. Which random words? Middle-frequency words from general/newspaper corpus ( public ). Basic vocabulary list ( private ).

Use queries for random word combinations to Google search engine. Start crawl from urls discovered in this way. Which random words? Middle-frequency words from general/newspaper corpus ( public ). Basic vocabulary list ( private ). How random are the urls collected in this way? Ongoing work with Massimiliano Ciaramita (ISTC, Rome).

Outline Introduction 1 Introduction 2 3 Basics Heritrix My ongoing crawl 4 Filtering and cleaning 5 Annotation Indexing, etc. Summing up and open issues Basics Heritrix My ongoing crawl

Introduction Basics Heritrix My ongoing crawl Fetch pages, extract links. Follow links, fetch pages.

Important in a good crawler Basics Heritrix My ongoing crawl Honoring robots.txt, politeness

Important in a good crawler Basics Heritrix My ongoing crawl Honoring robots.txt, politeness Efficiency, multi-threading, robust Frontier

Important in a good crawler Basics Heritrix My ongoing crawl Honoring robots.txt, politeness Efficiency, multi-threading, robust Frontier Avoid spider traps

Important in a good crawler Basics Heritrix My ongoing crawl Honoring robots.txt, politeness Efficiency, multi-threading, robust Frontier Avoid spider traps Control over crawl scope

Important in a good crawler Basics Heritrix My ongoing crawl Honoring robots.txt, politeness Efficiency, multi-threading, robust Frontier Avoid spider traps Control over crawl scope Progress monitoring

Important in a good crawler Basics Heritrix My ongoing crawl Honoring robots.txt, politeness Efficiency, multi-threading, robust Frontier Avoid spider traps Control over crawl scope Progress monitoring Intelligent management of downloaded text

Important in a good crawler Basics Heritrix My ongoing crawl Honoring robots.txt, politeness Efficiency, multi-threading, robust Frontier Avoid spider traps Control over crawl scope Progress monitoring Intelligent management of downloaded text Works out of the box, reasonable defaults

Heritrix Introduction Basics Heritrix My ongoing crawl http://crawler.archive.org/

Heritrix Introduction Basics Heritrix My ongoing crawl http://crawler.archive.org/ Free/open crawler of Internet Archive

Heritrix Introduction Basics Heritrix My ongoing crawl http://crawler.archive.org/ Free/open crawler of Internet Archive Very active, supporting community...

Heritrix Introduction Basics Heritrix My ongoing crawl http://crawler.archive.org/ Free/open crawler of Internet Archive Very active, supporting community... that includes linguists and machine learning experts

The Heritrix WUI Introduction Basics Heritrix My ongoing crawl

The outpuf of Heritrix Introduction Basics Heritrix My ongoing crawl Documents distributed across gzipped arc files not larger than 100 MB. Info about retrieved docs (fingerprints, size, path) in arc file headers and in log files.

My German crawl Introduction Basics Heritrix My ongoing crawl On server running RH Fedora Core 3 with 4 GB RAM, Dual Xeon 4.3 GHz CPUs, about 1.1 TB hard disk space. Seeded from random Google queries for SDZ and basic vocabulary list terms. 8631 urls, all from different domains. SURT scope: http:(at, http:(de, Tom Emerson s regexp to focus on HTML For most settings, Heritrix defaults.

Current status of crawl Basics Heritrix My ongoing crawl In about a week: Retrieved about 265 GB, about 54 GB of arc files In earlier experiments, 7 GB arc files yielded about 250M words after cleaning.

Outline Introduction 1 Introduction 2 3 Basics Heritrix My ongoing crawl 4 Filtering and cleaning 5 Annotation Indexing, etc. Summing up and open issues Filtering and cleaning

Introduction Filtering and cleaning Various forms of filtering, boilerplate stripping

Introduction Filtering and cleaning Various forms of filtering, boilerplate stripping

Introduction Filtering and cleaning Various forms of filtering, boilerplate stripping Near-duplicate identification

Filtering as you crawl... Filtering and cleaning Wouldn t it be nice to filter as you crawl?

Filtering as you crawl... Filtering and cleaning Wouldn t it be nice to filter as you crawl? Yes, but:

Filtering as you crawl... Filtering and cleaning Wouldn t it be nice to filter as you crawl? Yes, but: You don t know what you ve got until you download it

Filtering as you crawl... Filtering and cleaning Wouldn t it be nice to filter as you crawl? Yes, but: You don t know what you ve got until you download it Some pages are bad for corpus, but good for crawling

Filtering as you crawl... Filtering and cleaning Wouldn t it be nice to filter as you crawl? Yes, but: You don t know what you ve got until you download it Some pages are bad for corpus, but good for crawling Promising: brand new Heritrix/Rainbow interface.

Filters and boilerplate removal Filtering and cleaning Ignore docs smaller than 5KB, larger than 200KB.

Filters and boilerplate removal Filtering and cleaning Ignore docs smaller than 5KB, larger than 200KB. Porn stop words (not out of prudery, but because pornographers do funny things with language to fool search engines).

Filters and boilerplate removal Filtering and cleaning Ignore docs smaller than 5KB, larger than 200KB. Porn stop words (not out of prudery, but because pornographers do funny things with language to fool search engines). Boilerplate removal: see next talk.

Filtering and cleaning After boilerplate removal.

Filtering and cleaning After boilerplate removal. Among the options:

Filtering and cleaning After boilerplate removal. Among the options: Van Noord s TextCat tool: Not robust (not German if nouns not in uppercase). Efficiency problems?

Filtering and cleaning After boilerplate removal. Among the options: Van Noord s TextCat tool: Not robust (not German if nouns not in uppercase). Efficiency problems? Small list of function words: In my experiments, fast and effective. Minimum proportion of function words also good to detect connected prose (Zipf to our rescue).

Filtering and cleaning After boilerplate removal. Among the options: Van Noord s TextCat tool: Not robust (not German if nouns not in uppercase). Efficiency problems? Small list of function words: In my experiments, fast and effective. Minimum proportion of function words also good to detect connected prose (Zipf to our rescue). Non-latin1 languages: recognize language and encoding

Filtering and cleaning Simplified version of shingling algorithm of: Broder, Glassman, Manasse and Zweig (1997). Syntactic Clustering of the Web. Sixth International World-Wide Web Conference. Freely available implementation in perl and MySQL written with Eros Zanchetta (SSLMIT).

The shingling algorithm Filtering and cleaning For each page, randomly sample N n-grams (e.g., 25 pentagrams) Look for pages that share at least X of the randomly sampled n-grams (e.g., 5) (Important to do boilerplate removal before, or most of your n-grams will be things like: buy click here.)

Filtering and cleaning What are near-duplicates, exactly? Once boilerplate and small docs are removed, not that many near-duplicates.

Filtering and cleaning What are near-duplicates, exactly? Once boilerplate and small docs are removed, not that many near-duplicates. Should we really be throwing them away?

Outline Introduction 1 Introduction 2 3 Basics Heritrix My ongoing crawl 4 Filtering and cleaning 5 Annotation Indexing, etc. Summing up and open issues Annotation Indexing, etc. Summing up and open issues

Annotation Introduction Annotation Indexing, etc. Summing up and open issues With standard tools... However, need for robustness. Following wreaks havoc on TreeTagger tokenizer and tagger: und bewusst werden. ein unsichtbares band verbindet

Indexing, retrieval, interfaces... Annotation Indexing, etc. Summing up and open issues CWB, SketchEngine, Xaira? Lucene? MySQL?

Introduction Annotation Indexing, etc. Summing up and open issues Building a large corpus by crawling is quite straightforward...

Introduction Annotation Indexing, etc. Summing up and open issues Building a large corpus by crawling is quite straightforward... but devil is in the (terabytes of) details.

Introduction Annotation Indexing, etc. Summing up and open issues Building a large corpus by crawling is quite straightforward... but devil is in the (terabytes of) details. Some (of many) open issues: What language are we sampling from? How large is large enough?