From Web Page Storage to Living Web Archives Thomas Risse

Similar documents
From Web Page Storage to Living Web Archives

Exploiting the Social and Semantic Web for guided Web Archiving

Challenges and Approaches for Web Archive Creation and Usage

Information Retrieval Spring Web retrieval

Prof. Ahmet Süerdem Istanbul Bilgi University London School of Economics

Search Engines. Information Retrieval in Practice

CDL s Web Archiving System

MESH. Multimedia Semantic Syndication for Enhanced News Services. Project Overview

Information Retrieval May 15. Web retrieval

Basic techniques. Text processing; term weighting; vector space model; inverted index; Web Search

Anatomy of a Semantic Virus

Large Crawls of the Web for Linguistic Purposes

What Do You Mean It Doesn t Make Sense? Redesigning Finding Aids from the Users Perspective

August 14th - 18th 2005, Oslo, Norway. Web crawling : The Bibliothèque nationale de France experience

Ranking of ads. Sponsored Search

Crossing the Archival Borders

A B2B Search Engine. Abstract. Motivation. Challenges. Technical Report

Empowering People with Knowledge the Next Frontier for Web Search. Wei-Ying Ma Assistant Managing Director Microsoft Research Asia

SciVerse Scopus. Date: 21 Sept, Coen van der Krogt Product Sales Manager Presented by Andrea Kmety

On the Change in Archivability of Websites Over Time

DIGIT.B4 Big Data PoC

INLS : Introduction to Information Retrieval System Design and Implementation. Fall 2008.

Web Archiving Workshop

CS6200 Information Retreival. Crawling. June 10, 2015

CS 425 / ECE 428 Distributed Systems Fall 2015

OnCrawl Metrics. What SEO indicators do we analyze for you? Dig into our board of metrics to find the one you are looking for.

Robin Wilson Director. Digital Identifiers Metadata Services

Workshop W14 - Audio Gets Smart: Semantic Audio Analysis & Metadata Standards

Exploring archives with probabilistic models: Topic modelling for the European Commission Archives

The Black Magic of Flash SEO

The 6 Most Common Website SEO Mistakes

Web Search Basics Introduction to Information Retrieval INF 141/ CS 121 Donald J. Patterson

Workshop B: Archiving the Web

Minghai Liu, Rui Cai, Ming Zhang, and Lei Zhang. Microsoft Research, Asia School of EECS, Peking University

Web Crawling. Advanced methods of Information Retrieval. Gerhard Gossen Gerhard Gossen Web Crawling / 57

Archiving dynamic websites: a considerations framework and a tool comparison

Exam IST 441 Spring 2011

The Role of Language Evolution in Digital Archives

CS47300: Web Information Search and Management

How Co-Occurrence can Complement Semantics?

Conclusions. Chapter Summary of our contributions

doc. RNDr. Tomáš Skopal, Ph.D. Department of Software Engineering, Faculty of Information Technology, Czech Technical University in Prague

TEI, METS and ALTO, why we need all of them. Günter Mühlberger University of Innsbruck Digitisation and Digital Preservation

CS 347 Parallel and Distributed Data Processing

CS 347 Parallel and Distributed Data Processing

16. HTML5, HTML Graphics, & HTML Media 웹프로그래밍 2016 년 1 학기 충남대학교컴퓨터공학과

Workshop on Web Archiving

Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p.

Digital Preservation in the UK. David Thomas

Preservation of Web Materials

New Issues in Near-duplicate Detection

Néonaute: mining web archives for linguistic analysis

Search Framework for a Large Digital Records Archive DLF SPRING 2007 April 23-25, 25, 2007 Dyung Le & Quyen Nguyen ERA Systems Engineering National Ar

CS47300: Web Information Search and Management

SEO. Definitions/Acronyms. Definitions/Acronyms

An Approach To Web Content Mining

Search Engines. Charles Severance

Self Adjusting Refresh Time Based Architecture for Incremental Web Crawler

TECHNICAL TOOL SPECIFICATION

Long-term preservation for INSPIRE: a metadata framework and geo-portal implementation

DATA MINING II - 1DL460. Spring 2014"

EUDAT. Towards a pan-european Collaborative Data Infrastructure

How Does a Search Engine Work? Part 1

Web Science Your Business, too!

How to Crawl the Web. Hector Garcia-Molina Stanford University. Joint work with Junghoo Cho

THE HISTORY & EVOLUTION OF SEARCH

BUbiNG. Massive Crawling for the Masses. Paolo Boldi, Andrea Marino, Massimo Santini, Sebastiano Vigna

Knowledge Discovery and Data Mining 1 (KU)

Relevant?!? Algoritmi per IR. Goal of a Search Engine. Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Web Search

Structured Data on the Web

Overview of Web Mining Techniques and its Application towards Web

Part I: Data Mining Foundations

Search Engine Optimization (SEO) & Your Online Success

Search Engine Technology. Mansooreh Jalalyazdi

Searching the Deep Web

A Novel Architecture of Ontology based Semantic Search Engine

The Functional Extension Parser (FEP) A Document Understanding Platform

Focused Crawling with

Weaving the Web(VTT) of Data

Accessibility and Moodle: Jailbreak your LMS

Focused Crawling with

Final dissemination and exploitation report

Archiving the Web: What can Institutions learn from National and International Web Archiving Initiatives

A network is a group of two or more computers that are connected to share resources and information.

Data Mining Issues. Danushka Bollegala

Design and implementation of an incremental crawler for large scale web. archives

Wikipedia and the Web of Confusable Entities: Experience from Entity Linking Query Creation for TAC 2009 Knowledge Base Population

National Centre for Text Mining NaCTeM. e-science and data mining workshop

Tennessee. Trade & Industrial Course Web Page Design II - Site Designer Standards. A Guide to Web Development Using Adobe Dreamweaver CS3 2009

Searching the Web What is this Page Known for? Luis De Alba

This document is for informational purposes only. PowerMapper Software makes no warranties, express or implied in this document.

Enterprise Manager: Scalable Oracle Management

Tambako the Bixo - a webcrawler toolkit Ken Krugler, Stefan Groschupf

EXTRACT THE TARGET LIST WITH HIGH ACCURACY FROM TOP-K WEB PAGES

DATA COLLECTION. Slides by WESLEY WILLETT 13 FEB 2014

Temporal Analysis of Inter-Community User Flows in Online Knowledge-Sharing Networks

UNIT-V WEB MINING. 3/18/2012 Prof. Asha Ambhaikar, RCET Bhilai.

You got a website. Now what?

verapdf after PREFORMA

Exploiting Semantics Where We Find Them

Transcription:

From Web Page Storage to Living Web Archives Thomas Risse JISC, the DPC and the UK Web Archiving Consortium Workshop British Library, London, 21.7.2009 1

Agenda Web Crawlingtoday& Open Issues LiWA Living Web Archives Project Selected working areas of LiWA Dynamic Pages Handling ofspam Temporal coherence of crawls Archive Interpretability Conclusions and Expected Project Results 2

Current Web Archiving at a Glance Quality Review Access Selection Preparation Discovery Filtering Deep Web Archiving Noise Filtering Dynamic Pages JavaScript, Flash Archivist Link Extractio on Temporal Coherence of Crawls Capture Fetching User Long-term Interpretability Index Storage Multimedia Content 3

LiWA Living Web Archives (EU-IST 216267) Next generation Web Archiving technology for: High Quality Web Archives Long-term Archive usability From Web page storage to Living Web Archives Semantic & Terminology Evolution Temporal Coherence & Consistency Noise and Spam Filtering Improved Capturing Started Feb. 2008 (3 Years) Existing Web Archiving Technology 4

Some LiWAObjectives in more Detail Quality Review Access Selection Preparation Discovery Filtering Deep Web Archiving Noise Filtering Dynamic Pages JavaScript, Flash Archivist Link Extractio on Temporal Coherence of Crawls Capture Fetching User Long-term Interpretability Index Storage Multimedia Content 5

Some LiWAObjectives in more Detail Quality Review Access Selection Preparation Discovery Filtering Deep Web Archiving Noise Filtering Dynamic Pages JavaScript, Flash Archivist Link Extractio on Temporal Coherence of Crawls Capture Fetching User Long-term Interpretability Index Storage Multimedia Content 6

Link Extraction of Dynamic Pages Quality Review Access Selection Preparation Discovery Filtering Deep Web Archiving Noise Filtering Dynamic Pages JavaScript, Flash Archivist Link Extractio on Temporal Coherence of Crawls Capture Fetching User Long-term Interpretability Index Storage Multimedia Content 7

Follow the links... 8 8

Hmm 9 9

Link Extraction of Dynamic Pages The Problem: links don t exist as raw text lying around user interaction and code assemble them Current Approach: guessing by assembling any fragments that look like links into URLs and trying them out Can be very noisy -lots of wrong URL s 10 1 10

Link Extraction of Dynamic Pages Approach pressing the links and see what comes out Execute code in a Javascriptengine Extract links from resulting DOM tree Implementation based on WebKit 11 1 11

Noise Filtering Quality Review Access Selection Preparation Discovery Filtering Deep Web Archiving Noise Filtering Dynamic Pages JavaScript, Flash Archivist Link Extractio on Temporal Coherence of Crawls Capture Fetching User Long-term Interpretability Index Storage Multimedia Content 12

Web spam: for (or against) search engines 13

Web Spam: indexing vs. archiving Primary target: search engines to manipulate ranking As side effect, we also archive spam But very costly if not fought against: traps crawler 10+% sites near 20% HTML pages Reputable 70.0% Unknown 0.4% Alias 0.3% Empty 0.4% Non-existent 7.9% 2004.de crawl courtesy: T. Suel Ad 3.7% Spam 16.5% 14

What can we do? Ideal solution Automatic identification of spam pages Requires Right selection of features to identify spam Development of new features e.g. creation and disappearance of new sites, pages Good training sets Problem: Spam is constantly changing Features need to be adapted Updated training sets are necessary Training set need to be prepared manually 15

Spam Assessment Interface 16

Temporal Coherence of Crawls Quality Review Access Selection Preparation Discovery Filtering Deep Web Archiving Noise Filtering Dynamic Pages JavaScript, Flash Archivist Link Extractio on Temporal Coherence of Crawls Capture Fetching User Long-term Interpretability Index Storage Multimedia Content 17

Temporal Coherence Capturing Web sites as authentic as possible Make a site snapshot at once is not possible Crawlers need to be polite to web sites Slow crawling, maybe with delays Pages are changing during site crawl When Do we have a coherent crawl? 18

Coherence by Example t coherence = [t 2, t 3 ) p 1 p 2 p 3 p 4 t 1 = t s t 2 t 3 t 4 = t e 19/219

Coherence by Example t coherence p 1 p 2 p 3 p 4 t 1 = t s t 2 t 3 t 4 = t e 20/20

Coherence Analysis Technology Creation of automatically generated reports Visualization of coherence defects 21

Long-term Interpretability Quality Review Access Selection Preparation Discovery Filtering Deep Web Archiving Noise Filtering Dynamic Pages JavaScript, Flash Archivist Link Extractio on Temporal Coherence of Crawls Capture Fetching User Long-term Interpretability Index Storage Multimedia Content 22

Motivation Archives store content over long time ranges Content is created latest in the year of archiving Content typically creators use the language of that time 1800 1900 1920 1950 2009 St. Piter Burh 1703 St. Petersburg Petrograd Leningrad St. Petersburg 1703-1914 1914-1924 1924-1991 1991- present 23

Overview Process Find the meaning of a term in one collection at one period in time Detect evolution of terms Step 1: Word Sense Discrimination Step 2: Tracking Evolution 24

Data Sets for Evaluation (1/2) Data Set Requirements Large corpus Fully digitized Long time range Increase probability of terminology evolution Not too domain specific (like the Mesh corpus) Homogeneous language Time annotated Using Web Archives Large digital corpus At most 10 Years old Inhomogeneous with all the noise of the web Not suitable for initial evaluations 25

Data Sets for Evaluation (2/2) Newspaper Archives Fully digitized corpora Controlled language Clear time annotations Süddeutsche Zeitung (ger.) Spans from year 1994-2006 ~ 1.3 Million articles London Times Archive (engl.) Spans from year 1785-1985 ~ 20 Million articles Strategy Year 2: Initial evaluations on well known corpora Year 3: Apply technology to web archives,.gov.uk crawls provided by EA 26

Term. Evol- Conclusions and Future Work Terminology Extraction Find methods that are time independent for Extraction Stop word removal Lemmatization Correcting OCR errors Word Sense Discrimination Other types of clustering Metrics for evaluation Detecting Evolution Methods for comparing clusters and detecting evolution Methods for evaluation Nina Tahmasebi 27

Conclusions and Expected Project Results Improving Web Archiving Technology - Rich Media Capturing - Spam Processing - Archive Coherence - More general scope: Improving Archive Interpretability Selected results will be integrated in - Heritrix Crawler (our test-bed) - Hanzo Archives Crawler Evaluation in two test cases: - Streaming Media WebArchive by Sound & Vision - WebArchivistsWorkbench by European Archive and Nat. Lib. Czech Republic 28

Thank you! More information on http://www.liwa-project.eu/ 29