Challenges and Approaches for Web Archive Creation and Usage

Size: px

Start display at page:

Download "Challenges and Approaches for Web Archive Creation and Usage"

Marjorie Hawkins
6 years ago
Views:

1 Challenges and Approaches for Web Archive Creation and Usage Thomas Risse L3S Research Center Lleida, 29. May 2015 Thomas Risse 25/11/15 1

2 The Web is a core part of our daily life World Wide Web = 50 Bill. Pages + 1 Bill. Users The Web and the Social Web play a crucial role Information and services for all domains Allows contributions by every citizen Giving room for the articulation for a multitude of stakeholders Reflects all types of events, opinions, developments within society, science, politics, environment, business, Thomas Risse 25/11/15 2

3 Are we loosing the past of the web? Gun running from Sudan Attack on Copts Spam Thomas Risse 25/11/15 3

The Web is Changing and Forgetting The Web is a quickly changing, ever growing information space [1] It s growing by >8% per week After 1 year only 40% of the pages are still accessible while 60% of

4 The Web is Changing and Forgetting The Web is a quickly changing, ever growing information space [1] It s growing by >8% per week After 1 year only 40% of the pages are still accessible while 60% of the pages are new A Web Archive as a Collective Memory is a cultural necessity for the future But Archive and Store Everything is not a practical approach [1] A. Ntoulas, J. Cho, and C. Olston. What's new on the web?: the evolution of the web from a search engine perspective. In Proceedings of the 13th international conference on World Wide Web (WWW '04) Thomas Risse 25/11/15 4

Where are we now? A number of tools are available (e.g.

scopes [2] Still a lot manual effort is necessary but only some 100 people are involved world wide [2]

More support is necessary Crawl by Events, Topics and Entities Using the Wisdom of the Crowds for selection and

5 Where are we now? A number of tools are available (e.g. Heritrix) Crawl descriptions are currently lists of URLs >42 world wide Web Archives initiatives with different scopes [2] Still a lot manual effort is necessary but only some 100 people are involved world wide [2] Increasing interest from Digital Humanities, Journalism, etc. More support is necessary Crawl by Events, Topics and Entities Using the Wisdom of the Crowds for selection and appraisal [2] D. Gomes, J. Miranda and M. Costa. A survey on web archiving initiatives. In Proceedings of the 1st International Conference on Theory and Practice of Digital Libraries 2011 (TPDL 2011) Thomas Risse 25/11/15 5

6 Agenda Motivation The Crawl Process and its Challenges Process Overview Dynamic Pages Basic Crawl Strategies Termination of Crawls Next Generation Web Archiving Requirements Topical Crawling The ARCOMEM Approach icrawl - Integrated Crawling Web Archive Access and Usage Examples Current methods The ARCOMEM Approach Time and Topic aware diversification Language Evolution Conclusions Thomas Risse 25/11/15 6

7 Agenda Motivation The Crawl Process and its Challenges Process Overview Dynamic Pages Basic Crawl Strategies Termination of Crawls Next Generation Web Archiving Requirements Topical Crawling The ARCOMEM Approach icrawl - Integrated Crawling Web Archive Access and Usage Examples Current methods The ARCOMEM Approach Time and Topic aware diversification Language Evolution Conclusions Thomas Risse 25/11/15 7

Temporal Coherence of Crawls Social Media API Crawling Noise Filtering Capture Access Deep Web Archiving

8 Index Storage Link Extraction Fetching Preparation Discovery Filtering Web Archiving and some challenges Content Appraisal Quality Review Selection Content Selection Contributions the European Projects Archivist Temporal Coherence of Crawls Social Media API Crawling Noise Filtering Capture Access Deep Web Archiving Dynamic Pages JavaScript, Flash User Data Volume Temporal Aspects Multimedia Content Thomas Risse 25/11/15 8

Coherence of Crawls Social Media API Crawling Noise Filtering Capture Access Deep Web Archiving

9 Index Storage Link Extraction Fetching Preparation Discovery Filtering Web Archiving and some challenges Content Appraisal Quality Review Selection Content Selection Archivist Temporal Coherence of Crawls Social Media API Crawling Noise Filtering Capture Access Deep Web Archiving Dynamic Pages JavaScript, Flash User Data Volume Temporal Aspects Multimedia Content Thomas Risse 25/11/15 9

10 Dynamic Pages Thomas Risse 25/11/15 10

Follow the links... <a href="http://www.gnu.

11 Follow the links... <a href=" software</a> Thomas Risse 25/11/15 11

$But what about these pages? function() { var fcb_referrers = [ "fcbarcelona.$ $com" ]; var current_language = "en"; var routes = {"ca":"http://www.fcbarcelona.$ cat"}; var navigator_language = navigator.language navigator.

cat"}; var navigator_language = navigator.language navigator.

$:\/\/[^\/]*(" + fcb_referrers.join(" ").replace(/\./g, '\\.$

12 But what about these pages? function() { var fcb_referrers = [ "fcbarcelona.com", "fcbarcelona.cat", "fcbarcelona.es", "fcbarab.com", "fcbarcelona.fr", "fcbarcelona.jp", "fcbarcelona.cn", "fcbarcelona.qq.com" ]; var current_language = "en"; var routes = {"ca":" var navigator_language = navigator.language navigator.userlanguage; var language = (navigator_language "").split("-")[0]; var s = "^https?:\/\/[^\/]*(" + fcb_referrers.join(" ").replace(/\./g, '\\.') + ")"; var fcb_referrer = document.referrer.match(new RegExp(s)); var redirect_url = routes[language]; if (language && language!= current_language &&!fcb_referrer && redirect_url) { window.location = redirect_url; } })(); Thomas Risse 25/11/15 12

13 Endless Pages Automatically reload content while browsing the page Requires execution of JavaScript When can the fetching be stopped? Examples: Facebook, Twitter, Tumblr, etc. Thomas Risse 25/11/15 13

14 Embedded Social Media Embedded Social Media Content JavaScript embeds Dynamically integrated content on the browser side Often standard scripts Possibility to extract parameters Content fetching requires API Crawler Thomas Risse 25/11/15 14

Handling of Dynamic Pages The Problem Embedded

is readable Dynamic creation of links and content

any fragments that look like links into URLs Can be

parameters from program code Applicable for known

Facebook, Twitter Execution of JavaScript Simulate

comes out Execute code in a Javascript engi

WebKit, Firefox Browser) Extract links from

15 Handling of Dynamic Pages The Problem Embedded Programs Adobe Flash impossible to handle JavaScript is readable Dynamic creation of links and content Approaches Guessing of links guessing by assembling any fragments that look like links into URLs Can be very noisy - lots of wrong URL s Extraction of parameters from program code Applicable for known code libraries e.g. Facebook, Twitter Execution of JavaScript Simulate user activities- pressing the links and see what comes out Execute code in a Javascript engine (e.g. WebKit, Firefox Browser) Extract links from resulting DOM tree Status Some implementations exist (e.g. LiWA Service, Browser Monkey) API Crawler exist No integration into standard Web Crawler Thomas Risse 25/11/15 15

16 Basic Crawl Strategies Thomas Risse 25/11/15 16

17 Crawl Strategy thomas.htm about.htm joe.htm journalists.htm ukraine0105.htm ukraine_crisis.htm ukraine0205.htm ukraine_sports0105.htm sports.htm spain_sports0205.htm Thomas Risse 25/11/15 17

18 Crawl Strategy: Depth-first thomas.htm about.htm joe.htm journalists.htm ukraine0105.htm ukraine_crisis.htm ukraine0205.htm ukraine_sports0105.htm sports.htm spain_sports0205.htm Thomas Risse 25/11/15 18

19 Crawl Strategy: Breadth-first thomas.htm about.htm joe.htm journalists.htm ukraine0105.htm ukraine_crisis.htm ukraine0205.htm ukraine_sports0105.htm sports.htm spain_sports0205.htm Thomas Risse 25/11/15 19

20 Termination of a Crawl When is the harvesting of Web Pages complete? Problem: The total amount of pages is unknown Typical Approach Crawler Queue is empty (only for very focused crawls) Termination after number of pages/amount of data Termination after time Most often incomplete crawls depending on the strategy Thomas Risse 25/11/15 20

21 Possible Consequences after stopping the Crawl Depth-first thomas.htm Breadth-first thomas.htm about.htm about.htm joe.htm joe.htm journalists.htm journalists.htm ukraine0105.htm ukraine0105.htm ukraine_crisis.htm ukraine0205.htm ukraine_crisis.htm ukraine0205.htm sports.htm ukraine_sports0105.htm sports.htm ukraine_sports0105.htm spain_sports0205.htm Thomas Risse spain_sports0205.htm 25/11/15 21

example: PageRank - Mainly useful to optimize regular crawls - Content based selection - Topics, Events, Entities - Requires semantic crawl specification

22 thomas.htm Alternative Crawl Strategies about.htm journalists.htm joe.htm - Selection by Popularity - Ranking of the waiting queue according to the popularity of the content - Requires knowledge about the Web Graph - For example: PageRank - Mainly useful to optimize regular crawls - Content based selection - Topics, Events, Entities - Requires semantic crawl specification - The selection of the right strategy depends on the crawl intention ukraine_crisis.htm ukraine0105.htm ukraine_sports0105.htm sports.htm spain_sports0205.htm Will be discussed afterwards ukraine0205.htm Thomas Risse 25/11/15 22

23 Agenda Motivation The Crawl Process and its Challenges Process Overview Dynamic Pages Basic Crawl Strategies Termination of Crawls Next Generation Web Archiving Requirements Topical Crawling The ARCOMEM Approach icrawl - Integrated Crawling Web Archive Access and Usage Examples Current methods The ARCOMEM Approach Time and Topic aware diversification Language Evolution Conclusions Thomas Risse 25/11/15 23

24 Growing Scientific Interest in Web Archive Content (1/2) Historians - Official Publications (e.g. Government) - Journalistic Resources - Important topics and events with a high media coverage - Multi-cultural or controversial topics - Optimal: continues observation of topics in the Web Social Sciences - Observations of topics and events on major sites are good starting points - Identified Topic - Official publications, journalistic and social media sources - Changes on the topic should be identified - Metadata / Context (e.g. Author, Organizations and their interests, gender, location) - Demographic information about social sites - Provenance: Transparent and detailed documentation of content selection Thomas Risse 25/11/15 24

Growing Scientific Interest in Web Archive Content (2/2) Law - Research is based on official publications and protocols of parliaments or comments released by publishers - Social media (especially

25 Growing Scientific Interest in Web Archive Content (2/2) Law - Research is based on official publications and protocols of parliaments or comments released by publishers - Social media (especially blogs) are increasingly used - Only used as background information - Reason: missing citability and authenticity of resources - Genesis of laws - Used to understand original intention of laws - A democratic system requires a complete documentation of the law genesis. - Currently different degrees of documentation - Official publications: parliament and committee meetings - Public discourse Thomas Risse 25/11/15 25

26 Derived Requirements Topical Dimension - Crawl intention are mainly focused around events and rarely around entities - What is the intention of the researcher? - Easy monitoring by the researcher and possibility to correct Flexible Crawling Strategies - Shallow observation crawls - Focused crawls with prioritization (e.g. PageRank and/or semantics) Social Web Crawling - General interest with different media focus - Integrated with Web crawler to capture the full context Authenticity - See a web page as the user saw the page (e.g. including ads and tweets at that time point) Context and Provenance - Demographics of sites - Documentation of crawl specification and history Thomas Risse 25/11/15 26

27 Topical / Event Crawling thomas.htm about.htm journalists.htm joe.htm Crawl Specification - Terms - Ukraine - Crisis - - Seed List ukraine0105.htm ukraine_crisis.htm ukraine0205.htm ukraine_sports0105.htm sports.htm spain_sports0205.htm Thomas Risse 25/11/15 27

28 ARCOMEM Crawling Phases Entities Obama, Romney, Biden, Ryan, Republicans, Democrats, Keywords US Election, CommitToMitt, Teaparty, Budget deficit, Social Media Seedlist Seedlist Offline Processing Online Processing Crawling ARCOMEM Storage SARA for Broadcaster, Parliaments Archive Internet Crawling Cross Crawl Processing Appraisal Selection Thomas Risse 25/11/15 28

29 Architecture Overview Application Cross Crawl Analysis SARA Broadcaster Named Entity Evol. Recog. Twitter Dynamics Parliament Image/Video Analysis Social Web Analysis Offline Processing WARC Files + SOLR Index WARC Export Crawler Cockpit ARCOMEM Storage (HBase, H2RDF) GATE Offline Analysis Consolidation Enrichment GATE Online Analysis Extracted SocialWeb Information Social Web Analysis Online Processing Queue Management Resource Fetching Application-Aware Helper URLs Relevance Analysis & Priorization Resource Selection & Prioritization Crawler Intelligent Crawl Definition Thomas Risse 25/11/15 29

30 Crawler Cockpit Thomas Risse 25/11/15 30

31 Relevance Distribution Time Thomas Risse 25/11/15 31

32 New Strategies lead to Example: German Elections Crawl 2013 Completely user defined crawl (German Broadcasters) Many focused terms and entities in German Many full names e.g. Angela Merkel Few less focused keywords in English 1 st Phase Many English pages with no relation to German Elections 2 nd Phase Refinement of Crawl Specification Focused English terms Last names instead of full names Finding a good cut off point of archive threshold Smaller result set with higher focus Thomas Risse 25/11/15 32

33 new Experiences A good specification of the crawl intention becomes critical Crawl specifications are rather complex compared to seed lists More experiences, observations and analysis are necessary from Users and Developers Room for more sophisticated guidance in the 1 st phase Finding the right cut-off point Depends on the user requirements Highly focused Smaller Archives but higher risk of loosing interesting information Less focused Lower risk of missing information but larger archives Translation into scoring of individual crawls Thomas Risse 25/11/15 33

34 and new Limitiations Social Media and Web Crawling are separate systems Process 1. Crawling of Social Media content 2. Extraction of Links 3. Crawling of Web Pages Static integration of Social Media Uni-directional Path: Social Media Web Content Missing Path: Web Content Social Media In Addition Complex system Required many changes in the Heritrix queue handling Thomas Risse 25/11/15 34

35 Integrated crawling with the L3S icrawl System Crawl Monitor Scheduler Semantic Crawl Description Initial Seedlist Crawl Specification Crawler Web Crawler API Crawler Specification Refinement Crawl Analysis & Enrichment Archive Creation & Cataloguing Web Web Archive Archive Learning the Crawl Specification Provenance Crawl Preparation Crawl Execution Crawl Finalization Apache Nutch based Web Archive crawler (under development) Learning the intention of the crawl Integration of Web and Social Media Crawling Content based monitoring of the crawl process Thomas Risse 25/11/15 35

36 icrawl Wizard Thomas Risse 25/11/15 36

Example for Integrated Crawling Twitter #Ukraine Feed (Medium Page

37 Example for Integrated Crawling Twitter #Ukraine Feed (Medium Page Relevance) (Low Page Relevance) Crawler Queue ID Batch URL Priority UK1 1 UK UK3 x UK4 y (High Page Relevance) Web Link Extracted URL Thomas Risse 25/11/15 37

Agenda Motivation The Crawl Process and its Challenges Process Overview Dynamic Pages Basic Crawl Strategies Termination of Crawls Next Generation Web Archiving Requirements Topical Crawling The

38 Agenda Motivation The Crawl Process and its Challenges Process Overview Dynamic Pages Basic Crawl Strategies Termination of Crawls Next Generation Web Archiving Requirements Topical Crawling The ARCOMEM Approach icrawl - Integrated Crawling Web Archive Access and Usage Examples Current methods The ARCOMEM Approach Time and Topic aware diversification Language Evolution Conclusions Thomas Risse 25/11/15 38

39 WayBackMachine Basic tool to access Web Archives Also used in combination with other tools Only URL based access Provides Capture overviews Details about crawl dates Thomas Risse 25/11/15 40

40 Searching the Web Archive Thomas Risse 25/11/15 41

41 Other Ways to Explore Web Archives Thomas Risse 25/11/15 42

Web Archive Access Different user groups with different needs Not the typical Web Search Engine user Explorative Search Web Archive Search Web Search Time Dimension Many versions

42 Web Archive Access Different user groups with different needs Not the typical Web Search Engine user Explorative Search Web Archive Search Web Search Time Dimension Many versions of the same page Comparing versions Large scale analysis across time Information extraction (Entities, Events, Topics, Opinions) Interlinking Visualizations Thomas Risse 25/11/15 43

43 The ARCOMEM Approach to Web Archive Access Thomas Risse 25/11/15 44

44 Number of Documents Historical Search on News Archives* Consider a journalist interested in the history of... Rudolph Giuliani Mayoral Campaign Mayoral Campaign Senate, Cancer, Allegations Mayoral Campaign 9/11 Post politics endeavours Mayor News articles encode history as it happens. Aspects are diverse across time Time windows can be diverse in aspects. *Jaspreet Singh, Avishek Anand; Historical Search on News Archives; L3S Report 2015 Thomas Risse 25/11/15 45

45 Large Scale Data Analytics Example Named Entity Evolution Recognition Thomas Risse 25/11/15 46

46 Language changes over time! Our language is DYNAMIC and changes with our culture, politics, technology, social media, etc. Different spellings over time New words are introduced Words change their meanings In long-term digital archives Problems! St. Petersburg Petrograd Leningrad St. Petersburg St. Petersburg today t Thomas Risse 25/11/15 47

47 Problem: Finding Documents A scholar writing a thesis about Pope Benedikt : I want to know more about Pope Benedikt?? Thomas Risse 25/11/15 48

Named Entity Evolution Joseph Ratzinger Pope Benedict Joseph Aloisius Ratzinger Cardinal

Named Entities (NE): people, places, companies.

Change occurs over short periods of time Small or no concept shift Announced to the public

48 Named Entity Evolution Joseph Ratzinger Pope Benedict Joseph Aloisius Ratzinger Cardinal Ratzinger Cardinal Joseph Ratzinger Pope Benedict XVI Benedict XVI Pope emeritus Benedict XVI Named Entities (NE): people, places, companies... Characteristics of Named Entity Evolution (NEE) Same thing but different terms over time Change occurs over short periods of time Small or no concept shift Announced to the public repeatedly Goal: Find method for named entity evolution recognition independent from external knowledge sources Change Period Thomas Risse 25/11/15 49

Named Entity Evolution Recognizer (NEER) Identifying Change Periods (Burst Detection) Extract Text NLP Processing Context Creation In his latest address to American In his bishops latest address

that Benedict remain Catholic XVI true educators stressed the should faith that -- remain Catholic a reminder true educators issued the just should faith in time -- remain a In his latest address to

educators stressed Rome, should Pope that Benedict remain Catholic XVI true educators stressed the should faith that -- remain Catholic a reminder true educators issued the just should faith in time

49 Named Entity Evolution Recognizer (NEER) Identifying Change Periods (Burst Detection) Extract Text NLP Processing Context Creation In his latest address to American In his bishops latest address visiting to Rome, American In Pope his bishops Benedict latest address visiting XVI to stressed Rome, American Pope that bishops Benedict Catholic visiting XVI educators stressed Rome, should Pope that Benedict remain Catholic XVI true educators stressed the should faith that -- remain Catholic a reminder true educators issued the just should faith in time -- remain a In his latest address to American In his bishops latest address visiting to Rome, American In Pope his Benedict bishops latest address visiting XVI to stressed Rome, American Pope that Benedict bishops Catholic XVI visiting educators stressed Rome, should Pope that Benedict remain Catholic XVI true educators stressed the should faith that -- remain Catholic a reminder true educators issued the just should faith in time -- remain a for reminder another true tense to issued the season just faith in of time -- a for reminder another true tense issued to the season just faith in of time -- a commencement for reminder another tense issued addresses. season just in of time No, commencement for the another pope tense did addresses. season not of mention No, commencement the pope Georgetown did addresses. not University mention No, by the name pope Georgetown when did not discussing University mention the by name Catholic Georgetown when campus discussing University culture wars. the by name Catholic when commencement for another reminder tense issued season addresses. commencement for No, another pope tense did season not addr- of just in of time mention esses. commencement No, the Georgetown pope did not addresses. by No, name Georgetown the pope when did not University mention discussing University mention the by name Catholic Georgetown when campus discussing University culture the wars. by Catholic name when campus discussing culture wars. the Catholic campus discussing culture the wars. Catholic campus culture wars. campus culture wars. Barack Obama Senator State Senator Barack Obama Senator-elect Barack Obama Senator Barack Obama Illinois Democrat Vladimir Putin President-elect Vladimir V Putin Minister Vladimir Putin Acting President Vladimir V Putin President Vladimir V Putin Finding Temporal Co-references Filtering 1. Pope Benedict XVI 2. Pope Benedict 3. Benedict XVI 4. Cardinal Ratzinger 5. Pope 6. Benedict Co-References Benedict XVI Joseph Ratzinger Cardinal Ratzinger Evaluation Results Burst detection found total 73% of all change periods High recall for unsupervised method Machine Learning boosts precision Data Set: Thomas Risse 25/11/15 50

Motivation Optimal access to Web archives taking into account the temporal dimension of Web archives structured semantic information available on

Temporal Retrieval and Ranking Collaborative Exploration and Analytics Testbeds Temporal Wikipedia Academic Web Archive Politics on the Web

50 Motivation Optimal access to Web archives taking into account the temporal dimension of Web archives structured semantic information available on the Web social media and network information Objectives Evolution-Aware Entity-Based Enrichment and Indexing Aggregating Social Networks and Streams Temporal Retrieval and Ranking Collaborative Exploration and Analytics Testbeds Temporal Wikipedia Academic Web Archive Politics on the Web ALEXANDRIA Foundations for Temporal Retrieval, Exploration and Analytics in Web Archives More information: Thomas Risse 25/11/15 51

51 Conclusions Web Crawling Mature standard Web crawlers but still many challenges (e.g. Social Media, Dynamics) Manual interventions will always be necessary New user communities with different interests and requirements Additional crawling strategies are necessary Web Archive Access Current approaches far behind the State of the Art Increasing Web Archive interest will lead to increasing user expectations Different usage categories: Explorative search vs. Big Data analysis Legal Aspects are still a big issue Thomas Risse 25/11/15 52

52 Thank You! Dr. Thomas Risse Forschungszentrum L3S Leibniz Universität Hannover Appelstrasse 9a Hannover, Germany Telefon: Telefax: Thomas Risse 25/11/15 53

From Web Page Storage to Living Web Archives Thomas Risse

From Web Page Storage to Living Web Archives Thomas Risse JISC, the DPC and the UK Web Archiving Consortium Workshop British Library, London, 21.7.2009 1 Agenda Web Crawlingtoday& Open Issues LiWA Living