Challenges and Approaches for Web Archive Creation and Usage
|
|
- Marjorie Hawkins
- 6 years ago
- Views:
Transcription
1 Challenges and Approaches for Web Archive Creation and Usage Thomas Risse L3S Research Center Lleida, 29. May 2015 Thomas Risse 25/11/15 1
2 The Web is a core part of our daily life World Wide Web = 50 Bill. Pages + 1 Bill. Users The Web and the Social Web play a crucial role Information and services for all domains Allows contributions by every citizen Giving room for the articulation for a multitude of stakeholders Reflects all types of events, opinions, developments within society, science, politics, environment, business, Thomas Risse 25/11/15 2
3 Are we loosing the past of the web? Gun running from Sudan Attack on Copts Spam Thomas Risse 25/11/15 3
4 The Web is Changing and Forgetting The Web is a quickly changing, ever growing information space [1] It s growing by >8% per week After 1 year only 40% of the pages are still accessible while 60% of the pages are new A Web Archive as a Collective Memory is a cultural necessity for the future But Archive and Store Everything is not a practical approach [1] A. Ntoulas, J. Cho, and C. Olston. What's new on the web?: the evolution of the web from a search engine perspective. In Proceedings of the 13th international conference on World Wide Web (WWW '04) Thomas Risse 25/11/15 4
5 Where are we now? A number of tools are available (e.g. Heritrix) Crawl descriptions are currently lists of URLs >42 world wide Web Archives initiatives with different scopes [2] Still a lot manual effort is necessary but only some 100 people are involved world wide [2] Increasing interest from Digital Humanities, Journalism, etc. More support is necessary Crawl by Events, Topics and Entities Using the Wisdom of the Crowds for selection and appraisal [2] D. Gomes, J. Miranda and M. Costa. A survey on web archiving initiatives. In Proceedings of the 1st International Conference on Theory and Practice of Digital Libraries 2011 (TPDL 2011) Thomas Risse 25/11/15 5
6 Agenda Motivation The Crawl Process and its Challenges Process Overview Dynamic Pages Basic Crawl Strategies Termination of Crawls Next Generation Web Archiving Requirements Topical Crawling The ARCOMEM Approach icrawl - Integrated Crawling Web Archive Access and Usage Examples Current methods The ARCOMEM Approach Time and Topic aware diversification Language Evolution Conclusions Thomas Risse 25/11/15 6
7 Agenda Motivation The Crawl Process and its Challenges Process Overview Dynamic Pages Basic Crawl Strategies Termination of Crawls Next Generation Web Archiving Requirements Topical Crawling The ARCOMEM Approach icrawl - Integrated Crawling Web Archive Access and Usage Examples Current methods The ARCOMEM Approach Time and Topic aware diversification Language Evolution Conclusions Thomas Risse 25/11/15 7
8 Index Storage Link Extraction Fetching Preparation Discovery Filtering Web Archiving and some challenges Content Appraisal Quality Review Selection Content Selection Contributions the European Projects Archivist Temporal Coherence of Crawls Social Media API Crawling Noise Filtering Capture Access Deep Web Archiving Dynamic Pages JavaScript, Flash User Data Volume Temporal Aspects Multimedia Content Thomas Risse 25/11/15 8
9 Index Storage Link Extraction Fetching Preparation Discovery Filtering Web Archiving and some challenges Content Appraisal Quality Review Selection Content Selection Archivist Temporal Coherence of Crawls Social Media API Crawling Noise Filtering Capture Access Deep Web Archiving Dynamic Pages JavaScript, Flash User Data Volume Temporal Aspects Multimedia Content Thomas Risse 25/11/15 9
10 Dynamic Pages Thomas Risse 25/11/15 10
11 Follow the links... <a href=" software</a> Thomas Risse 25/11/15 11
12 But what about these pages? function() { var fcb_referrers = [ "fcbarcelona.com", "fcbarcelona.cat", "fcbarcelona.es", "fcbarab.com", "fcbarcelona.fr", "fcbarcelona.jp", "fcbarcelona.cn", "fcbarcelona.qq.com" ]; var current_language = "en"; var routes = {"ca":" var navigator_language = navigator.language navigator.userlanguage; var language = (navigator_language "").split("-")[0]; var s = "^https?:\/\/[^\/]*(" + fcb_referrers.join(" ").replace(/\./g, '\\.') + ")"; var fcb_referrer = document.referrer.match(new RegExp(s)); var redirect_url = routes[language]; if (language && language!= current_language &&!fcb_referrer && redirect_url) { window.location = redirect_url; } })(); Thomas Risse 25/11/15 12
13 Endless Pages Automatically reload content while browsing the page Requires execution of JavaScript When can the fetching be stopped? Examples: Facebook, Twitter, Tumblr, etc. Thomas Risse 25/11/15 13
14 Embedded Social Media Embedded Social Media Content JavaScript embeds Dynamically integrated content on the browser side Often standard scripts Possibility to extract parameters Content fetching requires API Crawler Thomas Risse 25/11/15 14
15 Handling of Dynamic Pages The Problem Embedded Programs Adobe Flash impossible to handle JavaScript is readable Dynamic creation of links and content Approaches Guessing of links guessing by assembling any fragments that look like links into URLs Can be very noisy - lots of wrong URL s Extraction of parameters from program code Applicable for known code libraries e.g. Facebook, Twitter Execution of JavaScript Simulate user activities- pressing the links and see what comes out Execute code in a Javascript engine (e.g. WebKit, Firefox Browser) Extract links from resulting DOM tree Status Some implementations exist (e.g. LiWA Service, Browser Monkey) API Crawler exist No integration into standard Web Crawler Thomas Risse 25/11/15 15
16 Basic Crawl Strategies Thomas Risse 25/11/15 16
17 Crawl Strategy thomas.htm about.htm joe.htm journalists.htm ukraine0105.htm ukraine_crisis.htm ukraine0205.htm ukraine_sports0105.htm sports.htm spain_sports0205.htm Thomas Risse 25/11/15 17
18 Crawl Strategy: Depth-first thomas.htm about.htm joe.htm journalists.htm ukraine0105.htm ukraine_crisis.htm ukraine0205.htm ukraine_sports0105.htm sports.htm spain_sports0205.htm Thomas Risse 25/11/15 18
19 Crawl Strategy: Breadth-first thomas.htm about.htm joe.htm journalists.htm ukraine0105.htm ukraine_crisis.htm ukraine0205.htm ukraine_sports0105.htm sports.htm spain_sports0205.htm Thomas Risse 25/11/15 19
20 Termination of a Crawl When is the harvesting of Web Pages complete? Problem: The total amount of pages is unknown Typical Approach Crawler Queue is empty (only for very focused crawls) Termination after number of pages/amount of data Termination after time Most often incomplete crawls depending on the strategy Thomas Risse 25/11/15 20
21 Possible Consequences after stopping the Crawl Depth-first thomas.htm Breadth-first thomas.htm about.htm about.htm joe.htm joe.htm journalists.htm journalists.htm ukraine0105.htm ukraine0105.htm ukraine_crisis.htm ukraine0205.htm ukraine_crisis.htm ukraine0205.htm sports.htm ukraine_sports0105.htm sports.htm ukraine_sports0105.htm spain_sports0205.htm Thomas Risse spain_sports0205.htm 25/11/15 21
22 thomas.htm Alternative Crawl Strategies about.htm journalists.htm joe.htm - Selection by Popularity - Ranking of the waiting queue according to the popularity of the content - Requires knowledge about the Web Graph - For example: PageRank - Mainly useful to optimize regular crawls - Content based selection - Topics, Events, Entities - Requires semantic crawl specification - The selection of the right strategy depends on the crawl intention ukraine_crisis.htm ukraine0105.htm ukraine_sports0105.htm sports.htm spain_sports0205.htm Will be discussed afterwards ukraine0205.htm Thomas Risse 25/11/15 22
23 Agenda Motivation The Crawl Process and its Challenges Process Overview Dynamic Pages Basic Crawl Strategies Termination of Crawls Next Generation Web Archiving Requirements Topical Crawling The ARCOMEM Approach icrawl - Integrated Crawling Web Archive Access and Usage Examples Current methods The ARCOMEM Approach Time and Topic aware diversification Language Evolution Conclusions Thomas Risse 25/11/15 23
24 Growing Scientific Interest in Web Archive Content (1/2) Historians - Official Publications (e.g. Government) - Journalistic Resources - Important topics and events with a high media coverage - Multi-cultural or controversial topics - Optimal: continues observation of topics in the Web Social Sciences - Observations of topics and events on major sites are good starting points - Identified Topic - Official publications, journalistic and social media sources - Changes on the topic should be identified - Metadata / Context (e.g. Author, Organizations and their interests, gender, location) - Demographic information about social sites - Provenance: Transparent and detailed documentation of content selection Thomas Risse 25/11/15 24
25 Growing Scientific Interest in Web Archive Content (2/2) Law - Research is based on official publications and protocols of parliaments or comments released by publishers - Social media (especially blogs) are increasingly used - Only used as background information - Reason: missing citability and authenticity of resources - Genesis of laws - Used to understand original intention of laws - A democratic system requires a complete documentation of the law genesis. - Currently different degrees of documentation - Official publications: parliament and committee meetings - Public discourse Thomas Risse 25/11/15 25
26 Derived Requirements Topical Dimension - Crawl intention are mainly focused around events and rarely around entities - What is the intention of the researcher? - Easy monitoring by the researcher and possibility to correct Flexible Crawling Strategies - Shallow observation crawls - Focused crawls with prioritization (e.g. PageRank and/or semantics) Social Web Crawling - General interest with different media focus - Integrated with Web crawler to capture the full context Authenticity - See a web page as the user saw the page (e.g. including ads and tweets at that time point) Context and Provenance - Demographics of sites - Documentation of crawl specification and history Thomas Risse 25/11/15 26
27 Topical / Event Crawling thomas.htm about.htm journalists.htm joe.htm Crawl Specification - Terms - Ukraine - Crisis - - Seed List ukraine0105.htm ukraine_crisis.htm ukraine0205.htm ukraine_sports0105.htm sports.htm spain_sports0205.htm Thomas Risse 25/11/15 27
28 ARCOMEM Crawling Phases Entities Obama, Romney, Biden, Ryan, Republicans, Democrats, Keywords US Election, CommitToMitt, Teaparty, Budget deficit, Social Media Seedlist Seedlist Offline Processing Online Processing Crawling ARCOMEM Storage SARA for Broadcaster, Parliaments Archive Internet Crawling Cross Crawl Processing Appraisal Selection Thomas Risse 25/11/15 28
29 Architecture Overview Application Cross Crawl Analysis SARA Broadcaster Named Entity Evol. Recog. Twitter Dynamics Parliament Image/Video Analysis Social Web Analysis Offline Processing WARC Files + SOLR Index WARC Export Crawler Cockpit ARCOMEM Storage (HBase, H2RDF) GATE Offline Analysis Consolidation Enrichment GATE Online Analysis Extracted SocialWeb Information Social Web Analysis Online Processing Queue Management Resource Fetching Application-Aware Helper URLs Relevance Analysis & Priorization Resource Selection & Prioritization Crawler Intelligent Crawl Definition Thomas Risse 25/11/15 29
30 Crawler Cockpit Thomas Risse 25/11/15 30
31 Relevance Distribution Time Thomas Risse 25/11/15 31
32 New Strategies lead to Example: German Elections Crawl 2013 Completely user defined crawl (German Broadcasters) Many focused terms and entities in German Many full names e.g. Angela Merkel Few less focused keywords in English 1 st Phase Many English pages with no relation to German Elections 2 nd Phase Refinement of Crawl Specification Focused English terms Last names instead of full names Finding a good cut off point of archive threshold Smaller result set with higher focus Thomas Risse 25/11/15 32
33 new Experiences A good specification of the crawl intention becomes critical Crawl specifications are rather complex compared to seed lists More experiences, observations and analysis are necessary from Users and Developers Room for more sophisticated guidance in the 1 st phase Finding the right cut-off point Depends on the user requirements Highly focused Smaller Archives but higher risk of loosing interesting information Less focused Lower risk of missing information but larger archives Translation into scoring of individual crawls Thomas Risse 25/11/15 33
34 and new Limitiations Social Media and Web Crawling are separate systems Process 1. Crawling of Social Media content 2. Extraction of Links 3. Crawling of Web Pages Static integration of Social Media Uni-directional Path: Social Media Web Content Missing Path: Web Content Social Media In Addition Complex system Required many changes in the Heritrix queue handling Thomas Risse 25/11/15 34
35 Integrated crawling with the L3S icrawl System Crawl Monitor Scheduler Semantic Crawl Description Initial Seedlist Crawl Specification Crawler Web Crawler API Crawler Specification Refinement Crawl Analysis & Enrichment Archive Creation & Cataloguing Web Web Archive Archive Learning the Crawl Specification Provenance Crawl Preparation Crawl Execution Crawl Finalization Apache Nutch based Web Archive crawler (under development) Learning the intention of the crawl Integration of Web and Social Media Crawling Content based monitoring of the crawl process Thomas Risse 25/11/15 35
36 icrawl Wizard Thomas Risse 25/11/15 36
37 Example for Integrated Crawling Twitter #Ukraine Feed (Medium Page Relevance) (Low Page Relevance) Crawler Queue ID Batch URL Priority UK1 1 UK UK3 x UK4 y (High Page Relevance) Web Link Extracted URL Thomas Risse 25/11/15 37
38 Agenda Motivation The Crawl Process and its Challenges Process Overview Dynamic Pages Basic Crawl Strategies Termination of Crawls Next Generation Web Archiving Requirements Topical Crawling The ARCOMEM Approach icrawl - Integrated Crawling Web Archive Access and Usage Examples Current methods The ARCOMEM Approach Time and Topic aware diversification Language Evolution Conclusions Thomas Risse 25/11/15 38
39 WayBackMachine Basic tool to access Web Archives Also used in combination with other tools Only URL based access Provides Capture overviews Details about crawl dates Thomas Risse 25/11/15 40
40 Searching the Web Archive Thomas Risse 25/11/15 41
41 Other Ways to Explore Web Archives Thomas Risse 25/11/15 42
42 Web Archive Access Different user groups with different needs Not the typical Web Search Engine user Explorative Search Web Archive Search Web Search Time Dimension Many versions of the same page Comparing versions Large scale analysis across time Information extraction (Entities, Events, Topics, Opinions) Interlinking Visualizations Thomas Risse 25/11/15 43
43 The ARCOMEM Approach to Web Archive Access Thomas Risse 25/11/15 44
44 Number of Documents Historical Search on News Archives* Consider a journalist interested in the history of... Rudolph Giuliani Mayoral Campaign Mayoral Campaign Senate, Cancer, Allegations Mayoral Campaign 9/11 Post politics endeavours Mayor News articles encode history as it happens. Aspects are diverse across time Time windows can be diverse in aspects. *Jaspreet Singh, Avishek Anand; Historical Search on News Archives; L3S Report 2015 Thomas Risse 25/11/15 45
45 Large Scale Data Analytics Example Named Entity Evolution Recognition Thomas Risse 25/11/15 46
46 Language changes over time! Our language is DYNAMIC and changes with our culture, politics, technology, social media, etc. Different spellings over time New words are introduced Words change their meanings In long-term digital archives Problems! St. Petersburg Petrograd Leningrad St. Petersburg St. Petersburg today t Thomas Risse 25/11/15 47
47 Problem: Finding Documents A scholar writing a thesis about Pope Benedikt : I want to know more about Pope Benedikt?? Thomas Risse 25/11/15 48
48 Named Entity Evolution Joseph Ratzinger Pope Benedict Joseph Aloisius Ratzinger Cardinal Ratzinger Cardinal Joseph Ratzinger Pope Benedict XVI Benedict XVI Pope emeritus Benedict XVI Named Entities (NE): people, places, companies... Characteristics of Named Entity Evolution (NEE) Same thing but different terms over time Change occurs over short periods of time Small or no concept shift Announced to the public repeatedly Goal: Find method for named entity evolution recognition independent from external knowledge sources Change Period Thomas Risse 25/11/15 49
49 Named Entity Evolution Recognizer (NEER) Identifying Change Periods (Burst Detection) Extract Text NLP Processing Context Creation In his latest address to American In his bishops latest address visiting to Rome, American In Pope his bishops Benedict latest address visiting XVI to stressed Rome, American Pope that bishops Benedict Catholic visiting XVI educators stressed Rome, should Pope that Benedict remain Catholic XVI true educators stressed the should faith that -- remain Catholic a reminder true educators issued the just should faith in time -- remain a In his latest address to American In his bishops latest address visiting to Rome, American In Pope his Benedict bishops latest address visiting XVI to stressed Rome, American Pope that Benedict bishops Catholic XVI visiting educators stressed Rome, should Pope that Benedict remain Catholic XVI true educators stressed the should faith that -- remain Catholic a reminder true educators issued the just should faith in time -- remain a for reminder another true tense to issued the season just faith in of time -- a for reminder another true tense issued to the season just faith in of time -- a commencement for reminder another tense issued addresses. season just in of time No, commencement for the another pope tense did addresses. season not of mention No, commencement the pope Georgetown did addresses. not University mention No, by the name pope Georgetown when did not discussing University mention the by name Catholic Georgetown when campus discussing University culture wars. the by name Catholic when commencement for another reminder tense issued season addresses. commencement for No, another pope tense did season not addr- of just in of time mention esses. commencement No, the Georgetown pope did not addresses. by No, name Georgetown the pope when did not University mention discussing University mention the by name Catholic Georgetown when campus discussing University culture the wars. by Catholic name when campus discussing culture wars. the Catholic campus discussing culture the wars. Catholic campus culture wars. campus culture wars. Barack Obama Senator State Senator Barack Obama Senator-elect Barack Obama Senator Barack Obama Illinois Democrat Vladimir Putin President-elect Vladimir V Putin Minister Vladimir Putin Acting President Vladimir V Putin President Vladimir V Putin Finding Temporal Co-references Filtering 1. Pope Benedict XVI 2. Pope Benedict 3. Benedict XVI 4. Cardinal Ratzinger 5. Pope 6. Benedict Co-References Benedict XVI Joseph Ratzinger Cardinal Ratzinger Evaluation Results Burst detection found total 73% of all change periods High recall for unsupervised method Machine Learning boosts precision Data Set: Thomas Risse 25/11/15 50
50 Motivation Optimal access to Web archives taking into account the temporal dimension of Web archives structured semantic information available on the Web social media and network information Objectives Evolution-Aware Entity-Based Enrichment and Indexing Aggregating Social Networks and Streams Temporal Retrieval and Ranking Collaborative Exploration and Analytics Testbeds Temporal Wikipedia Academic Web Archive Politics on the Web ALEXANDRIA Foundations for Temporal Retrieval, Exploration and Analytics in Web Archives More information: Thomas Risse 25/11/15 51
51 Conclusions Web Crawling Mature standard Web crawlers but still many challenges (e.g. Social Media, Dynamics) Manual interventions will always be necessary New user communities with different interests and requirements Additional crawling strategies are necessary Web Archive Access Current approaches far behind the State of the Art Increasing Web Archive interest will lead to increasing user expectations Different usage categories: Explorative search vs. Big Data analysis Legal Aspects are still a big issue Thomas Risse 25/11/15 52
52 Thank You! Dr. Thomas Risse Forschungszentrum L3S Leibniz Universität Hannover Appelstrasse 9a Hannover, Germany Telefon: Telefax: Thomas Risse 25/11/15 53
From Web Page Storage to Living Web Archives Thomas Risse
From Web Page Storage to Living Web Archives Thomas Risse JISC, the DPC and the UK Web Archiving Consortium Workshop British Library, London, 21.7.2009 1 Agenda Web Crawlingtoday& Open Issues LiWA Living
More informationFrom Web Page Storage to Living Web Archives
From Web Page Storage to Living Web Archives Thomas Risse JISC, the DPC and the UK Web Archiving Consortium Workshop British Library, London, 21.7.2009 1 Agenda Web Crawling today & Open Issues LiWA Living
More informationExploiting the Social and Semantic Web for guided Web Archiving
Exploiting the Social and Semantic Web for guided Web Archiving Thomas Risse 1, Stefan Dietze 1, Wim Peters 2, Katerina Doka 3, Yannis Stavrakas 3, and Pierre Senellart 4 1 L3S Research Center, Hannover,
More informationThe Role of Language Evolution in Digital Archives
The Role of Language Evolution in Digital Archives Thomas Risse, Helge Holzmann (L3S) Nina Tahmasebi, Uni Gothenburg L3S Research Center Web Science @ L3S Preserving, understanding and shaping the Web
More informationAccessing Web Archives
Accessing Web Archives Web Science Course 2017 Helge Holzmann 05/16/2017 Helge Holzmann (holzmann@l3s.de) Not today s topic http://blog.archive.org/2016/09/19/the-internet-archive-turns-20/ 05/16/2017
More informationTechnologies and Tools for Working with Web Archives
Technologies and Tools for Working with Web Archives Helge Holzmann Web Data Engineer @ Internet Archive Researcher @ L3S Research Center, Leibniz Universität Hannover Conference on Preserving Digital
More informationMinghai Liu, Rui Cai, Ming Zhang, and Lei Zhang. Microsoft Research, Asia School of EECS, Peking University
Minghai Liu, Rui Cai, Ming Zhang, and Lei Zhang Microsoft Research, Asia School of EECS, Peking University Ordering Policies for Web Crawling Ordering policy To prioritize the URLs in a crawling queue
More informationInformation Retrieval
Multimedia Computing: Algorithms, Systems, and Applications: Information Retrieval and Search Engine By Dr. Yu Cao Department of Computer Science The University of Massachusetts Lowell Lowell, MA 01854,
More informationProf. Ahmet Süerdem Istanbul Bilgi University London School of Economics
Prof. Ahmet Süerdem Istanbul Bilgi University London School of Economics Media Intelligence Business intelligence (BI) Uses data mining techniques and tools for the transformation of raw data into meaningful
More informationCS6200 Information Retreival. Crawling. June 10, 2015
CS6200 Information Retreival Crawling Crawling June 10, 2015 Crawling is one of the most important tasks of a search engine. The breadth, depth, and freshness of the search results depend crucially on
More informationNDSA Web Archiving Survey
NDSA Web Archiving Survey Introduction In 2011 and 2013, the National Digital Stewardship Alliance (NDSA) conducted surveys of U.S. organizations currently or prospectively engaged in web archiving to
More informationEnhancing applications with Cognitive APIs IBM Corporation
Enhancing applications with Cognitive APIs After you complete this section, you should understand: The Watson Developer Cloud offerings and APIs The benefits of commonly used Cognitive services 2 Watson
More informationWeb Archiving at UTL
Web Archiving at UTL iskills workshops February 2018 Sam-chin Li Reference and Government Information Librarian, UTL Nich Worby Government Information and Statistics Librarian, UTL Agenda What is web archiving
More informationDATA MINING II - 1DL460. Spring 2014"
DATA MINING II - 1DL460 Spring 2014" A second course in data mining http://www.it.uu.se/edu/course/homepage/infoutv2/vt14 Kjell Orsborn Uppsala Database Laboratory Department of Information Technology,
More informationEmpowering People with Knowledge the Next Frontier for Web Search. Wei-Ying Ma Assistant Managing Director Microsoft Research Asia
Empowering People with Knowledge the Next Frontier for Web Search Wei-Ying Ma Assistant Managing Director Microsoft Research Asia Important Trends for Web Search Organizing all information Addressing user
More informationA Comparison Between The Performance of Wayback Machines
A Comparison Between The Performance of Wayback Machines Fernando Melo, Daniel Bicho and Daniel Gomes Arquivo.pt - The Portuguese Web Archive {fernando.melo, daniel.bicho, daniel.gomes}@fccn.pt April 19,
More informationAugust 14th - 18th 2005, Oslo, Norway. Web crawling : The Bibliothèque nationale de France experience
World Library and Information Congress: 71th IFLA General Conference and Council "Libraries - A voyage of discovery" August 14th - 18th 2005, Oslo, Norway Conference Programme: http://www.ifla.org/iv/ifla71/programme.htm
More informationAnnotating Web Archives structure, provenance, and context through archival cataloguing
Annotating Web Archives structure, provenance, and context through archival cataloguing Paul WU Horng-Jyh, PhD Nanyang Technological University hjwu@ntu.edu.sg Paul Wu NDL 19/02/2010 Agenda Background:
More informationDeveloping Focused Crawlers for Genre Specific Search Engines
Developing Focused Crawlers for Genre Specific Search Engines Nikhil Priyatam Thesis Advisor: Prof. Vasudeva Varma IIIT Hyderabad July 7, 2014 Examples of Genre Specific Search Engines MedlinePlus Naukri.com
More informationCompetitive Intelligence and Web Mining:
Competitive Intelligence and Web Mining: Domain Specific Web Spiders American University in Cairo (AUC) CSCE 590: Seminar1 Report Dr. Ahmed Rafea 2 P age Khalid Magdy Salama 3 P age Table of Contents Introduction
More informationEntity-centric Topic Extraction and Exploration: A Network-based Approach
Entity-centric Topic Extraction and Exploration: A Network-based Approach Andreas Spitz and Michael Gertz March 27, 2018 ECIR 2018, Grenoble Heidelberg University, Germany Database Systems Research Group
More informationTempWeb rd Temporal Web Analytics Workshop
TempWeb 2013 3 rd Temporal Web Analytics Workshop Stuff happens continuously: exploring Web contents with temporal information Omar Alonso Microsoft 13 May 2013 Disclaimer The views, opinions, positions,
More informationInformation Retrieval Spring Web retrieval
Information Retrieval Spring 2016 Web retrieval The Web Large Changing fast Public - No control over editing or contents Spam and Advertisement How big is the Web? Practically infinite due to the dynamic
More informationFILTERING OF URLS USING WEBCRAWLER
FILTERING OF URLS USING WEBCRAWLER Arya Babu1, Misha Ravi2 Scholar, Computer Science and engineering, Sree Buddha college of engineering for women, 2 Assistant professor, Computer Science and engineering,
More informationTime-aware Approaches to Information Retrieval
Time-aware Approaches to Information Retrieval Nattiya Kanhabua Department of Computer and Information Science Norwegian University of Science and Technology 24 February 2012 Motivation Searching documents
More informationFocused Crawling with
Focused Crawling with ApacheCon North America Vancouver, 2016 Hello! I am Sujen Shah Computer Science @ University of Southern California Research Intern @ NASA Jet Propulsion Laboratory Member of The
More informationWeb Search Basics Introduction to Information Retrieval INF 141/ CS 121 Donald J. Patterson
Web Search Basics Introduction to Information Retrieval INF 141/ CS 121 Donald J. Patterson Content adapted from Hinrich Schütze http://www.informationretrieval.org Web Search Basics The Web as a graph
More informationThe University of Jordan. Accreditation & Quality Assurance Center. Curriculum for Doctorate Degree
Accreditation & Quality Assurance Center Curriculum for Doctorate Degree 1. Faculty King Abdullah II School for Information Technology 2. Department Computer Science الدكتوراة في علم الحاسوب (Arabic).3
More informationPID System for eresearch
the European Persistant Identifier Gesellschaft für wissenschaftliche Datenverarbeitung mbh Göttingen Am Fassberg, 37077 Göttingen ulrich.schwardmann@gwdg.de IZA/Gesis/RatSWD-WS Persistent Identifiers
More informationWorkshop on Web Archiving
Workshop on Web Archiving MODULE 1 A: WEB ARCHIVING Niels Brügger Asger Harlung Program, KB 10.40-11.50 Workshop part 1: Web archiving 11.50-12.10 Discussion and a short break before lunch 12.10-13.00
More informationGive me the info I need now Setting the text analysis strategy at Belga News Agency
Give me the info I need now Setting the text analysis strategy at Belga News Agency Tom Wuytack ICT Manager, wut@belga.be Powered by Belga News Agency 2015 A few words on Belga News Agency Each day the
More informationSciVerse Scopus. 1. Scopus introduction and content coverage. 2. Scopus in comparison with Web of Science. 3. Basic functionalities of Scopus
Prepared by: Jawad Sayadi Account Manager, United Kingdom Elsevier BV Radarweg 29 1043 NX Amsterdam The Netherlands J.Sayadi@elsevier.com SciVerse Scopus SciVerse Scopus 1. Scopus introduction and content
More informationDIGIT.B4 Big Data PoC
DIGIT.B4 Big Data PoC DIGIT 01 Social Media D02.01 PoC Requirements Table of contents 1 Introduction... 5 1.1 Context... 5 1.2 Objective... 5 2 Data SOURCES... 6 2.1 Data sources... 6 2.2 Data fields...
More informationSupporting Research Data Collection from YouTube with TubeKit
Supporting Research Data Collection from YouTube with TubeKit Chirag Shah School of Information & Library Science University of North Carolina Chapel Hill NC 27599, USA chirag@unc.edu Abstract We present
More informationWEB LIP6 STÉPHANE GANÇARSKI Z. PEHLIVAN, M. CORD, M. BEN-SAAD, A. SANOJA,N. THOME, M. LAW
WEB ARCHIVING @ LIP6 STÉPHANE GANÇARSKI Z. PEHLIVAN, M. CORD, M. BEN-SAAD, A. SANOJA,N. THOME, M. LAW French ANR project Cartec (ended) European project Scape (ends 9/2014) Web Archives Web is ephemeral
More informationArchives in a Networked Information Society: The Problem of Sustainability in the Digital Information Environment
Archives in a Networked Information Society: The Problem of Sustainability in the Digital Information Environment Shigeo Sugimoto Research Center for Knowledge Communities Graduate School of Library, Information
More informationNéonaute: mining web archives for linguistic analysis
Néonaute: mining web archives for linguistic analysis Sara Aubry, Bibliothèque nationale de France Emmanuel Cartier, LIPN, University of Paris 13 Peter Stirling, Bibliothèque nationale de France IIPC Web
More informationArchiving and Preserving the Web. Kristine Hanna Internet Archive November 2006
Archiving and Preserving the Web Kristine Hanna Internet Archive November 2006 1 About Internet Archive Non profit founded in 1996 by Brewster Kahle, as an Internet library Provide universal and permanent
More informationSupporting Research Data Collection from YouTube with TubeKit
Supporting Research Data Collection from YouTube with TubeKit Chirag Shah School of Information & Library Science University of North Carolina chirag@unc.edu Abstract We present TubeKit, a query-based
More informationCollection Building on the Web. Basic Algorithm
Collection Building on the Web CS 510 Spring 2010 1 Basic Algorithm Initialize URL queue While more If URL is not a duplicate Get document with URL [Add to database] Extract, add to queue CS 510 Spring
More informationSEO: SEARCH ENGINE OPTIMISATION
SEO: SEARCH ENGINE OPTIMISATION SEO IN 11 BASIC STEPS EXPLAINED What is all the commotion about this SEO, why is it important? I have had a professional content writer produce my content to make sure that
More informationAssociate Diploma in Web and Multimedia Development
Associate Diploma in Web and Multimedia Development Program Components CRD Major Support 4% University 8% University (UR) 5 College (CR) 9 Major (MR) 49 College 14% Major Support (MSR) 3 Training (Internship)
More informationPlease note: Only the original curriculum in Danish language has legal validity in matters of discrepancy. CURRICULUM
Please note: Only the original curriculum in Danish language has legal validity in matters of discrepancy. CURRICULUM CURRICULUM OF 1 SEPTEMBER 2008 FOR THE BACHELOR OF ARTS IN INTERNATIONAL BUSINESS COMMUNICATION:
More informationVISUAL SUMMARY ACCESS INTERNET AND WEB. The Internet, the Web, and Electronic Commerce
VISUAL SUMMARY The Internet, the Web, and Electronic Commerce INTERNET AND WEB Internet Launched in 1969 with ARPANET, the Internet consists of the actual physical network. Web Introduced in 1991 at CERN,
More informationCommunity Tools and Best Practices for Harvesting and Preserving At-Risk Web Content ACA 2013
Community Tools and Best Practices for Harvesting and Preserving At-Risk Web Content ACA 2013 Scott Reed, Internet Archive Amanda Wakaruk, University of Alberta Libraries Kelly E. Lau, University of Alberta
More informationFocused Crawling with
Focused Crawling with ApacheCon North America Vancouver, 2016 Hello! I am Sujen Shah Computer Science @ University of Southern California Research Intern @ NASA Jet Propulsion Laboratory Member of The
More informationSearch Engines. Information Retrieval in Practice
Search Engines Information Retrieval in Practice All slides Addison Wesley, 2008 Web Crawler Finds and downloads web pages automatically provides the collection for searching Web is huge and constantly
More informationShrey Patel B.E. Computer Engineering, Gujarat Technological University, Ahmedabad, Gujarat, India
International Journal of Scientific Research in Computer Science, Engineering and Information Technology 2018 IJSRCSEIT Volume 3 Issue 3 ISSN : 2456-3307 Some Issues in Application of NLP to Intelligent
More informationALL-KANSAS NEWS WEBSITE CRITIQUE BOOKLET. This guide is designed to be an educational device SCHOOL NAME: NEW WEBSITE NAME: YEAR: ADVISER:
ÇP ALL-KANSAS NEWS WEBSITE CRITIQUE BOOKLET This guide is designed to be an educational device to improve the quality of your news website program. It is intended to point out positive aspects of your
More informationAnatomy of a Semantic Virus
Anatomy of a Semantic Virus Peyman Nasirifard Digital Enterprise Research Institute National University of Ireland, Galway IDA Business Park, Lower Dangan, Galway, Ireland peyman.nasirifard@deri.org Abstract.
More informationCommunication (COMM) Communication (COMM) 1
Communication (COMM) 1 Communication (COMM) COMM 110. Fundamentals of Public Speaking. 3 Credits. Theory and practice of public speaking with emphasis on content, organization, language, delivery, and
More informationEnhanced retrieval using semantic technologies:
Enhanced retrieval using semantic technologies: Ontology based retrieval as a new search paradigm? - Considerations based on new projects at the Bavarian State Library Dr. Berthold Gillitzer 28. Mai 2008
More informationPreservation of Web Materials
Preservation of Web Materials Julie Dietrich INFO 560 Literature Review 7/20/13 1 Introduction Websites are a communication and informational tool that can be shared and updated across the World Wide Web.
More informationCognos: Crowdsourcing Search for Topic Experts in Microblogs
Cognos: Crowdsourcing Search for Topic Experts in Microblogs Saptarshi Ghosh, Naveen Sharma, Fabricio Benevenuto, Niloy Ganguly, Krishna Gummadi IIT Kharagpur, India; UFOP, Brazil; MPI-SWS, Germany Topic
More informationMetadata Framework for Resource Discovery
Submitted by: Metadata Strategy Catalytic Initiative 2006-05-01 Page 1 Section 1 Metadata Framework for Resource Discovery Overview We must find new ways to organize and describe our extraordinary information
More informationCAT (CURATOR ARCHIVING TOOL): IMPROVING ACCESS TO WEB ARCHIVES
CAT (CURATOR ARCHIVING TOOL): IMPROVING ACCESS TO WEB ARCHIVES Ciro Llueca, Daniel Cócera Biblioteca de Catalunya (BC) Barcelona, Spain padicat@bnc.cat Natalia Torres, Gerard Suades, Ricard de la Vega
More informationCOUNCIL OF THE EUROPEAN UNION. Brussels, 28 January 2003 (OR. en) 15723/02 TELECOM 78 JAI 307 PESC 593
COUNCIL OF THE EUROPEAN UNION Brussels, 28 January 2003 (OR. en) 15723/02 TELECOM 78 JAI 307 PESC 593 LEGISLATIVE ACTS AND OTHER INSTRUMTS Subject : Council Resolution on a European approach towards a
More informationBPS Suite and the OCEG Capability Model. Mapping the OCEG Capability Model to the BPS Suite s product capability.
BPS Suite and the OCEG Capability Model Mapping the OCEG Capability Model to the BPS Suite s product capability. BPS Contents Introduction... 2 GRC activities... 2 BPS and the Capability Model for GRC...
More informationCDL s Web Archiving System
CDL s Web Archiving System Erik Hetzner UC3, California Digital Library 16 June 2011 Erik Hetzner (UC3, California Digital Library) CDL s Web Archiving System 16 June 2011 1 / 24 Introduction We don t
More informationUNIT-V WEB MINING. 3/18/2012 Prof. Asha Ambhaikar, RCET Bhilai.
UNIT-V WEB MINING 1 Mining the World-Wide Web 2 What is Web Mining? Discovering useful information from the World-Wide Web and its usage patterns. 3 Web search engines Index-based: search the Web, index
More informationALOE - A Socially Aware Learning Resource and Metadata Hub
ALOE - A Socially Aware Learning Resource and Metadata Hub Martin Memmel & Rafael Schirru Knowledge Management Department German Research Center for Artificial Intelligence DFKI GmbH, Trippstadter Straße
More informationInformation Retrieval May 15. Web retrieval
Information Retrieval May 15 Web retrieval What s so special about the Web? The Web Large Changing fast Public - No control over editing or contents Spam and Advertisement How big is the Web? Practically
More informationMarketing & Back Office Management
Marketing & Back Office Management Menu Management Add, Edit, Delete Menu Gallery Management Add, Edit, Delete Images Banner Management Update the banner image/background image in web ordering Online Data
More informationChapter 3: Google Penguin, Panda, & Hummingbird
Chapter 3: Google Penguin, Panda, & Hummingbird Search engine algorithms are based on a simple premise: searchers want an answer to their queries. For any search, there are hundreds or thousands of sites
More informationPreserving Public Government Information: The 2008 End of Term Crawl Project
Preserving Public Government Information: The 2008 End of Term Crawl Project Abbie Grotke, Library of Congress Mark Phillips, University of North Texas Libraries George Barnum, U.S. Government Printing
More informationGoogle Analytics 101
Copyright GetABusinessMobileApp.com All rights reserved worldwide. YOUR RIGHTS: This book is restricted to your personal use only. It does not come with any other rights. LEGAL DISCLAIMER: This book is
More informationEnriching Lifelong User Modelling with the Social e- Networking and e-commerce Pieces of the Puzzle
Enriching Lifelong User Modelling with the Social e- Networking and e-commerce Pieces of the Puzzle Demetris Kyriacou Learning Societies Lab School of Electronics and Computer Science, University of Southampton
More informationKEY FUNCTIONS. Dynamic content. Conditional s. A/B/X testing. Creation. testing. Video s. Spam. testing. Advanced personalisation
KEY FUNCTIONS Dynamic content Conditional emails Messages sent as part of one campaign will automatically adjust their looks to particular addressees preferences. Depending on the values of the characteristics
More informationVersion 11
The Big Challenges Networked and Electronic Media European Technology Platform The birth of a new sector www.nem-initiative.org Version 11 1. NEM IN THE WORLD The main objective of the Networked and Electronic
More informationANNUAL REPORT Visit us at project.eu Supported by. Mission
Mission ANNUAL REPORT 2011 The Web has proved to be an unprecedented success for facilitating the publication, use and exchange of information, at planetary scale, on virtually every topic, and representing
More informationInteroperability and transparency The European context
JOINING UP GOVERNMENTS EUROPEAN COMMISSION Interoperability and transparency The European context ITAPA 2011, Bratislava Francisco García Morán Director General for Informatics Background 2 3 Every European
More informationUAE National Space Policy Agenda Item 11; LSC April By: Space Policy and Regulations Directory
UAE National Space Policy Agenda Item 11; LSC 2017 06 April 2017 By: Space Policy and Regulations Directory 1 Federal Decree Law No.1 of 2014 establishes the UAE Space Agency UAE Space Agency Objectives
More informationSoftware Requirements Specification for the Names project prototype
Software Requirements Specification for the Names project prototype Prepared for the JISC Names Project by Daniel Needham, Amanda Hill, Alan Danskin & Stephen Andrews April 2008 1 Table of Contents 1.
More informationWeb Crawling. Jitali Patel 1, Hardik Jethva 2 Dept. of Computer Science and Engineering, Nirma University, Ahmedabad, Gujarat, India
Web Crawling Jitali Patel 1, Hardik Jethva 2 Dept. of Computer Science and Engineering, Nirma University, Ahmedabad, Gujarat, India - 382 481. Abstract- A web crawler is a relatively simple automated program
More informationQUALITY SEO LINK BUILDING
QUALITY SEO LINK BUILDING Developing Your Online Profile through Quality Links TABLE OF CONTENTS Introduction The Impact Links Have on Your Search Profile 02 Chapter II Evaluating Your Link Profile 03
More informationOverview of Web Mining Techniques and its Application towards Web
Overview of Web Mining Techniques and its Application towards Web *Prof.Pooja Mehta Abstract The World Wide Web (WWW) acts as an interactive and popular way to transfer information. Due to the enormous
More informationContent Based Smart Crawler For Efficiently Harvesting Deep Web Interface
Content Based Smart Crawler For Efficiently Harvesting Deep Web Interface Prof. T.P.Aher(ME), Ms.Rupal R.Boob, Ms.Saburi V.Dhole, Ms.Dipika B.Avhad, Ms.Suvarna S.Burkul 1 Assistant Professor, Computer
More informationMedici for Digital Cultural Heritage Libraries. George Tsouloupas, PhD The LinkSCEEM Project
Medici for Digital Cultural Heritage Libraries George Tsouloupas, PhD The LinkSCEEM Project Overview of Digital Libraries A Digital Library: "An informal definition of a digital library is a managed collection
More informationArchiving the Web: What can Institutions learn from National and International Web Archiving Initiatives
Archiving the Web: What can Institutions learn from National and International Web Archiving Initiatives Maureen Pennock Michael Day Lizzie Richmond UKOLN University of Bath UKOLN University of Bath University
More informationLegal Deposit of Online Newspapers at the BnF - Clément Oury - IFLA PAC Paris 2012
1 Legal Deposit of Online Newspapers Digital collections in BnF stacks Clément Oury Head of Digital Legal Deposit Bibliothèque nationale de France Summary The issue : ensuring the continuity of BnF heritage
More informationCS47300: Web Information Search and Management
CS47300: Web Information Search and Management Web Search Prof. Chris Clifton 18 October 2017 Some slides courtesy Croft et al. Web Crawler Finds and downloads web pages automatically provides the collection
More informationOleksandr Kuzomin, Bohdan Tkachenko
International Journal "Information Technologies Knowledge" Volume 9, Number 2, 2015 131 INTELLECTUAL SEARCH ENGINE OF ADEQUATE INFORMATION IN INTERNET FOR CREATING DATABASES AND KNOWLEDGE BASES Oleksandr
More informationONLINE EVALUATION FOR: Company Name
ONLINE EVALUATION FOR: Company Name Address Phone URL media advertising design P.O. Box 2430 Issaquah, WA 98027 (800) 597-1686 platypuslocal.com SUMMARY A Thank You From Platypus: Thank you for purchasing
More informationEUDAT Training 2 nd EUDAT Conference, Rome October 28 th Introduction, Vision and Architecture. Giuseppe Fiameni CINECA Rob Baxter EPCC EUDAT members
EUDAT Training 2 nd EUDAT Conference, Rome October 28 th Introduction, Vision and Architecture Giuseppe Fiameni CINECA Rob Baxter EPCC EUDAT members Agenda Background information Services Common Data Infrastructure
More informationCrawling the Web. Web Crawling. Main Issues I. Type of crawl
Web Crawling Crawling the Web v Retrieve (for indexing, storage, ) Web pages by using the links found on a page to locate more pages. Must have some starting point 1 2 Type of crawl Web crawl versus crawl
More informationTechnology Brown Bag: Web 2.0
Technology Brown Bag: Web 2.0 Schedule information Event Technology Brown Bag: Web 2.0 When Thursday, May 4, 2006 from 12:00pm to 1:30pm Where Harris 1300 Event details Details Access Contact What is Web
More informationCHAPTER 4 EVALUATION
45 CHAPTER 4 EVALUATION Three degrees of evaluation were performed. The first degree determines how effective the process of conforming tools to this specification is in accomplishing the task of preserving
More informationRole of Social Media and Semantic WEB in Libraries
Role of Social Media and Semantic WEB in Libraries By Dr. Anwar us Saeed Email: anwarussaeed@yahoo.com Layout Plan Where Library streams merge the WEB Recent Evolution of the WEB Social WEB Semantic WEB
More informationExplorations of Netarkivet: Preliminary Findings. Emily Maemura, PhD Candidate Faculty of Information, University of Toronto NetLab Forum May 4, 2018
Explorations of Netarkivet: Preliminary Findings Emily Maemura, PhD Candidate Faculty of Information, University of Toronto NetLab Forum May 4, 2018 Elements of Provenance Maemura, Worby, Milligan, Becker,
More informationTopics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples
Hadoop Introduction 1 Topics Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples 2 Big Data Analytics What is Big Data?
More informationArchiving dynamic websites: a considerations framework and a tool comparison
Archiving dynamic websites: a considerations framework and a tool comparison Allard Oelen Student number: 2560801 Supervisor: Victor de Boer 30 June 2017 Vrije Universiteit Amsterdam a.b.oelen@student.vu.nl
More informationALEXANDRIA. Temporal Retrieval, Exploration and Analytics in Web Archives. Wolfgang Nejdl. L3S Research Center Hannover, Germany
ALEXANDRIA Temporal Retrieval, Exploration and Analytics in Web Archives Wolfgang Nejdl L3S Research Center Hannover, Germany Looking back: The Austrian Socialist Party and Europe What is missing? ALEXANDRIA
More informationOpen-Source Natural Language Processing and Computational Archival Science
Open-Source Natural Language Processing and Computational Archival Science Kalina Bontcheva University of Sheffield @kbontcheva The University of Sheffield, 1995-2018 This work is licensed under the Creative
More informationBuilding Institutional Repositories: Emerging Challenges
University of Nebraska at Omaha From the SelectedWorks of Yumi Ohira 2014 Building Institutional Repositories: Emerging Challenges Yumi Ohira, University of Nebraska at Omaha Available at: https://works.bepress.com/yumi-ohira/3/
More informationRanking of ads. Sponsored Search
Sponsored Search Ranking of ads Goto model: Rank according to how much advertiser pays Current model: Balance auction price and relevance Irrelevant ads (few click-throughs) Decrease opportunities for
More informationASSIUT UNIVERSITY. Faculty of Computers and Information Department of Information Systems. IS Ph.D. Program. Page 0
ASSIUT UNIVERSITY Faculty of Computers and Information Department of Information Systems Informatiio on Systems PhD Program IS Ph.D. Program Page 0 Assiut University Faculty of Computers & Informationn
More informationWebsite Acecore Technologies JL B.V.:
Privacy policy Acecore Technologies JL B.V. B.V. Acecore Technologies JL B.V. (hereafter: Acecore technologies JL B.V.) focusses on the development and shipping of drones in the creative, industrial and
More informationEssential for Employee Engagement. Frequently Asked Questions
Essential for Employee Engagement The Essential Communications Intranet is a solution that provides clients with a ready to use framework that takes advantage of core SharePoint elements and our years
More informationSUMMON WEB-SCALE DISCOVERY. ADA University Baku 02/04/2014
SUMMON WEB-SCALE DISCOVERY ADA University Baku 02/04/2014 Why an Automated Management Solution is Important Academic Library Expenditures on Purchased and Licensed Content 90% 80% 70% 60% 50% 40% 30% 20%
More informationOnCrawl Metrics. What SEO indicators do we analyze for you? Dig into our board of metrics to find the one you are looking for.
1 OnCrawl Metrics What SEO indicators do we analyze for you? Dig into our board of metrics to find the one you are looking for. UNLEASH YOUR SEO POTENTIAL Table of content 01 Crawl Analysis 02 Logs Monitoring
More information