Challenges and Approaches for Web Archive Creation and Usage

Size: px
Start display at page:

Download "Challenges and Approaches for Web Archive Creation and Usage"

Transcription

1 Challenges and Approaches for Web Archive Creation and Usage Thomas Risse L3S Research Center Lleida, 29. May 2015 Thomas Risse 25/11/15 1

2 The Web is a core part of our daily life World Wide Web = 50 Bill. Pages + 1 Bill. Users The Web and the Social Web play a crucial role Information and services for all domains Allows contributions by every citizen Giving room for the articulation for a multitude of stakeholders Reflects all types of events, opinions, developments within society, science, politics, environment, business, Thomas Risse 25/11/15 2

3 Are we loosing the past of the web? Gun running from Sudan Attack on Copts Spam Thomas Risse 25/11/15 3

4 The Web is Changing and Forgetting The Web is a quickly changing, ever growing information space [1] It s growing by >8% per week After 1 year only 40% of the pages are still accessible while 60% of the pages are new A Web Archive as a Collective Memory is a cultural necessity for the future But Archive and Store Everything is not a practical approach [1] A. Ntoulas, J. Cho, and C. Olston. What's new on the web?: the evolution of the web from a search engine perspective. In Proceedings of the 13th international conference on World Wide Web (WWW '04) Thomas Risse 25/11/15 4

5 Where are we now? A number of tools are available (e.g. Heritrix) Crawl descriptions are currently lists of URLs >42 world wide Web Archives initiatives with different scopes [2] Still a lot manual effort is necessary but only some 100 people are involved world wide [2] Increasing interest from Digital Humanities, Journalism, etc. More support is necessary Crawl by Events, Topics and Entities Using the Wisdom of the Crowds for selection and appraisal [2] D. Gomes, J. Miranda and M. Costa. A survey on web archiving initiatives. In Proceedings of the 1st International Conference on Theory and Practice of Digital Libraries 2011 (TPDL 2011) Thomas Risse 25/11/15 5

6 Agenda Motivation The Crawl Process and its Challenges Process Overview Dynamic Pages Basic Crawl Strategies Termination of Crawls Next Generation Web Archiving Requirements Topical Crawling The ARCOMEM Approach icrawl - Integrated Crawling Web Archive Access and Usage Examples Current methods The ARCOMEM Approach Time and Topic aware diversification Language Evolution Conclusions Thomas Risse 25/11/15 6

7 Agenda Motivation The Crawl Process and its Challenges Process Overview Dynamic Pages Basic Crawl Strategies Termination of Crawls Next Generation Web Archiving Requirements Topical Crawling The ARCOMEM Approach icrawl - Integrated Crawling Web Archive Access and Usage Examples Current methods The ARCOMEM Approach Time and Topic aware diversification Language Evolution Conclusions Thomas Risse 25/11/15 7

8 Index Storage Link Extraction Fetching Preparation Discovery Filtering Web Archiving and some challenges Content Appraisal Quality Review Selection Content Selection Contributions the European Projects Archivist Temporal Coherence of Crawls Social Media API Crawling Noise Filtering Capture Access Deep Web Archiving Dynamic Pages JavaScript, Flash User Data Volume Temporal Aspects Multimedia Content Thomas Risse 25/11/15 8

9 Index Storage Link Extraction Fetching Preparation Discovery Filtering Web Archiving and some challenges Content Appraisal Quality Review Selection Content Selection Archivist Temporal Coherence of Crawls Social Media API Crawling Noise Filtering Capture Access Deep Web Archiving Dynamic Pages JavaScript, Flash User Data Volume Temporal Aspects Multimedia Content Thomas Risse 25/11/15 9

10 Dynamic Pages Thomas Risse 25/11/15 10

11 Follow the links... <a href=" software</a> Thomas Risse 25/11/15 11

12 But what about these pages? function() { var fcb_referrers = [ "fcbarcelona.com", "fcbarcelona.cat", "fcbarcelona.es", "fcbarab.com", "fcbarcelona.fr", "fcbarcelona.jp", "fcbarcelona.cn", "fcbarcelona.qq.com" ]; var current_language = "en"; var routes = {"ca":" var navigator_language = navigator.language navigator.userlanguage; var language = (navigator_language "").split("-")[0]; var s = "^https?:\/\/[^\/]*(" + fcb_referrers.join(" ").replace(/\./g, '\\.') + ")"; var fcb_referrer = document.referrer.match(new RegExp(s)); var redirect_url = routes[language]; if (language && language!= current_language &&!fcb_referrer && redirect_url) { window.location = redirect_url; } })(); Thomas Risse 25/11/15 12

13 Endless Pages Automatically reload content while browsing the page Requires execution of JavaScript When can the fetching be stopped? Examples: Facebook, Twitter, Tumblr, etc. Thomas Risse 25/11/15 13

14 Embedded Social Media Embedded Social Media Content JavaScript embeds Dynamically integrated content on the browser side Often standard scripts Possibility to extract parameters Content fetching requires API Crawler Thomas Risse 25/11/15 14

15 Handling of Dynamic Pages The Problem Embedded Programs Adobe Flash impossible to handle JavaScript is readable Dynamic creation of links and content Approaches Guessing of links guessing by assembling any fragments that look like links into URLs Can be very noisy - lots of wrong URL s Extraction of parameters from program code Applicable for known code libraries e.g. Facebook, Twitter Execution of JavaScript Simulate user activities- pressing the links and see what comes out Execute code in a Javascript engine (e.g. WebKit, Firefox Browser) Extract links from resulting DOM tree Status Some implementations exist (e.g. LiWA Service, Browser Monkey) API Crawler exist No integration into standard Web Crawler Thomas Risse 25/11/15 15

16 Basic Crawl Strategies Thomas Risse 25/11/15 16

17 Crawl Strategy thomas.htm about.htm joe.htm journalists.htm ukraine0105.htm ukraine_crisis.htm ukraine0205.htm ukraine_sports0105.htm sports.htm spain_sports0205.htm Thomas Risse 25/11/15 17

18 Crawl Strategy: Depth-first thomas.htm about.htm joe.htm journalists.htm ukraine0105.htm ukraine_crisis.htm ukraine0205.htm ukraine_sports0105.htm sports.htm spain_sports0205.htm Thomas Risse 25/11/15 18

19 Crawl Strategy: Breadth-first thomas.htm about.htm joe.htm journalists.htm ukraine0105.htm ukraine_crisis.htm ukraine0205.htm ukraine_sports0105.htm sports.htm spain_sports0205.htm Thomas Risse 25/11/15 19

20 Termination of a Crawl When is the harvesting of Web Pages complete? Problem: The total amount of pages is unknown Typical Approach Crawler Queue is empty (only for very focused crawls) Termination after number of pages/amount of data Termination after time Most often incomplete crawls depending on the strategy Thomas Risse 25/11/15 20

21 Possible Consequences after stopping the Crawl Depth-first thomas.htm Breadth-first thomas.htm about.htm about.htm joe.htm joe.htm journalists.htm journalists.htm ukraine0105.htm ukraine0105.htm ukraine_crisis.htm ukraine0205.htm ukraine_crisis.htm ukraine0205.htm sports.htm ukraine_sports0105.htm sports.htm ukraine_sports0105.htm spain_sports0205.htm Thomas Risse spain_sports0205.htm 25/11/15 21

22 thomas.htm Alternative Crawl Strategies about.htm journalists.htm joe.htm - Selection by Popularity - Ranking of the waiting queue according to the popularity of the content - Requires knowledge about the Web Graph - For example: PageRank - Mainly useful to optimize regular crawls - Content based selection - Topics, Events, Entities - Requires semantic crawl specification - The selection of the right strategy depends on the crawl intention ukraine_crisis.htm ukraine0105.htm ukraine_sports0105.htm sports.htm spain_sports0205.htm Will be discussed afterwards ukraine0205.htm Thomas Risse 25/11/15 22

23 Agenda Motivation The Crawl Process and its Challenges Process Overview Dynamic Pages Basic Crawl Strategies Termination of Crawls Next Generation Web Archiving Requirements Topical Crawling The ARCOMEM Approach icrawl - Integrated Crawling Web Archive Access and Usage Examples Current methods The ARCOMEM Approach Time and Topic aware diversification Language Evolution Conclusions Thomas Risse 25/11/15 23

24 Growing Scientific Interest in Web Archive Content (1/2) Historians - Official Publications (e.g. Government) - Journalistic Resources - Important topics and events with a high media coverage - Multi-cultural or controversial topics - Optimal: continues observation of topics in the Web Social Sciences - Observations of topics and events on major sites are good starting points - Identified Topic - Official publications, journalistic and social media sources - Changes on the topic should be identified - Metadata / Context (e.g. Author, Organizations and their interests, gender, location) - Demographic information about social sites - Provenance: Transparent and detailed documentation of content selection Thomas Risse 25/11/15 24

25 Growing Scientific Interest in Web Archive Content (2/2) Law - Research is based on official publications and protocols of parliaments or comments released by publishers - Social media (especially blogs) are increasingly used - Only used as background information - Reason: missing citability and authenticity of resources - Genesis of laws - Used to understand original intention of laws - A democratic system requires a complete documentation of the law genesis. - Currently different degrees of documentation - Official publications: parliament and committee meetings - Public discourse Thomas Risse 25/11/15 25

26 Derived Requirements Topical Dimension - Crawl intention are mainly focused around events and rarely around entities - What is the intention of the researcher? - Easy monitoring by the researcher and possibility to correct Flexible Crawling Strategies - Shallow observation crawls - Focused crawls with prioritization (e.g. PageRank and/or semantics) Social Web Crawling - General interest with different media focus - Integrated with Web crawler to capture the full context Authenticity - See a web page as the user saw the page (e.g. including ads and tweets at that time point) Context and Provenance - Demographics of sites - Documentation of crawl specification and history Thomas Risse 25/11/15 26

27 Topical / Event Crawling thomas.htm about.htm journalists.htm joe.htm Crawl Specification - Terms - Ukraine - Crisis - - Seed List ukraine0105.htm ukraine_crisis.htm ukraine0205.htm ukraine_sports0105.htm sports.htm spain_sports0205.htm Thomas Risse 25/11/15 27

28 ARCOMEM Crawling Phases Entities Obama, Romney, Biden, Ryan, Republicans, Democrats, Keywords US Election, CommitToMitt, Teaparty, Budget deficit, Social Media Seedlist Seedlist Offline Processing Online Processing Crawling ARCOMEM Storage SARA for Broadcaster, Parliaments Archive Internet Crawling Cross Crawl Processing Appraisal Selection Thomas Risse 25/11/15 28

29 Architecture Overview Application Cross Crawl Analysis SARA Broadcaster Named Entity Evol. Recog. Twitter Dynamics Parliament Image/Video Analysis Social Web Analysis Offline Processing WARC Files + SOLR Index WARC Export Crawler Cockpit ARCOMEM Storage (HBase, H2RDF) GATE Offline Analysis Consolidation Enrichment GATE Online Analysis Extracted SocialWeb Information Social Web Analysis Online Processing Queue Management Resource Fetching Application-Aware Helper URLs Relevance Analysis & Priorization Resource Selection & Prioritization Crawler Intelligent Crawl Definition Thomas Risse 25/11/15 29

30 Crawler Cockpit Thomas Risse 25/11/15 30

31 Relevance Distribution Time Thomas Risse 25/11/15 31

32 New Strategies lead to Example: German Elections Crawl 2013 Completely user defined crawl (German Broadcasters) Many focused terms and entities in German Many full names e.g. Angela Merkel Few less focused keywords in English 1 st Phase Many English pages with no relation to German Elections 2 nd Phase Refinement of Crawl Specification Focused English terms Last names instead of full names Finding a good cut off point of archive threshold Smaller result set with higher focus Thomas Risse 25/11/15 32

33 new Experiences A good specification of the crawl intention becomes critical Crawl specifications are rather complex compared to seed lists More experiences, observations and analysis are necessary from Users and Developers Room for more sophisticated guidance in the 1 st phase Finding the right cut-off point Depends on the user requirements Highly focused Smaller Archives but higher risk of loosing interesting information Less focused Lower risk of missing information but larger archives Translation into scoring of individual crawls Thomas Risse 25/11/15 33

34 and new Limitiations Social Media and Web Crawling are separate systems Process 1. Crawling of Social Media content 2. Extraction of Links 3. Crawling of Web Pages Static integration of Social Media Uni-directional Path: Social Media Web Content Missing Path: Web Content Social Media In Addition Complex system Required many changes in the Heritrix queue handling Thomas Risse 25/11/15 34

35 Integrated crawling with the L3S icrawl System Crawl Monitor Scheduler Semantic Crawl Description Initial Seedlist Crawl Specification Crawler Web Crawler API Crawler Specification Refinement Crawl Analysis & Enrichment Archive Creation & Cataloguing Web Web Archive Archive Learning the Crawl Specification Provenance Crawl Preparation Crawl Execution Crawl Finalization Apache Nutch based Web Archive crawler (under development) Learning the intention of the crawl Integration of Web and Social Media Crawling Content based monitoring of the crawl process Thomas Risse 25/11/15 35

36 icrawl Wizard Thomas Risse 25/11/15 36

37 Example for Integrated Crawling Twitter #Ukraine Feed (Medium Page Relevance) (Low Page Relevance) Crawler Queue ID Batch URL Priority UK1 1 UK UK3 x UK4 y (High Page Relevance) Web Link Extracted URL Thomas Risse 25/11/15 37

38 Agenda Motivation The Crawl Process and its Challenges Process Overview Dynamic Pages Basic Crawl Strategies Termination of Crawls Next Generation Web Archiving Requirements Topical Crawling The ARCOMEM Approach icrawl - Integrated Crawling Web Archive Access and Usage Examples Current methods The ARCOMEM Approach Time and Topic aware diversification Language Evolution Conclusions Thomas Risse 25/11/15 38

39 WayBackMachine Basic tool to access Web Archives Also used in combination with other tools Only URL based access Provides Capture overviews Details about crawl dates Thomas Risse 25/11/15 40

40 Searching the Web Archive Thomas Risse 25/11/15 41

41 Other Ways to Explore Web Archives Thomas Risse 25/11/15 42

42 Web Archive Access Different user groups with different needs Not the typical Web Search Engine user Explorative Search Web Archive Search Web Search Time Dimension Many versions of the same page Comparing versions Large scale analysis across time Information extraction (Entities, Events, Topics, Opinions) Interlinking Visualizations Thomas Risse 25/11/15 43

43 The ARCOMEM Approach to Web Archive Access Thomas Risse 25/11/15 44

44 Number of Documents Historical Search on News Archives* Consider a journalist interested in the history of... Rudolph Giuliani Mayoral Campaign Mayoral Campaign Senate, Cancer, Allegations Mayoral Campaign 9/11 Post politics endeavours Mayor News articles encode history as it happens. Aspects are diverse across time Time windows can be diverse in aspects. *Jaspreet Singh, Avishek Anand; Historical Search on News Archives; L3S Report 2015 Thomas Risse 25/11/15 45

45 Large Scale Data Analytics Example Named Entity Evolution Recognition Thomas Risse 25/11/15 46

46 Language changes over time! Our language is DYNAMIC and changes with our culture, politics, technology, social media, etc. Different spellings over time New words are introduced Words change their meanings In long-term digital archives Problems! St. Petersburg Petrograd Leningrad St. Petersburg St. Petersburg today t Thomas Risse 25/11/15 47

47 Problem: Finding Documents A scholar writing a thesis about Pope Benedikt : I want to know more about Pope Benedikt?? Thomas Risse 25/11/15 48

48 Named Entity Evolution Joseph Ratzinger Pope Benedict Joseph Aloisius Ratzinger Cardinal Ratzinger Cardinal Joseph Ratzinger Pope Benedict XVI Benedict XVI Pope emeritus Benedict XVI Named Entities (NE): people, places, companies... Characteristics of Named Entity Evolution (NEE) Same thing but different terms over time Change occurs over short periods of time Small or no concept shift Announced to the public repeatedly Goal: Find method for named entity evolution recognition independent from external knowledge sources Change Period Thomas Risse 25/11/15 49

49 Named Entity Evolution Recognizer (NEER) Identifying Change Periods (Burst Detection) Extract Text NLP Processing Context Creation In his latest address to American In his bishops latest address visiting to Rome, American In Pope his bishops Benedict latest address visiting XVI to stressed Rome, American Pope that bishops Benedict Catholic visiting XVI educators stressed Rome, should Pope that Benedict remain Catholic XVI true educators stressed the should faith that -- remain Catholic a reminder true educators issued the just should faith in time -- remain a In his latest address to American In his bishops latest address visiting to Rome, American In Pope his Benedict bishops latest address visiting XVI to stressed Rome, American Pope that Benedict bishops Catholic XVI visiting educators stressed Rome, should Pope that Benedict remain Catholic XVI true educators stressed the should faith that -- remain Catholic a reminder true educators issued the just should faith in time -- remain a for reminder another true tense to issued the season just faith in of time -- a for reminder another true tense issued to the season just faith in of time -- a commencement for reminder another tense issued addresses. season just in of time No, commencement for the another pope tense did addresses. season not of mention No, commencement the pope Georgetown did addresses. not University mention No, by the name pope Georgetown when did not discussing University mention the by name Catholic Georgetown when campus discussing University culture wars. the by name Catholic when commencement for another reminder tense issued season addresses. commencement for No, another pope tense did season not addr- of just in of time mention esses. commencement No, the Georgetown pope did not addresses. by No, name Georgetown the pope when did not University mention discussing University mention the by name Catholic Georgetown when campus discussing University culture the wars. by Catholic name when campus discussing culture wars. the Catholic campus discussing culture the wars. Catholic campus culture wars. campus culture wars. Barack Obama Senator State Senator Barack Obama Senator-elect Barack Obama Senator Barack Obama Illinois Democrat Vladimir Putin President-elect Vladimir V Putin Minister Vladimir Putin Acting President Vladimir V Putin President Vladimir V Putin Finding Temporal Co-references Filtering 1. Pope Benedict XVI 2. Pope Benedict 3. Benedict XVI 4. Cardinal Ratzinger 5. Pope 6. Benedict Co-References Benedict XVI Joseph Ratzinger Cardinal Ratzinger Evaluation Results Burst detection found total 73% of all change periods High recall for unsupervised method Machine Learning boosts precision Data Set: Thomas Risse 25/11/15 50

50 Motivation Optimal access to Web archives taking into account the temporal dimension of Web archives structured semantic information available on the Web social media and network information Objectives Evolution-Aware Entity-Based Enrichment and Indexing Aggregating Social Networks and Streams Temporal Retrieval and Ranking Collaborative Exploration and Analytics Testbeds Temporal Wikipedia Academic Web Archive Politics on the Web ALEXANDRIA Foundations for Temporal Retrieval, Exploration and Analytics in Web Archives More information: Thomas Risse 25/11/15 51

51 Conclusions Web Crawling Mature standard Web crawlers but still many challenges (e.g. Social Media, Dynamics) Manual interventions will always be necessary New user communities with different interests and requirements Additional crawling strategies are necessary Web Archive Access Current approaches far behind the State of the Art Increasing Web Archive interest will lead to increasing user expectations Different usage categories: Explorative search vs. Big Data analysis Legal Aspects are still a big issue Thomas Risse 25/11/15 52

52 Thank You! Dr. Thomas Risse Forschungszentrum L3S Leibniz Universität Hannover Appelstrasse 9a Hannover, Germany Telefon: Telefax: Thomas Risse 25/11/15 53

From Web Page Storage to Living Web Archives Thomas Risse

From Web Page Storage to Living Web Archives Thomas Risse From Web Page Storage to Living Web Archives Thomas Risse JISC, the DPC and the UK Web Archiving Consortium Workshop British Library, London, 21.7.2009 1 Agenda Web Crawlingtoday& Open Issues LiWA Living

More information

From Web Page Storage to Living Web Archives

From Web Page Storage to Living Web Archives From Web Page Storage to Living Web Archives Thomas Risse JISC, the DPC and the UK Web Archiving Consortium Workshop British Library, London, 21.7.2009 1 Agenda Web Crawling today & Open Issues LiWA Living

More information

Exploiting the Social and Semantic Web for guided Web Archiving

Exploiting the Social and Semantic Web for guided Web Archiving Exploiting the Social and Semantic Web for guided Web Archiving Thomas Risse 1, Stefan Dietze 1, Wim Peters 2, Katerina Doka 3, Yannis Stavrakas 3, and Pierre Senellart 4 1 L3S Research Center, Hannover,

More information

The Role of Language Evolution in Digital Archives

The Role of Language Evolution in Digital Archives The Role of Language Evolution in Digital Archives Thomas Risse, Helge Holzmann (L3S) Nina Tahmasebi, Uni Gothenburg L3S Research Center Web Science @ L3S Preserving, understanding and shaping the Web

More information

Accessing Web Archives

Accessing Web Archives Accessing Web Archives Web Science Course 2017 Helge Holzmann 05/16/2017 Helge Holzmann (holzmann@l3s.de) Not today s topic http://blog.archive.org/2016/09/19/the-internet-archive-turns-20/ 05/16/2017

More information

Technologies and Tools for Working with Web Archives

Technologies and Tools for Working with Web Archives Technologies and Tools for Working with Web Archives Helge Holzmann Web Data Engineer @ Internet Archive Researcher @ L3S Research Center, Leibniz Universität Hannover Conference on Preserving Digital

More information

Minghai Liu, Rui Cai, Ming Zhang, and Lei Zhang. Microsoft Research, Asia School of EECS, Peking University

Minghai Liu, Rui Cai, Ming Zhang, and Lei Zhang. Microsoft Research, Asia School of EECS, Peking University Minghai Liu, Rui Cai, Ming Zhang, and Lei Zhang Microsoft Research, Asia School of EECS, Peking University Ordering Policies for Web Crawling Ordering policy To prioritize the URLs in a crawling queue

More information

Information Retrieval

Information Retrieval Multimedia Computing: Algorithms, Systems, and Applications: Information Retrieval and Search Engine By Dr. Yu Cao Department of Computer Science The University of Massachusetts Lowell Lowell, MA 01854,

More information

Prof. Ahmet Süerdem Istanbul Bilgi University London School of Economics

Prof. Ahmet Süerdem Istanbul Bilgi University London School of Economics Prof. Ahmet Süerdem Istanbul Bilgi University London School of Economics Media Intelligence Business intelligence (BI) Uses data mining techniques and tools for the transformation of raw data into meaningful

More information

CS6200 Information Retreival. Crawling. June 10, 2015

CS6200 Information Retreival. Crawling. June 10, 2015 CS6200 Information Retreival Crawling Crawling June 10, 2015 Crawling is one of the most important tasks of a search engine. The breadth, depth, and freshness of the search results depend crucially on

More information

NDSA Web Archiving Survey

NDSA Web Archiving Survey NDSA Web Archiving Survey Introduction In 2011 and 2013, the National Digital Stewardship Alliance (NDSA) conducted surveys of U.S. organizations currently or prospectively engaged in web archiving to

More information

Enhancing applications with Cognitive APIs IBM Corporation

Enhancing applications with Cognitive APIs IBM Corporation Enhancing applications with Cognitive APIs After you complete this section, you should understand: The Watson Developer Cloud offerings and APIs The benefits of commonly used Cognitive services 2 Watson

More information

Web Archiving at UTL

Web Archiving at UTL Web Archiving at UTL iskills workshops February 2018 Sam-chin Li Reference and Government Information Librarian, UTL Nich Worby Government Information and Statistics Librarian, UTL Agenda What is web archiving

More information

DATA MINING II - 1DL460. Spring 2014"

DATA MINING II - 1DL460. Spring 2014 DATA MINING II - 1DL460 Spring 2014" A second course in data mining http://www.it.uu.se/edu/course/homepage/infoutv2/vt14 Kjell Orsborn Uppsala Database Laboratory Department of Information Technology,

More information

Empowering People with Knowledge the Next Frontier for Web Search. Wei-Ying Ma Assistant Managing Director Microsoft Research Asia

Empowering People with Knowledge the Next Frontier for Web Search. Wei-Ying Ma Assistant Managing Director Microsoft Research Asia Empowering People with Knowledge the Next Frontier for Web Search Wei-Ying Ma Assistant Managing Director Microsoft Research Asia Important Trends for Web Search Organizing all information Addressing user

More information

A Comparison Between The Performance of Wayback Machines

A Comparison Between The Performance of Wayback Machines A Comparison Between The Performance of Wayback Machines Fernando Melo, Daniel Bicho and Daniel Gomes Arquivo.pt - The Portuguese Web Archive {fernando.melo, daniel.bicho, daniel.gomes}@fccn.pt April 19,

More information

August 14th - 18th 2005, Oslo, Norway. Web crawling : The Bibliothèque nationale de France experience

August 14th - 18th 2005, Oslo, Norway. Web crawling : The Bibliothèque nationale de France experience World Library and Information Congress: 71th IFLA General Conference and Council "Libraries - A voyage of discovery" August 14th - 18th 2005, Oslo, Norway Conference Programme: http://www.ifla.org/iv/ifla71/programme.htm

More information

Annotating Web Archives structure, provenance, and context through archival cataloguing

Annotating Web Archives structure, provenance, and context through archival cataloguing Annotating Web Archives structure, provenance, and context through archival cataloguing Paul WU Horng-Jyh, PhD Nanyang Technological University hjwu@ntu.edu.sg Paul Wu NDL 19/02/2010 Agenda Background:

More information

Developing Focused Crawlers for Genre Specific Search Engines

Developing Focused Crawlers for Genre Specific Search Engines Developing Focused Crawlers for Genre Specific Search Engines Nikhil Priyatam Thesis Advisor: Prof. Vasudeva Varma IIIT Hyderabad July 7, 2014 Examples of Genre Specific Search Engines MedlinePlus Naukri.com

More information

Competitive Intelligence and Web Mining:

Competitive Intelligence and Web Mining: Competitive Intelligence and Web Mining: Domain Specific Web Spiders American University in Cairo (AUC) CSCE 590: Seminar1 Report Dr. Ahmed Rafea 2 P age Khalid Magdy Salama 3 P age Table of Contents Introduction

More information

Entity-centric Topic Extraction and Exploration: A Network-based Approach

Entity-centric Topic Extraction and Exploration: A Network-based Approach Entity-centric Topic Extraction and Exploration: A Network-based Approach Andreas Spitz and Michael Gertz March 27, 2018 ECIR 2018, Grenoble Heidelberg University, Germany Database Systems Research Group

More information

TempWeb rd Temporal Web Analytics Workshop

TempWeb rd Temporal Web Analytics Workshop TempWeb 2013 3 rd Temporal Web Analytics Workshop Stuff happens continuously: exploring Web contents with temporal information Omar Alonso Microsoft 13 May 2013 Disclaimer The views, opinions, positions,

More information

Information Retrieval Spring Web retrieval

Information Retrieval Spring Web retrieval Information Retrieval Spring 2016 Web retrieval The Web Large Changing fast Public - No control over editing or contents Spam and Advertisement How big is the Web? Practically infinite due to the dynamic

More information

FILTERING OF URLS USING WEBCRAWLER

FILTERING OF URLS USING WEBCRAWLER FILTERING OF URLS USING WEBCRAWLER Arya Babu1, Misha Ravi2 Scholar, Computer Science and engineering, Sree Buddha college of engineering for women, 2 Assistant professor, Computer Science and engineering,

More information

Time-aware Approaches to Information Retrieval

Time-aware Approaches to Information Retrieval Time-aware Approaches to Information Retrieval Nattiya Kanhabua Department of Computer and Information Science Norwegian University of Science and Technology 24 February 2012 Motivation Searching documents

More information

Focused Crawling with

Focused Crawling with Focused Crawling with ApacheCon North America Vancouver, 2016 Hello! I am Sujen Shah Computer Science @ University of Southern California Research Intern @ NASA Jet Propulsion Laboratory Member of The

More information

Web Search Basics Introduction to Information Retrieval INF 141/ CS 121 Donald J. Patterson

Web Search Basics Introduction to Information Retrieval INF 141/ CS 121 Donald J. Patterson Web Search Basics Introduction to Information Retrieval INF 141/ CS 121 Donald J. Patterson Content adapted from Hinrich Schütze http://www.informationretrieval.org Web Search Basics The Web as a graph

More information

The University of Jordan. Accreditation & Quality Assurance Center. Curriculum for Doctorate Degree

The University of Jordan. Accreditation & Quality Assurance Center. Curriculum for Doctorate Degree Accreditation & Quality Assurance Center Curriculum for Doctorate Degree 1. Faculty King Abdullah II School for Information Technology 2. Department Computer Science الدكتوراة في علم الحاسوب (Arabic).3

More information

PID System for eresearch

PID System for eresearch the European Persistant Identifier Gesellschaft für wissenschaftliche Datenverarbeitung mbh Göttingen Am Fassberg, 37077 Göttingen ulrich.schwardmann@gwdg.de IZA/Gesis/RatSWD-WS Persistent Identifiers

More information

Workshop on Web Archiving

Workshop on Web Archiving Workshop on Web Archiving MODULE 1 A: WEB ARCHIVING Niels Brügger Asger Harlung Program, KB 10.40-11.50 Workshop part 1: Web archiving 11.50-12.10 Discussion and a short break before lunch 12.10-13.00

More information

Give me the info I need now Setting the text analysis strategy at Belga News Agency

Give me the info I need now Setting the text analysis strategy at Belga News Agency Give me the info I need now Setting the text analysis strategy at Belga News Agency Tom Wuytack ICT Manager, wut@belga.be Powered by Belga News Agency 2015 A few words on Belga News Agency Each day the

More information

SciVerse Scopus. 1. Scopus introduction and content coverage. 2. Scopus in comparison with Web of Science. 3. Basic functionalities of Scopus

SciVerse Scopus. 1. Scopus introduction and content coverage. 2. Scopus in comparison with Web of Science. 3. Basic functionalities of Scopus Prepared by: Jawad Sayadi Account Manager, United Kingdom Elsevier BV Radarweg 29 1043 NX Amsterdam The Netherlands J.Sayadi@elsevier.com SciVerse Scopus SciVerse Scopus 1. Scopus introduction and content

More information

DIGIT.B4 Big Data PoC

DIGIT.B4 Big Data PoC DIGIT.B4 Big Data PoC DIGIT 01 Social Media D02.01 PoC Requirements Table of contents 1 Introduction... 5 1.1 Context... 5 1.2 Objective... 5 2 Data SOURCES... 6 2.1 Data sources... 6 2.2 Data fields...

More information

Supporting Research Data Collection from YouTube with TubeKit

Supporting Research Data Collection from YouTube with TubeKit Supporting Research Data Collection from YouTube with TubeKit Chirag Shah School of Information & Library Science University of North Carolina Chapel Hill NC 27599, USA chirag@unc.edu Abstract We present

More information

WEB LIP6 STÉPHANE GANÇARSKI Z. PEHLIVAN, M. CORD, M. BEN-SAAD, A. SANOJA,N. THOME, M. LAW

WEB LIP6 STÉPHANE GANÇARSKI Z. PEHLIVAN, M. CORD, M. BEN-SAAD, A. SANOJA,N. THOME, M. LAW WEB ARCHIVING @ LIP6 STÉPHANE GANÇARSKI Z. PEHLIVAN, M. CORD, M. BEN-SAAD, A. SANOJA,N. THOME, M. LAW French ANR project Cartec (ended) European project Scape (ends 9/2014) Web Archives Web is ephemeral

More information

Archives in a Networked Information Society: The Problem of Sustainability in the Digital Information Environment

Archives in a Networked Information Society: The Problem of Sustainability in the Digital Information Environment Archives in a Networked Information Society: The Problem of Sustainability in the Digital Information Environment Shigeo Sugimoto Research Center for Knowledge Communities Graduate School of Library, Information

More information

Néonaute: mining web archives for linguistic analysis

Néonaute: mining web archives for linguistic analysis Néonaute: mining web archives for linguistic analysis Sara Aubry, Bibliothèque nationale de France Emmanuel Cartier, LIPN, University of Paris 13 Peter Stirling, Bibliothèque nationale de France IIPC Web

More information

Archiving and Preserving the Web. Kristine Hanna Internet Archive November 2006

Archiving and Preserving the Web. Kristine Hanna Internet Archive November 2006 Archiving and Preserving the Web Kristine Hanna Internet Archive November 2006 1 About Internet Archive Non profit founded in 1996 by Brewster Kahle, as an Internet library Provide universal and permanent

More information

Supporting Research Data Collection from YouTube with TubeKit

Supporting Research Data Collection from YouTube with TubeKit Supporting Research Data Collection from YouTube with TubeKit Chirag Shah School of Information & Library Science University of North Carolina chirag@unc.edu Abstract We present TubeKit, a query-based

More information

Collection Building on the Web. Basic Algorithm

Collection Building on the Web. Basic Algorithm Collection Building on the Web CS 510 Spring 2010 1 Basic Algorithm Initialize URL queue While more If URL is not a duplicate Get document with URL [Add to database] Extract, add to queue CS 510 Spring

More information

SEO: SEARCH ENGINE OPTIMISATION

SEO: SEARCH ENGINE OPTIMISATION SEO: SEARCH ENGINE OPTIMISATION SEO IN 11 BASIC STEPS EXPLAINED What is all the commotion about this SEO, why is it important? I have had a professional content writer produce my content to make sure that

More information

Associate Diploma in Web and Multimedia Development

Associate Diploma in Web and Multimedia Development Associate Diploma in Web and Multimedia Development Program Components CRD Major Support 4% University 8% University (UR) 5 College (CR) 9 Major (MR) 49 College 14% Major Support (MSR) 3 Training (Internship)

More information

Please note: Only the original curriculum in Danish language has legal validity in matters of discrepancy. CURRICULUM

Please note: Only the original curriculum in Danish language has legal validity in matters of discrepancy. CURRICULUM Please note: Only the original curriculum in Danish language has legal validity in matters of discrepancy. CURRICULUM CURRICULUM OF 1 SEPTEMBER 2008 FOR THE BACHELOR OF ARTS IN INTERNATIONAL BUSINESS COMMUNICATION:

More information

VISUAL SUMMARY ACCESS INTERNET AND WEB. The Internet, the Web, and Electronic Commerce

VISUAL SUMMARY ACCESS INTERNET AND WEB. The Internet, the Web, and Electronic Commerce VISUAL SUMMARY The Internet, the Web, and Electronic Commerce INTERNET AND WEB Internet Launched in 1969 with ARPANET, the Internet consists of the actual physical network. Web Introduced in 1991 at CERN,

More information

Community Tools and Best Practices for Harvesting and Preserving At-Risk Web Content ACA 2013

Community Tools and Best Practices for Harvesting and Preserving At-Risk Web Content ACA 2013 Community Tools and Best Practices for Harvesting and Preserving At-Risk Web Content ACA 2013 Scott Reed, Internet Archive Amanda Wakaruk, University of Alberta Libraries Kelly E. Lau, University of Alberta

More information

Focused Crawling with

Focused Crawling with Focused Crawling with ApacheCon North America Vancouver, 2016 Hello! I am Sujen Shah Computer Science @ University of Southern California Research Intern @ NASA Jet Propulsion Laboratory Member of The

More information

Search Engines. Information Retrieval in Practice

Search Engines. Information Retrieval in Practice Search Engines Information Retrieval in Practice All slides Addison Wesley, 2008 Web Crawler Finds and downloads web pages automatically provides the collection for searching Web is huge and constantly

More information

Shrey Patel B.E. Computer Engineering, Gujarat Technological University, Ahmedabad, Gujarat, India

Shrey Patel B.E. Computer Engineering, Gujarat Technological University, Ahmedabad, Gujarat, India International Journal of Scientific Research in Computer Science, Engineering and Information Technology 2018 IJSRCSEIT Volume 3 Issue 3 ISSN : 2456-3307 Some Issues in Application of NLP to Intelligent

More information

ALL-KANSAS NEWS WEBSITE CRITIQUE BOOKLET. This guide is designed to be an educational device SCHOOL NAME: NEW WEBSITE NAME: YEAR: ADVISER:

ALL-KANSAS NEWS WEBSITE CRITIQUE BOOKLET. This guide is designed to be an educational device SCHOOL NAME: NEW WEBSITE NAME: YEAR: ADVISER: ÇP ALL-KANSAS NEWS WEBSITE CRITIQUE BOOKLET This guide is designed to be an educational device to improve the quality of your news website program. It is intended to point out positive aspects of your

More information

Anatomy of a Semantic Virus

Anatomy of a Semantic Virus Anatomy of a Semantic Virus Peyman Nasirifard Digital Enterprise Research Institute National University of Ireland, Galway IDA Business Park, Lower Dangan, Galway, Ireland peyman.nasirifard@deri.org Abstract.

More information

Communication (COMM) Communication (COMM) 1

Communication (COMM) Communication (COMM) 1 Communication (COMM) 1 Communication (COMM) COMM 110. Fundamentals of Public Speaking. 3 Credits. Theory and practice of public speaking with emphasis on content, organization, language, delivery, and

More information

Enhanced retrieval using semantic technologies:

Enhanced retrieval using semantic technologies: Enhanced retrieval using semantic technologies: Ontology based retrieval as a new search paradigm? - Considerations based on new projects at the Bavarian State Library Dr. Berthold Gillitzer 28. Mai 2008

More information

Preservation of Web Materials

Preservation of Web Materials Preservation of Web Materials Julie Dietrich INFO 560 Literature Review 7/20/13 1 Introduction Websites are a communication and informational tool that can be shared and updated across the World Wide Web.

More information

Cognos: Crowdsourcing Search for Topic Experts in Microblogs

Cognos: Crowdsourcing Search for Topic Experts in Microblogs Cognos: Crowdsourcing Search for Topic Experts in Microblogs Saptarshi Ghosh, Naveen Sharma, Fabricio Benevenuto, Niloy Ganguly, Krishna Gummadi IIT Kharagpur, India; UFOP, Brazil; MPI-SWS, Germany Topic

More information

Metadata Framework for Resource Discovery

Metadata Framework for Resource Discovery Submitted by: Metadata Strategy Catalytic Initiative 2006-05-01 Page 1 Section 1 Metadata Framework for Resource Discovery Overview We must find new ways to organize and describe our extraordinary information

More information

CAT (CURATOR ARCHIVING TOOL): IMPROVING ACCESS TO WEB ARCHIVES

CAT (CURATOR ARCHIVING TOOL): IMPROVING ACCESS TO WEB ARCHIVES CAT (CURATOR ARCHIVING TOOL): IMPROVING ACCESS TO WEB ARCHIVES Ciro Llueca, Daniel Cócera Biblioteca de Catalunya (BC) Barcelona, Spain padicat@bnc.cat Natalia Torres, Gerard Suades, Ricard de la Vega

More information

COUNCIL OF THE EUROPEAN UNION. Brussels, 28 January 2003 (OR. en) 15723/02 TELECOM 78 JAI 307 PESC 593

COUNCIL OF THE EUROPEAN UNION. Brussels, 28 January 2003 (OR. en) 15723/02 TELECOM 78 JAI 307 PESC 593 COUNCIL OF THE EUROPEAN UNION Brussels, 28 January 2003 (OR. en) 15723/02 TELECOM 78 JAI 307 PESC 593 LEGISLATIVE ACTS AND OTHER INSTRUMTS Subject : Council Resolution on a European approach towards a

More information

BPS Suite and the OCEG Capability Model. Mapping the OCEG Capability Model to the BPS Suite s product capability.

BPS Suite and the OCEG Capability Model. Mapping the OCEG Capability Model to the BPS Suite s product capability. BPS Suite and the OCEG Capability Model Mapping the OCEG Capability Model to the BPS Suite s product capability. BPS Contents Introduction... 2 GRC activities... 2 BPS and the Capability Model for GRC...

More information

CDL s Web Archiving System

CDL s Web Archiving System CDL s Web Archiving System Erik Hetzner UC3, California Digital Library 16 June 2011 Erik Hetzner (UC3, California Digital Library) CDL s Web Archiving System 16 June 2011 1 / 24 Introduction We don t

More information

UNIT-V WEB MINING. 3/18/2012 Prof. Asha Ambhaikar, RCET Bhilai.

UNIT-V WEB MINING. 3/18/2012 Prof. Asha Ambhaikar, RCET Bhilai. UNIT-V WEB MINING 1 Mining the World-Wide Web 2 What is Web Mining? Discovering useful information from the World-Wide Web and its usage patterns. 3 Web search engines Index-based: search the Web, index

More information

ALOE - A Socially Aware Learning Resource and Metadata Hub

ALOE - A Socially Aware Learning Resource and Metadata Hub ALOE - A Socially Aware Learning Resource and Metadata Hub Martin Memmel & Rafael Schirru Knowledge Management Department German Research Center for Artificial Intelligence DFKI GmbH, Trippstadter Straße

More information

Information Retrieval May 15. Web retrieval

Information Retrieval May 15. Web retrieval Information Retrieval May 15 Web retrieval What s so special about the Web? The Web Large Changing fast Public - No control over editing or contents Spam and Advertisement How big is the Web? Practically

More information

Marketing & Back Office Management

Marketing & Back Office Management Marketing & Back Office Management Menu Management Add, Edit, Delete Menu Gallery Management Add, Edit, Delete Images Banner Management Update the banner image/background image in web ordering Online Data

More information

Chapter 3: Google Penguin, Panda, & Hummingbird

Chapter 3: Google Penguin, Panda, & Hummingbird Chapter 3: Google Penguin, Panda, & Hummingbird Search engine algorithms are based on a simple premise: searchers want an answer to their queries. For any search, there are hundreds or thousands of sites

More information

Preserving Public Government Information: The 2008 End of Term Crawl Project

Preserving Public Government Information: The 2008 End of Term Crawl Project Preserving Public Government Information: The 2008 End of Term Crawl Project Abbie Grotke, Library of Congress Mark Phillips, University of North Texas Libraries George Barnum, U.S. Government Printing

More information

Google Analytics 101

Google Analytics 101 Copyright GetABusinessMobileApp.com All rights reserved worldwide. YOUR RIGHTS: This book is restricted to your personal use only. It does not come with any other rights. LEGAL DISCLAIMER: This book is

More information

Enriching Lifelong User Modelling with the Social e- Networking and e-commerce Pieces of the Puzzle

Enriching Lifelong User Modelling with the Social e- Networking and e-commerce Pieces of the Puzzle Enriching Lifelong User Modelling with the Social e- Networking and e-commerce Pieces of the Puzzle Demetris Kyriacou Learning Societies Lab School of Electronics and Computer Science, University of Southampton

More information

KEY FUNCTIONS. Dynamic content. Conditional s. A/B/X testing. Creation. testing. Video s. Spam. testing. Advanced personalisation

KEY FUNCTIONS. Dynamic content. Conditional  s. A/B/X testing. Creation. testing. Video  s. Spam. testing. Advanced personalisation KEY FUNCTIONS Dynamic content Conditional emails Messages sent as part of one campaign will automatically adjust their looks to particular addressees preferences. Depending on the values of the characteristics

More information

Version 11

Version 11 The Big Challenges Networked and Electronic Media European Technology Platform The birth of a new sector www.nem-initiative.org Version 11 1. NEM IN THE WORLD The main objective of the Networked and Electronic

More information

ANNUAL REPORT Visit us at project.eu Supported by. Mission

ANNUAL REPORT Visit us at   project.eu Supported by. Mission Mission ANNUAL REPORT 2011 The Web has proved to be an unprecedented success for facilitating the publication, use and exchange of information, at planetary scale, on virtually every topic, and representing

More information

Interoperability and transparency The European context

Interoperability and transparency The European context JOINING UP GOVERNMENTS EUROPEAN COMMISSION Interoperability and transparency The European context ITAPA 2011, Bratislava Francisco García Morán Director General for Informatics Background 2 3 Every European

More information

UAE National Space Policy Agenda Item 11; LSC April By: Space Policy and Regulations Directory

UAE National Space Policy Agenda Item 11; LSC April By: Space Policy and Regulations Directory UAE National Space Policy Agenda Item 11; LSC 2017 06 April 2017 By: Space Policy and Regulations Directory 1 Federal Decree Law No.1 of 2014 establishes the UAE Space Agency UAE Space Agency Objectives

More information

Software Requirements Specification for the Names project prototype

Software Requirements Specification for the Names project prototype Software Requirements Specification for the Names project prototype Prepared for the JISC Names Project by Daniel Needham, Amanda Hill, Alan Danskin & Stephen Andrews April 2008 1 Table of Contents 1.

More information

Web Crawling. Jitali Patel 1, Hardik Jethva 2 Dept. of Computer Science and Engineering, Nirma University, Ahmedabad, Gujarat, India

Web Crawling. Jitali Patel 1, Hardik Jethva 2 Dept. of Computer Science and Engineering, Nirma University, Ahmedabad, Gujarat, India Web Crawling Jitali Patel 1, Hardik Jethva 2 Dept. of Computer Science and Engineering, Nirma University, Ahmedabad, Gujarat, India - 382 481. Abstract- A web crawler is a relatively simple automated program

More information

QUALITY SEO LINK BUILDING

QUALITY SEO LINK BUILDING QUALITY SEO LINK BUILDING Developing Your Online Profile through Quality Links TABLE OF CONTENTS Introduction The Impact Links Have on Your Search Profile 02 Chapter II Evaluating Your Link Profile 03

More information

Overview of Web Mining Techniques and its Application towards Web

Overview of Web Mining Techniques and its Application towards Web Overview of Web Mining Techniques and its Application towards Web *Prof.Pooja Mehta Abstract The World Wide Web (WWW) acts as an interactive and popular way to transfer information. Due to the enormous

More information

Content Based Smart Crawler For Efficiently Harvesting Deep Web Interface

Content Based Smart Crawler For Efficiently Harvesting Deep Web Interface Content Based Smart Crawler For Efficiently Harvesting Deep Web Interface Prof. T.P.Aher(ME), Ms.Rupal R.Boob, Ms.Saburi V.Dhole, Ms.Dipika B.Avhad, Ms.Suvarna S.Burkul 1 Assistant Professor, Computer

More information

Medici for Digital Cultural Heritage Libraries. George Tsouloupas, PhD The LinkSCEEM Project

Medici for Digital Cultural Heritage Libraries. George Tsouloupas, PhD The LinkSCEEM Project Medici for Digital Cultural Heritage Libraries George Tsouloupas, PhD The LinkSCEEM Project Overview of Digital Libraries A Digital Library: "An informal definition of a digital library is a managed collection

More information

Archiving the Web: What can Institutions learn from National and International Web Archiving Initiatives

Archiving the Web: What can Institutions learn from National and International Web Archiving Initiatives Archiving the Web: What can Institutions learn from National and International Web Archiving Initiatives Maureen Pennock Michael Day Lizzie Richmond UKOLN University of Bath UKOLN University of Bath University

More information

Legal Deposit of Online Newspapers at the BnF - Clément Oury - IFLA PAC Paris 2012

Legal Deposit of Online Newspapers at the BnF - Clément Oury - IFLA PAC Paris 2012 1 Legal Deposit of Online Newspapers Digital collections in BnF stacks Clément Oury Head of Digital Legal Deposit Bibliothèque nationale de France Summary The issue : ensuring the continuity of BnF heritage

More information

CS47300: Web Information Search and Management

CS47300: Web Information Search and Management CS47300: Web Information Search and Management Web Search Prof. Chris Clifton 18 October 2017 Some slides courtesy Croft et al. Web Crawler Finds and downloads web pages automatically provides the collection

More information

Oleksandr Kuzomin, Bohdan Tkachenko

Oleksandr Kuzomin, Bohdan Tkachenko International Journal "Information Technologies Knowledge" Volume 9, Number 2, 2015 131 INTELLECTUAL SEARCH ENGINE OF ADEQUATE INFORMATION IN INTERNET FOR CREATING DATABASES AND KNOWLEDGE BASES Oleksandr

More information

ONLINE EVALUATION FOR: Company Name

ONLINE EVALUATION FOR: Company Name ONLINE EVALUATION FOR: Company Name Address Phone URL media advertising design P.O. Box 2430 Issaquah, WA 98027 (800) 597-1686 platypuslocal.com SUMMARY A Thank You From Platypus: Thank you for purchasing

More information

EUDAT Training 2 nd EUDAT Conference, Rome October 28 th Introduction, Vision and Architecture. Giuseppe Fiameni CINECA Rob Baxter EPCC EUDAT members

EUDAT Training 2 nd EUDAT Conference, Rome October 28 th Introduction, Vision and Architecture. Giuseppe Fiameni CINECA Rob Baxter EPCC EUDAT members EUDAT Training 2 nd EUDAT Conference, Rome October 28 th Introduction, Vision and Architecture Giuseppe Fiameni CINECA Rob Baxter EPCC EUDAT members Agenda Background information Services Common Data Infrastructure

More information

Crawling the Web. Web Crawling. Main Issues I. Type of crawl

Crawling the Web. Web Crawling. Main Issues I. Type of crawl Web Crawling Crawling the Web v Retrieve (for indexing, storage, ) Web pages by using the links found on a page to locate more pages. Must have some starting point 1 2 Type of crawl Web crawl versus crawl

More information

Technology Brown Bag: Web 2.0

Technology Brown Bag: Web 2.0 Technology Brown Bag: Web 2.0 Schedule information Event Technology Brown Bag: Web 2.0 When Thursday, May 4, 2006 from 12:00pm to 1:30pm Where Harris 1300 Event details Details Access Contact What is Web

More information

CHAPTER 4 EVALUATION

CHAPTER 4 EVALUATION 45 CHAPTER 4 EVALUATION Three degrees of evaluation were performed. The first degree determines how effective the process of conforming tools to this specification is in accomplishing the task of preserving

More information

Role of Social Media and Semantic WEB in Libraries

Role of Social Media and Semantic WEB in Libraries Role of Social Media and Semantic WEB in Libraries By Dr. Anwar us Saeed Email: anwarussaeed@yahoo.com Layout Plan Where Library streams merge the WEB Recent Evolution of the WEB Social WEB Semantic WEB

More information

Explorations of Netarkivet: Preliminary Findings. Emily Maemura, PhD Candidate Faculty of Information, University of Toronto NetLab Forum May 4, 2018

Explorations of Netarkivet: Preliminary Findings. Emily Maemura, PhD Candidate Faculty of Information, University of Toronto NetLab Forum May 4, 2018 Explorations of Netarkivet: Preliminary Findings Emily Maemura, PhD Candidate Faculty of Information, University of Toronto NetLab Forum May 4, 2018 Elements of Provenance Maemura, Worby, Milligan, Becker,

More information

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples Hadoop Introduction 1 Topics Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples 2 Big Data Analytics What is Big Data?

More information

Archiving dynamic websites: a considerations framework and a tool comparison

Archiving dynamic websites: a considerations framework and a tool comparison Archiving dynamic websites: a considerations framework and a tool comparison Allard Oelen Student number: 2560801 Supervisor: Victor de Boer 30 June 2017 Vrije Universiteit Amsterdam a.b.oelen@student.vu.nl

More information

ALEXANDRIA. Temporal Retrieval, Exploration and Analytics in Web Archives. Wolfgang Nejdl. L3S Research Center Hannover, Germany

ALEXANDRIA. Temporal Retrieval, Exploration and Analytics in Web Archives. Wolfgang Nejdl. L3S Research Center Hannover, Germany ALEXANDRIA Temporal Retrieval, Exploration and Analytics in Web Archives Wolfgang Nejdl L3S Research Center Hannover, Germany Looking back: The Austrian Socialist Party and Europe What is missing? ALEXANDRIA

More information

Open-Source Natural Language Processing and Computational Archival Science

Open-Source Natural Language Processing and Computational Archival Science Open-Source Natural Language Processing and Computational Archival Science Kalina Bontcheva University of Sheffield @kbontcheva The University of Sheffield, 1995-2018 This work is licensed under the Creative

More information

Building Institutional Repositories: Emerging Challenges

Building Institutional Repositories: Emerging Challenges University of Nebraska at Omaha From the SelectedWorks of Yumi Ohira 2014 Building Institutional Repositories: Emerging Challenges Yumi Ohira, University of Nebraska at Omaha Available at: https://works.bepress.com/yumi-ohira/3/

More information

Ranking of ads. Sponsored Search

Ranking of ads. Sponsored Search Sponsored Search Ranking of ads Goto model: Rank according to how much advertiser pays Current model: Balance auction price and relevance Irrelevant ads (few click-throughs) Decrease opportunities for

More information

ASSIUT UNIVERSITY. Faculty of Computers and Information Department of Information Systems. IS Ph.D. Program. Page 0

ASSIUT UNIVERSITY. Faculty of Computers and Information Department of Information Systems. IS Ph.D. Program. Page 0 ASSIUT UNIVERSITY Faculty of Computers and Information Department of Information Systems Informatiio on Systems PhD Program IS Ph.D. Program Page 0 Assiut University Faculty of Computers & Informationn

More information

Website Acecore Technologies JL B.V.:

Website Acecore Technologies JL B.V.: Privacy policy Acecore Technologies JL B.V. B.V. Acecore Technologies JL B.V. (hereafter: Acecore technologies JL B.V.) focusses on the development and shipping of drones in the creative, industrial and

More information

Essential for Employee Engagement. Frequently Asked Questions

Essential for Employee Engagement. Frequently Asked Questions Essential for Employee Engagement The Essential Communications Intranet is a solution that provides clients with a ready to use framework that takes advantage of core SharePoint elements and our years

More information

SUMMON WEB-SCALE DISCOVERY. ADA University Baku 02/04/2014

SUMMON WEB-SCALE DISCOVERY. ADA University Baku 02/04/2014 SUMMON WEB-SCALE DISCOVERY ADA University Baku 02/04/2014 Why an Automated Management Solution is Important Academic Library Expenditures on Purchased and Licensed Content 90% 80% 70% 60% 50% 40% 30% 20%

More information

OnCrawl Metrics. What SEO indicators do we analyze for you? Dig into our board of metrics to find the one you are looking for.

OnCrawl Metrics. What SEO indicators do we analyze for you? Dig into our board of metrics to find the one you are looking for. 1 OnCrawl Metrics What SEO indicators do we analyze for you? Dig into our board of metrics to find the one you are looking for. UNLEASH YOUR SEO POTENTIAL Table of content 01 Crawl Analysis 02 Logs Monitoring

More information