Building a Digital Library Software

Similar documents
Invenio: a modern digital library system. <

Invenio: A Modern Digital Library for Grey Literature

CERN Document Server Software

Institutional Repository using DSpace. Yatrik Patel Scientist D (CS)

Building a Digital Repository on a Shoestring Budget

Ing. José A. Mejía Villar M.Sc. Computing Center of the Alfred Wegener Institute for Polar and Marine Research

Promoting Open Standards for Digital Repository. case study examples and challenges

Information retrieval concepts Search and browsing on unstructured data sources Digital libraries applications

Building for the Future

EUDAT. A European Collaborative Data Infrastructure. Daan Broeder The Language Archive MPI for Psycholinguistics CLARIN, DASISH, EUDAT

Adding OAI ORE Support to Repository Platforms

SobekCM. Compiled for presentation to the Digital Library Working Group School of Oriental and African Studies

Comparing Open Source Digital Library Software

BPMN Processes for machine-actionable DMPs


Persistent identifiers, long-term access and the DiVA preservation strategy

The Semantic Institution: An Agenda for Publishing Authoritative Scholarly Facts. Leslie Carr

OAI-PMH. DRTC Indian Statistical Institute Bangalore

2nd Technical Validation Questionnaire - interim results -

GNU EPrints 2 Overview

The OAIS Reference Model: current implementations

September Development of favorite collections & visualizing user search queries in CERN Document Server (CDS)

Flexible Design for Simple Digital Library Tools and Services

Lessons Learned. Implementing Rosetta in the Harold B. Lee Library

Introduction to TIND. Guillaume Lastecoueres

Digital The Harold B. Lee Library

Capturing and Analyzing User Behavior in Large Digital Libraries

Horizon Societies of Symbiotic Robot-Plant Bio-Hybrids as Social Architectural Artifacts. Deliverable D4.1

How to contribute information to AGRIS

RVOT: A Tool For Making Collections OAI-PMH Compliant

Metadata and Encoding Standards for Digital Initiatives: An Introduction

Data Exchange and Conversion Utilities and Tools (DExT)

The Virtual Language Observatory!

Increasing access to OA material through metadata aggregation

Problem: Solution: No Library contains all the documents in the world. Networking the Libraries

The Design of a DLS for the Management of Very Large Collections of Archival Objects

B2SAFE metadata management

COAR Interoperability Roadmap. Uppsala, May 21, 2012 COAR General Assembly

A service-oriented national e-thesis information system and repository

ACDH AUSTRIAN CENTRE FOR DIGITAL HUMANITIES

Research Data Edinburgh: MANTRA & Edinburgh DataShare. Stuart Macdonald EDINA & Data Library University of Edinburgh

CERN Open Data and Data Analysis Knowledge Preservation

The Materials Data Facility

The Metadata Challenge:

Introduction to Federico 2.0 and Fedora Commons

Software Requirements Specification for the Names project prototype

Its All About The Metadata

Role of Social Media and Semantic WEB in Libraries

Building Illinois Electronic Documents Access

2011 Emerging Leaders: Group C Improving ALA Poster Sessions. Final Report and Recommendations Date Submitted: Monday, May 16 th, 2011

SMART CONNECTOR TECHNOLOGY FOR FEDERATED SEARCH

Semantic Web Systems Introduction Jacques Fleuriot School of Informatics

Dataverse: Modular Storage and Migration to the Cloud

Part 2: Current State of OAR Interoperability. Towards Repository Interoperability Berlin 10 Workshop 6 November 2012

Union catalogue models

Digital Curation and Preservation: Defining the Research Agenda for the Next Decade

MuseKnowledge Hybrid Search

Nuno Freire National Library of Portugal Lisbon, Portugal

Metadata for Data Discovery: The NERC Data Catalogue Service. Steve Donegan

Long-term digital preservation of UNSWorks

Brown University Libraries Technology Plan

Copyright 2008, Paul Conway.

CORE: Improving access and enabling re-use of open access content using aggregations

The Open Archives Initiative and the Sheet Music Consortium

DRIVER Step One towards a Pan-European Digital Repository Infrastructure

Survey of Existing Services in the Mathematical Digital Libraries and Repositories in the EuDML Project

Invenio at UAB 11 years after

OPENAIRE FP7 POST-GRANT OPEN ACCESS PILOT

EUDAT-B2FIND A FAIR and Interdisciplinary Discovery Portal for Research Data

Prototyping Data Intensive Apps: TrendingTopics.org

Scalable, Reliable Marshalling and Organization of Distributed Large Scale Data Onto Enterprise Storage Environments *

Non-text theses as an integrated part of the University Repository

Citation Services for Institutional Repositories: Citebase Search. Tim Brody Intelligence, Agents, Multimedia Group University of Southampton

Institutional repositories: description of VITAL as an example of a Fedora-based digital assets management system.

Open Archives Initiative protocol development and implementation at arxiv

Citation Services for Institutional Repositories: Citebase Search. Tim Brody Intelligence, Agents, Multimedia Group University of Southampton

1. Download and install the Firefox Web browser if needed. 2. Open Firefox, go to zotero.org and click the big red Download button.

OpenAIRE. Fostering the social and technical links that enable Open Science in Europe and beyond

The OpenAIREplus Project

Digital Preservation Standards Using ISO for assessment

Artificially enhanced research

Archives in a Networked Information Society: The Problem of Sustainability in the Digital Information Environment

Digital Libraries: Interoperability

Helping Journals to Upgrade Data Publications for Reusable Research

Open source software for building open access repositories. Imma Subirats Coll knowledge and information management officer FAO of the United Nations

Data publication and discovery with Globus

Sessions 3/4: Member Node Breakouts. John Cobb Matt Jones Laura Moyers 7 July 2013 DataONE Users Group

Registry Interchange Format: Collections and Services (RIF-CS) explained

Working with Islandora

Introduction

For those of you who may not have heard of the BHL let me give you some background. The Biodiversity Heritage Library (BHL) is a consortium of

OAI-PMH for dummies: how to build an institutional repository with limited resources?

MEDIA PROCESSING ON CLOUD

Article begins on next page

Using metadata for interoperability. CS 431 February 28, 2007 Carl Lagoze Cornell University

Exploring Open Source Solutions in the Management of ETD Processes CHETAN S SONAWANE KMC COLLEGE, INDIA

DAITSS Demo Virtual Machine Quick Start Guide

Representation/Indexing (fig 1.2) IR models - overview (fig 2.1) IR models - vector space. Weighting TF*IDF. U s e r. T a s k s

EUDAT B2FIND A Cross-Discipline Metadata Service and Discovery Portal

The Ohio State University's Knowledge Bank: An Institutional Repository in Practice

Transcription:

Building a Software INVENIO, Part 1 J-Y. Le Meur Department of Information Technology CERN JINR-CERN School on GRID and Information Management Systems 14 May 2012

Outline 1 2 3 4

Outline 1 2 3 4

A physicist office at CERN: the "Non-Digital" Library

Specialized Software Specialist software for running a digital library: Content is organized and ready for exchange, support of interoperability protocols Metadata and Data is preserved for long term, support of preservation standards Submission, Edition, Curation processes are supported Dissemination is organized and controlled Combined traditional Library Systems, Document Management SW and Engine SW examples: Eprints, DuraSpace, Greenstone... Institutional repository software focuses primarily on ingest, preservation and access of locally produced documents, particularly locally produced academic outputs.

Invenio DL SW History 1954: CERN laboratory is created 1989: Tim Berners-Lee invents the Web 1991: HEP SPIRES/ArXiv is the first DL on the Web 1993: CERN Preprint Server starts as an institutional and disciplinary repository 1996: CERN Library Server includes Books and Periodicals, as an hybrid library 2000: CERN Document Server serves Multimedia material and restricted notes 2002: CDSWare SW released Open Source 2006: CDSWare becomes Invenio; start of I18N collaborations 2010: Invenio 1.0 released and adopted world-wide 2012: First Invenio User Group Workshop is organized

Project Overview Invenio DL SW Open Source GPL project Linux Server side Medium to big data repositories Flexible at every layer Technology: Python (and C and Lisp), MySQL and Apache + mod_wsgi WSGI: Web Server Gateway Interface Supports Library standards: XML-MARC, MARC21, OAI-PMH, OpenURL...

Library Standards exchange, identifiers and preservation Exchange protocols: Z39.50 and OAI-PMH between Data and Service providers Interoperability: SWORD = Simple Web-service Offering Repository Deposit Identifiers: ISBN, DOI, PURL, etc Preservation: METS, PDF/A, OAIS Content description: Metadata Encoding and Transmission Standard Data formats Supporting system: Open Archival Information System ref. model Content representation: MARC, DC XML-MARC

De facto Standards Plugins examples Compatibility with LibX: Invenio toolbar LibX: http://libx.org/editions/download.php?edition=4f46cd81 Can be integrated with IExplorer and Firefox browsers Integration with the main digital content websites including Amazon, Google Schoolar, Wikipedia Highlighted text from a web page can be used to directly query an Invenio installation Zotero: Export DL content to Zotero Firefox plugin for compiling CVs, etc Cooliris: Browsing multimedia content as a 3 dimensional wall (integration with the Cooliris plugin)

Technology Component concepts 3 Tier architecture:

Technology Overview languages

Why Python? languages Easy to read and understand: good for many temporary developers Suitable for rapid prototyping: good for organic-growth software development model Write code to throw it away

Why Python? art of ikebana programming (T. Simko)

Why Python? Speeding up Pyhton bitecode interpreted language: what about speed? Cython permits to write C extensions easily combining efficiency of C with the high-levelness of Python declace C types on variables and class attributes:

Domain logic

Outline 1 2 3 4

Ingestion Modules Overview

Ingestion Modules Overview

Ingestion Modules Overview

Ingestion Modules Submission by Humans The ingestion is performed by humans (authors, secretaries, cataloguers, etc) WebSubmit is a framework for helping collecting user data and creating MARCXML records + other workflow-related goodies. "" Strategy.

Ingestion Modules Submission: usual Web Form

Ingestion Modules Submission: unusual Web Form

Ingestion Modules Submission: Behind the Form

Ingestion Modules Submission: interfaces, workflows and functions

Ingestion Modules Submission: interfaces, workflows and functions

Ingestion Modules Submission: interfaces, workflows and functions

Ingestion Modules Submission: interfaces, workflows and functions

Ingestion Modules Submission by Robots Ingestion by robots, three use cases: Pulling from OAI-PMH Compatible Source Pulling from a non-compatible Source ing from External systems into the DL

Ingestion Pull: Harvesting from OAI source Pulling from OAI-PMH source OAI-PMH: Open Archives Initiative Protocol for Metadata Harvesting Data Provider vs Service Provider XML (Dulin Core) over HTTP Commercial search engines have started using OAI-PMH to acquire/deliver more resources Help in reducing the network traffic and other resource usage by doing an incremental harvesting

Ingestion : Robot Upload ing records in the : To POST (the HTTP way) a record Insert/replace/modify/delete Authorization: checked by IP or API key (a la Twitter) Upload files via special protocol FFT (Invenio) Feedack: Immediate HTTP error code in case of non valid MARCXML or other request error Examples at CERN Experiment (CMS) pushes records automatically to CERN Document Server Event recording SW pushes talks held at CERN in CDS automatically CERN Drupal web infrastructure to push photos and other documents to CDS

Which one? Example 1: Nikos wants to create an archive of all the blogs about High Energy Physics E.g. Nikos works at CERN and he must harvest blog posts from Fermilab Quantium Diaries Example 2: Sam wants to create an archive of all the scheduled TV programmes in his nation E.g. Sam lives in France and he would like to harvest content from the TV Broadcast site Telerama.fr

Which one? Example 1: Nikos wants to create an archive of all the blogs about High Energy Physics E.g. Nikos works at CERN and he must harvest blog posts from Fermilab Quantium Diaries Example 2: Sam wants to create an archive of all the scheduled TV programmes in his nation E.g. Sam lives in France and he would like to harvest content from the TV Broadcast site Telerama.fr

The TV DL Example 2: Sam wants to create an archive of all the scheduled TV programmes in his nation E.g. Sam lives in France and he would like to harvest content from the TV Broadcast site Telerama.fr Good luck Sam!

The TV DL Find a common input standard: For TV programmes this is XMLTV Map the input to MARCXML Wrap the harvesting, conversion and uploading into a Tasklet That s all!

The TV DL Mapping XMLTV to XMLMARC: XML Conversion smart_add_field(rec, 245 : a : programme.get( title ), b : programme.get( sub-title ), ## Title 260 : b : channel_map[programme[ channel ]], c : start_time, ## "Place" 269 : c : programme.get( date ), ## Date 520 : a : programme.get( desc ), ## Summary 037 : a : u -.join([programme.get( channel, ), programme.get( start, ).replace(, )]), 088 : a : u -.join([programme.get( title, [(, )])[0][0].replace(, ), programme.get( episode-num, [(, )])[0][0]]), FMT : g : original_xml, f : xmltv,)

The TV DL Tasklet programming: Main function: bibtasklet def bst_xmltv2marcxml(): write_message("grabbing XMLTV data") xmltvfile = grab_xmltv() write_message("xmltv data saved into %s" % xmltvfile) fd, marcxmlfile = tempfile.mkstemp(dir=tmpdir, suffix=.xml ) os.write(fd, xmltv2marcxml(open(xmltvfile))) os.close(fd) write_message("derived MARCXML saved into %s" % marcxmlfile) task_id = task_low_level_submission("bibupload", "xmltv", "-ir", marcxmlfile) write_message("bibupload scheduled with task id %s" % task_id)

The TV DL Scheduler with pending tasklets:

The TV DL or Pull?

Outline 1 2 3 4

Processing Modules Overview

Processing Modules Overview

Processing Modules Overview

Processing Modules Example: indexing

Processing Modules Example: indexing

Processing Modules Example: indexing

Building Indexes designing a search engine performance-driven design assumptions: high number of selects, low number of updates fast searching, slow indexation cache everything cacheable search functionality: search for words, phrases, regular expressions search in any field, authors, titles, etc index design: forward indexes: rec1 > [word1, word8,... ] rec2 > [word1, word2,... ] reverse indexes: word1 > [rec1, rec2,... ] word2 > [rec2, rec7,... ] Zipf s law on word frequency: few words occur very often (e.g. the) most words are infrequent (even e.g. boson)

Building Indexes Optimizing Three important speed factors to consider: speed of finding sets (DB Server) speed of demarshaling sets (DB < > Web App Server) speed of intersecting sets (Web App Server) Optimizing data structures data structures tested: sorted (lists, Patricia trees) unsorted (hashed sets, binary vectors) fast prototyping: (Python, Common Lisp) Binary vectors found the best compromise: typical search time gain: 4.0 sec > 0.2 sec typical indexing time loss: 7 hours > 4 days mostly spare data modelled via mostly dense data structure

Processing Modules Sorting Quikly Very Large Sets A processing phase is needed to generate Sorting Buckets At Search time, 1st 10 records sorted ascendant by title:

Processing Modules Citation-graph based (L. Marian) Citation Counts Time-dependent Citation Counts Link-based Time-dependent Link-based Link-based with External Citations

Processing Modules Citation-graph based methods

Processing Modules Citation-graph based methods

Processing Modules D-Rank D-Rank: Distributed technology One method to rule them all One method that can aggregate all the existing ranking methods + user feedback Readjustment of parameters based on user feedback as a function of Relevance and Quality

Processing Modules Citation-graph based methods Spying the "life" of a record in all queries:

Outline 1 2 3 4