Building a Digital Library Software

Building a Software INVENIO, Part 1 J-Y. Le Meur Department of Information Technology CERN JINR-CERN School on GRID and Information Management Systems 14 May 2012

Outline 1 2 3 4

A physicist office at CERN: the "Non-Digital" Library

Specialized Software Specialist software for running a digital library: Content is organized and ready for exchange, support of interoperability protocols Metadata and Data is preserved for long term, support of preservation standards Submission, Edition, Curation processes are supported Dissemination is organized and controlled Combined traditional Library Systems, Document Management SW and Engine SW examples: Eprints, DuraSpace, Greenstone... Institutional repository software focuses primarily on ingest, preservation and access of locally produced documents, particularly locally produced academic outputs.

Invenio DL SW History 1954: CERN laboratory is created 1989: Tim Berners-Lee invents the Web 1991: HEP SPIRES/ArXiv is the first DL on the Web 1993: CERN Preprint Server starts as an institutional and disciplinary repository 1996: CERN Library Server includes Books and Periodicals, as an hybrid library 2000: CERN Document Server serves Multimedia material and restricted notes 2002: CDSWare SW released Open Source 2006: CDSWare becomes Invenio; start of I18N collaborations 2010: Invenio 1.0 released and adopted world-wide 2012: First Invenio User Group Workshop is organized

Project Overview Invenio DL SW Open Source GPL project Linux Server side Medium to big data repositories Flexible at every layer Technology: Python (and C and Lisp), MySQL and Apache + mod_wsgi WSGI: Web Server Gateway Interface Supports Library standards: XML-MARC, MARC21, OAI-PMH, OpenURL...

Library Standards exchange, identifiers and preservation Exchange protocols: Z39.50 and OAI-PMH between Data and Service providers Interoperability: SWORD = Simple Web-service Offering Repository Deposit Identifiers: ISBN, DOI, PURL, etc Preservation: METS, PDF/A, OAIS Content description: Metadata Encoding and Transmission Standard Data formats Supporting system: Open Archival Information System ref. model Content representation: MARC, DC XML-MARC

De facto Standards Plugins examples Compatibility with LibX: Invenio toolbar LibX: http://libx.org/editions/download.php?edition=4f46cd81 Can be integrated with IExplorer and Firefox browsers Integration with the main digital content websites including Amazon, Google Schoolar, Wikipedia Highlighted text from a web page can be used to directly query an Invenio installation Zotero: Export DL content to Zotero Firefox plugin for compiling CVs, etc Cooliris: Browsing multimedia content as a 3 dimensional wall (integration with the Cooliris plugin)

Technology Component concepts 3 Tier architecture:

Technology Overview languages

Why Python? languages Easy to read and understand: good for many temporary developers Suitable for rapid prototyping: good for organic-growth software development model Write code to throw it away

Why Python? art of ikebana programming (T. Simko)

Why Python? Speeding up Pyhton bitecode interpreted language: what about speed? Cython permits to write C extensions easily combining efficiency of C with the high-levelness of Python declace C types on variables and class attributes:

Domain logic

Outline 1 2 3 4

Ingestion Modules Overview

Ingestion Modules Submission by Humans The ingestion is performed by humans (authors, secretaries, cataloguers, etc) WebSubmit is a framework for helping collecting user data and creating MARCXML records + other workflow-related goodies. "" Strategy.

Ingestion Modules Submission: usual Web Form

Ingestion Modules Submission: unusual Web Form

Ingestion Modules Submission: Behind the Form

Ingestion Modules Submission: interfaces, workflows and functions

Ingestion Modules Submission by Robots Ingestion by robots, three use cases: Pulling from OAI-PMH Compatible Source Pulling from a non-compatible Source ing from External systems into the DL

Ingestion Pull: Harvesting from OAI source Pulling from OAI-PMH source OAI-PMH: Open Archives Initiative Protocol for Metadata Harvesting Data Provider vs Service Provider XML (Dulin Core) over HTTP Commercial search engines have started using OAI-PMH to acquire/deliver more resources Help in reducing the network traffic and other resource usage by doing an incremental harvesting

Ingestion : Robot Upload ing records in the : To POST (the HTTP way) a record Insert/replace/modify/delete Authorization: checked by IP or API key (a la Twitter) Upload files via special protocol FFT (Invenio) Feedack: Immediate HTTP error code in case of non valid MARCXML or other request error Examples at CERN Experiment (CMS) pushes records automatically to CERN Document Server Event recording SW pushes talks held at CERN in CDS automatically CERN Drupal web infrastructure to push photos and other documents to CDS

Which one? Example 1: Nikos wants to create an archive of all the blogs about High Energy Physics E.g. Nikos works at CERN and he must harvest blog posts from Fermilab Quantium Diaries Example 2: Sam wants to create an archive of all the scheduled TV programmes in his nation E.g. Sam lives in France and he would like to harvest content from the TV Broadcast site Telerama.fr

The TV DL Example 2: Sam wants to create an archive of all the scheduled TV programmes in his nation E.g. Sam lives in France and he would like to harvest content from the TV Broadcast site Telerama.fr Good luck Sam!

The TV DL Find a common input standard: For TV programmes this is XMLTV Map the input to MARCXML Wrap the harvesting, conversion and uploading into a Tasklet That s all!

The TV DL Mapping XMLTV to XMLMARC: XML Conversion smart_add_field(rec, 245 : a : programme.get( title ), b : programme.get( sub-title ), ## Title 260 : b : channel_map[programme[ channel ]], c : start_time, ## "Place" 269 : c : programme.get( date ), ## Date 520 : a : programme.get( desc ), ## Summary 037 : a : u -.join([programme.get( channel, ), programme.get( start, ).replace(, )]), 088 : a : u -.join([programme.get( title, [(, )])[0][0].replace(, ), programme.get( episode-num, [(, )])[0][0]]), FMT : g : original_xml, f : xmltv,)

The TV DL Tasklet programming: Main function: bibtasklet def bst_xmltv2marcxml(): write_message("grabbing XMLTV data") xmltvfile = grab_xmltv() write_message("xmltv data saved into %s" % xmltvfile) fd, marcxmlfile = tempfile.mkstemp(dir=tmpdir, suffix=.xml ) os.write(fd, xmltv2marcxml(open(xmltvfile))) os.close(fd) write_message("derived MARCXML saved into %s" % marcxmlfile) task_id = task_low_level_submission("bibupload", "xmltv", "-ir", marcxmlfile) write_message("bibupload scheduled with task id %s" % task_id)

The TV DL Scheduler with pending tasklets:

The TV DL or Pull?

Outline 1 2 3 4

Processing Modules Overview

Processing Modules Example: indexing

Building Indexes designing a search engine performance-driven design assumptions: high number of selects, low number of updates fast searching, slow indexation cache everything cacheable search functionality: search for words, phrases, regular expressions search in any field, authors, titles, etc index design: forward indexes: rec1 > [word1, word8,... ] rec2 > [word1, word2,... ] reverse indexes: word1 > [rec1, rec2,... ] word2 > [rec2, rec7,... ] Zipf s law on word frequency: few words occur very often (e.g. the) most words are infrequent (even e.g. boson)

Building Indexes Optimizing Three important speed factors to consider: speed of finding sets (DB Server) speed of demarshaling sets (DB < > Web App Server) speed of intersecting sets (Web App Server) Optimizing data structures data structures tested: sorted (lists, Patricia trees) unsorted (hashed sets, binary vectors) fast prototyping: (Python, Common Lisp) Binary vectors found the best compromise: typical search time gain: 4.0 sec > 0.2 sec typical indexing time loss: 7 hours > 4 days mostly spare data modelled via mostly dense data structure

Processing Modules Sorting Quikly Very Large Sets A processing phase is needed to generate Sorting Buckets At Search time, 1st 10 records sorted ascendant by title:

Processing Modules Citation-graph based (L. Marian) Citation Counts Time-dependent Citation Counts Link-based Time-dependent Link-based Link-based with External Citations

Processing Modules Citation-graph based methods

Processing Modules D-Rank D-Rank: Distributed technology One method to rule them all One method that can aggregate all the existing ranking methods + user feedback Readjustment of parameters based on user feedback as a function of Relevance and Quality

Processing Modules Citation-graph based methods Spying the "life" of a record in all queries:

Outline 1 2 3 4