Content domain. Leonardo Candela, Paolo Manghi, Carlo Meghini. DL.org Autumn School Athens, 3-8 October th October 2010

Similar documents
Interoperability Patterns in Digital Library Systems Federations. Paolo Manghi, Leonardo Candela, Pasquale Pagano

On Constructing Repository Infrastructures The D-NET Software Toolkit

Architecture domain. Leonardo Candela. DL.org Autumn School Athens, 3-8 October th October 2010

The OpenAIREplus Project

AGGREGATIVE DATA INFRASTRUCTURES FOR THE CULTURAL HERITAGE

Networking European Digital Repositories

The Design of a DLS for the Management of Very Large Collections of Archival Objects

OAI-Publishers in Repository Infrastructures

OAI-PMH. DRTC Indian Statistical Institute Bangalore

DRIVER Step One towards a Pan-European Digital Repository Infrastructure

Problem: Solution: No Library contains all the documents in the world. Networking the Libraries

Networking European Digital Repositories

An Architecture to Share Metadata among Geographically Distributed Archives

Nuno Freire National Library of Portugal Lisbon, Portugal

Institutional Repository using DSpace. Yatrik Patel Scientist D (CS)

Networking European Digital Repositories

Creating a National Federation of Archives using OAI-PMH

CUSTOMIZED OAI-ORE AND OAI-PMH EXPORTS OF COMPOUND OBJECTS FOR FEDORA REPOSITORIES

EUDAT-B2FIND A FAIR and Interdisciplinary Discovery Portal for Research Data

OpenAIRE From Pilot to Service

Europeana update: aspects of the data

EUDAT. A European Collaborative Data Infrastructure. Daan Broeder The Language Archive MPI for Psycholinguistics CLARIN, DASISH, EUDAT

The Open Archives Initiative and the Sheet Music Consortium

OpenAIRE. Fostering the social and technical links that enable Open Science in Europe and beyond

Joining the BRICKS Network - A Piece of Cake

MINT METADATA INTEROPERABILITY SERVICES

EUDAT B2FIND A Cross-Discipline Metadata Service and Discovery Portal

OpenAIRE Guidelines for Data Archive Managers 1.0 December 2012

OpenAIRE From Pilot to Service The Open Knowledge Infrastructure for Europe

Integrating Access to Digital Content

/// INTEROPERABILITY BETWEEN METADATA STANDARDS: A REFERENCE IMPLEMENTATION FOR METADATA CATALOGUES

Flexible Design for Simple Digital Library Tools and Services

Digital Libraries: Interoperability

Harvesting Metadata Using OAI-PMH

Research on the Interoperability Architecture of the Digital Library Grid

clarin:el an infrastructure for documenting, sharing and processing language data

Archives in a Networked Information Society: The Problem of Sustainability in the Digital Information Environment

DRIVER Building an Infrastructure of European Scientific Repositories


Part 2: Current State of OAR Interoperability. Towards Repository Interoperability Berlin 10 Workshop 6 November 2012

Using metadata for interoperability. CS 431 February 28, 2007 Carl Lagoze Cornell University

Persistent Identifiers for Audiovisual Archives and Cultural Heritage

Digital Library Interoperability. Europeana

B2FIND: EUDAT Metadata Service. Daan Broeder, et al. EUDAT Metadata Task Force

From Open Data to Data- Intensive Science through CERIF

Open Archives Initiative protocol development and implementation at arxiv

Persistent identifiers: jnbn, a JEE application for the management of a national NBN infrastructure

University of Bath. Publication date: Document Version Publisher's PDF, also known as Version of record. Link to publication

The DL.org Quality Working Group

An aggregation system for cultural heritage content

1. CONCEPTUAL MODEL 1.1 DOMAIN MODEL 1.2 UML DIAGRAM

Interoperability for Digital Libraries

COAR Interoperability Roadmap. Uppsala, May 21, 2012 COAR General Assembly

Europeana, the prototype EDLfoundation Europeana Network Europeana, vs. 1.0 ThoughtLab Technical requirements

Open Archives Initiatives Protocol for Metadata Harvesting Practices for the cultural heritage sector

EUDAT Training 2 nd EUDAT Conference, Rome October 28 th Introduction, Vision and Architecture. Giuseppe Fiameni CINECA Rob Baxter EPCC EUDAT members

Part I: Future Internet Foundations: Architectural Issues

MPEG-21: The 21st Century Multimedia Framework

Integrating Multi-dimensional Information Spaces

OAI-PMH implementation and tools guidelines

Data Exchange and Conversion Utilities and Tools (DExT)

RVOT: A Tool For Making Collections OAI-PMH Compliant

Information retrieval concepts Search and browsing on unstructured data sources Digital libraries applications

A Repository of Metadata Crosswalks. Jean Godby, Devon Smith, Eric Childress, Jeffrey A. Young OCLC Online Computer Library Center Office of Research

Contribution of OCLC, LC and IFLA

A methodology for Sharing Archival Descriptive Metadata in a Distributed Environment

Declarative Internet-scale Computing CoreGRID WP3 - Barcelona

Building a Digital Repository on a Shoestring Budget

A Distributed Digital Library System Architecture for Archive Metadata

For each use case, the business need, usage scenario and derived requirements are stated. 1.1 USE CASE 1: EXPLORE AND SEARCH FOR SEMANTIC ASSESTS

Common Language Resources and Technology Infrastructure REVISED WEBSITE

CMDI and granularity

2nd Technical Validation Questionnaire - interim results -

Resilient Linked Data. Dave Reynolds, Epimorphics

Towards a Long Term Research Agenda for Digital Library Research. Yannis Ioannidis University of Athens

The Sunshine State Digital Network

RDF and Digital Libraries

Building for the Future

Metadata Standards & Applications. 7. Approaches to Models of Metadata Creation, Storage, and Retrieval

Natalia Manola University of Athens ATHENA Research and Innovation Center. OpenAIRE. An Open Knowledge & Research Information Infrastructure

The CASPAR Finding Aids

Comparing Open Source Digital Library Software

Data Replication: Automated move and copy of data. PRACE Advanced Training Course on Data Staging and Data Movement Helsinki, September 10 th 2013

Reducing Consumer Uncertainty

Building Virtual Collections

Share.TEC System Architecture

EUROINDICATORS WORKING GROUP. Demetra+, a new seasonal adjustment tool 12 TH MEETING 3 RD & 4 TH DECEMBER 2009 EUROSTAT D5 DOC 287/09

The Metadata Assignment and Search Tool Project. Anne R. Diekema Center for Natural Language Processing April 18, 2008, Olin Library 106G

Metadata Catalogue Issues. Daan Broeder Max-Planck Institute for Psycholinguistics

A Comparative Study of the Search and Retrieval Features of OAI Harvesting Services

The DART-Europe E-theses Portal

Design of The PORTA EUROPA Portal (PEP) Pilot Project

Key Elements of Global Data Infrastructures

An Infrastructure for MultiMedia Metadata Management

ORCA-Registry v2.4.1 Documentation

Working towards a Metadata Federation of CLARIN and DARIAH-DE

ALOE - A Socially Aware Learning Resource and Metadata Hub

Persistent identifiers, long-term access and the DiVA preservation strategy

Taking D2D Services to the Users with OpenURL, RSS, and OAI-PMH. Chuck Koscher Technology Director, CrossRef

Workshop on Addressing the Barriers to IPv6 Deployment Spanish use case

Transcription:

Content domain Leonardo Candela, Paolo Manghi, Carlo Meghini 5 th

Lecture outline What is the Content Content domain in the Reference Model Content domain interoperability: Issues and Approaches Content domain interoperability approaches: DLSs Federations Content domain interoperability approaches: Europeana Data Model Hands-on Time 2

What is [the Digital Library ] Content 3

What is [the Digital Library ] Content? Digital Library are tools requested to support intellectual activity having no logical, conceptual, physical, temporal or personal borders or barriers on information From a content-centric system to a person-centric system From static storage and retrieval of information to facilitation of communication, collaboration and other forms of interaction among scientists, researchers or the general public From handling mostly centrally located text to synthesising distributed multimedia documentcollections, sensor data, mobile information and pervasive computing services 4

data = raw facts information = processed data, connected data knowledge= application of information, appropriate collection of information wisdom= processed knowledge From data to wisdom a.k.a. the DIKW hierarchy www.systems-thinking.org 5

Papers today 6

escience publications David De Roure @ OR10 7

Content Domain: the Reference Model 8

Content Domain One of the six main concepts characterising the Digital Library universe. It represents the various aspects related to the modelling of information managed in the Digital Library universe to serve the information needs of the Actors. Encompasses the data and information that the Digital Library handles and makes available to its users Encompasses the diverse range of information objects, including such resources as objects, annotations and metadata It is composed of a set of information objects organised in collections 9

Content Domain: the map 2 concepts only! 10

Information Object The main Resource of the Content Domain. An Information Object is a Resource identified by a Resource Identifier. It must belong to at least one Collection. It may have Metadata, Annotationsand multiple Editions, Views, Manifestations, which are also represented as Information Objects. In addition, it may have Quality Parameters and Policies. 11

Information Object (cont.) As an Information Objectis a Resource, it inherits all its features has a unique identifier (Resource Identifier) also known as the information object identifier; is arranged according to a format (Resource Format) also known as the document model; can arbitrarily be composed (<haspart> and <associatedwith>) to capture compound artefacts; is characterised by various Quality Parameterseach capturing different object quality facets (<hasquality>); is regulated by Policies(<regulatedBy>) governing every aspect of its lifetime; and can be described or augmented by Metadata(<hasMetadata>) and Annotations(<hasAnnotation>) 12

Collection A content Resource Set. The extension of a collection consists of the Information Objectsit contains. A Collectionmay be defined by a membership criterion, which is the intension of the collection. A Collection is an Information Object thus a Resource! 13

Content Domain: where is the rest? Storage / representation Data vs Metadata 14

Content Domain: where is the rest? (cont.) Different levels of abstractions 15

Do we have enough constructs? David De Roure @ OR10 16

Content Domain Interoperability: Main Issues & Approaches 17

The Content Interoperability monster in G. Anthes, Happy Birthday, RDBMS!, CACM, Vol. 53(5), May 2010 integration of heterogeneous data. "A special case that is still really hard is schema mapping converting data from one format to another,"... "It sounds straightforward, but it's very subtle. the "unsolved problem" of querying geographically distributed databases 18

[Content] Interoperability Framework the ability of two or more systemsor components to exchange informationand to use the informationthat has been exchanged (IEEE, 1990) Rephrasing and elaborating at least two entities Provider and Consumer willing to share a Resource to perform a Task (has preconditions) through a Communication Channel, involving a protocol and information representation across Organizational, Semantic and Technical boundaries of entities 19

Interoperability Framework: focusing on the three key aspects Organizationaldeals with business goalsand processes of an entity (Provider or Consumer) Semanticdeals with the meaning of the exchanged resource and the rest of information Technicaldeals with technological solution supporting the operation of the Provider / Consumer as well as the communication among the two Dependencies/ constraints among the three boundaries, e.g. organisational aspects has to be implemented by the technical aspects Implicit /hidden information, e.g. the technical aspects might implement part of the organisational aspects only Different approaches for diverse boundaries, e.g. human-centric vs. machine-centric Complete solutions involve all of them, e.g. the decision to rely on a certain technology might be useless if it is not complemented by proper organisationalaspects 20

Interoperability Approaches Agreement-based Approaches Include Standard-based approaches Infringes autonomy, strong in effectiveness Mediator-based Approaches An intermediary service linking Provider and Consumer Strong in autonomy, development and maintenance cost Blending & Compound Approaches Mix & compose Compatibility issue among solutions Alternative solutions No solution exist vsfor each problem there exists at least a solution 21

How is an interoperability solution described? 1. Overview Context of the proposed approach including pointers to detailed description of it 2. Requirements Conditions under which the solution might be used 3. Results Changes resulting from the usage of the solution 4. Implementation guidelines How the changes are produced 5. Assessment Qualitative evaluationof the proposed solution 22

Information Object Content Interoperability Enabling approach Scenario Information Object features enabling the task 23

Content Interoperability Issues and Solutions IO Features: ID, structure, any attribute (e.g. provenance) Approaches Agreement-based OAI-PMH, OAI-ORE, DublinCore Mediator-based metadata mapping, Application Profiles Blending 24

Metadata Schemas and Application Profiles Metadata Schemas 1995 DublinCore Identifier, Title, Creator, Contributor, Publisher, Subject, Description, Coverage, Format, Type, Date, Relation, Source, Rights, Language cross-domain simplicity vs precision/completeness -> cost not one size fits all Application Profiles metadata schema defined by combining elements from existing schemes 25

OAI-PMH A 2001 simple protocol for metadata harvesting metadata centric (?!?!) data provider vs service provider six verbs (Identify, ListMetadataFormats, ListSets, ListRecords, ListIdentifiers, GetRecord) http & XML http://www.openarchives.org/pmh 26

Linked Data A 2006 (?) set of principles for publishing structured data on the Web resource -an item of interest URI -global identifier for a resource representation -data corresponding to the state of a resource information resource -a document containing information non-information resource -anything else associated description -representation describing a Semantic Web resource http://www.linkeddata.org 27

OAI-ORE A 2008 approach for Aggregated Resources web principles adherence http://www.openarchives.org/ore 28

Lesson learned Open / Publish Comprehensive Information Objects (features) => recall -orientation not implies free (policies are always there) machine-orientation Web-orientation and standards no new / custom protocol no complex protocol Information Objects semantic & data quality are yet the hard problem (but mitigated by shared agreements) 29

Content Domain: Interoperability Solutions Interoperability Patters in Digital Library Systems Federations Paolo Manghi http://mc.gunet.gr/live/dlorg.php 30

Outline Digital Library System Federations Interoperability issues Data impedance mismatch Structural, semantic and granularity mismatch Solution: D-NET Software Toolkit 31

Digital Library Systems business output Consumers Information Objects & Data model 32

Motivations Digital Library Systems Federations (DLSFs) On-line availability of fragmented research outcomes Multidisciplinary character of modern research Increased speed of research life-cycle, i.e., immediate availability and access to research outcome Others 33

DLSFs OAI-PMH archive/libraries/repository federations e.g., Europeana, OCLC-OAIster, BASE, NARCIS Community-oriented data infrastructures e.g., DRIVER, SAPIR, CLARIN, EFG, HOPE, D4Science 34

DLSFs and the DL.org interoperability framework Providers= DigitalLibrarySystemsor Data Providers Consumer = Service Provider, software system specially devised for Collecting input content resources (information objects, e.g., metadata, payloads, compound objects) from a set of data providers From input information objects, producing a uniform information space ofoutput information objects, requiredbythe consumer toperforma giventask 35

DLSFs and the DL.org interoperability framework Providers = Digital Library Systems or Data Providers Consumer = Service Provider Data providers Content resource IOs & Data model? Data consumer IOs & Data model 36

DLSFs: content interoperability Obstacles encountered by a data provider (DLS) willing to offer useful information objects to a service provider to accomplish its task Obstacles encountered by a service provider willing to accomplish its task by accessing the information objects of a data provider which it considers useful 37

DLSFs: content interoperability issues Low-level issues: How to exchange objects Identifying common on-the-wire data-exchange practices High-level issues: How to harmonize information objects data models Resolve data impedance mismatch problems arising from distinct data models of data and service providers 38

Low-level issues: How to exchange information objects Adoption of XML as lingua-franca and standard dataexchange protocols, e.g., OAI-PMH, OAI-ORE, ODBC XML schema for data model Data providers implement exporting components: information objects XML files Service provider implement importing component: XML files information objects Worth noticing: Equal data models does not mean equal XML schemas Data and service providers may manage information objects as XML files (e.g., native XML DBs) 39

High-level issues: data impedance mismatch Data model Data provider Content Re esources Data model XML export (XML Schema) Data provider XML export (XML Schema) Manipulation Service provider XML import (XML Schema) Data model Task <Article> <Title> Interoperabilty patterns </Title> <Authors> Paolo Manghi, Leonardo Candela </Authors> <Date > September 2010 </Date> </Article> Definitions: Schema path: Article.Title Schema leaf: September 2010 40

High-level issues: data impedance mismatch Data model impedance mismatch Data and service providers XML schemas do not match, either structurally (schema paths) or semantically (schema leaves) Granularity impedance mismatch XML encodings of information objects at the service provider and data providers adopt different levels of granularity. 41

Structural heterogeneity (Data model impedance mismatch) Data provider Article Title Authors Date Article Title Authors Date Loss Casting Service provider Article Title Authors Article Title Creators DateOfCreation 42

Semantic heterogeneity (Data model impedance mismatch) Data provider Article Title Interoperability Authors Paolo Manghi, Date September 2010 Formats Service provider Article Title Interoperability Authors Manghi, P., Date 01-09-2010 Dervation/ Article Inference Title Interoperability Authors Paolo Manghi, Date September 2010 Article Title Interoperability Authors Paolo Manghi, Date September 2010 TitleLanguage EN 43

Semantic&Structural heterogeneity (Data model impedance mismatch) Article Title Interoperability Authors Paolo Manghi, Date September 2010 Article Title Interoperability Creator Name Paolo Surname Manghi Creator Name Leonardo Surname Candela Date September 2010 44

Tackling the data model impedance mismatch: Data model Data provider transformation components Co ontent Res sources Data model XML export (XML Schema) Data provider XML export (XML Schema) Transformation: paths & leaves Structure and Semantics Service provider XML import (XML Schema) Data model Task Use-cases: All data providers have the same XML schema Data providers have different XML schemas 45

Data providers with equal XML schema (Data model impedance mismatch solutions) The transformation component considers one mapping from such common XML schema onto the service provider schema Output schema leaves (identified by output schema paths) are generated by processing input leaves (identified by schema paths) through transformation functions F The complexity of the F s can be arbitrary: feature extraction functions: taking a URL, downloading the file (e.g., HTML, PDF, JPG) and returning content extracted from it conversion functions: translation from vocabulary to vocabulary transcodingfunctions: leaf format to leaf format (e.g., date formats); regular expression: generating one leaf from a set of leaves (e.g., generating a person name leaf by concatenating name and surname originally kept in two distinct leaves). 46

Data providers with different XML schemas (Data model impedance mismatch solutions) The transformation component must consider multiple mappings from the diverse input XML schemas onto the service provider XML schema of the service provider Simple scenario: pre-determined set of data providers Providing one transformation component as the one described for the previous scenario for each set of data providers with the same schema Complex scenario: undetermined number of data providers is expected, possibly bearing different XML schema Providing general-purpose components, capable of managing (create, remove, update) a set of mappings Mappings are named lists of pairs (input paths, F, output path) The component may allow for the addition of new F s 47

Granularity impedance mismatch Data model Data provider Service provider Data model (1:N) XML export (XML Schema) Splitting Transformation: paths & leaves XML import (XML Schema) Granularity Structure and Semantics Data model Data provider Service provider Data model (N:1) XML export (XML Schema) Packaging Transformation: paths & leaves XML import (XML Schema) Granularity Structure and Semantics 48

Granularity impedance mismatch Data model Data provider XML export (XML Schema) Service provider Data model (1xM:1) Data model. Data provider. Packaging Transformation: paths & leaves XML export (XML Schema) XML export (XML Schema) Granularity Structure and Semantics 49

Architecture of interoperability solutions Bottom-up federations, e.g.,darenet-narcis, Realized by organizations who have control over the set of participating data providers, Agree on common data model and XML schema so that no interoperability issues occur Open federations, e.g., the DRIVER repository infrastructure, OpenAIRE system, Europeana Federations attractive to data providers, which are willing to adhere to given data model specifications ( guidelines ) in order to join the aggregation Transformation: data providers are responsible of structural interoperability (typically light-weight transformation issues); semantics interoperability is typically responsibility of service provider Packaging/splitting not required 50

Architecture of interoperability solutions Community-oriented federations, e.g., the European Film Gateway project Data providers handling the same typology of content invest on the realization of a service provider to enable cross-provider functionality Define a common data model on the service provider Packaging/splitting: if needed, typically occurs at the service provider side Transformation: may occur at the data provider side (before XML export takes place) or data providers are directly involved in the definition of mappings on the service provider Top-down federations, e.g., OAIster-OCLC project, BASE search engine Realized by organizations willing to deliver a service provider to offer functionality over data providers whose content is openly reachable. Service provider deals with any interoperability issues 51

D-NET Software Toolkit 52

D-NET Software Toolkit: general-purpose DLCLs General-purpose framework for the realization and maintenance of context-specific DLCLs Management of information objects of arbitrary data models Management of DLSs of several typologies (e.g., OAI, ODBC, FTP) Construction of personalized and automated data workflows Management of robustness and scalability parameters DLSs life-cycle administration tools Extensibility with new functionality 53

D-NET Software Toolkit The solution Service Kits supporting realization of personalized DLSFs by exploiting customizability, extensibility and modularity features Service Service-oriented infrastructure features (autonomicity, distribution and sharing) to support scalable and robust production systems Service Service Service Service Service Service 54

End User Functionality D-NET: Service Kits Authz&Authn Service Information Service Web Generic UI Service Search Service Recomm. Service User Profile Service Community Service Collection Service Personalization Service ResultSe et Service e Manager Service OAI-PMH Publisher Service Feature Extraction Service Index Service Store Service Browse Service OAI-PMH Harvester Service Repositories Validator Service Graph Database Service Transformation Service Enabling Compound Object Service Database Service XML Import Service Object Packaging Service Data Management Repository Man Service Authority File Service MDStore Service FS, FTP, NFS Data Sources 55

Modularity, customizability, sharing (and orchestration) DRIVER Project PDF Store Downloader FeatureExt MDStore OAI-PMH Harvester MDStore MDStore MDStore Index Index Index OAI-PMH Dublin Core MDStore Transformator Search Search Search SRW Metadata Formats Dublin Core DRIVER Metadata Format Intermediate extraction format 56

Modularity, customizability, sharing (and orchestration) EFG Project EFG schema splitese Europeana Authority File Object Pack. Transformator OAI-PMH OAI-PMH Harvester MDStore MDStore MDStore Index Index Index Local Schema MDStore Transformator Search Search Search SRW EFG schema Metadata Formats 57

Modularity, customizability, sharing (and orchestration) OpenAIRE Project Project and Participants Format (from EC) OAI-PMH OpenAIRE Export format: Dublin Core + project ID + License info Downloader OpenAIRE Internal Format: OpenAIRE Export format + repo info Harvester MDStore MDStore MDStore MDStore Transformator Database Index Index Index Search Search Search Search format: Project Papers Participant SRW Metadata Formats 58

D-NET s uptake DRIVER project 250 repositories (34 countries), 2,300,000+ items search.driver.research-infrastructures.eu European Film Gateway EC project 14 archives, 300,000 items, compound object data model www.europeanfilmgateway.eu OpenAIRE EC pilot Harvesting, depositing and statistics of publications and EC project data www.openaire.eu HOPE project +20 archives, millions of items, compound object data model www.iisg.nl/news/hope.php ScholarLynk R2D2 Project: Microsoft Research Cambridge and D-NET 59

Experimentation Experimentation of deployment of new D-NET repository infrastructures China, India, Portugal, Belgium, Spain, Slovenia Upcoming: Greece and Bulgaria 60

D-Net Software Toolkit Software packages Open Source Apache License Release v1.0 (production) and v1.2 (beta) Release v2.0 (beta): Enhanced Publication Under continuous refinement www.d-net.research-infrastructures.eu 61

Technical Team CNR-ISTI: Istituto di Scienze e Tecnologie Informatiche, Centro NazionaledelleRicerche, Pisa, Italy NKUA: Department of Informatics and Telecommunications, National and Kapodistrian University of Athens, Greece UNIBI: Universität Bielefeld, Germany ICM: Interdisciplinary Centre for Mathematical and Computational Modeling, Uniwesytet Warszawski, Poland 62

Content Domain: Interoperability Solutions Europeana Data Model Carlo Meghini 63

Content Domain: Hands-on Time 64

Exercises Indentify and produce RM Content domain enhancements Each enhancements should be equipped with a motivation. Enhancements might be on the introduction of new concepts and/or relationships, on the revision of existing definitions as well as on exemplars; Select one (or more) DL system and describe its content domain by relying on the Reference Model; Identify and describe a content-oriented interoperability solution (Overview-Requirements-Results-Implementation Guidelines- Assessment); Revise one (or more) Cookbook content-oriented interoperability solutions; Work on the Content domain part of the interoperability scenario; 65

Thank you 66