Curation module in action - its preliminary findings on VLO metadata quality

Similar documents
CLARIN for Linguists Portal & Searching for Resources. Jan Odijk LOT Summerschool Nijmegen,

Component Metadata Infrastructure Best Practices for CLARIN

Working towards a Metadata Federation of CLARIN and DARIAH-DE

Something will be connected - Semantic mapping from CMDI to Parthenos Entities

1 Overview chart. PIDs: talk with EPIC PIDs: MoU or advice. Assessment wave 3. VLO overhaul CMDI 1.2

1. General requirements

Number game Experience of a European research infrastructure (CLARIN) for the analysis of web traffic

The Virtual Language Observatory!

B2FIND: EUDAT Metadata Service. Daan Broeder, et al. EUDAT Metadata Task Force

ACDH AUSTRIAN CENTRE FOR DIGITAL HUMANITIES

acdh data services full stack

EUDAT. A European Collaborative Data Infrastructure. Daan Broeder The Language Archive MPI for Psycholinguistics CLARIN, DASISH, EUDAT

Bringing Europeana and CLARIN together: Dissemination and exploitation of cultural heritage data in a research infrastructure

PubMan Workshop. This work is licensed under a Creative Commons Attribution 3.0 Germany License

D2.1. Metadata benchmarking and curation module. Document information. Project information. CLARINPLUS-D2.1 (CE ) Davor Ostojić, Matej Ďurčo

Metadata quality assurance for CLARIN

Persistent identifiers, long-term access and the DiVA preservation strategy

FLAT: A CLARIN-compatible repository solution based on Fedora Commons

Axon Fixed Limitations... 1 Known Limitations... 3 Informatica Global Customer Support... 5

1. CONCEPTUAL MODEL 1.1 DOMAIN MODEL 1.2 UML DIAGRAM

Gale Digital Scholar Lab Getting Started Walkthrough Guide

This page provides instructions for adding protocols to the GUDMAP/RBK Data Explorer, based on the Nature Protocols format.

OpenAIRE. Fostering the social and technical links that enable Open Science in Europe and beyond

CORLI. a linguistic consortium for corpus, language and interaction

Kaltura Video Building Block 4.0 for Blackboard 9.x Quick Start Guide. Version: 4.0 for Blackboard 9.x

Component Registry, Browser and Editor Reference Manual

BLOOMBERG VAULT FOR FILES. Administrator s Guide

BEXIS Release Notes

CIS-CAT Pro Dashboard Documentation

How can CLARIN archive and curate my resources?

Cognition User Guide. Phone: Web: Bluesky Interactive ltd American Barns Banbury Road Lighthorne, CV35 0AE

Putting the Archives to Work: Workflow and Metadata-driven Analysis in LTER Science

Feeding the beast: Managing your collections of problems

Digital repositories as research infrastructure: a UK perspective

NetBackup Collection Quick Start Guide

123RF Corporate+ Administrator s Guide

Metadata and DCR. <CMD_Component /> Dieter Van Uytvanck. Max Planck Institute for Psycholinguistics

IBM InfoSphere Information Server Version 8 Release 7. Reporting Guide SC

Just for the record, CMDI should be about semantic interoperability

Best practices in the design, creation and dissemination of speech corpora at The Language Archive

PDS 2010 System Design Report

HP Asset Hub. Fundamentals Training April 2015

D6.2 Report on services and tools. George Bruseker Matej Durco Marc Kemps-Snijders Matteo Lorenzini Maria Theodoridou

How to mark assessments

Applied Data Governance - Part 3

Kaltura Video Package for Moodle 2.x Quick Start Guide. Version: 3.1 for Moodle

How to contribute information to AGRIS

Construction IC User Guide. Analyse Markets.

Sourcing - How to Create a Negotiation

Flexible Design for Simple Digital Library Tools and Services

Informatica PIM. Functional Overview. Version: Date:

Macbook Pro HostEurope CESNET 100%IT TransIP. DE (commercial) CZ UK Xeon E GHz. vcores Mem (GB)

on Air Quality Management and Assessment

Networking European Digital Repositories

Building metadata components

Best Practices for Choosing Content Reporting Tools and Datasources. Andrew Grohe Pentaho Director of Services Delivery, Hitachi Vantara

RECOMIND USER GUIDE 1

Test Automation Integration with Test Management QAComplete

ORAC Match. User Manual for Support Workers. V1.0 Updated 13/11/2017

SymmetryCRM: Outlook Mail Application Tool

ORAC Match. User Manual for Support Workers. V1.0 Updated 13/11/2017

Test Automation Integration with Test Management QAComplete

ecrt Workflow and Basic Information

Performance Evaluation Essentials

The Community Data Portal and the WMO WIS

Networking European Digital Repositories

B2FIND and Metadata Quality

Salary Survey Tool Help Guide

INTEGRATED WORKFLOW COPYEDITOR

User guide for GEM-TREND

EUDAT-B2FIND A FAIR and Interdisciplinary Discovery Portal for Research Data

Report to the IMC EML Data Package Checks and the PASTA Quality Engine July 2012

Interactive Data Submission System (IDSS) Frequently Asked Questions

User s Guide for Suppliers

DBpedia Data Processing and Integration Tasks in UnifiedViews

Oracle Big Data Cloud Service, Oracle Storage Cloud Service, Oracle Database Cloud Service

TemperateReefBase Data Submission

Webinar Annotate data in the EUDAT CDI

Spatial Data on the Web

Metadata implementation for a Business Intelligence environment. Yuriy Verbitskiy William Yeoh Andy Koronios

Demo: Linked Open Statistical Data for the Scottish Government

Semantic Web Company. PoolParty - Server. PoolParty - Technical White Paper.

Morningstar Direct SM Equity Attribution

Integration of INSPIRE & SDMX data infrastructures for the 2021 population and housing census

Table of Contents Error! Bookmark not defined.

VISITOR SEGMENTATION

IMPROVING YOUR JOURNAL WORKFLOW

Title Version Author(s) Date Status Distribution 1 Introduction 2 OpenSKOS community 2.1 OpenSKOS user stories

Community Edition Getting Started Guide. July 25, 2018

edamis Web Forms for sending data to Eurostat

Created on 3/23/2015 9:47:00 AM

Sourcing Buyer User Guide

MINT METADATA INTEROPERABILITY SERVICES

Exactly User Guide. Contact information. GitHub repository. Download pages for application. Version

About the Edinburgh Pathway Editor:

<Partner Name> <Partner Product> RSA NETWITNESS Intel Feeds Implementation Guide. Symantec DeepSight Intelligence

Monitoring and Evaluation Tool

REST API Documentation Using OpenAPI (Swagger)

Design and Implementation of the Japanese Virtual Observatory (JVO) system Yuji SHIRASAKI National Astronomical Observatory of Japan

A campus wide setup of Question Mark Perception (V2.5) at the Katholieke Universiteit Leuven (Belgium) facing a large scale implementation

Transcription:

Curation module in action - its preliminary findings on VLO metadata quality Davor Ostojić, Go Sugimoto, Matej Ďurčo (Austrian Centre for Digital Humanities) CLARIN Annual Conference 2016, Aix-en-Provence, France 2016-10-28

Outline 1. Why Curation Module? 2. What is Curation Module? 3. How does it work? 4. What can we find out? 5. Into the future

1. Why Curation Module? It s all about VLO (and VLO is about resource discovery) CMDI complexity and flexibility Impact on Metadata Quality VLO problems for resource discovery Trippel, et al. (2014), Kemps-Snijders (2014), ACDH team (2015), King et al. (2015), Odijk (2014, 2015)

What we know by now: How many CLARIN centers? (Center Registry) - 38 How many records in VLO? (VLO search) - 913629 How many records harvested? (CMDI harvester) - 881268 How many (public) profiles / components (CompReg) - 194 / 1207 How many concepts (CCR) - 3160 What structure and stats of CMDI metadata? (SMC browser) What percentage of facets covered? (Odijk 2014, King et al 2015)

2. What is Curation Module? tool for curation, normalisation and quality assessment / benchmarking of CMD records, collections and profiles. automatically and systematically collects statistics about metadata quantity and quality Web Application, RESTful API, and Java library https://clarin.oeaw.ac.at/curate (https://clarin.oeaw.ac.at/vlo)

Use cases 1. Metadata author can validate and assess the quality of a new record 2. Metadata modeler can assess the quality of profiles 3. Repository administrator can assess the quality of his repository 4. Metadata curators can investigate additional aspects of CMDI (beyond VLO facets) 5. Integration in VLO workflow

Initial Curation Module concept: Durco, Matej, and Karlheinz Mörth. 2014. "Towards a DH Knowledge Hub-Step 1: Vocabularies." Clarin Conference 2014.

Curation Module in the long-term Dashboard and its workflow (King et al, 2016)

3. How does it work Your enquiry Pre-processed

Facet section

Issues

Public profiles assessment Profile ID Profile Name Score (0.0-3.0) Facet coverage (0.0-1.0) Percentage of elements with concepts (0.0-1.0) Link to SMC Browser

Collection Assessment Avg. Score (0.0-14.0) Number of records Overall size (bytes) Avg. size per record (bytes) Number of used profiles Avg. number of Resource Proxy Avg. number of XML elements Avg. number of empty XML elements Avg. rate of XML population (0.0-1.0) Avg. facet coverage (0.0-1.0)

Basic integration of the SMC Browser

How we calculate the score - Profile profile is public 1.0 rate of elements annotated with concept [0.. 1] facets coverage [0.. 1] Max 3.0

How we calculate the score - Instance profile's score [0.. 3] file size < 10 Mb 0.0 or 1.0 Header schema is specified 0.0 or 1.0 schema resides in Component Registry 0.0 or 1.0 MdProfile element contains a valid value 0.0 or 1.0 MdCollectionDisplayName is not empty 0.0 or 1.0 MdSelfLink is not empty 0.0 or 1.0 ResProxy rate of resources with MIME type [0.. 1] rate of resources with references [0.. 1] rate of non empty XML elements [0.. 1] rate of accessible URLs [0.. 1] facets coverage [0.. 1] Max 14.0

4. What can we find out? Lot of numbers Assessment criteria -> Scores for profiles, instances & collections esp. Facet coverage of profiles, instances & collections Concept coverage of profiles Statistics on size of collections Problems with ResourceProxies (link checks) Detailed info messages on issues

Public vs. private profiles

Public vs. used profiles public used 194 105 public unused 127 public used private used 67 38

Avg Score

Comparison of facet uncoverage 2015 Nr. Records 631000 ratio Language Code 240183 Collection Resource Type 2016 880973 ratio change 38% 224423 25% -13% 0 0% 0 0% 0% 482935 77% 520290 59% -17% 174170 20% -57% - - Continent 472048 75% - Country 474637 75% 669885 76% 1% Modality 490195 78% 706971 80% 3% Genre 329114 52% 478582 54% 2% Subject 503233 80% 611406 69% -10% Format 62381 10% 37714 4% -6% Organisation 520560 82% 687504 78% -4% Availability 580907 92% 704595 80% -12% National Project 104316 17% 312475 35% 19% Keywords 567347 90% 816140 93% 3%

5. Into the future Upload file/copy&paste function for the assessment API support and/or CSV export (on its way) Automatic email notification to data providers (curation reports) Visualise results in a user friendly way Calibration of scoring Concentrate on the facet-value variability / normalisation Add facet-oriented view VLO Dashboard as integrated environment for VLO curators and administrators for work with CM and other components (harvester, validator, mapping and normalisation, VLO importer etc.) Any other suggestions?

Demo is ready for you to see more stats of the metadata (after the break) Austrian Centre for Digital Humanities (ACDH-OEAW) www.oeaw.ac.at/acdh Davor Ostojic davor.ostojic@oeaw.ac.at Go Sugimoto go.sugimoto@oeaw.ac.at Matej Ďurčo matej.durco@oeaw.ac.at