The Virtual Language Observatory!

Similar documents
CLARIN for Linguists Portal & Searching for Resources. Jan Odijk LOT Summerschool Nijmegen,

Using the data in the archive

Metadata and DCR. <CMD_Component /> Dieter Van Uytvanck. Max Planck Institute for Psycholinguistics

D-SPIN Report R2.2b: The German Resource Landscape and a Portal

Building metadata components

Component Metadata Infrastructure Best Practices for CLARIN

Best practices in the design, creation and dissemination of speech corpora at The Language Archive

CMDI and granularity

META-SHARE metadata: Overview of the schema & Interoperability with other schemas

Just for the record, CMDI should be about semantic interoperability

Bringing Europeana and CLARIN together: Dissemination and exploitation of cultural heritage data in a research infrastructure

1. General requirements

ACDH AUSTRIAN CENTRE FOR DIGITAL HUMANITIES

1 Overview chart. PIDs: talk with EPIC PIDs: MoU or advice. Assessment wave 3. VLO overhaul CMDI 1.2

Some challenges ahead for the Open Language Archives Community

Metadata Proposals for Corpora and Lexica

Macbook Pro HostEurope CESNET 100%IT TransIP. DE (commercial) CZ UK Xeon E GHz. vcores Mem (GB)

B2FIND: EUDAT Metadata Service. Daan Broeder, et al. EUDAT Metadata Task Force

EUDAT. A European Collaborative Data Infrastructure. Daan Broeder The Language Archive MPI for Psycholinguistics CLARIN, DASISH, EUDAT

CLARIN s central infrastructure. Dieter Van Uytvanck CLARIN-PLUS Tools & Services Workshop 2 June 2016 Vienna

Managing very large Multimedia Archives and their Integration into Federations

Working towards a Metadata Federation of CLARIN and DARIAH-DE

Curation module in action - its preliminary findings on VLO metadata quality

OLAC: Accessing the World s Language Resources

Building a Faceted Browser in CouchDB Using Views on Views and Erlang Metaprogramming

Component Registry, Browser and Editor Reference Manual

clarin:el an infrastructure for documenting, sharing and processing language data

New EuroVO registry. architecture and status as of May Menelaus Perdikeas, ESAC Neuropublic.

CORLI. a linguistic consortium for corpus, language and interaction

Metadata Tools Supporting Controlled Vocabulary Services

SobekCM. Compiled for presentation to the Digital Library Working Group School of Oriental and African Studies

FLAT: A CLARIN-compatible repository solution based on Fedora Commons

ISLE Metadata Initiative (IMDI) PART 1 B. Metadata Elements for Catalogue Descriptions

EUDAT-B2FIND A FAIR and Interdisciplinary Discovery Portal for Research Data

An Evolving escience Environment for Research Data in Linguistics

Working with CMDI in Arbil Jeroen Geerts - September 2016

EUDAT. Towards a pan-european Collaborative Data Infrastructure

Editing and adding content to the deposit page

Expressing language resource metadata as Linked Data: A potential agenda for the Open Language Archives Community

Building a Digital Library Software

Metadata Catalogue Issues. Daan Broeder Max-Planck Institute for Psycholinguistics

EMELD Working Group on Resource Archiving

GE: A flexible presentation platform for LR. Alex Dukers Jacquelijn Ringersma

Lessons Learned. Implementing Rosetta in the Harold B. Lee Library

(Some) Standards in the Humanities. Sebastian Drude CLARIN ERIC RDA 4 th Plenary, Amsterdam September 2014

Citation Services for Institutional Repositories: Citebase Search. Tim Brody Intelligence, Agents, Multimedia Group University of Southampton

OPENAIRE FP7 POST-GRANT OPEN ACCESS PILOT

Citation Services for Institutional Repositories: Citebase Search. Tim Brody Intelligence, Agents, Multimedia Group University of Southampton

MuseKnowledge Hybrid Search

Metadata quality assurance for CLARIN

Implementation of the Data Seal of Approval

National Documentation Centre Open access in Cultural Heritage digital content

Registry Interchange Format: Collections and Services (RIF-CS) explained

Search Framework for a Large Digital Records Archive DLF SPRING 2007 April 23-25, 25, 2007 Dyung Le & Quyen Nguyen ERA Systems Engineering National Ar

Wittenburg, Peter; Gulrajani, Greg; Broeder, Daan; Uneson, Marcus

Something will be connected - Semantic mapping from CMDI to Parthenos Entities

Digital The Harold B. Lee Library

Web-enabled Physical Samples: Curating and Publishing Physical Samples in CSIRO

Metadata Infrastructure for Language Resources and Technology

SMART CONNECTOR TECHNOLOGY FOR FEDERATED SEARCH

MINT METADATA INTEROPERABILITY SERVICES

D-SPIN. D-SPIN Report 2.1: Formation of Centres

Centres Network Formation

Implementation of the Data Seal of Approval

Show me the data. The pilot UK Research Data Registry. 26 February 2014

Institutional Repository using DSpace. Yatrik Patel Scientist D (CS)

On the way to Language Resources sharing: principles, challenges, solutions

Extending the Facets concept by applying NLP tools to catalog records of scientific literature

How can CLARIN archive and curate my resources?

How to Create a Custom Ingest Form


Hosted by ALCTS Continuing Education. Elissah Becknell and Sarah Beth Weeks September 18, 2013

Towards a roadmap for standardization in language technology

Comparing Open Source Digital Library Software

Application Services for Knowledge Organisation and System Integration

Data Exchange and Conversion Utilities and Tools (DExT)

CMDI 1.2: Improvements in the CLARIN Component Metadata Infrastructure

Informatics 1: Data & Analysis

OAI (Open Archives Initiative) Suite Version 3.0. Introductory Guide for New Users

BIBLID (2004) 93:1 pp (2004.6) 209. NBINet NBINet 92

SERAD CNES Service for Data Referencing and Archiving

A Repository of Metadata Crosswalks. Jean Godby, Devon Smith, Eric Childress, Jeffrey A. Young OCLC Online Computer Library Center Office of Research

Fedora Relationships and Information Network Overlays. CS 431 April 19, 2006 Carl Lagoze Cornell University

Developing ArXivSI to Help Scientists to Explore the Research Papers in ArXiv

An e-infrastructure for Language Documentation on the Web

Microdata Management Toolkit (MMT) National Data Archive (NADA)

Towards a Web Search Service for Minority Language Communities

Long-term digital preservation of UNSWorks

EUDAT Training 2 nd EUDAT Conference, Rome October 28 th Introduction, Vision and Architecture. Giuseppe Fiameni CINECA Rob Baxter EPCC EUDAT members

Building an OAI-based Union Catalog for the National Digital Archives Program in Taiwan

A Gentle Introduction to Metadata

The OAIS Reference Model: current implementations

Roy Lowry, Gwen Moncoiffe and Adam Leadbetter (BODC) Cathy Norton and Lisa Raymond (MBLWHOI Library) Ed Urban (SCOR) Peter Pissierssens (IODE Project

Evolving the digital library for digital scholarship enablement

Survey of Existing Services in the Mathematical Digital Libraries and Repositories in the EuDML Project

Persistent identifiers, long-term access and the DiVA preservation strategy

Open Archives Initiatives Protocol for Metadata Harvesting Practices for the cultural heritage sector

The challenge of collecting and evaluating LRs for commercial use

Nuno Freire National Library of Portugal Lisbon, Portugal

Package rdryad. June 18, 2018

Transcription:

The Virtual Language Observatory! Dieter Van Uytvanck! CMDI workshop, Nijmegen! 2012-09-13! 1!

Overview! VLO?! What is behind it? Relation to CMDI?! How do I get my data in there?! Demo + excercises!! 2!

Context sketch! Lots of resources somewhere out there:! Data collections! Corpora! Lexica! Grammars! Multimedia recordings! Software! Web applications / services! Old-school linguistic resources:! Books! Articles! CD-ROMs! Itʼs like a jungle, sometimes...!!

VLO: the idea! Researcher: where do I start?! Provide a single entry point giving access to all information! Because of the large amount of data:! Drill-down paradigm (decrease search space gradually)! Multiple ways of exploring:! Full-text search! Facet browsing! Geographic overlay! Unified interface, links to the original context!

VLO?! Virtual Language Observatory! http://www.clarin.eu/vlo/! Several parts:! Facet browser (real search)! Google Earth overlay (visualization)! LRT inventory (ad-hoc, last resort metadata entry)! 5!

Facets?! A simple way to narrow down the search space, step by step! Values offered are dynamic: they change with every previous selection made! Purpose: quickly navigating through a huge amount of metadata! 6!

Facets?! Purpose: quickly navigating through a huge amount of resources! Useful too for metadata curation! Not the tool to answer research questions!! 7!

VLO Faceted Browser (1)! h"p://catalog.clarin.eu/ds/vlo 8!

VLO Faceted Browser (2)

VLO Faceted Browser (3)! Metadata analyzed is CMDI format! Metadata sources! CMDI files harvested from CLARIN centres! CMDIʼfied OLAC records (from CLARIN centres and others)! CMDIʼfied LRT inventory records! You can get to resources directly from search results! 10!

Exercises (1)! www.clarin.eu/vlo! Find some resources in the catalogue:! Corpus Gysseling! Telephone conversation recordings in Nepal!

Exercises (2)! Find some resources in the Endangered Languages archive which are:! (Spoken) discourse with at least two consultants in Asia! Or (spoken) discourse with at least two consultants in a Face to Face conversation!

Limits! Inherent limit: Simple search! no OR combinations possible! no sophisticated search operations! Current limit (to be fixed)! Full-text search not for all fields, but only the ones displayed in the VLO! 13!

CMDI architecture! metadata catalogue ISOcat component registry & editor metadata modeler metadata user search & semantic mapping metadata editor metadata creator metadata curator Joint metadata repository OAI-PMH Service provider Local metadata repository OAI-PMH Data provider metadata curator DATA

Behind the scenes (1)! SOLR + lucene! Tomcat web application! For the parsing of the CMDIʼs: VTD-XML! Faster than SAX-parser! Still full XPath access! Memory-efficient (1.3x~1.5x the size of an XML document)! 15!

Behind the scenes (2)! 16!

VLO and ISOcat: natural allies (1)! The import of metadata files used to be hard coded! Now we look at the ISOcat links in the XSDs as generated from the CMDI profiles! Fallback to hard-coded XPath in case no ISOcat link is found! 17!

VLO and ISOcat: natural allies (2)! Import configuration example:! <facetconcept name="name" allowmultiplevalues="false"> <concept>http://www.isocat.org/datcat/dc-2544</concept> <concept>http://www.isocat.org/datcat/dc-2545</concept> <concept>http://purl.org/dc/terms/title</concept> <!-- no concept in lrt schema --> <pattern> /c:cmd/c:components/c:lrtinventoryresource/c:lrtcommon/ c:resourcename/text() </pattern> </facetconcept> 18!

How do I get my metadata in there?! Provide it as CMDI over OAI-PMH! If that is not possible:! Provide it as OLAC over OAI-PMH! Provide it as IMDI over OAI-PMH! If that is not possible either:! Enter it into the LRT inventory:! www.clarin.eu/inventory! 19!

Order Specimen Habitat definikons ISOcat ISOcat.org XSD files Profile 1 Profile 2 Profile 3 Component registry Ingester XPath = data category VLO CMDI files Instance 1 Instance 2 Instance 3 Metadata Repository

Recent Additions! Links to language information: WALS, Wikipedia, Ethnologue, LinguistList and the VLO! Descriptions in the record listing! National Project facet! Feedback link! 21!

Still to come! A faceted browser is as good as its data, so curation steps are needed! more CMDI metadata! some more facets e.g.: year! Human-readable hdl links! Interface improvements! 22!

Questions?! ask them now! or send a mail to vlw@clarin.eu! More information:! www.clarin.eu/vlo! www.clarin.eu/cmdi! 23!