The Virtual Language Observatory! Dieter Van Uytvanck! CMDI workshop, Nijmegen! 2012-09-13! 1!
Overview! VLO?! What is behind it? Relation to CMDI?! How do I get my data in there?! Demo + excercises!! 2!
Context sketch! Lots of resources somewhere out there:! Data collections! Corpora! Lexica! Grammars! Multimedia recordings! Software! Web applications / services! Old-school linguistic resources:! Books! Articles! CD-ROMs! Itʼs like a jungle, sometimes...!!
VLO: the idea! Researcher: where do I start?! Provide a single entry point giving access to all information! Because of the large amount of data:! Drill-down paradigm (decrease search space gradually)! Multiple ways of exploring:! Full-text search! Facet browsing! Geographic overlay! Unified interface, links to the original context!
VLO?! Virtual Language Observatory! http://www.clarin.eu/vlo/! Several parts:! Facet browser (real search)! Google Earth overlay (visualization)! LRT inventory (ad-hoc, last resort metadata entry)! 5!
Facets?! A simple way to narrow down the search space, step by step! Values offered are dynamic: they change with every previous selection made! Purpose: quickly navigating through a huge amount of metadata! 6!
Facets?! Purpose: quickly navigating through a huge amount of resources! Useful too for metadata curation! Not the tool to answer research questions!! 7!
VLO Faceted Browser (1)! h"p://catalog.clarin.eu/ds/vlo 8!
VLO Faceted Browser (2)
VLO Faceted Browser (3)! Metadata analyzed is CMDI format! Metadata sources! CMDI files harvested from CLARIN centres! CMDIʼfied OLAC records (from CLARIN centres and others)! CMDIʼfied LRT inventory records! You can get to resources directly from search results! 10!
Exercises (1)! www.clarin.eu/vlo! Find some resources in the catalogue:! Corpus Gysseling! Telephone conversation recordings in Nepal!
Exercises (2)! Find some resources in the Endangered Languages archive which are:! (Spoken) discourse with at least two consultants in Asia! Or (spoken) discourse with at least two consultants in a Face to Face conversation!
Limits! Inherent limit: Simple search! no OR combinations possible! no sophisticated search operations! Current limit (to be fixed)! Full-text search not for all fields, but only the ones displayed in the VLO! 13!
CMDI architecture! metadata catalogue ISOcat component registry & editor metadata modeler metadata user search & semantic mapping metadata editor metadata creator metadata curator Joint metadata repository OAI-PMH Service provider Local metadata repository OAI-PMH Data provider metadata curator DATA
Behind the scenes (1)! SOLR + lucene! Tomcat web application! For the parsing of the CMDIʼs: VTD-XML! Faster than SAX-parser! Still full XPath access! Memory-efficient (1.3x~1.5x the size of an XML document)! 15!
Behind the scenes (2)! 16!
VLO and ISOcat: natural allies (1)! The import of metadata files used to be hard coded! Now we look at the ISOcat links in the XSDs as generated from the CMDI profiles! Fallback to hard-coded XPath in case no ISOcat link is found! 17!
VLO and ISOcat: natural allies (2)! Import configuration example:! <facetconcept name="name" allowmultiplevalues="false"> <concept>http://www.isocat.org/datcat/dc-2544</concept> <concept>http://www.isocat.org/datcat/dc-2545</concept> <concept>http://purl.org/dc/terms/title</concept> <!-- no concept in lrt schema --> <pattern> /c:cmd/c:components/c:lrtinventoryresource/c:lrtcommon/ c:resourcename/text() </pattern> </facetconcept> 18!
How do I get my metadata in there?! Provide it as CMDI over OAI-PMH! If that is not possible:! Provide it as OLAC over OAI-PMH! Provide it as IMDI over OAI-PMH! If that is not possible either:! Enter it into the LRT inventory:! www.clarin.eu/inventory! 19!
Order Specimen Habitat definikons ISOcat ISOcat.org XSD files Profile 1 Profile 2 Profile 3 Component registry Ingester XPath = data category VLO CMDI files Instance 1 Instance 2 Instance 3 Metadata Repository
Recent Additions! Links to language information: WALS, Wikipedia, Ethnologue, LinguistList and the VLO! Descriptions in the record listing! National Project facet! Feedback link! 21!
Still to come! A faceted browser is as good as its data, so curation steps are needed! more CMDI metadata! some more facets e.g.: year! Human-readable hdl links! Interface improvements! 22!
Questions?! ask them now! or send a mail to vlw@clarin.eu! More information:! www.clarin.eu/vlo! www.clarin.eu/cmdi! 23!