Implementation of OpenAIRE Guidelines for CRIS Managers to Finnish VIRTA Publication Information Service Joonas Nikkanen, CSC - IT Center for Science, Finland - https://orcid.org/0000-0002-5036-6444 Dragan Ivanović, University of Novi Sad, Serbia - https://orcid.org/0000-0002-9942-5521 Hanna-Mari Puuska, CSC IT - Center For Science, Finland - https://orcid.org/0000-0001-5532-9274 eurocris Strategic Membership Meeting November 28, 2018, WUT, Warsaw, Poland CSC Finnish research, education, culture and public administration ICT knowledge center
Agenda Background: VIRTA Publication Information Service Context: Finland from OpenAIRE s point of view Case: Implementation of OpenAIRE Guidelines for CRIS Managers Conclusions
VIRTA Publication Information Service
Background: Publication information collection in Finland Ministry of Education and Culture collects bibliographic information on scientific publications annually from 14 universities and 5 university hospital districts (since 2011) 23 universities of applied sciences (since 2012) 12 state research institutes (gradually since 2014) Used as criteria in performance-based funding model of higher education institutions For universities 13 % of core funding (~200 mill. euros) is allocated via publication points Publication points are calculated based on publication types and their level (evaluated by national scholarly panels - Publication Forum - http://www.julkaisufoorumi.fi/en) Publication type Level 3 Level 2 Level 1 Level 0 Peer-reviewed monograph (C1) 16 12 4 0.4 Peer-reviewed article in journal (A1-2) 4 3 1 0.1 Peer-reviewed article in book (A3) 4 3 1 0.1 Peer-reviewed article in proceedings (A4) 4 3 1 0.1 Peer-reviewed edited work (C2) 4 3 1 0.1 Not-peer-reviewed monographs 0.4 Not-peer-reviewed articles 0.1
Background: Use of publication information in Finland Organizations provide a copy of their publication information to VIRTA making it a national data warehouse (or data hub) for other services to use the publication metadata In total some 60 000 publications per year = books, journal articles, conference papers, non-scholarly publications, dissertations, artistic publications All scientific fields are covered The data can be examined on bibliographic level: http://www.juuli.fi/ pivoted on statistical level: www.vipunen.fi/en-gb/ queried via authenticated REST API (XML, JSON) and OAI-PMH API (Dublin Core, XML) 5
Annual publication data National funding model Vipunen Statistical portal for analysis Juuli Portal for examining publications +60k publications annually 14 universities, 5 univ. hosp. Pure / Converis / SoleCRIS Academy of Finland Funding calls and reporting 23 universities of applied sci. Mostly JUSTUS / some manually VIRTA Publication Information System Validating / de-duplicating / REST and OAI-PMH APIs available +350k publications Organisations systems For master data purposes 12 state research institutes Varies from Pure to manual JUSTUS For prefilling co-publications Research Information Hub Wider linking and interoperability OpenAIRE National aggregator 6
VIRTA in short Data sources Data format Data contents Data transfer Updates Temporal coverage Data validation Data use and availability Original metadata in local CRISes (Pure, Converis, SoleCRIS) or other publication databases of HEIs, university hospitals, state research institutes XML files (XML-CSV converter provided for small organizations) The data must include required fields and fulfill certain technical criteria. From organizations via a secure and certified connection by using SFTP protocol and SSH authentication keys. New publications and corrections in local systems can be updated to VIRTA e.g. once a day. The frequency depends on the organizations, minimum being once a year. All data from previous years to present can be transferred. Statistics compiled once a year. Duplicates, faults as well as inter-organizational co-publications identified automatically and real time. Errors informed to research organizations both in an online service and email reports. All metadata is synced once per day and can be examined in JUULI portal: www.juuli.fi. Yearly statistical data is available in Vipunen portal www.vipunen.fi. REST API provides metadata in XML and JSON formats, OAI-PMH API in XML and Dublin Core. 7
Finland from OpenAIRE s point of view
Finnish repositories supporting OpenAIRE harvesting Includes o The repositories from 7 universities (+ 2 as sub-repositories) o Common repository for universities of applied sciences (Theseus) o One research institution (VTT) (+ 4 as sub-repositories) o Self-archived (green OA) publications o Theses Missing o Repositories from 5 universities + most research institutions o Publications not archived in repositories, e.g. o Aaltodoc repository 2011-2017: o 3049 publications o AaltoCRIS 2011-2017: o 35 886 publications o Aalto in VIRTA 2011-2017: o 27 257 publications 9
Survey for Finnish OpenAIRE providers by Finnish NOAD in 2018 The survey was carried out by the Finnish NOAD, the University of Helsinki Responses from 7 (out of 9) current repositories that are harvested by OpenAIRE and 7 non-openaire repositories 6/7 current providers mentioned the work on data models and supporting harvestable API endpoint to be the biggest issues regarding OpenAIRE o In some cases repositories are not connected to CRIS systems o DSpace (used by most repositories) needs work (on publication forms etc.) to be compliant with OpenAIRE specifications o Harvested metadata is rather poor after mapping (repositories include richer metadata) 6/7 of non-openaire harvested repositories are planning to implement OpenAIRE support o Half of them to be harvestable in 2018/2019, others have no schedule yet o 5/7 mention the technical implementation and work on metadata to be too resource intensive to yet have support for OpenAIRE spesifications 10 Summary of results (in Finnish) at: https://blogs.helsinki.fi/openaire2020/2018/08/23/kansallisestakoordinaatiosta-toivotaan-tukea-julkaisuarkistoille-openaire-kyselyn-tulosten-yhteenveto/
Implementation of OpenAIRE Guidelines for CRIS Managers
Why? Centralized solution - to save time and resources oimplementation and possible updates to OpenAIRE specifications has to be done to one system only Better quality and more complete metadata for OpenAIRE orepositories only include a fraction of all publications in Finnish organizations 2018 - Q1 2018 - Q2 2018 - Q3 2018 - Q4 2019 - Q1 Planning Presenting to organizations Mapping data models Procedures for CERIF-XML OAI-PMH work OAI-PMH endpoint validating Permissions from organizations OpenAIRE beta First harvests to OpenAIRE? 12
Guidelines for CRIS Managers The Guidelines provide orientation for CRIS managers to expose their metadata in a way that is compatible with the OpenAIRE infrastructure. By implementing the Guidelines, CRIS managers support the inclusion and therefore the reuse of metadata in their systems within the OpenAIRE infrastructure. OpenAIRE Guidelines for CRIS Managers version 1.1. released in June 2018
Main prerequisites 1. Metadata representation in CERIF XML 2. OAI-PMH endpoint for harvesting 14
What to do? 1. Mapping VIRTA data model to CERIF data model 2. Making procedure for converting data from VIRTA to CERIF- XML and necessary customising 3. Validating that OAI-PMH endpoint returns data as it should based on the Guidelines 4. Agree with organizations on what information they want to be made available for harvest 5. Discuss with OpenAIRE on the details of how and when to do the harvest 15
VIRTA Architecture Virta Publication Information Service Load XML files to SA (SSIS) Validate data (SQL procedure) Find Jufo_IDs (C#) Find co-publications (C#) Find duplicates (C#) Send validation email to organisation Transfer data to DW (SQL Procedure) Create VIRTA-XML (SQL Procedure) Create CERIF-XML (SQL Procedure) Create Dublin Core (SQL Procedure) Update API tables (SQL Procedure) API (REST, OAI-PMH) Powershell script GET GET GET GET GET/POST PUT/DELETE Vipunen statistical portal https://vipunen.fi Juuli portal http://juuli.fi Organisations Systems CRIS, DW, master data etc. JUSTUS https://justus.csc.fi Academy of Finland Funding calls/reporting OpenAIRE https://www.openaire.eu Update reports (SQL Procedure) Data flow via db connection Research information hub https://research.fi ready to do
Data models Many similarities between VIRTA and CERIF data models on publications and what elements are included Key differences: opublication type classification ocase of IDs as national aggregator oopen access classification onational classifications (e.g. field of science) Chosen not to be included in mapping: opublication forum levels (national scholarly panels) oartistic publications 17
Data models VIRTA - CERIF mapping table available: https://wiki.eduuni.fi/pages/viewpage.action?pageid=80941717 VIRTA data model for reference: https://tietomallit.suomi.fi/model/julkaisu/ 18
CERIF-XML SQL procedures used in VIRTA for multiple purposes already New SQL procedure based on the VIRTA-CERIF mapping Run the procedure to populate a database table with CERIF- XML data opopulate by publications that originate from organizations which have granted the permission for OpenAIRE harvest 19
Organizational level Coordination with Finnish OpenAIRE NOAD and discussions with organizations based on the plans and data model mapping Organizations as registrars of data - VIRTA only stores a copy owritten permissions needed if data is allowed to external services / use oorganizations were asked if o 1) OpenAIRE can harvest their data o 2) Which publication years can be included in the harvest o 3) Are there other limitations for the harvest (e.g. publication types) These can be implemented on CERIF-XML procedure as they come and thus exposed via OAI-PMH endpoint 20
OAI-PMH OAI-PMH was already implemented in VIRTA for both Dublin Core and VIRTA-XML metadata oused as basis for implementing OpenAIRE specifications New java implementation for OAI-PMH to support Guidelines ometadata prefix: oai_cerif_openaire oextension of supported sets: openaire_cris_publications omake sure that data is retrievable if above are requested oadd description for Identify request owrite tests ovalidate the implementation by using OpenAIRE CRIS-validator 21
OAI-PMH Tests ran with the local CRIS Guidelines validator against the endpoint Issues found by running it: oelement issues (Type, Access, PublisheIn) oset issues (Events, OrgUnits) oformat issues (Too long ID, wrong order, missing tags etc.) owork in progress 22
23
Conclusion 1. Implementation process highly dependent on source system architecture / technologies 2. Aiming for high quality metadata equals more work and more complex mapping (+ upkeep) 3. Following Guidelines is straightforward, but further support and best practices would be useful for implementers 24
Thank you! 25
Save the date: 21 st to 25 th of October 2019, Poznan, Poland ENRESSH Training school on working with national bibliographic databases. Follow-up for workshop held in Antwerp in September 2018. More information: ecoom@uantwerp.be http://blogs.lse.ac.uk/impactofsocialsciences/2018/1 1/13/towards-more-consistent-transparent-andmulti-purpose-national-bibliographic-databasesfor-research-output/ 26
Joonas Nikkanen Project Manager Research Information Management and Interoperability Tel. +358 50 381 80 92 linkedin.com/in/joonas-nikkanen facebook.com/cscfi twitter.com/cscfi youtube.com/cscfi linkedin.com/company/csc---it-center-for-science github.com/cscfi 27 Kuvat CSC:n arkisto ja Thinkstock