Fedora and GSearch in a Research Project about Integrated Search Open Repositories 2009

Similar documents
Introduction to Federico 2.0 and Fedora Commons

Ing. José A. Mejía Villar M.Sc. Computing Center of the Alfred Wegener Institute for Polar and Marine Research

Representing and Storing Complex Digital Objects Fedora

Fedora. CS 431 April 17, 2006 Carl Lagoze Cornell University. Acknowledgements: Sandy Payette (Cornell)

Comparing Open Source Digital Library Software

Share.TEC Repository System

Combining Content, Semantic Relationships, and Web Services Fedora


A Repository of Metadata Crosswalks. Jean Godby, Devon Smith, Eric Childress, Jeffrey A. Young OCLC Online Computer Library Center Office of Research

WikiD (Wiki/Data) Jeffrey A. Young OCLC Office of Research code4lib 2006 Oregon State University, Corvallis, Oregon 15 February 2006

Fedora Commons Update

Software Requirements Specification for the Names project prototype

OAI Static Repositories (work area F)

Developing an Institutional Repository Service in Chinese Academy of Sciences

Digital Library Interoperability. Europeana

Fedora Relationships and Information Network Overlays. CS 431 April 19, 2006 Carl Lagoze Cornell University

IMu OAI-PMH Web Service

Nuno Freire National Library of Portugal Lisbon, Portugal

Problem: Solution: No Library contains all the documents in the world. Networking the Libraries

Appendix REPOX User Manual

Digibess: thanks Islandora! Arcidosso Italy March, 20-22, Giancarlo Birello, Anna Perin IT office and Library CNR-Ceris

Metadata Catalogue Issues. Daan Broeder Max-Planck Institute for Psycholinguistics

Using DSpace for Digitized Collections. Lisa Spiro, Marie Wise, Sidney Byrd & Geneva Henry Rice University. Open Repositories 2007 January 23, 2007

Presented by Sandro Zic

D4.8 Report on semantic interoperability with Europeana

OAI-PMH. DRTC Indian Statistical Institute Bangalore

Adding user context to IR test collections

SMART CONNECTOR TECHNOLOGY FOR FEDERATED SEARCH

Developing Seamless Discovery of Scholarly and Trade Journal Resources Via OAI and RSS Chumbe, Santiago Segundo; MacLeod, Roddy

How to Edit/Create Ingest Forms

MuseKnowledge Hybrid Search

(Geo)DCAT-AP Status, Usage, Implementation Guidelines, Extensions

Registry Interchange Format: Collections and Services (RIF-CS) explained

Adding OAI ORE Support to Repository Platforms

XML. Objectives. Duration. Audience. Pre-Requisites

Developing a Test Collection for the Evaluation of Integrated Search Lykke, Marianne; Larsen, Birger; Lund, Haakon; Ingwersen, Peter

OAI (Open Archives Initiative) Suite Version 3.0. Introductory Guide for New Users

Network Information System. NESCent Dryad Subcontract (Year 1) Metacat OAI-PMH Project Plan 25 February Mark Servilla

Outline of the course

European Holocaust Research Infrastructure Theme [INFRA ] GA no Deliverable D19.5

Islandora and Fedora 4; The Atonement v3: The Atonermenter

ECP-2008-DILI EuropeanaConnect. D5.7.1 EOD Connector Documentation and Final Prototype. PP (Restricted to other programme participants)

FLAT: A CLARIN-compatible repository solution based on Fedora Commons

Signed metadata : method and application

Harvester Service Technical and User Guide 5 June 2008

How to Edit/Create Ingest Forms

Institutional repositories: description of VITAL as an example of a Fedora-based digital assets management system.

Joining the BRICKS Network - A Piece of Cake

DSpace Fedora. Eprints Greenstone. Handle System

Digital Libraries: Interoperability

An aggregation system for cultural heritage content

ACDH AUSTRIAN CENTRE FOR DIGITAL HUMANITIES

Richard Marciano Alexandra Chassanoff David Pcolar Bing Zhu Chien-Yi Hu. March 24, 2010

Fedora Commons: Taking on the Challenge of the Next Generation of Scholarly Communication

Content-based image retrieval integrated into Fedora

Building for the Future

RVOT: A Tool For Making Collections OAI-PMH Compliant

Chuck Cartledge, PhD. 25 February 2018

Inter and Intra-Document Contexts Applied in Polyrepresentation

Drupal for Virtual Learning And Higher Education

Development of an Ontology-Based Portal for Digital Archive Services

The Open Archives Initiative in Practice:

ScienceDirect. Multi-interoperable CRIS repository. Ivanović Dragan a *, Ivanović Lidija b, Dimić Surla Bojana c CRIS

Mapping Existing Data Sources into VIVO. Pedro Szekely, Craig Knoblock, Maria Muslea and Shubham Gupta University of Southern California/ISI

Phase 1 RDRDS Metadata

How to Create a Custom Ingest Form

CodeSharing: a simple API for disseminating our TEI encoding. Martin Holmes

GeoNetwork opensource

The Virtual Language Observatory!

IVOA Registry Interfaces Version 0.1

Orbis Cascade Alliance Content Creation & Dissemination Program Digital Collections Service. Enabling OAI & Mapping Fields in Digital Commons

WEB-BASED COLLECTION MANAGEMENT FOR LIBRARIES

2nd Technical Validation Questionnaire - interim results -

Expected and Unexpected Synergies

National Documentation Centre Open access in Cultural Heritage digital content

Open Archives Initiatives Protocol for Metadata Harvesting Practices for the cultural heritage sector

The Open Archives Initiative Protocol for Metadata Harvesting: An Introduction

BPMN Processes for machine-actionable DMPs

Ryen W. White, Peter Bailey, Liwei Chen Microsoft Corporation

Pipe Dreams: Harvesting Local Collections into Primo Using OAI-PMH

OAI-PMH implementation and tools guidelines

CMDI and granularity

On Being a Hub: Some Details behind Providing Metadata for the Digital Public Library of America

Flexible Design for Simple Digital Library Tools and Services

COAR Interoperability Roadmap. Uppsala, May 21, 2012 COAR General Assembly

DISCOVER THE POWER OF VITAL The Solution for Your Digital Collection Management

Representing LEAD Experiments in a FEDORA digital repository

DC Regional Fedora Users Meeting

Open Source Software Packages for E-Resource Management

The EHRI GraphQL API IEEE Big Data Workshop on Computational Archival Science

Metadata and Encoding Standards for Digital Initiatives: An Introduction

Developing ArXivSI to Help Scientists to Explore the Research Papers in ArXiv

The Metadata Challenge:

Web-based workflow software to support book digitization and dissemination. The Mounting Books project

A Case Study in Metadata Harvesting: the NSDL

An Approach To Automatically Generate Digital Library Image Metadata For Semantic And Content- Based Retrieval

The Community Data Portal and the WMO WIS

Metadata Harvesting Framework

Building a Digital Library Software

Part 2: Current State of OAR Interoperability. Towards Repository Interoperability Berlin 10 Workshop 6 November 2012

Transcription:

Fedora and GSearch in a Research Project about Integrated Search Open Repositories 2009 Gert Schmeltz Pedersen DTU Library, Technical Information Center @ DTU, Technical University of Denmark

Overview Introduction Isearch experimental setup GSearch indexing and configuration Fedora+GSearch for Isearch experimental setup Details of harvest, ingest and indexing Integrated search Conclusion

Introduction The Royal School of Library and Information Science[1] in Denmark is performing a research project about integrated search, partly funded by DEFF[2], Denmark's Electronic Research Library. DTU Library provides assistance in the form of a Fedora+GSearch installation. The research project involves empirical investigations following the principles of polyrepresentation in information retrieval[3]: The polyrepresentation principle suggests that cognitively and functionally different representations of information objects may be used in information retrieval to enhance quality of results.. The continuum proposed by Larsen (2004; Ingwersen & Larsen, 2005) showing the structural dimension of the retrieval techniques involved in polyrepresentation is further elaborated by adding a novel second dimension consisting of query structure and modus. The new twodimensional continuum can be seen as a constructive framework for further investigations of polyrepresentative principles in IR. [1] The Royal School of Librarianship, http://db.dk [2] Denmark's Electronic Research Library, http://deff.dk [3] B.Larsen, P.Ingwersen, J.Kekäläinen. The Polyrepresentation Continuum in IR, http://portal.acm.org/citation.cfm?id=1164840

Introduction The Fedora+GSearch installation provides a backend for the empirical investigations, containing an index of metadata records harvested from various sources, thus allowing polyrepresentation of information objects. The installation so far harvests from two types of sources, one using OAI-PMH[4], and one using the SRU protocol[5]. This presentation will focus on the technical challenges involved in the setup and indexing of the various sources, facilitating the integrated search. [4] Protocol for Metadata Harvesting, http://www.openarchives.org/pmh/ [5] Search/Retrieval via Url, http://www.loc.gov/standards/sru/

Isearch experimental setup User information needs, ~50, formulated by domain experts in physics, translated to queries by the Isearch researchers Sets of metadata (and possibly full-texts) characterized by origin, format, etc. Indexes built on subsets of metadata, characterized by origin and index process properties Search results for user information needs Evaluations of search results vs user information needs by domain experts Isearch webapp Receives metadata and full-texts Performs searches Provides search sets for evaluation Stores evaluations for analysis by researchers Allows various configurations of metadata and indexes, for experimentation The harvest and ingest service does Harvest by various protocols, data formats, and data providers Ingest according to configurations Thus, we may have polyrepresentation, in that a publication may have many different representations Each configuration will lead to a different set of search results for the set of user information needs, so that the Isearch researchers can use the domain expert evaluations to conclude on effects of polyrepresentation Fedora with GSearch was selected to assist the tasks in green. Some of the considerations follow

GSearch indexing Clients interact with the fedora webapp for management of fedora objects (foxml records) and access to them Clients interact with the gsearch webapp for indexing, search and browse of foxml records Fedora may be configured to send indexing messages to gsearch, when foxml records are managed The gsearch updateindex operation (green) consists of a transformation of a foxml record into an IndexDocument, which is delivered to the search engine for inclusion in the index The transformation is given in an xslt stylesheet, where each IndexField gets its contents from somewhere in the foxml record, either from object properties, or from inline xml datastreams, or from managed, external or referenced datastreams, which could be full-texts <foxml:digitalobject PID="demo:21"> <foxml:objectproperties> <foxml:property NAME="http://www.w3.org/1999/02/22-rdf-syntax-ns#type" VA <foxml:property NAME="info:fedora/fedora-system:def/model#state" VALUE="A <foxml:property NAME="info:fedora/fedora-system:def/model#label" VALUE="S <foxml:property NAME="info:fedora/fedora-system:def/model#contentModel" V </foxml:objectproperties> <foxml:datastream ID="DC" STATE="A" CONTROL_GROUP="X" VERSIONABLE <foxml:datastreamversion ID="DC1.0" LABEL="Dublin Core for the Document <foxml:xmlcontent> <oai_dc:dc xmlns:oai_dc="http://www.openarchives.org/oai/2.0/oai_dc/" xmlns:dc=" <dc:title>advanced FO Sample from Apache FOP Distribution</dc:title> <dc:creator>apache Group</dc:creator> </oai_dc:dc> </foxml:xmlcontent> </foxml:datastreamversion> </foxml:datastream> </foxml:digitalobject> xslt transformation <IndexDocument > <IndexField IFname="PID >demo:21 <IndexField IFname="property.type >FedoraObject <IndexField IFname="property.state >Active <IndexField IFname="property.contentModel >FO_TO_PDFDOC <IndexField IFname="dc.title">Advanced FO Sample <IndexField IFname="dc.creator">Apache Group <IndexField index="tokenized" dsid="ds1" IFname="DS1.text"/> </IndexDocument>

GSearch - many-to-many GSearch may be configured to receive foxml records from many fedora repositories and to maintain many indexes, in parallel Foxml records may be indexed by many different stylesheets, and thus give different IndexDocuments for the different indexes

GSearch configuration Configuration example: fedoragsearch.properties - soapbase = http://hostport/fedoragsearch/services - repositorynames = REPOSNAMES - indexnames = INDEXNAMES INDEXNAME/index.properties - operationsimpl = dk.defxws.fgslucene.operationsimpl - defaultqueryfields = dc.description dc.title REPOSNAME/repository.properties - soapbase = http://fedorahostport/fedora/services - fedoraobjectdir = FEDORAOBJECTDIR

Fedora+GSearch for Isearch experimental setup The Isearch webapp is realized by the two webapps and the gsearch config We use the lucene search engine The harvest service is so far realized for two protocols, OAI-PMH and SRU The ingest service is realized by format specific xslt stylesheets and some gluing code. These stylesheets basically wraps the harvested records as datastreams into foxml records The indexing xslt stylesheets are tailored to fetch the contents for the IndexFields from the contents of specific elements in the metadata records (examples follow) Many different gsearch configs may be provided, each will determine a subset of metadata records and characteristics of the resulting indexes, and will thereby lead to a different set of search results for the set of user information needs (run as gfindobjects queries from a client)

Harvest and ingest with OAI-PMH OAI-PMH, Open Archives Initiative Protocol for Metadata Harvesting, is a mechanism for repository interoperability The harvester2 library from OCLC has been downloaded and installed, run example: java cp.:log4j-1.2.12.jar:harvester2.jar:xalan.jar ORG.oclc.oai.harvester2.app.RawWrite -from 2009-01-01 until 2009-04-30 -setspec physics -metadataprefix oai_dc http://export.arxiv.org/oai2

Harvest with OAI-PMH

OAI-PMH record and indexing example <xsl:template match="/foxml:digitalobject/foxml:datastream/foxml:datastreamversion[last()]/foxml:xmlcontent/oai_dc:dc" > <IndexDocument> <xsl:for-each select="dc:title"> <IndexField IFname="TITLE" index="tokenized" store="yes" termvector="no" boost="1.0"> <xsl:value-of select="text()"/> </xsl:for-each> <xsl:for-each select="dc:creator"> <IndexField IFname="CREATOR" index="tokenized" store="yes" termvector="no" boost="1.0"> <xsl:value-of select="text()"/> </xsl:for-each> <xsl:for-each select="dc:subject"> <IndexField IFname="SUBJECT" index="tokenized" store="yes" termvector="no" boost="1.0"> <xsl:value-of select="text()"/> </xsl:for-each> </IndexDocument> </xsl:template> === <IndexDocument> <IndexField IFname="TITLE" index="tokenized" store="yes" termvector="no" boost="1.0"> Using Structural Metadata... <IndexField IFname="CREATOR" index="tokenized" store="yes" termvector="no" boost="1.0"> Dushay, Naomi <IndexField IFname="SUBJECT" index="tokenized" store="yes" termvector="no" boost="1.0"> Digital Libraries </IndexDocument>

Harvest and indexing with SRU SRU Search/Retrieval via URL Example: http://webservice.bibliotek.dk/soeg/?version=1.1 &operation=searchretrieve &query=dc.title=physics &startrecord=1 &maximumrecords=100 &recordschema=marcx &stylesheet=default.xsl &recordpacking=xml

Harvest with SRU

Harvest with SRU

SRU record and indexing example <xsl:template match="/foxml:digitalobject/foxml:datastream/foxml:datastreamversion[last()]/foxml:xmlcontent/mxc:record" > <IndexDocument> <xsl:for-each select="mxc:datafield[@tag=100]/mxc:subfield"> <IndexField IFname="CREATOR" index="tokenized" store="yes" termvector="no" boost="1.0"> <xsl:value-of select="text()"/> </xsl:for-each> <xsl:for-each select="//mxc:datafield[@tag=241]/mxc:subfield"> <IndexField IFname="TITLE" index="tokenized" store="yes" termvector="no" boost="1.0"> <xsl:value-of select="text()"/> </xsl:for-each> <xsl:for-each select="mxc:datafield[@tag=245]/mxc:subfield"> <IndexField IFname="TITLE" index="tokenized" store="yes" termvector="no" boost="1.0"> <xsl:value-of select="text()"/> </xsl:for-each> <xsl:for-each select="mxc:datafield[@tag=652]/mxc:subfield"> <IndexField IFname="SUBJECT" index="tokenized" store="yes" termvector="no" boost="1.0"> <xsl:value-of select="text()"/> </xsl:for-each> <xsl:for-each select="mxc:datafield[@tag=662]/mxc:subfield"> <IndexField IFname="SUBJECT" index="tokenized" store="yes" termvector="no" boost="1.0"> <xsl:value-of select="text()"/> </xsl:for-each> </IndexDocument> </xsl:template> === <IndexDocument> <IndexField IFname="CREATOR" index="tokenized" store="yes" termvector="no" boost="1.0">... <IndexField IFname="TITLE" index="tokenized" store="yes" termvector="no" boost="1.0">... <IndexField IFname="SUBJECT" index="tokenized" store="yes" termvector="no" boost="1.0">... </IndexDocument>

Integrated search Now we may search on all sources in one query Examples: CREATOR:Smith and SUBJECT:oxygen Smith and oxygen and TITLE:process* Queries may be on specific fields and/or on the configured defaultqueryfields Search results may be from all sources indexed in the default index or in a specific index named in the query URL

Conclusion The delivery to Isearch is an example of a non-trivial and unusual application of Fedora+GSearch The facilities for experimentation are key points for Isearch The delivery has yet to prove its value Hopefully, you have been inspired to see new possibilities for your own projects with Fedora+GSearch