Advanced Tooling in MarcEdit TERRY REESE THE OHIO STATE UNIVERSITY

Similar documents
Editing Records with the MarcEditor

MarcEdit: Working with Data

A beginners guide to MarcEdit and beyond the editor: Advanced tools and techniques for working with metadata

Getting Started. 1. Sample Data Files

Editing Records with the MarcEditor

Alma and MarcEdit. Karen Stone Manager, Description Services State Library of Queensland

Using MarcEdit. Presentation for the Eastern Great Lakes Innovative Users Group, October 20, 2006 Toledo, Ohio

Contribution of OCLC, LC and IFLA

NOTSL Fall Meeting, October 30, 2015 Cuyahoga County Public Library Parma, OH by

Using the WorldCat Digital Collection Gateway

Media-Ready Network Transcript

Navigating the Universe of ETDs: Streamlining for an Efficient and Sustainable Workflow at the University of North Florida Library

OCLC Cataloging Services

CONTENTdm 4.3. Russ Hunt Product Specialist Barcelona October 30th 2007

Library Technology Conference, March 20, 2014 St. Paul, MN

Introduction to MarcEdit iskills Workshop Series. University of Toronto. Faculty of Information. Winter 2018.

MarcEdit & Other Cataloging Tips. Rachel Gravel, Technical Services Librarian, Marlboro College

How to contribute information to AGRIS

Automating Authority Work

Holdings: Part II Conversions. Technique I. Advantages of Technique I 10/29/12

Alphabet Soup: Choosing Among DC, QDC, MARC, MARCXML, and MODS. Jenn Riley IU Metadata Librarian DLP Brown Bag Series February 25, 2005

Million Book Universal Library Project :Manual for Metadata Capture, Digitization, and OCR

Competencies for Cataloging and Metadata Librarians

Compound or complex object: a set of files with a hierarchical relationship, associated with a single descriptive metadata record.

Metadata and Encoding Standards for Digital Initiatives: An Introduction

Key principles: The more complete and accurate the information, the better the matching

Retrospective Implementation of Faceted Vocabularies for Music

Record Manager for New Zealand Schools

MARC Futures. International Workshop: MARC 21 Experiences, Challenges, and Visions May Sally H. McCallum Library of Congress

Uploading Records to the COE Database: Easy as 1,2,3!

OpenOffice.org Writer

Future Trends of ILS

Union catalogue models

Welcome Back! Without further delay, let s get started! First Things First. If you haven t done it already, download Turbo Lister from ebay.

Using the WorldCat Digital Collection Gateway with CONTENTdm

Adventures in Minimal MARC and Bulkimport; Or,

Sharing Archival Metadata MODULE 20. Aaron Rubinstein

Virtual Collections. Challenges in Harvesting and Transforming Metadata from Harvard Catalogs for Topical Collections

Transforming Our Data, Transforming Ourselves RDA as a First Step in the Future of Cataloging

Connexion Client Module 4 Save Files and Batch Processing

Pipe Dreams: Harvesting Local Collections into Primo Using OAI-PMH

Migration With Duda.

Implementing EDS. A ten step summary, as experienced at Hofstra University Library

WHAT S NEW IN QLIKVIEW 11

Building Consensus: An Overview of Metadata Standards Development

SAFARICOM MANAGED WIDE AREA NETWORK. Safaricom MWAN CUSTOMER SERVICE MANAGEMENT: OR

(Refer Slide Time: 1:26)

Unlocking Library Data for the Web: BIBFRAME, Linked Data and the LibHub Initiative

Enhancing discovery with entity reconciliation: Use cases from the Linked Data for Libraries (LD4L) project

Enrichment, Reconciliation and Publication of Linked Data with the BIBFRAME model. Tiziana Possemato Casalini Libri

TekTalk Word 2007 Notes

Hello, I m Melanie Feltner-Reichert, director of Digital Library Initiatives at the University of Tennessee. My colleague. Linda Phillips, is going

Comparing Open Source Digital Library Software

The key objectives for this session are:

Research, Development, and Evaluation of a FRBR-Based Catalog Prototype

RDA: Where We Are and

08/10/2018. Istanbul Now Platform User Interface

Wonghong Jang LG Sangnam Digital Library Manager

Photoshop and Lightroom for Photographers

Karlen Communications Word 2007 Settings. Karen McCall, M.Ed.

Working with Metadata in ArcGIS

Batch Editing MARC Records with MarcEdit and Regular Expressions

Sonatype CLM - Release Notes. Sonatype CLM - Release Notes

Presented By: CSIR-NISCAIR New Delhi, India Web :

Table of contents for The organization of information / Arlene G. Taylor and Daniel N. Joudrey.

Make Your Course Content Accessible using Microsoft Office and Windows.

Uniqueness and Duplication in Ohio s Shared Depository System

RVOT: A Tool For Making Collections OAI-PMH Compliant

Hello, and welcome to another episode of. Getting the Most Out of IBM U2. This is Kenny Brunel, and

[Compatibility Mode] Confusion in Office 2007

MINT METADATA INTEROPERABILITY SERVICES

ead-transform.py (custom script by Josh) ArchivesSpace compliance schematron validation Post-import cleanup via Python/API


Orbis Cascade Alliance Content Creation & Dissemination Program Digital Collections Service. OpenRefine for Metadata Cleanup.

Metadata Cataloging. regarding items. For the assignment, I chose to outline some fields from three different

DEL or DELETE - Deletes the character at cursor and/or characters to the right of the cursor and all highlighted (or selected) text.

EPISODE 23: HOW TO GET STARTED WITH MAILCHIMP

Using MARC Records to Populate CONTENTdm

2012 June 17. OCLC Users Group Meeting

Update on the TDL Metadata Working Group s activities for

BIBFLOW Roadmap. BIBFLOW: A Roadmap for Library Linked Data Transition. Prepared 14 March, 2017

- Evergreen Reports Training Session - Handouts. September 29, 2016 Hermiston Public Library

CURZON PR BUYER S GUIDE WEBSITE DEVELOPMENT

BUYER S GUIDE WEBSITE DEVELOPMENT

Loading Approval Records into NZ. Last Update 9/26/2016; Created by the YBP EOCR Record Loads in Alms Group

GROW YOUR BUSINESS WITH AN ALL-IN-ONE REAL ESTATE PLATFORM

2010 Mid-Atlantic PUG Ask Polaris Q&A

The EHRI GraphQL API IEEE Big Data Workshop on Computational Archival Science

Kalaivani Ananthan Version 2.0 October 2008 Funded by the Library of Congress

OAI-PMH. DRTC Indian Statistical Institute Bangalore

The Easy Way to Get your Print Periodicals into the OCLC Electronic Link Resolver. Gary R. Cocozzoli Lawrence Technological University

Introduction to TIND. Guillaume Lastecoueres

Workflow Templates in Compliance 360 Version 2018

How Primo Works VE. 1.1 Welcome. Notes: Published by Articulate Storyline Welcome to how Primo works.

Science-as-a-Service

Chinese Geo-Names Calculator A Linked Data Approach

AUTHORITY CONTROL PROFILE Part I: Customer Specifications

Export and Import Authority Records

Our legacy archival system resides in an Access Database lovingly named The Beast. Having the data in a database provides the opportunity and ability

Apps from K15t Software help teams work better together in Confluence and Jira.

Transcription:

Advanced Tooling in MarcEdit TERRY REESE THE OHIO STATE UNIVERSITY REESET@GMAIL.COM

Data and Slides Download @: http://marcedit.reeset.net/workshops/um_marcedit7.zip Download, Open and Extract saving to your desktop (or wherever)

MarcEdit 7! MarcEdit 7 was released over the U.S. Thanksgiving Holiday The release: 1. Has been in development for close to 9 months with ~20 testers in 7 countries using 4 different MARC flavors providing direct feedback 2. Touched nearly every part of the program when finished, the release updated a shade under 350,000 lines of code 3. Was tested against almost 20 million records 4. Is the first version of MarcEdit designed with Accessibility in mind 5. Is fast (I m going to show you a few places where)

Lite weight cluster has been added directly into the program New way to process XML/JSON data A new linked data engine, with support for locally defined rdf vocabularies in reconcillation MarcEdit 7 highlights New task processing Consolidated Z39.50/SRU client Added Editing Functions New Add/Delete Field Tools (deduplication) Expanded Regular Expression options Updated OCLC Integrations Integrated Help

Quick overview of MarcEdit 7 Changes Explore MarcEdit 7 s new Clustering Functionality Today s topics Working with non MARC data using known and unknown metadata formats Explore MarcEdit 7 s Linked Data Platform MarcEdit Regular Expressions Primer Integration opportunities with Alma or other ILS Systems OCLC Connexion

Let s look at what s new Welcome to Project Hazel, your friendly (and sometimes helpful) installation agent Hazel is there to help highlight important options, and make sure you can work with Unicode data by making sure you have a Unicode font. Accessibility MarcEdit 7 includes an improved font/sizing engine for improved layout on different screen sizes and resolutions All images are tagged with text and accessibility via screen readers or using the operating system s accessibility tooling Availability of themes, to allow you to customize windowing and contracts to ease eye strain Keyboard shortcuts (everywhere) Sound cues Window transparency

Let s look at what s new More International MarcEdit 7 uses an intelligent machine translation service, providing an interface in close to 26 languages at this point It s Faster Lists have been virtualized (lower overhead) Pages load quicker Tasks have been super charged It s leaner in part because Windows XP support is no longer provided

Let s look at what s new Program is easier to manage The program has 4 installation modes 32 bit Administrator and non Administrator installation modes 64 bit Administrator and non Administrator installation modes How do I choose? Depends on your needs: http://marcedit.reeset.net/downloads

Let s talk about task changes How they worked in MarcEdit 6

Task Changes HOW THEY WORK IN MARCEDIT 7

So what does that mean to me? In MarcEdit 6, the optimal task size was ~20 operations or lower. Once the operation count began to get higher, the time that it would take to process data would become exponentially slower. In MarcEdit 7, that performance line actually goes the other direction. The tool processes records faster, and handles more records per second, the more task actions completed. Real World Example Library in Greece has a task list with over 1000 task actions. They would use this task to clean up large portions of their database in one pass. Generally, this would mean processing ~300,000 500,000 records at a time. In MarcEdit 6, this process would take as many as 10 hours to complete. Using the MarcEdit 7 task processing, this process now takes less than 20 minutes.

But seeing is believing Let s compare processing using one record, but with a task list that uses north of 100 task actions in MarcEdit 6 and MarcEdit 7.

Other comparisons Virtual Lists Loading Files Comparing the Extract Selected Records Tooling Loading large data files

MarcEdit 7 continues growing Near term planned additions Completion of the Updated MarcEdit Mac 3.0 Upgrade (to include the new functionality) New plugins for individual record creation templates Support for HDT and linked data fragments (this is awesome stuff) Additional clustering algorithms

Clustering in MarcEdit 7 How people clustered MARC data in the past 1. Export the fields considered for investigation into a tabbed delimited format 2. Import into OpenRefine 3. Cluster the data 4. Make Edits 5. Export the delimited data out of OpenRefine 6. Develop a process to merge the changed data back into MARC If you need to have your data start or end up in MARC, working with OpenRefine can be challenging because there isn t a natural process to move between these two formats

Clustering in MarcEdit 7 MarcEdit s built in clustering tools support native grouping and batch editing and works well on file sizes of a million records and smaller (can work on large sets, but the larger the file, the longer the cluster operation takes)

Clustering Options Clustering Algorthms Levenshtein Distance This algorithm is best for people, places, and subjects This algorithm builds clusters based on the number of positions/character difference between a word or phase This algorithm is generally faster Composite Coefficient This algorithm is best for highly variable data where a great deal of fuzziness is desired.

Clustering Changes Clustered changes are queued and stacked. Changes happen once all edits have been set. Clustered changes can be made by group, across groups, or selected items within a group

Clustering Enhancements Things I m thinking about: Enabling clustering support to be run on non MARC data I d like to hear your ideas as well

XML Conversions

MarcEdit: crosswalking design MarcEdit model: So long as a schema has been mapped to MARCXML, any metadata combination could be utilized. This means that no more than two tranformations will ever take place. Example: MODS MARCXML EAD

EAD Dublin Core MARC21XML FGDC MarcEdit Crosswalking model MARC MODS

MarcEdit: Crosswalks for everyone What s MarcEdit doing? Facilitates the crosswalk by: 1. Performing character translations (MARC8 UTF8) 2. Facilitates interaction between binary and XML formats.

Setting up Crosswalks

XML Function Wizard The wizard was created to help fill a gap to enable metadata crosswalking when a user doesn t have a lot of expertise building XSLT or Xquery transformations

OAI Harvesting MarcEdit s OAI Harvester can run in two modes User Initiated Scheduled Let s look at both!

OAI Harvesting User Initiated Harvesting supports the following verbs GetRecord ListRecords ResumptionToken Any metadataprefix can be accommodated, but by default, the tool has XSLT crosswalks for: MARCXML OAIMARC Dublin Core MODS

OAI Harvesting Scheduled Using scheduler on Windows, or cron on Linux, or whatever the equivalent is on MacOS, you can create Harvesting Jobs and schedule them for regular harvest

Working with Linked Data In MarcEdit

Objects not strings Probably the biggest reason people talk about linked data is the notion of moving from strings to objects Strings

Objects not strings Probably the biggest reason people talk about linked data is the notion of moving from strings to objects Objects

Objects Not Strings URIs provide actionable data Controlled terms can be updated without user intervention (generally) And URIs can provide access to more information I.E. a URI to VIAF provides access not just to author information, but to all their related works and collaborators as well.

So why aren t we doing this already? Great Question! We aren t ready

So why aren t we doing this already? 1 Changing Strings to Objects is hard and expensive 2 We have some folks, like OCLC, that could be in a position to help us, but our current systems are not setup to use (and in some cases) store the data. 3 Many of our controlled vocabularies are not designed to support reconciliation work And those that are aren t production ready Or are proprietary

So what can we do right now? A lot Many of the large national services are making resources and infrastructure available to enable libraries to begin doing this work OCLC has been largely supportive, and provides their own tools with output linked data content We can start lobbying our systems to not just store the data, but make use of the information when provided We can start the reconciliation process (because this process takes time)

MarcEdit and Linked Data MarcEdit 6 and 7 include a linked data plaftform this is an integration platform that enables MarcEdit to work with various linked data services, and provides a way to build new services around this functionality Designed to support RDF, JSON LD, SPARQL and a wide range of library specific services currently providing one off access to controlled data The framework has been utilized in MarcEdit for the development of a toolset called MARCNext

MARCNext These are Experimental services that allow catalogers to play with their data and visualize it through the BibFrame lens as well as begin the process of turning strings to objects.

Linked Data Tool Linked data tool enables reconciliation services Works from a rules file, which enables users to customize the output provided MarcEdit 7 provides a rules file optimized for MARC21, but I have rules files being tested for a number of MARC formats (including UNIMARC) Currently supports the insertion of $0 and $1 into bibliographic and authority data Includes support from ~25 remote linked data endpoints Can use local rdf files as locally mounted SPARQL stores Allows for targeted, or automatic processing

Does this currently scale? I get asked this question, because the Library of Congress actively throttles data request made against their service. So too do many other service providers. They have to, it s a method of self preservation. When I test reconciled Ohio State s entire database (~6 million records), I estimated that I would end up making on the low end, 48,000,000 requests, just to the Library of Congress. Over a very short period that s a lot of requests, and can overwhelm their services. However I work closely with many of the large data providers, and they give MarcEdit some leeway because: MarcEdit follows some established patterns in LCs case, they can provide an HTTP status code that let s the application know that their service is under load, and MarcEdit will start slowing down requests for a specified time period. MarcEdit does its own internal caching this way an item is only retrieved once per reconciliation session. Using this method,i can likely cut the number of requests to a service like LC by over a 1/3 or more. In fact, the more data that s processed, the faster it goes and the less requests it has to make to the source vocabulary

We can build new services USING LINKED DATA TOOLS FOR HEADING VALIDATION

Validate Headings How it works Working directly with the U.S. Library of Congress MarcEdit queries the NACO and SACO headings directly Returning information about URIs and variants/changes MarcEdit then generates a report, automatically corrects headings (when possible) and can generate brief authority records or downloads the existing authority record

Questions Again I would like to hear from you? I ve been working with members of the PCC task force looking at how we embed linked data into MARC records (and outside of MARC records), and I ve been actively building these tools into MarcEdit (both for research and production). How would you like to work with linked data recordset in your library? What could MarcEdit do to make this easier for you?

MarcEdit Regular Expression Primer

MarcEdit Regular Expression Support Functions that presently support regular expressions Delete Field Edit Field Copy Field Swap Field Build New Field Validation Extract/Delete Selected Records

Expression Scope Deciding which function to use depends on the scope of data needing to be evaluated Add/Delete Field Regular Expressions have access to the entire field (from the = to the end of line (eol) Edit Subfield Regular Expressions have access to the subfield code, to the end of the subfield Edit Field Regular Expression has access to all subfield data, but *not* indicator data Edit Indicators only access to indicator data Copy Field Regular Expression has access to indicator data + all subfield data Replace Function Regular Expression has access to all record data

Microsoft s Regular Expression language Concepts: Character escapes Anchors Character classes Grouping Qualifiers Substitutions Let s open Regular Expression Language Quick Reference.html or https://msdn.microsoft.com/en us/library/az24scfc(v=vs.110).aspx

How we use Regular Expressions in MarcEdit Your most important parts of the regular expression language are: 1. Character escapes: \d\r\n\$\x## 2. Character Classes [] & [^] 3. Grouping Elements () 4. Anchors: ^$ 5. Quantifiers: *?+{#} 6. Substitutions: $#

Examples Looking at regex_example.mrk using the replace function: Add a period to the 500 if it is missing Update the 300 to reflect electronic information Split the 856 into two fields, breaking on the $u.

Examples 1 Add a period to the 500 if it is missing Find What: (=500..)(.*[^\W]$) Replace With: $1$2. Explanation: (=500..) Searches for the 500 field. We leave two blanks because there are always 2 blank characters as part of the mnemonic format. The two periods which stand for any character. If we want to search for exact indicators, you d place those values rather than the periods. (.*[^W.]$) Take any characters, and match on a field where the last character in the field isn t a period.

Examples 2 Add online resource information to the 300 field Example: Change: 300 \\$a 32 p. To: 300 \\$a1 online resource (32 p.) Explanation: (=500..) Searches for the 500 field. We leave two blanks because there are always 2 blank characters as part of the mnemonic format. The two periods which stand for any character. If we want to search for exact indicators, you d place those values rather than the periods. (?<one>\$a)([^$]*) Capture the $a and then all data in the subfield until you get to the next subfield (if there is one)

Example 3 Split the 856 into two fields, breaking on the $u. Find What: (=856.{4})(\$u.*[^$])(\$u.*) (=856.{4}) Matches the 856 field (\$u.*[^$]) Match $u, but stop at the end of the subfield (\$u.*) Match reminder of field Replace With: $1$2\n=856 41$3

lcase/ucase MarcEdit s regular expression engine includes to extension functions for dealing with case switching of characters. lcase & ucase Usage: (=450.{4})(\$a.)(.*) $1$2lcase($3) Example: Find the 500 with all upper case characters and convert the case of all values but the first letter in the sentence to lower case.

Example (lcase) Find the 500 with all upper case characters and convert the case of all values but the first letter in the sentence to lower case. Find What: (=500.{4})(\$a.)([A Z.]*) Replace With: $1$2lcase($3)

Multi Field Replacements By default, MarcEdit handles one field at a time when doing regular expressions. However, when you need to do evaluations against multiple fields, you can by adding /m to the end of your replacement in the Replace Function in the MarcEditor This is a special function added to the MarcEdit regular expression engine

Example Using regex_example.mrk Changing video disc to blue ray in the 300 if the 538 is marked as blue ray

Multi Line Example

Placeholder Are there specific editing tasks that folks are interested in? We can talk about these now

Questions

Integrations

ILS Integration ILS Integration currently supports direct integration with Koha, Alma, and a local option. Are other integrations possible? http://blog.reeset.net/archives/2133

Let s talk about ALMA Integration

How MarcEdit Works with Alma MarcEdit works through the following API endpoints: https://developers.exlibrisgroup.com/alma/apis/bibs Because the API is rate limited (i.e., you can only process so many transactions concurrently through the API, and all Alma operations use the API), MarcEdit limits API processes to a single thread. It takes a little longer, but eliminates the possibility that using MarcEdit to automate workflows will bring down your system because the tool is trying to communicate with the system too quickly. This this API, MarcEdit can: Edit holdings data (and Holdings Records) Create and Update bibliographic data Extract Records Though discovery should be done via Z39.50 or SRU (which is preferred)

Working with OCLC Connexion https://youtu.be/a7cen0gxfcw?list=plrhrsj91nvfscjls91swr5awtffpewmwg

Working with OCLC s Metadata API MARCEDIT CAN WORK DIRECTLY WITH WORLDCAT VIA THE METADATA API.

MarcEdit: Batch WorldCat Holdings Management

MarcEdit: Batch Bibliographic Record Upload

OCLC s Developer Network: http://oclc.org/developer/ More Information OCLC Metadata API Documentation: http://oclc.org/developer/services/worldcat metadata api Notes on MarcEdit Integration: http://blog.reeset.net/archives/1245 C# OCLC API Library https://github.com/reeset/oclc_api