Million Book Universal Library Project :Manual for Metadata Capture, Digitization, and OCR

Similar documents
BIBLIOGRAPHIC REFERENCE DATA STANDARD

The Dublin Core Metadata Element Set

AGLS Metadata Element Set Part 1: Reference Description

USING DC FOR SERVICE DESCRIPTION

Getting Started with Omeka Music Library Association March 5, 2016

Creating Compound Objects (Documents, Monographs Postcards, and Picture Cubes)

Beginner Workshop Activity Guide 2012 User Conference

How to use the open-access scanners 1. Basic instructions (pg 2) 2. How to scan a document and perform OCR (pg 3 7) 3. How to scan a document and

Metadata Workshop 3 March 2006 Part 1

What s New in Version 4.0

1. CONCEPTUAL MODEL 1.1 DOMAIN MODEL 1.2 UML DIAGRAM

ABBYY FineReader 14. User s Guide ABBYY Production LLC. All rights reserved.

October 7, 2013 Kourtney Blackburn

BIBLID (2004) 93:1 pp (2004.6) 209. NBINet NBINet 92

CONTENTdm Basic Skills 1: Getting Started with CONTENTdm

CONTENTdm Core Metadata Application Profile v2.1

Creating a Web Page using KompoZer

Workflow option for getting an existing CONTENTdm collection ready for IM DPLA harvest

DCMI Abstract Model - DRAFT Update

2010 by Microtek International, Inc. All rights reserved.

Instant Content Creator. User Guide

Record Manager for New Zealand Schools

Creating Pages with the CivicPlus System

LIMB Processing Release Notes

USER S GUIDE Software/Hardware Module: ADOBE ACROBAT 7

There are six main steps in creating web pages in FrontPage98:

1. Download and install the Firefox Web browser if needed. 2. Open Firefox, go to zotero.org and click the big red Download button.

ELO. ELO Dropzone. Document Management and Archiving Software. September ELO Digital Office GmbH.

Read&Write 5 GOLD FOR MAC MANUAL

BR-Receipts User's Guide

MCDOUGAL LITTELL EASYPLANNER USER S GUIDE

Address Bar. Application. The space provided on a web browser that shows the addresses of websites.

Table of Contents. Page 2 of 72. High Impact 4.0 User Manual

A GET YOU GOING GUIDE

Advanced Topics in Curricular Accessibility: Strategies for Math and Science Accessibility

Operation Guide <Functions Edition> Click on the button to jump to the desired section.

Iconasys Advanced 360 Product View Creator. User Guide (Mac OSX)

User Guide 701P Wide Format Solution Wide Format Scan Service

World Digital Library Metadata with Crosswalks and Instructions *Bold elements are required

To complete this tutorial you will need to install the following software and files:

SobekCM METS Editor Application Guide for Version 1.0.1

Based on the functionality defined there are five required fields, out of which two are system generated. The other elements are optional.

Specific requirements on the da ra metadata schema

A tool for Entering Structural Metadata in Digital Libraries

How to Create Metadata in ArcGIS 10.0

Both transferring content to a template or re-formatting an existing Word document are similar in terms of time and effort.

TABLE OF CONTENTS TABLE OF CONTENTS... 1 INTRODUCTION... 2 USING WORD S MENUS... 3 USING WORD S TOOLBARS... 5 TASK PANE... 9

CONTENTdm 4.3. Russ Hunt Product Specialist Barcelona October 30th 2007

SECTION E: DOCUMENT DIGITIZATION

HOW TO USE THE CONTENT MANAGEMENT SYSTEM (CMS) TABLE OF CONTENTS

Guide to KI-ELN, downloaded/remote desktop client

Web logs (blogs. blogs) Feed support BLOGS) WEB LOGS (BLOGS

Single click Catalogs Pull down File menu Click on make alias. Drag the alias to the desktop. Click on the application Pull down File to make alias

Lava New Media s CMS. Documentation Page 1

Records Management Metadata Standard

Scan to PC Desktop Professional v7.0 Orientation Guide

Introduction to Microsoft Office 2016: Word

SmartWorks MFP V4 Help File

Introduction to Microsoft Excel 2007

How to Construct. Accessible Talking Books MAC

Creating Postcards in Microsoft Publisher

The viewer makes it easy to view and collaborate on virtually any file, including Microsoft Office documents, PDFs, CAD drawings, and image files.

Accessible Formatting for MS Word

Using Kodak Imaging For Ariel Use

PowerPoint Level 2. An advanced class in presentation software. Phone: SCIENCE SKILL SOLUTIONS TECH CENTER

CENTAUR S REAL-TIME GRAPHIC INTERFACE V4.0 OPERATOR S MANUAL

Using PDF Files in CONTENTdm

Cataloging: Create Bibliographic Records

Unit Microsoft Word. Microsoft Word is the word processor included in Office. Word is one of the most popular word processors.

What will I learn today?

DRAFT. Table of Contents About this manual... ix About CuteSITE Builder... ix. Getting Started... 1

User Manual V

Archiving Full Resolution Images

ABBYY FineReader 14 Full Feature List

P6 Professional Reporting Guide Version 18

San Pedro Junior College. WORD PROCESSING (Microsoft Word 2016) Week 4-7

DOWNLOAD PDF EDITING TEXT IN A SCANNED FILE

Capture Perfect 3.0. Operation Guide ENGLISH

INFOLIB2015 USER INSTRUCTION GUIDE

DISCLAIMER Whilst every effort has been made

PowerPoint for Art History Presentations

OASIS Specification Document Template Usage

Securit Version 6.2 Service Pack 1. Quick Start Guide. September 2006

READ&WRITE 5 GOLD FOR MAC USER GUIDE

TABLE OF CONTENTS TABLE OF CONTENTS... 1 INTRODUCTION... 3 BREAK... 4 DEFINITIONS... 4 STEP BY STEP- SECTION BREAK... 6 PAGE NUMBERS...

Export and Import Authority Records

ivina BulletScan Manager

Plain-paper digital Fax/Copier/Printer/Scanner. Scanner and Fax Guide

Contribution of OCLC, LC and IFLA

A Dublin Core Application Profile in the Agricultural Domain

Flipping Book Publisher for Image also provides different output methods for you to publish your

Microsoft Office Publisher

P2WW ENZ0. PaperStream Capture 2.5. User's Guide

Transform Scan Center. User s Guide

Setup for LAUSDnet - Windows 95/98/ME Revised 8/1/2001

MICROSOFT WORD 2010 Quick Reference Guide

InDesign CS Basics. To learn the tools and features of InDesign CS to create publications efficiently and effectively.

This guide will show you how to create a basic multi-media PowerPoint presentation containing text, graphics, charts, and audio/video elements.

Flip Book Maker for Image Scan files into Page-flipping ebooks directly. User Documentation. About Flip Book Maker for Image. Detail features include:

1 ZoomBrowser EX Software User Guide 5.0

Transcription:

Million Book Universal Library Project :Manual for Metadata Capture, Digitization, and OCR Gabrielle V. Michalek, editor. Carnegie Mellon University. May 7, 2003

2 Table of Contents Data Production...3 Getting MARC Records From OCLC...4 Creating Metadata Using Dublin Core...6 Minolta PS 7000 Quickscan Software Instructions...12 ABBYYFineReader 6.0 Instructions...30

3 Data Production Bitonal images with a pixel depth of 1 bit-per-pixel were scanned at a resolution of 600 dots per inch (dpi). Images stored as "Intel" TIFF (Tagged Image File Format) files, with the header content specified. The compression algorithm used is ITU (Formerly CCITT) Group 4. TIFF version 5.0 is acceptable. Subject to testing, version 6.0 (or later) may also be acceptable. Initial-capture system includes dynamic thresholding or a similar feature to capture variability of darkness in the imprint and possibly darker (e.g., foxed) backgrounds from decay. Images should be as readable as the original pages. "Typical" or "expected" data to be provided for most TIFF tags (normally, the data supplied by software default settings). A specification for the TIFF header to be produced to include scanner technical information, filename, and other data, but to be in no way a burden on the production service. Images written in sequential order, with corresponding 8.3 file names, e.g., 00000001.tif as first image in volume sequence and 00000341.tif as 341st image in volume sequence Volumes to be provided to Million Book Project by libraries with unique identifiers that conform to 8.3 format; images should be in directories named with corresponding identifier (e.g., akf3435.001 as identifier for volume will result in directory with same name, and 00000001.tif through 0000000N.tif within that directory) Images and directories (as specified above) to be written by Million Book Project to gold CD-ROM meeting agreed upon specifications, and using ISO9660 format. Skew to be within a specified range of degrees allowed. Excerpt from NCF Million Book Proposal

Getting MARC Records From OCLC 4

5 You receive MARC data from a company called OCLC. OCLC maintains an international database of library holdings. The OCLC product you use to get the MARC record from is called Connexion. It can be accessed at: http://connexion.oclc.org/ You will use OCLC s Connexion product to search the OCLC database to determine if a book has already been catalogued and to export a MARC binary record. To do this go to the Connexion URL listed above and click on the logon icon. The Authorization number is 110-250-490 and the Password is BOOKS. Select the General Tab Select the Admin Tab Select Export Options Select the MARC option Select the Export to File option That is all you need to fill out in this section Next go to Cataloging Tab Go to Search menue and select WorldCat You will be presented with a search interface that will allow you to search for materials via title, author, etc Perform your search Once you receive a hit on a search you must display the full record you believe matches the item you are looking for to compare the record to the item in hand to determine that they match up. Once the record is displayed and you are sure they match. Go to the View menue Select MARC Text Area Go to Action menue Select Export Record in MARC Save record in metadata file created for each book. Add extension.mrc to each file. This will create a MARC binary record. For materials not already cataloged, or materials that cannot be located in OCLC you should create a Dublin Core record

Creating Metadata Using Dublin Core 6

7 Materials that have not been catalogued should be catalogued using Dublin Core. Dublin Core is a subset of MARC. Dublin Core fields represent the lowest common demoninator for cataloging any type of library holdings. To read more about Dublin Core go to: http://dublincore.org/documents/dces/ There is a Dublin Core template that will produce an HTML output of a Dublin Core record that can be accessed here: http://www.lub.lu.se/cgi-bin/nmdc.pl Dublin Core Metadata Element Set, Version 1.1 The definitions provided here include both the conceptual and representational form of the Dublin Core elements. The Definition attribute captures the semantic concept and the Datatype and Comment attributes capture the data representation. Each Dublin Core definition refers to the resource being described. A resource is defined in [RFC2396] as "anything that has identity". For the purposes of Dublin Core metadata, a resource will typically be an information or service resource, but may be applied more broadly. Element: Title Name: Title Identifier: Title Definition: A name given to the resource. Comment: Typically, a Title will be a name by which the resource is formally known. Element: Creator Name: Creator Identifier: Creator Definition: An entity primarily responsible for making the content of the resource. Comment: Examples of a Creator include a person, an organization, or a service. Typically, the name of a Creator should be used to indicate the entity. Element: Subject Name: Subject and Keywords Identifier: Subject Definition: The topic of the content of the resource. Comment: Typically, a Subject will be expressed as keywords,key phrases or classification codes that describe a topic of the resource. Recommended best practice is to select a value from a controlled vocabulary or formal classification scheme. Element: Description Name: Description

8 Identifier: Description Definition: An account of the content of the reso Comment: Description may include but is not limited to: an abstract, table of contents, reference to a graphical representation of content or a free-text account of the content. Element: Publisher Name: Publisher Identifier: Publisher Definition: An entity responsible for making the resource available Comment: Examples of a Publisher include a person, an organization, or a service. Typically, the name of a Publisher should be used to indicate the entity. Element: Contributor Name: Contributor Identifier: Contributor Definition: An entity responsible for making contributions to the content of the resource. Comment: Examples of a Contributor include a person, an organization, or a service.typically, the name of a Contributor should be used to indicate the entity. Element: Date Name: Date Identifier: Date Definition: A date associated with an event in the life cycle of the resource. Comment: Typically, Date will be associated with the creation or availability of the resource. Recommended best practice for encoding the date value is defined in a profile of ISO 8601 [W3CDTF] and follows the YYYY-MM-DD format. Element: Type Name: Resource Type Identifier: Type Definition: The nature or genre of the content of the resource. Comment: Type includes terms describing general categories, functions, genres, or aggregation levels for content. Recommended best practice is to select a value from a controlled vocabulary (for example, the working draft list of Dublin Core Types [DCT1]). To describe the physical or digital manifestation of the resource, use the

9 FORMAT element. Element: Format Name: Format Identifier: Format Definition: The physical or digital manifestation of the resource. Comment: Typically, Format may include the media-type or dimensions of the resource. Format may be used to determine the software, hardware or other equipment needed to display or operate the resource. Examples of dimensions include size and duration. Recommended best practice is to select a value from a controlled vocabulary (for example, the list of Internet Media Types [MIME] defining computer media formats). Element: Identifier Name: Resource Identifier Identifier: Identifier Definition: An unambiguous reference to the resource within a given context. Comment: Recommended best practice is to identify the resource by means of a string or number conforming to a formal identification system. Example formal identification systems include the Uniform Resource Identifier (URI) (including the Uniform Resource Locator (URL)), the Digital Object Identifier (DOI) and the International Standard Book Number (ISBN). Element: Source Name: Source Identifier: Source Definition: A Reference to a resource from which the present resource is derived. Comment: The present resource may be derived from the Source resource in whole or in part. Recommended best practice is to reference the resource by means of a string or number conforming to a formal identification system. Element: Language Name: Language Identifier: Language Definition: A language of the intellectual content of the resource. Comment: Recommended best practice for the values of the Language element is defined by RFC 1766 [RFC1766] which includes a two-letter Language Code (taken from the ISO 639 standard [ISO639]), followed optionally, by a two-letter

10 Country Code (taken from the ISO 3166 standard [ISO3166 For example, 'en' for English, 'fr' for French, or 'en-uk' for English used in the United Kingdom. Element: Relation Name: Relation Identifier: Relation Definition: A reference to a related resource. Comment: Recommended best practice is to reference the resource by means of a string or number conforming to a formal identification system. Element: Coverage Name: Coverage Identifier: Coverage Definition: The extent or scope of the content of the resource. Comment: Coverage will typically include spatial location (a place name or geographic coordinates), temporal period (a period label, date, or date range) or jurisdiction (such as a named administrative entity). Recommended best practice is to select a value from a controlled vocabulary (for example, the Thesaurus of Geographic Names [TGN]) and that, where appropriate, named places or time periods be used in preference to numeric identifiers such as sets of coordinates or date ranges. Element: Rights Name: Rights Management Identifier: Rights Definition: Information about rights held in and over the resource. Comment: Typically, a Rights element will contain a rights management statement for the resource, or reference a service providing such information. Rights information often encompasses Intellectual Property Rights (IPR), Copyright, and various Property Rights. If the Rights element is absent, no assumptions can be made about the status of these and other rights with respect to the resource.

11 Dublin Core Template Complete as many fields as possible without guessing. <Title> <Creator> <Subject> <Description> <Publisher> <Contributor> <Date> <Type> <Format> <Identifier> <Source> <Language> <Relation> <Coverage> <Rights>

12 Minolta PS 7000 Quick Scan Scanning Instructions

13 QuickScan Software Instructions 1. Click on QuickScan icon on the desktop. 2. Go under the Options menu and click on Scanner Settings. The Mode box should read Black and White. Change the Dots per inch setting to 600 dpi. Change the Page Size to accommodate the size of the book to be scanned. Change brightness from Manual to Automatic.

14 3. Next, click on the More button for the Special Features settings. Change Scan Mode to Split (Left page then Right page). Or, which every mode you wish to scan in. Make Sure that Frame Masking, Finger Masking and Centering options are selected. Click on the Center-Line Erase option and make sure to select Automatic Detection. Under Scan Method, at the bottom of the screen, make sure to select the Front Panel option. Click OK.

4. Go to the File, select the Scan Batch to File option, Create New Batch. 15

16 5. The book to be scanned should already have a file created using the template. Go under the F:\directory, or the directory, which contains the appropriate folder. In the folder you should find your book folder, then click on it. Give your book a file name using the OCLC number or ISBN number. Under Schema Activation option, choose Use Schema. (Warning: If you do not select this option, your files will not be saved properly.) Then, check the Warn on Overwrite option. Click OK.

17 5. Select Start Scanning. 6. At this point you can place the book on the scanner and scan the pages using the buttons on the front panel of the scanner.

18 7. Error Messages. If for some reason you get an error message that interrupts the scanning process, it is probably because the book shifted and the scanner needs to readjust. To continue scanning: Under the file menu go to Scan Batch to File and choose the Insert Pages option. A Prepare Scanner window will appear. Select the Start Scanning option. Most likely, you will get another error message, at which time you select ok. Repeat the above mentioned procedure. The scanner will work on the second time. Thumb nail images of your pages will appear along the left-handed side of the screen. When you go to insert pages, make sure that the blinking cursor bar is to the right of the page, which is to be before the pages you wish to insert. 8. To open a file that has been previously scanned, go to File, click on Open.

9. Select the name of the file you wish to open by clicking on it. list of tif files will appear under the file name. Click on the Select All button to your right in the Open Document Window. 19

10. A list of all the tif files will appear under Selected Files. Click the OK. 20

11. The following screen will appear. To start scanning, select Start Scanning. 21

12. Inserting Pages. Go to File, Scan Batch To File, and select Insert Pages. 22

13. Make sure that your cursor is to the right of the page, which you want to come before the inserted pages. 23

14. Go to File, Scan Batch to File, and select Insert Pages. 24

16. Click OK. 25

17. Select, Start Scanning. 26

27 18. Deleting Pages. Put your cursor to the right of the image you wish to delete. With your mouse, hold down the left button, and drag. The image should now be highlighted in black. Click on the Delete key, which is located on your keyboard. The image should now be deleted.

19. Using ScandAll software to post-process. (This only works if you are using the ScandAll IP options.) When scanning is complete, you can post-process your scanned pages. Go to the IP menu and choose the Configure option. 28

20. Highlight the options you wish to select and click Add. Your choices should then be added to the Selected Filters section of the screen. To crop, double click on Crop1 and select your margins. After you have choose what you wish to correct, click OK. Go to the IP menu, chose Run on Document options. The post-processing process will now start. 29

ABBYYFineReader 6.0 Instructions 30

31 ABBYY FineReader 6.0 Instructions: 1. Click on the Abbyy Fine Reader Icon 2. Select Tools, Options

3. Select Formatting. Select Retain font and font size. Check Keep pictures. Click OK. Select Format Settings. 32

4. Select RTF/DOC. Check Keep line breaks. Check Retain text color. Click OK. 33

5. Select HTML. Check Keep line breaks. Check Retain text color. Click OK. 34

6. Select TXT. Check Keep line breaks. Check Use blank line as paragraph separator. Click OK. Click Close. 35

7. Select Process, Click on Open & Read. 36

8. When the Open window appears, select the file that you wish to OCR. In this example, the file was called Operation of Electron Microscope. Next, the OTIFF image file was selected. 37

9. Highlight all the TIF files you wish to OCR and select Open. 38

10. The following screen will then appear. The OCR ed sections will appear to the right of the screen in the Text Page window. 39

11. To save your results, select Save. 40

11. After pressing the Save button, the Save Wizard screen will pop up. Choose Save to File in the Choose save type window. Select Retain font and font size. Select Keep pictures. Select All pages. Click OK. 41

12. Save OCR ed images to your file as Text Document. Select All pages. Select Name files as source images. Select Save. 42

13. To close the file, go to File and Select Close Batch. 43

44