The Case of the 35 Gigabyte Digital Record: OCR and Digital Workflows

Similar documents
LIMB Processing Release Notes

How to use TRANSKRIBUS a very first manual

GUIDELINES FOR CREATION AND PRESERVATION OF DIGITAL FILES

How to Build a Digital Library

Compound or complex object: a set of files with a hierarchical relationship, associated with a single descriptive metadata record.

Elba Project. Procedures and general norms used in the edition of the electronic book and in its storage in the digital library

Building for the Future

ARTSYL DOCALPHA INSTALLATION GUIDE

PROCESSING AND CATALOGUING DATA AND DOCUMENTATION - QUALITATIVE

Smarter Document Capture

RLG Model Request for Information (RFI) for Digital Imaging Services

Practical Experiences with Ingesting Materials for Long-Term Preservation

SNHU Academic Archive Policies

FineReader Engine Overview & New Features in V10

PROCESSING AND CATALOGUING DATA AND DOCUMENTATION: QUALITATIVE

ICH M8 Expert Working Group. Specification for Submission Formats for ectd v1.1

Scanshare Sales Guide V1.2

International Implementation of Digital Library Software/Platforms 2009 ASIS&T Annual Meeting Vancouver, November 2009

2010 by Microtek International, Inc. All rights reserved.

DjVu Technology Primer

Sustainable File Formats for Electronic Records A Guide for Government Agencies

color bit depth dithered

Text below in italics is directly from the LAMP Digitization Project Principles (

docalpha Installation Guide

Website software. Assessment Criteria The learner can:

ivina BulletScan Manager

Best practices for producing high quality PDF files

SOFTWARE AND MULTIMEDIA. Chapter 6 Created by S. Cox

Adlib PDF Release Notes

HP Records Manager. Kofax Capture Template. Software Version: 8.1. Document Release Date: August 2014

Mitigating Preservation Threats: Standards and Practices in the National Digital Newspaper Program

IMPLICIT RELIGION Guidelines for Contributors March 2007

AVS4YOU Programs Help

Lessons Learned. Implementing Rosetta in the Harold B. Lee Library

How to make a PDF from inside Acrobat

Bits and Bit Patterns

The Swedish National Archives digital preservation. Mats Berggren, IT-department,

Data Storage. Slides derived from those available on the web site of the book: Computer Science: An Overview, 11 th Edition, by J.

Improved automatic restart and failed job recovery 64-bit support for improved memory utilisation

Digital The Harold B. Lee Library

Kurzweil Automater- Mass Conversion of Files into K3000

Advanced Topics in Curricular Accessibility: Strategies for Math and Science Accessibility

NDNP: The Kentucky Edition

ID:webArchive User Manual

File Magic 5 Series. The power to share information PRODUCT OVERVIEW. Revised June 2003

Rockaway Township Library Scanner Instructions Flatbed Scanner: Canon Canoscan 8800

Preserving & Digitizing the Kent Tribune Newspaper. Sandy Halem, Kent Historical Society Jenni Salamon, Ohio History Connection

Web graphics. Introduction

Author Guide. Thieme Medical Publishers Inc. Editorial Department 333 Seventh Avenue New York, New York Important Notes:

Contact Us. ZySCAN Manual. For full contact details, visit the ZyLAB website -

Example 1: Denary = 1. Answer: Binary = (1 * 1) = 1. Example 2: Denary = 3. Answer: Binary = (1 * 1) + (2 * 1) = 3

SobekCM METS Editor Application Guide for Version 1.0.1

Stelae Technologies ... Data Conversion a Walkthrough. «Extracting the Intelligence from Content»

Institutional Repository using DSpace. Yatrik Patel Scientist D (CS)

1.1. Various Stages involved in Digitisation Process

1.6 Graphics Packages

PDF Specification for IEEE Xplore (Part A-Core Requirements)

Alternate Format for STEM

Web-based workflow software to support book digitization and dissemination. The Mounting Books project

Introduction to Omeka.net

Archiving Full Resolution Images

DOCUMENT NAVIGATOR SALES GUIDE ADD NAME. KONICA MINOLTA Document Navigator Sales Guide

Summary of Bird and Simons Best Practices

Digital Libraries on a Shoestring. Walter Nelson RAND Corporation walternelson.com

1 Installing the ZENworks Content Reporting Package

Essential Graphics/Design Concepts for Non-Designers

Tips for Digital File Creation

The DMS provides a web browser, a desktop client and a mobile browser as standard features.

DupScout DUPLICATE FILES FINDER

Version 4.1 September 2017

DAITSS Demo Virtual Machine Quick Start Guide

Chapter 5: The DAITSS Archiving Process

MANAGING DIGITAL PHOTOS AND MEDIA ASSETS: POTENTIAL SOFTWARE SOLUTIONS

Image coding and compression

Digital Preservation Efforts at UNLV Libraries

Java Image Processing Survival Guide. Siegfried Goeschl & Harald Kuhr

City of Bartlett. Request for Information. Document Management Solution

Scan to PC Desktop Professional v9 vs. Scan to PC Desktop SE v9 + SE

Hello, I m Melanie Feltner-Reichert, director of Digital Library Initiatives at the University of Tennessee. My colleague. Linda Phillips, is going

ithenticate User Guide Getting Started Folders Managing your Documents The Similarity Report Settings Account Information

Introduction to Archivists Toolkit Version (update 5)

The Functional Extension Parser (FEP) A Document Understanding Platform

Chapter 1. Data Storage Pearson Addison-Wesley. All rights reserved

This presentation is on issues that span most every digitization project.

Atari Games - FTP Site Statistics. Top 20 Directories Sorted by Disk Space

Norcom. e-fileplan Electronic Cabinet System

UWDCC Case Study. Supplementing DPOE Workshop 6 June Steven Dast Digital Asset Librarian UW Digital Collections Center

Software for Digital Library

FDA Affiliate s Guide to the FDA User Interface

The Journal of Insect Science

ScholarOne Manuscripts. Author File Upload Guide

FDA Portable Document Format (PDF) Specifications

QuickScan Pro Release Notes. Contents. Version 4.5

****** Release Note for Image Capture Plus ***** Copyright(C) , Panasonic Corporation All rights reserved.

**** Digitization. Pictures are important

Building Collections Using Greenstone

Maennerchor Project Digital Collection

User s Guide [Scan Operations]

DIGITAL RECORDS MANAGEMENT GUIDELINES

IRIScan Executive 2. Install the IRIScan. 2 - Quick Start Guide. For Win 2000 XP users. For Vista users

Transcription:

Florida International University FIU Digital Commons Works of the FIU Libraries FIU Libraries 8-14-2015 The Case of the 35 Gigabyte Digital Record: OCR and Digital Workflows Kelley F. Rowan Florida International University, krowan@fiu.edu Follow this and additional works at: http://digitalcommons.fiu.edu/glworks Part of the Archival Science Commons, Cataloging and Metadata Commons, and the Scholarly Publishing Commons Recommended Citation Rowan, Kelley F., "The Case of the 35 Gigabyte Digital Record: OCR and Digital Workflows" (2015). Works of the FIU Libraries. 31. http://digitalcommons.fiu.edu/glworks/31 This work is brought to you for free and open access by the FIU Libraries at FIU Digital Commons. It has been accepted for inclusion in Works of the FIU Libraries by an authorized administrator of FIU Digital Commons. For more information, please contact dcc@fiu.edu.

OCR and Digital Workflows Kelley Rowan, Digital Archives Librarian Florida International University August 14, 2015

1. Digital Collections Center Workflow 2. OCR the Way it Should Work 3. Case Study how to Make it Work Overview

3 Librarians + Assistant Director 3 part time staff + 1 shared time cataloging librarian Volunteers Digital Collections Center

Photographs, images, maps, Ephemera Microfilm Books, journals, directories Audio, video Materials we digitize

Digital repositories

Scanning ATIZ, flatbeds Processing Photoshop Record creation METS editor Dissemination dpanther, Digital Commons Preservation - FDA Basic Workflow

Scanning

Scanning

Academic Imaging Services 2,800+ pages/hr 100 pages 3-5 GB Scanning

True to the original Cropping Deskewing Image: http://www.codeproject.com/articles/13615/how-to-deskew-an-image Processing

Deskew Manual or auto zoning Color & grayscale 14 languages Input: TIFF, PDF, JPEG, PCX, PDA Output: PDF, HTML, RTF, ASCII, XML 8 errors/2000 characters (v. 40/2000) 6 engines (v. 3 engines) Template & Job Wizard Built-in JBIG lossless image compression means smaller files Runs unattended http://www.primerecognition.com/prime_ocr.htm Processing

Positives: Side by side viewing More than 1 language at a time on a page (over 200 languages) Label images User friendly Automatically detects images, text, charts, graphs, columns Negatives: Less accurate (only 1 OCR engine) Slower Processing

METS an encoding and transmission standard for descriptive, administrative, and structural metadata regarding objects within a digital library. (http://www.loc.gov/standards/mets/) MODS - a schema for a bibliographic element set that may be used for a variety of purposes, and particularly for library applications. (http://www.loc.gov/standards/mods/) XML - a markup language that defines a set of rules for encoding documents in a format which is both humanreadable and machine-readable. Record Creation, Dissemination, & Preservation

Or how to turn a 30 minute project into a 5 month project

A recent text with easily recognizable font All font over 6 point All images scanned at 300 dpi 50% brightness level Original scans that are very straight For older or discolored texts, scan in RGB mode to capture all the image data A document that is either all images or all text All text aligned in the same direction A reasonably sized document http://www.library.illinois.edu/dcc/bestpractices/chapter_05_ocr.html

Unprocessed TIFFS 5 very lengthy directories 1,500-1,838 Pages out of order Old typeface Thin pages Faded type Colored pages Text in multiple directions Images & text Inconsistent dpi in scanning (72-300)

Rebecca Bakker - OCR expert! Martin Kass - Thousands of pages processed in photoshop!

could not read file TIFFs & JPEGs Time elapsed since scanning Image: http://www.primerecognition.com/prime_ocr.htm Several people involved in the process

Not enough room on server, moved to desktop OCR highly inaccurate Broken into even 4-8 sections

75 64 426 234 74 505 394 66 Divided by category (page color)

17,000 KB = TIFF image (19.8 GB 35.2 GB = directory) 300 KB after compression (450 MB 540 MB = directory) Back to Photoshop! File > Scripts > Image Processor LZW = universal lossless data compression algorithm END RESULT: 70 KB 240+ KB = JPEG (360 MB 640 MB) = directory

Positives A MUCH smaller file Negatives Loss of quality in images

Before Photoshop filter After

Photoshop curves brightness & contrast reduces file size

Photoshop adding layers increases file size

First pages meant to represent collection. Full resolution kept in order to maintain the original feel of the document. Pages with images converted to greyscale where possible

Each OCR engine includes some level of lexical review to see how recognized characters fit into word context. This may improve recognition results. Directory = abbreviations, proper nouns Each page took longer and longer: 1 minute a page

http://dpanther.fiu.edu/

Digital Collections Center, FIU Libraries Florida International University