A System to Automatically Index Genealogical Microfilm Titleboards Introduction Preprocessing Method Identification

Similar documents
Character Recognition

Face detection and recognition. Many slides adapted from K. Grauman and D. Lowe

Linear Discriminant Analysis in Ottoman Alphabet Character Recognition

An Efficient Secure Multimodal Biometric Fusion Using Palmprint and Face Image

Toward Part-based Document Image Decoding

HCR Using K-Means Clustering Algorithm

Patent Image Retrieval

Discover the Depths of your Data with iarchives OWR The benefits of Optical Word Recognition

Optical Character Recognition (OCR) for Printed Devnagari Script Using Artificial Neural Network

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS

OCR For Handwritten Marathi Script

Just-In-Time Browsing for Digitized Microfilm and Other Similar Image Collections

An Efficient Character Segmentation Based on VNP Algorithm

COSC160: Detection and Classification. Jeremy Bolton, PhD Assistant Teaching Professor

K-Nearest Neighbor Classification Approach for Face and Fingerprint at Feature Level Fusion

Dimension Reduction CS534

A Review: Content Base Image Mining Technique for Image Retrieval Using Hybrid Clustering

Image-Based Face Recognition using Global Features

FACE DETECTION AND RECOGNITION OF DRAWN CHARACTERS HERMAN CHAU

Classification of Printed Chinese Characters by Using Neural Network

CHAPTER 4 STOCK PRICE PREDICTION USING MODIFIED K-NEAREST NEIGHBOR (MKNN) ALGORITHM

Intelligent Indexing: A Semi-Automated, Trainable System for Field Labeling

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University

Face Recognition using Eigenfaces SMAI Course Project

Semi-Supervised PCA-based Face Recognition Using Self-Training

Mobile Face Recognization

Data Mining: Data. Lecture Notes for Chapter 2. Introduction to Data Mining

Principal Component Analysis (PCA) is a most practicable. statistical technique. Its application plays a major role in many

Learning to Recognize Faces in Realistic Conditions

Face detection and recognition. Detection Recognition Sally

Robust Face Recognition via Sparse Representation Authors: John Wright, Allen Y. Yang, Arvind Ganesh, S. Shankar Sastry, and Yi Ma

Estimating Noise and Dimensionality in BCI Data Sets: Towards Illiteracy Comprehension

CHAPTER 3 PRINCIPAL COMPONENT ANALYSIS AND FISHER LINEAR DISCRIMINANT ANALYSIS

PCA and KPCA algorithms for Face Recognition A Survey

Revealing the Modern History of Japanese Philosophy Using Digitization, Natural Language Processing, and Visualization

Keyword Spotting in Document Images through Word Shape Coding

The Detection of Faces in Color Images: EE368 Project Report

Improving Latent Fingerprint Matching Performance by Orientation Field Estimation using Localized Dictionaries

A Generalized Method to Solve Text-Based CAPTCHAs

Textural Features for Image Database Retrieval

GENDER CLASSIFICATION USING SUPPORT VECTOR MACHINES

Automatic Segmentation of Semantic Classes in Raster Map Images

Evaluation of Moving Object Tracking Techniques for Video Surveillance Applications

Training Algorithms for Robust Face Recognition using a Template-matching Approach

Automatic Keyword Extraction from Historical Document Images

Part-Based Skew Estimation for Mathematical Expressions

CS 231A Computer Vision (Fall 2012) Problem Set 3

One Dim~nsional Representation Of Two Dimensional Information For HMM Based Handwritten Recognition

Spatial Outlier Detection

AN EFFICIENT BINARIZATION TECHNIQUE FOR FINGERPRINT IMAGES S. B. SRIDEVI M.Tech., Department of ECE

A Real Time Facial Expression Classification System Using Local Binary Patterns

A Method of Annotation Extraction from Paper Documents Using Alignment Based on Local Arrangements of Feature Points

Three-Dimensional Face Recognition: A Fishersurface Approach

Face Recognition for Mobile Devices

Mobile Application with Optical Character Recognition Using Neural Network

CS 223B Computer Vision Problem Set 3

Short Survey on Static Hand Gesture Recognition

Performance Evaluation of the Eigenface Algorithm on Plain-Feature Images in Comparison with Those of Distinct Features

Recognition of Gurmukhi Text from Sign Board Images Captured from Mobile Camera

Face Recognition using Laplacianfaces

Generalized Principal Component Analysis CVPR 2007

Text Extraction in Video

Equation to LaTeX. Abhinav Rastogi, Sevy Harris. I. Introduction. Segmentation.

A Hierarchical Face Identification System Based on Facial Components

BRACE: A Paradigm For the Discretization of Continuously Valued Data

A Statistical approach to line segmentation in handwritten documents

Robust Face Recognition via Sparse Representation

Biometrics Technology: Image Processing & Pattern Recognition (by Dr. Dickson Tong)

Recognition of Non-symmetric Faces Using Principal Component Analysis

Structural Feature Extraction to recognize some of the Offline Isolated Handwritten Gujarati Characters using Decision Tree Classifier

Study of Viola-Jones Real Time Face Detector

Robust PDF Table Locator

Combining Gabor Features: Summing vs.voting in Human Face Recognition *

Fine Classification of Unconstrained Handwritten Persian/Arabic Numerals by Removing Confusion amongst Similar Classes

2. Basic Task of Pattern Classification

This tutorial has been prepared for computer science graduates to help them understand the basic-to-advanced concepts related to data mining.

Nearest Clustering Algorithm for Satellite Image Classification in Remote Sensing Applications

FACE RECOGNITION BASED ON GENDER USING A MODIFIED METHOD OF 2D-LINEAR DISCRIMINANT ANALYSIS

Real Time Motion Detection Using Background Subtraction Method and Frame Difference

AUTONOMOUS IMAGE EXTRACTION AND SEGMENTATION OF IMAGE USING UAV S

CS231A Course Project Final Report Sign Language Recognition with Unsupervised Feature Learning

An Accurate and Efficient System for Segmenting Machine-Printed Text. Yi Lu, Beverly Haist, Laurel Harmon, John Trenkle and Robert Vogt

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

REAL-TIME ROAD SIGNS RECOGNITION USING MOBILE GPU

Gene Clustering & Classification

Texture Image Segmentation using FCM

A STUDY OF FEATURES EXTRACTION ALGORITHMS FOR HUMAN FACE RECOGNITION

Scene Text Detection Using Machine Learning Classifiers

Tutorial 8. Jun Xu, Teaching Asistant March 30, COMP4134 Biometrics Authentication

Announcements. Recognition I. Gradient Space (p,q) What is the reflectance map?

Fundamentals of Digital Image Processing

Automatic Colorization of Grayscale Images

Automatic Bangla Corpus Creation

Prostate Detection Using Principal Component Analysis

Handwritten Devanagari Character Recognition Model Using Neural Network

Chapter 4 Face Recognition Using Orthogonal Transforms

Anale. Seria Informatică. Vol. XVII fasc Annals. Computer Science Series. 17 th Tome 1 st Fasc. 2019

A New Approach to Detect and Extract Characters from Off-Line Printed Images and Text

Haresh D. Chande #, Zankhana H. Shah *

International Journal of Advance Research in Engineering, Science & Technology

Transcription:

A System to Automatically Index Genealogical Microfilm Titleboards Samuel James Pinson, Mark Pinson and William Barrett Department of Computer Science Brigham Young University Introduction Millions of rolls of microfilm contain valuable genealogical information, yet remain largely inaccessible. A titleboard (see Figure 1) contains semi-structured metadata about the genealogical records that follow the titleboard on the roll of microfilm. This metadata may include the geographical origin of the records (i.e. city and country), the record type (i.e. birth records, marriage records), and relevant dates. We propose a system to automatically segment, extract, index and search digitized titleboards. Titleboards, the input to our system, pass through three modules (preprocessing, text recognition, and indexing) to produce an index over a digitized microfilm library. The index is accessed through a graphical user interface via boolean and regular expressions. The research we present here focuses on a necessary step in the indexing process: method identification. Method identification is the process of determining whether text is machine print or handwriting. Figure 1: Typical microfilm titleboards. Titleboards preface a collection of frames on a roll of microfilm and describe the type of information to be found in the succeeding frames. Segmentation and extraction of the content of the titleboards provides native information vital to the creation of a meaningful index for the frames that will follow, greatly increasing user access to the information contained in those frames. Preprocessing The preprocessing stage performs a sequence of steps on grayscale images of microfilm titleboards. This includes noise removal and ly adaptive thresholding [1]. Connected components, or groups of adjacent black pixels, are extracted from the binarized images. Method Identification Turk and Pentland published a face recognition technique called Eigenfaces [2]. We may consider an NxN image as a point in an N 2 -dimensional vector space. Eigenfaces relies on discovering face space, the subspace that best represents the distribution of images of faces in the N 2 -dimensional space. Images of known faces as

well as images of faces to be recognized are projected into face space. The minimum Euclidean distance from an unknown face to the known faces determines how the unknown face will be classified. If the minimum distance is above some threshold, then the face is rejected, i.e. not recognized. The basis of face space consists of the most significant eigenvectors of the covariance matrix of a training set of face images. The covariance matrix is given by M 1 T C = ( Γi Ψ)( Γi Ψ), M i= 1 where Γ i is the i th training image and Ψ is the average of all the training images. The projection of a face, Γ, into face space is a vector of weights, one for each dimension of face space. The k th component of the projection is = T th ωk uk ( Γ Ψ), where u k is the k most significant eigenvector of the covariance matrix C. Muller and Herbst [3] employ this idea in character recognition. Images of faces are replaced with images of characters. We build on their work and apply the principle to connected component level method identification. That is, for each connected component in an image, we determine whether the connected component is machine print or handwriting. We accomplish this by determining the face space for a large set of machine print characters. Representative machine print characters are then projected into this space. We determine a distance threshold for each representative machine print character based on a user-supplied global requirement for machine print precision and the radial density (See Figure 2) of machine print and handwriting surrounding the connected component. This algorithm is outlined below. Figure 2. Radial density θ θ global : user-supplied global requirement for machine print precision. i : distance threshold for the i th machine print representative. r mp ( d ): number of machine print connected components within distance d of the i th machine print representative. r hw ( d ): number of handwriting connected components within distance d of the i th machine print representative. 1. θ i = 0 r ( θ ) /[ r ( θ ) + r ( θ )] θ mp mp hw global 2. while ( i i i i i i ) Increment θ i 3. if ( θ i > 0) Decrement θ i.

Thus, for each representative machine print connected component, the distance threshold grows from zero to just before the global requirement for machine print precision first fails to be satisfied. Connected components that are within the threshold of their nearest machine print representative in face space are classified as machine print. Any connected components lying beyond the threshold are deemed handwriting. For example, in the annular diagram handwriting connected components begin to appear at the third ring. Increasing the threshold further will drop the machine print precision below the global requirement, so the threshold is fixed at the third ring (shown in blue). Preliminary results on microfilm titleboards are depicted in Figure 3 below. Figure 2: Preliminary method identification results on microfilm titleboards. The connected components in these titleboards have been classified by our system. Green signifies machine print, while red indicates handwriting. The confusion matrix over these four titleboards is: Predicted handwriting Predicted machine print Actually handwriting 88.9% 11.1% Actually machine print 16.6% 83.4%

Titleboards are a particularly noisy type of document to process. Thus, the results presented here are below the system s typical performance. We also note that 35.2% of our handwriting false positives were caused by touching machine print characters. Thus, one way to improve our performance is to split touching characters. The upper left image in Figure 2 reveals a difficulty in method application on titleboards. It is evident that the writer expended great effort to make his or her handwriting actually look like machine print. Despite this, our algorithm is able to correctly distinguish in most cases. Document Segmentation The Docstrum [4] layout analysis algorithm groups the connected components into text lines and words. These results of method identification presented above are especially encouraging since we are making our classification at the connected component level. Robust rules may be developed to make classifications at the word, text line, or block level. Character Identification Following method identification, the text is passed to the appropriate (either machine print or handwriting) character recognition engine. Index Construction Recognition errors are inevitable due to poor image acquisition, image degradation, segmentation errors, etc. This must be considered during index construction. We must decide whether to directly index possibly incorrect recognition results or attempt to correct the recognition results. Lexical context may serve to correct recognition errors, requiring, for example, that indexed terms must be present in a predefined dictionary. Information retrieval systems are typically evaluated by two measures: precision, and recall. Precision is defined as the percentage of the retrieved results that are actually relevant. Recall is defined as the percentage of the relevant results that were actually retrieved. Ideally, we would like 100% precision and 100% recall. In practice, we have to choose between the two. The nature of our problem, genealogical research, demands high recall at the expense of some precision. Thus, we build an index based directly on the (possibly incorrect) recognition results, rather than trying to force the indexed terms to be dictionary terms. Querying Queries take the form of Boolean or regular expressions in our system. However, these queries are expanded using approximate string matching to increase the recall. Approximate string matching seeks to find the closest match for a word based on a set of editing operations. Each operation, such as insert a character or delete a character, has an associated cost. The query expansion is user controlled via an adjustable cost threshold. For example, the index might contain the term buriais, stemming from misrecognizing burials. The query burials is expanded to include all keywords within some edit distance from burials.

Future Work The focus of this paper has been our research in connected component level method identification. We have implemented a prototype of the framework proposed, but much work remains incomplete before a robust system will be available. This work may also be extending in many ways. Adding metadata to index terms such as script, language, and meaning would enhance searching capabilities. For example, we could then search for titleboards with German city names. Our system makes classifications at the connected component level. Exploitation of spatial, stylistic, font, and linguistic context promises to increase robustness and accuracy. For example, we may safely assume that the connected components within a word are either all handwriting or all machine print. Confidence-weighted classifications of the constituent connected components ought to be harmonized with this assumption. Similar reasoning holds for the style and font of connected components within a word. Likewise, linguistic context in the form of lexicons and n-gram frequencies may also be brought to bear if character recognition is performed. Finally, we note that the system could be improved by specializing the set of representative machine print connected components. Currently, the representative set consists of 5957 machine print connected components in a wide variety of fonts and styles. This is necessary for success over a broad range of text. However, limiting the size of this set will increase the speed of the algorithm and simultaneously decrease the rate of handwriting false positives. In particular, for indexing microfilm titleboards the set of representative fonts could be limited to those fonts that are present in a random, but representative, sample of titleboards. Conclusion We have proposed a system for automatically indexing genealogical microfilm titleboards. We have implemented a proof-of-concept prototype, and the system remains a work in progress. However, the proposed system promises to powerfully improve genealogical research by creating a searchable index of microfilm titleboards. References [1] Evaluation of binarization methods for document images, Trier, O.D.; Taxt, T. Pattern Analysis and Machine Intelligence, IEEE Transactions on, Vol.17, Iss.3, Mar 1995, pp. 312-315 [2] Face recognition using eigenfaces, Turk, M.A.; Pentland, A.P. Computer Vision and Pattern Recognition, 1991. Proceedings CVPR '91., IEEE Computer Society Conference on, Jun 1991, pp. 586-591 [3] The use of eigenpictures for optical character recognition, Muller, N.; Herbst, B. Pattern Recognition, 1998. Proceedings. Fourteenth International Conference on, Vol.2, Aug 1998, pp.1124-1126 [4] The document spectrum for page layout analysis, O'Gorman, L. Pattern Analysis and Machine Intelligence, IEEE Transactions on, Vol.15, Iss.11, Nov 1993, pp.1162-1173