Groundtruth Image Generation from Electronic Text (Demonstration)

Size: px
Start display at page:

Download "Groundtruth Image Generation from Electronic Text (Demonstration)"

Transcription

1 Groundtruth Image Generation from Electronic Text (Demonstration) David Doermann and Gang Zi Laboratory for Language and Media Processing, University of Maryland, College Park MD 21043, USA {doermann, Abstract The problem of generating synthetic data for the training and evaluating of document analysis systems has been widely addressed in recent years. With the increased interest in processing multilingual sources, there is a tremendous need to be able to rapidly generate data in new languages and scripts, without the need to develop specialized systems. We have developed an approach that uses language support of the MSWindows operating system combined with custom print drivers to render tiff images simultaneously with windows Enhanced Metafile directives. The Metafile information is parsed to generated zone, line, word, and character groundtruth including location, font information and content in any language supported by Windows. The processing is embedded in a collection of tools for data generation, groundtruthing, degradation and evaluation. The discussion here focuses on the Groundtruth Generator. 1 Introduction Generating synthetic document images and symbolic groundtruth files in large scale has become a recent focal point for training algorithms and evaluating the performance of systems [1], [2], [6]. Typically, training and evaluation require the groundtruth data to be keyed in manually from the scanned image, but this is often a prohibitively labor-intensive and error prone process. Furthermore, it may require domain experts, especially for processing multi-lingual documents. In this text we present a methodology for generating synthetic document images and symbolic groundtruth files automatically by using a custom print driver and metafile information. We give a brief survey of related work, describe the system architecture, and present the main component of our system: the groundtruth generator. 2 Related work Using synthetic data has many advantages including the rapid generation of datasets at low cost, easy control of degradations models and parameters, and convenient testing of the same underlying documents with different corruption methods [1]. To generate synthetic data, many methods have been proposed. In [3], the authors presented an approach to get the noise-free document images from DVI (device independent format) files. However, the requirement of DVI files and LATEX typesetting

2 limits the practical application in many cases. In [4], the authors present an approach to propagating groundtruth information from an original collection allowing the reuse of groundtruth information and [5,6] extending previous work on the use of degradation models for data generation. Choosing suitable corpora for the evaluation plays a crucial part in evaluating an system. A representative corpus should have all characteristics of the target applications. Although a number of datasets have been created, they are typically not appropriate for all applications, but nevertheless, allow focused evaluation. For example, English technical journals are used in the UW dataset and magazines; Spanish newspapers, and English and German business letters are used in the UNLV evaluation set. Our approach allows users to supplement traditional groundtruth with images and groundtruth generated from electronic text, formatted in a way that is representative of the domain. 3 System Architecture The system architecture is shown in Figure 1. Starting from the structured electronic files, such as MSWord or HTML files, we import the source to a renderer and generate the noise-free images and the groundtruth files. The system uses MSWindows print drivers, so the document content can be rendered the same way to many different devices. The degraded images can be obtained from the ideal images through a degradation model, or by physically degrading (printing, scanning, faxing, etc) a hard copy. Finally, the synthetic images and groundtruth files can be used for training and evaluation. One application of our work is to study the downstream effect of degradation of information retrieval (IR), and machine translation (MT) systems. STYLE RENDERING FORMATTER RENDERER SOURCE STRUCTURED Such as Word or HTML DEGREDATION AND Ground Truth Generator TIFF IMAGE Symbolic Ground Truth DEGREDATION MODEL EVALUATION RESULTS PROFILE MACHINE TRANSLATION INFORMATION RETREIVAL TASK EVALUATION Figure 1: evaluation system architecture

3 4 Groundtruth Generator In our system, a groundtruth generator (GTG) is used to obtain the synthetic noise free images and the symbolic groundtruth files. First, the structured documents, such as HTML or MSWord files, are input to the GTG system. Image files and the metafiles are generated via a custom printer driver. By parsing the resulting metafiles, we obtain symbolic and layout information, and generate groundtruth in various formats. The synthetic images and layout information is used to create overlaid images, where the bounding boxes are displayed at the character, word, line, and zone level. Three kinds of groundtruth files are generated in GTG: Standard, Raw and Structured. Standard groundtruth contains basic information about the size of the reference pages, fonts used in the document, the character set of the content, and zone, line, word and character information where appropriate. For each zone, we identify the type of zone (Text, Image or Graphic), for each word, we identify the font, and for each character, we identify the font glyph and Unicode text. CONTENT (pixels): (629,264,1863,2279) PAGE SIZE (mm): (0,0,213,273) RESOLUTION: 301 dpi Font 0: Times New Roman, ARABIC_CHARSET Font 1: Times New Roman, ANSI_CHARSET Font 2: Times New Roman, ANSI_CHARSET Font 3: Bold Times New Roman, ARABIC_CHARSET Font 4: Bold Times New Roman, ARABIC_CHARSET Font 5: Bold Courier New, ARABIC_CHARSET ZONE 0: (1248, 2237, 1267, 2279) T LINE 0: (1248, 2237, 1267, 2279) WORD 0: (1248, 2237, 1267, 2279) 2 CHAR 0: (1248, 2237, 1267, 2279), 50, 00 32, 50 ZONE 1: ( 629, 336, 1863, 1205) T LINE 0: (1446, 336, 1863, 406) WORD 0: (1446, 340, 1554, 406) 5 CHAR 0: (1446, 340, 1482, 406), 1575, 06 27, 909 CHAR 1: (1482, 340, 1518, 406), 65194,fe aa, 938 CHAR 2: (1518, 340, 1554, 406), 65191,fe a7, 935 CHAR 3: (1554, 340, 1589, 406), 32,00 20, 3 WORD 1: (1589, 340, 1732, 406) 5 Raw groundtruth files are in Unicode format or in original coding format. These files contain only the character content and can be used for evaluation. The encoding will be identical to the original encoding used to generate the structured document. Structured groundtruth files include HTML and files. The HTML files can be used to check whether the groundtruth file is the same with the original document by simply viewing the results in a browser. The files are used for data exchange or storage. Because the groundtruth files are parsed from metafiles, as long as True Type Font (TTF) files are used, data from any character set can be created. We have tested our system on dozens of languages, including Arabic, Chinese, Farsi, Japanese, Thai, Hindi and Korean. From this point of view, our system provides a universal framework to generate groundtruth files for multi-lingual documents.

4 The synthetic images are noise-free images and can be generated at different resolutions. Those images can be degraded on pixel level with a parameterized model, or degraded on page level with noise templates. Figure 2: Overlaid images in character, line, and zone level For more information about the system, please contact the authors. References [1] D. Doermann, and S. Yao. Generating Synthetic Data for Text Analysis Systems. SDAIR, pages , [2] Tin Kam Ho, Henry S. Baird, Evaluation of Accuracy Using Synthetic Data, SDAIR95, 1995 [3] T. Kanungo, R.M. Haralick, and I.T. Phillips. Nonlinear local and global document degradation models. IJIST, 5(4):220-30, [4] T. Kanungo and R.M. Haralick. An automatic closed-loop methodology for generating character groundtruth for scanned documents. PAMI, 21(2): , February [5] T. Kanungo, R.M. Haralick, H.S. Baird, W. Stuezle, and D. Madigan. A statistical, nonparametric methodology for document degradation model validation. PAMI, 22(11): , November [6] T. Kanungo, and P. Resnik. The Bible, Truth, and Multilingual Evaluation. SPIE Conference on Document Recognition and Retrieval (VI), pages 86-96, JAN [7] D.W. Kim and T. Kanungo. Attributed point matching for automatic groundtruth generation. IJDAR, 5(1):47-66, [8] T. Kanungo, etc., Document Degradation Models: Parameter Estimation and Model Validation, Proc. of IAPR Workshop on Machine Vision and Applications, Kawasaki, Japan, 1994, pp [9] Esko Ukkonen, Algorithm for Approximate String Matching, Information and Control vol. 64, pp , 1985 [10] S.V. Rice, etc., The fifth annual test of accuracy, Tech. Rep. TR-96-01, Information Science Research Institute, University of Nevada, Las Vegas, NV, 1996.

5

GROUNDTRUTH GENERATION AND DOCUMENT IMAGE DEGRADATION. Gang Zi

GROUNDTRUTH GENERATION AND DOCUMENT IMAGE DEGRADATION. Gang Zi LAMP-TR-121 MAY 2005 CAR-TR-1008 CS-TR-4699 UMIACS-TR-2005-08 GROUNDTRUTH GENERATION AND DOCUMENT IMAGE DEGRADATION Gang Zi Language and Media Processing Laboratory Institute for Advanced Computer Studies

More information

Power Functions and Their Use In Selecting Distance Functions for. Document Degradation Model Validation. 600 Mountain Avenue, Room 2C-322

Power Functions and Their Use In Selecting Distance Functions for. Document Degradation Model Validation. 600 Mountain Avenue, Room 2C-322 Power Functions and Their Use In Selecting Distance Functions for Document Degradation Model Validation Tapas Kanungo y ; Robert M. Haralick y and Henry S. Baird z y Department of Electrical Engineering,

More information

A Point Matching Algorithm for Automatic Generation of Groundtruth for Document Images

A Point Matching Algorithm for Automatic Generation of Groundtruth for Document Images A Point Matching Algorithm for Automatic Generation of Groundtruth for Document Images Doe-Wan Kim and Tapas Kanungo Language and Media Processing Laboratory Center for Automation Research University of

More information

Automatic Generation of Character Groundtruth for Scanned Documents: A Closed-Loop Approach

Automatic Generation of Character Groundtruth for Scanned Documents: A Closed-Loop Approach Automatic Generation of Character Groundtruth for Scanned Documents: A Closed-Loop Approach Tapas Kanungo Caere Corporation 1 Cooper Court Los Gatos, CA, 953, USA tapas 62caere. com Robert M. Haralick

More information

On Segmentation of Documents in Complex Scripts

On Segmentation of Documents in Complex Scripts On Segmentation of Documents in Complex Scripts K. S. Sesh Kumar, Sukesh Kumar and C. V. Jawahar Centre for Visual Information Technology International Institute of Information Technology, Hyderabad, India

More information

A New Algorithm for Detecting Text Line in Handwritten Documents

A New Algorithm for Detecting Text Line in Handwritten Documents A New Algorithm for Detecting Text Line in Handwritten Documents Yi Li 1, Yefeng Zheng 2, David Doermann 1, and Stefan Jaeger 1 1 Laboratory for Language and Media Processing Institute for Advanced Computer

More information

Document Image Restoration Using Binary Morphological Filters. Jisheng Liang, Robert M. Haralick. Seattle, Washington Ihsin T.

Document Image Restoration Using Binary Morphological Filters. Jisheng Liang, Robert M. Haralick. Seattle, Washington Ihsin T. Document Image Restoration Using Binary Morphological Filters Jisheng Liang, Robert M. Haralick University of Washington, Department of Electrical Engineering Seattle, Washington 98195 Ihsin T. Phillips

More information

Estimating Degradation Model Parameters Using Neighborhood Pattern Distributions: An Optimization Approach

Estimating Degradation Model Parameters Using Neighborhood Pattern Distributions: An Optimization Approach 520 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 26, NO. 4, APRIL 2004 Estimating Degradation Model Parameters Using Neighborhood Pattern Distributions: An Optimization Approach

More information

A Line Drawings Degradation Model for Performance Characterization

A Line Drawings Degradation Model for Performance Characterization A Line Drawings Degradation Model for Performance Characterization 1 Jian Zhai, 2 Liu Wenin, 3 Dov Dori, 1 Qing Li 1 Dept. of Computer Engineering and Information Technolog; 2 Dept of Computer Science

More information

Single-Frame Text Super-Resolution: A Bayesian Approach

Single-Frame Text Super-Resolution: A Bayesian Approach MITSUBISHI ELECTRIC RESEARCH LABORATORIES http://www.merl.com Single-Frame Text Super-Resolution: A Bayesian Approach Gerald Dalley, Bill Freeman, Joe Marks TR2004-129 December 2004 Abstract We address

More information

Text Super-Resolution: A Bayesian Approach

Text Super-Resolution: A Bayesian Approach MITSUBISHI ELECTRIC RESEARCH LABORATORIES http://www.merl.com Text Super-Resolution: A Bayesian Approach Gerald Dalley, Bill Freeman, Joe Marks TR2003-147 October 2004 Abstract We address the problem of

More information

Automatic Ground-truth Generation for Document Image Analysis and Understanding

Automatic Ground-truth Generation for Document Image Analysis and Understanding Automatic Ground-truth Generation for Document Image Analysis and Understanding Pierre Héroux, Eugen Barbu, Sébastien Adam, Éric Trupin To cite this version: Pierre Héroux, Eugen Barbu, Sébastien Adam,

More information

Bookmarks for PDF Output(Outline-Group)

Bookmarks for PDF Output(Outline-Group) Bookmarks for PDF Output(Outline-Group) The axf:outline-group groups bookmark items of PDF, and outputs them collectively. Value: Initial: empty string Applies to: block-level formatting objects

More information

High Performance Layout Analysis of Arabic and Urdu Document Images

High Performance Layout Analysis of Arabic and Urdu Document Images High Performance Layout Analysis of Arabic and Urdu Document Images Syed Saqib Bukhari 1, Faisal Shafait 2, and Thomas M. Breuel 1 1 Technical University of Kaiserslautern, Germany 2 German Research Center

More information

arxiv: v1 [cs.cv] 9 Aug 2017

arxiv: v1 [cs.cv] 9 Aug 2017 Anveshak - A Groundtruth Generation Tool for Foreground Regions of Document Images Soumyadeep Dey, Jayanta Mukherjee, Shamik Sural, and Amit Vijay Nandedkar arxiv:1708.02831v1 [cs.cv] 9 Aug 2017 Department

More information

1.

1. * 390/0/2 : 389/07/20 : 2 25-8223 ( ) 2 25-823 ( ) ISC SCOPUS L ISA http://jist.irandoc.ac.ir 390 22-97 - :. aminnezarat@gmail.com mosavit@pnu.ac.ir : ( ).... 00.. : 390... " ". ( )...2 2. 3. 4 Google..

More information

Linguistic Resources for Handwriting Recognition and Translation Evaluation

Linguistic Resources for Handwriting Recognition and Translation Evaluation Linguistic Resources for Handwriting Recognition and Translation Evaluation Zhiyi Song*, Safa Ismael*, Steven Grimes*, David Doermann, Stephanie Strassel* *Linguistic Data Consortium, University of Pennsylvania,

More information

DOCUMENT IMAGE ZONE CLASSIFICATION A Simple High-Performance Approach

DOCUMENT IMAGE ZONE CLASSIFICATION A Simple High-Performance Approach DOCUMENT IMAGE ZONE CLASSIFICATION A Simple High-Performance Approach Daniel Keysers, Faisal Shafait German Research Center for Artificial Intelligence (DFKI) GmbH, Kaiserslautern, Germany {daniel.keysers,

More information

Ethiopic Document Image Database for Testing Character Recognition Systems

Ethiopic Document Image Database for Testing Character Recognition Systems Ethiopic Document Image Database for Testing Character Systems Yaregal Assabie and Josef Bigun School of Information Science, Computer and Electrical Engineering Halmstad University, SE-301 18 Halmstad,

More information

Document Image Segmentation using Discriminative Learning over Connected Components

Document Image Segmentation using Discriminative Learning over Connected Components Document Image Segmentation using Discriminative Learning over Connected Components Syed Saqib Bukhari Technical University of bukhari@informatik.unikl.de Mayce Ibrahim Ali Al Azawi Technical University

More information

Stochastic Language Models for Style-Directed Layout Analysis of Document Images

Stochastic Language Models for Style-Directed Layout Analysis of Document Images IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 12, NO. 5, MAY 2003 583 Stochastic Language Models for Style-Directed Layout Analysis of Document Images Tapas Kanungo, Senior Member, IEEE, and Song Mao, Member,

More information

A Comparison of Some Morphological Filters for Improving OCR Performance

A Comparison of Some Morphological Filters for Improving OCR Performance A Comparison of Some Morphological Filters for Improving OCR Performance Laurent Mennillo, Jean Cousty, Laurent Najman To cite this version: Laurent Mennillo, Jean Cousty, Laurent Najman. A Comparison

More information

A Technique for Classification of Printed & Handwritten text

A Technique for Classification of Printed & Handwritten text 123 A Technique for Classification of Printed & Handwritten text M.Tech Research Scholar, Computer Engineering Department, Yadavindra College of Engineering, Punjabi University, Guru Kashi Campus, Talwandi

More information

Performance Comparison of Six Algorithms for Page Segmentation

Performance Comparison of Six Algorithms for Page Segmentation Performance Comparison of Six Algorithms for Page Segmentation Faisal Shafait, Daniel Keysers, and Thomas M. Breuel Image Understanding and Pattern Recognition (IUPR) research group German Research Center

More information

A Study on the Document Zone Content Classification Problem

A Study on the Document Zone Content Classification Problem A Study on the Document Zone Content Classification Problem Yalin Wang 1, Ihsin T. Phillips 2, and Robert M. Haralick 3 1 Dept. of Elect. Eng. Univ. of Washington Seattle, WA 98195, US ylwang@u.washington.edu

More information

Table Detection in Heterogeneous Documents

Table Detection in Heterogeneous Documents Table Detection in Heterogeneous Documents Faisal Shafait German Research Center for Artificial Intelligence (DFKI GmbH) Kaiserslautern, Germany faisal.shafait@dfki.de Ray Smith Google Inc. Mountain View,

More information

XF RENDERING SERVER 2009 ARCHITECTS OVERVIEW

XF RENDERING SERVER 2009 ARCHITECTS OVERVIEW XF RENDERING SERVER 2009 ARCHITECTS OVERVIEW XF RENDERING SERVER 2009 XF Rendering Server 2009 is a high-volume, high-speed solution for generating a wide range of communication materials from XML. It

More information

preliminary draft, June 15, :57 preliminary draft, June 15, :57

preliminary draft, June 15, :57 preliminary draft, June 15, :57 TUGboat, Volume 0 (9999), No. 0 preliminary draft, June 15, 2018 17:57? 1 FreeType MF Module: A module for using METAFONT directly inside the FreeType rasterizer Jaeyoung Choi, Ammar Ul Hassan and Geunho

More information

Refinement of digitized documents through recognition of mathematical formulae

Refinement of digitized documents through recognition of mathematical formulae Refinement of digitized documents through recognition of mathematical formulae Toshihiro KANAHORI Research and Support Center on Higher Education for the Hearing and Visually Impaired, Tsukuba University

More information

Automatic Reader. Multi Lingual OCR System.

Automatic Reader. Multi Lingual OCR System. Automatic Reader Multi Lingual OCR System What is the Automatic Reader? Sakhr s Automatic Reader transforms scanned images into a grid of millions of dots, optically recognizes the characters found in

More information

Direct Processing of Document Images in Compressed Domain

Direct Processing of Document Images in Compressed Domain Direct Processing of Document Images in Compressed Domain Mohammed Javed* 1, P. Nagabhushan* 2, B.B. Chaudhuri #3 *Department of Studies in Computer Science, University of Mysore, Mysore 570006, India

More information

SAPGUI for Windows - I18N User s Guide

SAPGUI for Windows - I18N User s Guide Page 1 of 30 SAPGUI for Windows - I18N User s Guide Introduction This guide is intended for the users of SAPGUI who logon to Unicode systems and those who logon to non-unicode systems whose code-page is

More information

Unconstrained Language Identification Using A Shape Codebook

Unconstrained Language Identification Using A Shape Codebook Unconstrained Language Identification Using A Shape Codebook Guangyu Zhu, Xiaodong Yu, Yi Li, and David Doermann Language and Media Processing Laboratory University of Maryland {zhugy,xdyu,liyi,doermann}@umiacs.umd.edu

More information

OCR correction based on document level knowledge

OCR correction based on document level knowledge OCR correction based on document level knowledge T. Nartker, K. Taghva, R. Young, J. Borsack, and A. Condit UNLV/Information Science Research Institute, Box 4021 4505 Maryland Pkwy, Las Vegas, NV USA 89154-4021

More information

Ligature-based font size independent OCR for Noori Nastalique writing style

Ligature-based font size independent OCR for Noori Nastalique writing style Ligature-based font size independent OCR for Noori Nastalique writing style Qurat ul Ain Akram Sarmad Hussain Center for Language Engineering, Al-Khawarizmi Institute of Computer Science University of

More information

Segmentation Framework for Multi-Oriented Text Detection and Recognition

Segmentation Framework for Multi-Oriented Text Detection and Recognition Segmentation Framework for Multi-Oriented Text Detection and Recognition Shashi Kant, Sini Shibu Department of Computer Science and Engineering, NRI-IIST, Bhopal Abstract - Here in this paper a new and

More information

Scanner Parameter Estimation Using Bilevel Scans of Star Charts

Scanner Parameter Estimation Using Bilevel Scans of Star Charts ICDAR, Seattle WA September Scanner Parameter Estimation Using Bilevel Scans of Star Charts Elisa H. Barney Smith Electrical and Computer Engineering Department Boise State University, Boise, Idaho 8375

More information

ADAPTIVE HINDI OCR USING GENERALIZED HAUSDORFF IMAGE COMPARISON HUANFENG MA, DAVID DOERMANN

ADAPTIVE HINDI OCR USING GENERALIZED HAUSDORFF IMAGE COMPARISON HUANFENG MA, DAVID DOERMANN LAMP-TR-105 CAR-TR-987 CS-TR-4519 UMIACS-TR-2003-87 August 19, 2003 ADAPTIVE HINDI OCR USING GENERALIZED HAUSDORFF IMAGE COMPARISON HUANFENG MA, DAVID DOERMANN LAMP-TR-105 CAR-TR-987 CS-TR-4519 UMIACS-TR-2003-87

More information

The PAGE (Page Analysis and Ground-truth Elements) Format Framework

The PAGE (Page Analysis and Ground-truth Elements) Format Framework 2010,IEEE. Reprinted, with permission, frompletschacher, S and Antonacopoulos, A, The PAGE (Page Analysis and Ground-truth Elements) Format Framework, Proceedings of the 20th International Conference on

More information

Automated data entry system: performance issues

Automated data entry system: performance issues Automated data entry system: performance issues George R. Thoma, Glenn Ford National Library of Medicine, Bethesda, Maryland 20894 ABSTRACT This paper discusses the performance of a system for extracting

More information

SCRIPT-INDEPENDENT TEXT LINE SEGMENTATION IN FREESTYLE HANDWRITTEN DOCUMENTS LI Yi 1, Yefeng Zheng 2, David Doermann 1, and Stefan Jaeger 1

SCRIPT-INDEPENDENT TEXT LINE SEGMENTATION IN FREESTYLE HANDWRITTEN DOCUMENTS LI Yi 1, Yefeng Zheng 2, David Doermann 1, and Stefan Jaeger 1 LAMP-TR-136 CS-TR-4836 CFAR-TR-1017 UMIACS-TR-2006-51 DEC 2006 SCRIPT-INDEPENDENT TEXT LINE SEGMENTATION IN FREESTYLE HANDWRITTEN DOCUMENTS LI Yi 1, Yefeng Zheng 2, David Doermann 1, and Stefan Jaeger

More information

INTERNATIONALIZATION IN GVIM

INTERNATIONALIZATION IN GVIM INTERNATIONALIZATION IN GVIM A PROJECT REPORT Submitted by Ms. Nisha Keshav Chaudhari Ms. Monali Eknath Chim In partial fulfillment for the award of the degree Of B. Tech Computer Engineering UNDER THE

More information

Overview of the FIRE 2011 RISOT Task

Overview of the FIRE 2011 RISOT Task Overview of the FIRE 2011 RISOT Task Utpal Garain, 1* Jiaul Paik, 1* Tamaltaru Pal, 1 Prasenjit Majumder, 2 David Doermann, 3 and Douglas W. Oard 3 1 Indian Statistical Institute, Kolkata, India {utpal

More information

Optical Character Recognition (OCR) for Printed Devnagari Script Using Artificial Neural Network

Optical Character Recognition (OCR) for Printed Devnagari Script Using Artificial Neural Network International Journal of Computer Science & Communication Vol. 1, No. 1, January-June 2010, pp. 91-95 Optical Character Recognition (OCR) for Printed Devnagari Script Using Artificial Neural Network Raghuraj

More information

Structural Mixtures for Statistical Layout Analysis

Structural Mixtures for Statistical Layout Analysis Structural Mixtures for Statistical Layout Analysis Faisal Shafait 1, Joost van Beusekom 2, Daniel Keysers 1, Thomas M. Breuel 2 Image Understanding and Pattern Recognition (IUPR) Research Group 1 German

More information

A Laplacian Based Novel Approach to Efficient Text Localization in Grayscale Images

A Laplacian Based Novel Approach to Efficient Text Localization in Grayscale Images A Laplacian Based Novel Approach to Efficient Text Localization in Grayscale Images Karthik Ram K.V & Mahantesh K Department of Electronics and Communication Engineering, SJB Institute of Technology, Bangalore,

More information

1.1 Create a New Survey: Getting Started. To create a new survey, you can use one of two methods: a) Click Author on the navigation bar.

1.1 Create a New Survey: Getting Started. To create a new survey, you can use one of two methods: a) Click Author on the navigation bar. 1. Survey Authoring Section 1 of this User Guide provides step-by-step instructions on how to author your survey. Surveys can be created using questions and response choices you develop; copying content

More information

Mono-font Cursive Arabic Text Recognition Using Speech Recognition System

Mono-font Cursive Arabic Text Recognition Using Speech Recognition System Mono-font Cursive Arabic Text Recognition Using Speech Recognition System M.S. Khorsheed Computer & Electronics Research Institute, King AbdulAziz City for Science and Technology (KACST) PO Box 6086, Riyadh

More information

Goal-Oriented Performance Evaluation Methodology for Page Segmentation Techniques

Goal-Oriented Performance Evaluation Methodology for Page Segmentation Techniques Goal-Oriented Performance Evaluation Methodology for Page Segmentation Techniques Nikolaos Stamatopoulos, Georgios Louloudis and Basilis Gatos Computational Intelligence Laboratory, Institute of Informatics

More information

International Journal of Advance Research in Engineering, Science & Technology

International Journal of Advance Research in Engineering, Science & Technology Impact Factor (SJIF): 4.542 International Journal of Advance Research in Engineering, Science & Technology e-issn: 2393-9877, p-issn: 2394-2444 Volume 4, Issue 4, April-2017 A Simple Effective Algorithm

More information

Learning to Segment Document Images

Learning to Segment Document Images Learning to Segment Document Images K.S. Sesh Kumar, Anoop Namboodiri, and C.V. Jawahar Centre for Visual Information Technology, International Institute of Information Technology, Hyderabad, India Abstract.

More information

OCR and Automated Translation for the Navigation of non-english Handsets: A Feasibility Study with Arabic

OCR and Automated Translation for the Navigation of non-english Handsets: A Feasibility Study with Arabic OCR and Automated Translation for the Navigation of non-english Handsets: A Feasibility Study with Arabic Jennifer Biggs and Michael Broughton Defence Science and Technology Organisation Edinburgh, South

More information

Language Identification for Handwritten Document Images Using A Shape Codebook

Language Identification for Handwritten Document Images Using A Shape Codebook Language Identification for Handwritten Document Images Using A Shape Codebook Guangyu Zhu, Xiaodong Yu, Yi Li, David Doermann Institute of Advanced Computer Studies, University of Maryland, College Park,

More information

136 TUGboat, Volume 39 (2018), No. 2

136 TUGboat, Volume 39 (2018), No. 2 136 TUGboat, Volume 39 (2018), No. 2 FreeType MF Module: A module for using METAFONT directly inside the FreeType rasterizer Jaeyoung Choi, Ammar Ul Hassan, Geunho Jeong Abstract METAFONT is a font description

More information

Adaptive Hindi OCR Using Generalized Hausdorff Image Comparison

Adaptive Hindi OCR Using Generalized Hausdorff Image Comparison Adaptive Hindi OCR Using Generalized Hausdorff Image Comparison HUANFENG MA and DAVID DOERMANN University of Maryland, College Park We present an adaptive Hindi OCR implemented as part of a rapidly retargetable

More information

Licensed Program Specifications

Licensed Program Specifications AFP Font Collection for MVS, OS/390, VM, and VSE Program Number 5648-B33 Licensed Program Specifications AFP Font Collection for MVS, OS/390, VM, and VSE, hereafter referred to as AFP Font Collection,

More information

Desktop Crawls. Document Feeds. Document Feeds. Information Retrieval

Desktop Crawls. Document Feeds. Document Feeds. Information Retrieval Information Retrieval INFO 4300 / CS 4300! Web crawlers Retrieving web pages Crawling the web» Desktop crawlers» Document feeds File conversion Storing the documents Removing noise Desktop Crawls! Used

More information

Keyword Spotting in Document Images through Word Shape Coding

Keyword Spotting in Document Images through Word Shape Coding 2009 10th International Conference on Document Analysis and Recognition Keyword Spotting in Document Images through Word Shape Coding Shuyong Bai, Linlin Li and Chew Lim Tan School of Computing, National

More information

UW Document Image Databases. Document Analysis Module. Ground-Truthed Information DAFS. Generated Information DAFS. Performance Evaluation

UW Document Image Databases. Document Analysis Module. Ground-Truthed Information DAFS. Generated Information DAFS. Performance Evaluation Performance evaluation of document layout analysis algorithms on the UW data set Jisheng Liang, Ihsin T. Phillips y, and Robert M. Haralick Department of Electrical Engineering, University of Washington,

More information

Aletheia - An Advanced Document Layout and Text Ground-Truthing System for Production Environments *

Aletheia - An Advanced Document Layout and Text Ground-Truthing System for Production Environments * 2011 International Conference on Document Analysis and Recognition Aletheia - An Advanced Document Layout and Text Ground-Truthing System for Production Environments * C. Clausner, S. Pletschacher and

More information

Word Slant Estimation using Non-Horizontal Character Parts and Core-Region Information

Word Slant Estimation using Non-Horizontal Character Parts and Core-Region Information 2012 10th IAPR International Workshop on Document Analysis Systems Word Slant using Non-Horizontal Character Parts and Core-Region Information A. Papandreou and B. Gatos Computational Intelligence Laboratory,

More information

Enhancing Degraded Document Images via Bitmap Clustering and Averaging

Enhancing Degraded Document Images via Bitmap Clustering and Averaging V Enhancing Degraded Document Images via Bitmap Clustering and Averaging John D. Hobby Tin Kam Ho Bell Labs, Lucent Technologies Bell Labs, Lucent Technologies Murray Hill, New Jersey 07974 Murray Hill,

More information

MURDOCH RESEARCH REPOSITORY

MURDOCH RESEARCH REPOSITORY MURDOCH RESEARCH REPOSITORY http://researchrepository.murdoch.edu.au/ This is the author s final version of the work, as accepted for publication following peer review but without the publisher s layout

More information

IMPLEMENTING ON OPTICAL CHARACTER RECOGNITION USING MEDICAL TABLET FOR BLIND PEOPLE

IMPLEMENTING ON OPTICAL CHARACTER RECOGNITION USING MEDICAL TABLET FOR BLIND PEOPLE Impact Factor (SJIF): 5.301 International Journal of Advance Research in Engineering, Science & Technology e-issn: 2393-9877, p-issn: 2394-2444 Volume 5, Issue 3, March-2018 IMPLEMENTING ON OPTICAL CHARACTER

More information

Speedup of Optical Scanner Characterization Subsystem

Speedup of Optical Scanner Characterization Subsystem Boise State University ScholarWorks Electrical and Computer Engineering Faculty Publications and Presentations Department of Electrical and Computer Engineering --2003 Speedup of Optical Scanner Characterization

More information

Fine Classification of Unconstrained Handwritten Persian/Arabic Numerals by Removing Confusion amongst Similar Classes

Fine Classification of Unconstrained Handwritten Persian/Arabic Numerals by Removing Confusion amongst Similar Classes 2009 10th International Conference on Document Analysis and Recognition Fine Classification of Unconstrained Handwritten Persian/Arabic Numerals by Removing Confusion amongst Similar Classes Alireza Alaei

More information

AFP Support for TrueType/Open Type Fonts and Unicode

AFP Support for TrueType/Open Type Fonts and Unicode AFP Support for TrueType/Open Type Fonts and Unicode Reinhard Hohensee Distinguished Engineer October 24, 2003 Ricoh Topics What is Unicode? What are TrueType and OpenType fonts? Why have we extended the

More information

6.1 Font Types. Font Types

6.1 Font Types. Font Types 6 Font This chapter explains basic features of GP-Pro EX's "Font" and basic ways of placing text with each font. Please start by reading "6.1 Font Types" (page 6-2) and then turn to the corresponding page.

More information

DOCLIB: a software library for document processing

DOCLIB: a software library for document processing DOCLIB: a software library for document processing Stefan Jaeger 1a, Guangyu Zhu a, David Doermann a, Kevin Chen 2b, Summit Sampat b a Institute for Advanced Computer Studies, University of Maryland, College

More information

Multi-scale Techniques for Document Page Segmentation

Multi-scale Techniques for Document Page Segmentation Multi-scale Techniques for Document Page Segmentation Zhixin Shi and Venu Govindaraju Center of Excellence for Document Analysis and Recognition (CEDAR), State University of New York at Buffalo, Amherst

More information

A Segmentation Free Approach to Arabic and Urdu OCR

A Segmentation Free Approach to Arabic and Urdu OCR A Segmentation Free Approach to Arabic and Urdu OCR Nazly Sabbour 1 and Faisal Shafait 2 1 Department of Computer Science, German University in Cairo (GUC), Cairo, Egypt; 2 German Research Center for Artificial

More information

Table of Contents. Installation Global Office Mini-Tutorial Additional Information... 12

Table of Contents. Installation Global Office Mini-Tutorial Additional Information... 12 TM Table of Contents Installation... 1 Global Office Mini-Tutorial... 5 Additional Information... 12 Installing Global Suite The Global Suite installation program installs both Global Office and Global

More information

Layout Segmentation of Scanned Newspaper Documents

Layout Segmentation of Scanned Newspaper Documents , pp-05-10 Layout Segmentation of Scanned Newspaper Documents A.Bandyopadhyay, A. Ganguly and U.Pal CVPR Unit, Indian Statistical Institute 203 B T Road, Kolkata, India. Abstract: Layout segmentation algorithms

More information

Segmentation of Characters of Devanagari Script Documents

Segmentation of Characters of Devanagari Script Documents WWJMRD 2017; 3(11): 253-257 www.wwjmrd.com International Journal Peer Reviewed Journal Refereed Journal Indexed Journal UGC Approved Journal Impact Factor MJIF: 4.25 e-issn: 2454-6615 Manpreet Kaur Research

More information

LECTURE 6 TEXT PROCESSING

LECTURE 6 TEXT PROCESSING SCIENTIFIC DATA COMPUTING 1 MTAT.08.042 LECTURE 6 TEXT PROCESSING Prepared by: Amnir Hadachi Institute of Computer Science, University of Tartu amnir.hadachi@ut.ee OUTLINE Aims Character Typology OCR systems

More information

PLATYPUS FUNCTIONAL REQUIREMENTS V. 2.02

PLATYPUS FUNCTIONAL REQUIREMENTS V. 2.02 PLATYPUS FUNCTIONAL REQUIREMENTS V. 2.02 TABLE OF CONTENTS Introduction... 2 Input Requirements... 2 Input file... 2 Input File Processing... 2 Commands... 3 Categories of Commands... 4 Formatting Commands...

More information

Character Encodings. Fabian M. Suchanek

Character Encodings. Fabian M. Suchanek Character Encodings Fabian M. Suchanek 22 Semantic IE Reasoning Fact Extraction You are here Instance Extraction singer Entity Disambiguation singer Elvis Entity Recognition Source Selection and Preparation

More information

TUGboat, Volume 37 (2016), No

TUGboat, Volume 37 (2016), No TUGboat, Volume 37 (2016), No. 2 163 MFCONFIG: A METAFONT plug-in module for the Freetype rasterizer Jaeyoung Choi, Sungmin Kim, Hojin Lee and Geunho Jeong Abstract One of METAFONT s advantages is its

More information

Localizing Intellicus. Version: 7.3

Localizing Intellicus. Version: 7.3 Localizing Intellicus Version: 7.3 Copyright 2015 Intellicus Technologies This document and its content is copyrighted material of Intellicus Technologies. The content may not be copied or derived from,

More information

NEW ALGORITHMS FOR SKEWING CORRECTION AND SLANT REMOVAL ON WORD-LEVEL

NEW ALGORITHMS FOR SKEWING CORRECTION AND SLANT REMOVAL ON WORD-LEVEL NEW ALGORITHMS FOR SKEWING CORRECTION AND SLANT REMOVAL ON WORD-LEVEL E.Kavallieratou N.Fakotakis G.Kokkinakis Wire Communication Laboratory, University of Patras, 26500 Patras, ergina@wcl.ee.upatras.gr

More information

ICH M8 Expert Working Group. Specification for Submission Formats for ectd v1.1

ICH M8 Expert Working Group. Specification for Submission Formats for ectd v1.1 INTERNATIONAL COUNCIL FOR HARMONISATION OF TECHNICAL REQUIREMENTS FOR PHARMACEUTICALS FOR HUMAN USE ICH M8 Expert Working Group Specification for Submission Formats for ectd v1.1 November 10, 2016 DOCUMENT

More information

Layout Analysis of Urdu Document Images

Layout Analysis of Urdu Document Images Layout Analysis of Urdu Document Images Faisal Shafait*, Adnan-ul-Hasan, Daniel Keysers*, and Thomas M. Breuel** *Image Understanding and Pattern Recognition (IUPR) research group German Research Center

More information

Part III: Survey of Internet technologies

Part III: Survey of Internet technologies Part III: Survey of Internet technologies Content (e.g., HTML) kinds of objects we re moving around? References (e.g, URLs) how to talk about something not in hand? Protocols (e.g., HTTP) how do things

More information

Scanner Parameter Estimation Using Bilevel Scans of Star Charts

Scanner Parameter Estimation Using Bilevel Scans of Star Charts Boise State University ScholarWorks Electrical and Computer Engineering Faculty Publications and Presentations Department of Electrical and Computer Engineering 1-1-2001 Scanner Parameter Estimation Using

More information

Urdu Hindi Transliteration System Help Document V1.2 URL:

Urdu Hindi Transliteration System Help Document V1.2 URL: System Requirements: http://uh.learnpunjabi.org Urdu Hindi Transliteration System Help Document V1.2 URL: http://uh.learnpunjabi.org/ Browser : Internet Explorer 6 or Higher Unicode Font : GIST_UROTNabeel;

More information

One type of these solutions is automatic license plate character recognition (ALPR).

One type of these solutions is automatic license plate character recognition (ALPR). 1.0 Introduction Modelling, Simulation & Computing Laboratory (msclab) A rapid technical growth in the area of computer image processing has increased the need for an efficient and affordable security,

More information

Segmentation of Handwritten Textlines in Presence of Touching Components

Segmentation of Handwritten Textlines in Presence of Touching Components 2011 International Conference on Document Analysis and Recognition Segmentation of Handwritten Textlines in Presence of Touching Components Jayant Kumar Le Kang David Doermann Wael Abd-Almageed Institute

More information

Handwritten Devanagari Character Recognition Model Using Neural Network

Handwritten Devanagari Character Recognition Model Using Neural Network Handwritten Devanagari Character Recognition Model Using Neural Network Gaurav Jaiswal M.Sc. (Computer Science) Department of Computer Science Banaras Hindu University, Varanasi. India gauravjais88@gmail.com

More information

TEXT line segmentation is one of the major components of

TEXT line segmentation is one of the major components of IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 30, NO. 8, AUGUST 2008 1313 Script-Independent Text Line Segmentation in Freestyle Handwritten Documents Yi Li, Student Member, IEEE,

More information

Restoring Chinese Documents Images Based on Text Boundary Lines

Restoring Chinese Documents Images Based on Text Boundary Lines Proceedings of the 2009 IEEE International Conference on Systems, Man, and Cybernetics San Antonio, TX, USA - October 2009 Restoring Chinese Documents Images Based on Text Boundary Lines Hong Liu Key Laboratory

More information

Document Recognition and Retrieval. DOCLIB: A Software Library for Document Processing

Document Recognition and Retrieval. DOCLIB: A Software Library for Document Processing Document Recognition and Retrieval DOCLIB: A Software Library for Document Processing Stefan Jaeger, Guangyu Zhu, David Doermann Institute for Advanced Computer Studies Laboratory for Language and Media

More information

A Ground-Truthed Mathematical Character and Symbol Image Database

A Ground-Truthed Mathematical Character and Symbol Image Database A Ground-Truthed Mathematical Character and Symbol Image Database Masakazu Suzuki, Seiichi Uchida and Akihiro Nomura Faculty of Mathematics, Faculty of Information Science and Electrical Engineering, Graduate

More information

WORKSTATION APPLICATION NVIDIA POWERdraft Release Notes. Software Version 15.06

WORKSTATION APPLICATION NVIDIA POWERdraft Release Notes. Software Version 15.06 WORKSTATION APPLICATION NVIDIA POWERdraft Release Notes Software Version 15.06 NVIDIA Corporation DECEMBER 2002 Published by NVIDIA Corporation 2701 San Tomas Expressway Santa Clara, CA 95050 Copyright

More information

CID-Keyed Font Technology Overview

CID-Keyed Font Technology Overview CID-Keyed Font Technology Overview Adobe Developer Support Technical Note #5092 12 September 1994 Adobe Systems Incorporated Adobe Developer Technologies 345 Park Avenue San Jose, CA 95110 http://partners.adobe.com/

More information

Mako is a multi-platform technology for creating,

Mako is a multi-platform technology for creating, 1 Multi-platform technology for prepress, document conversion and manipulation Mako is a multi-platform technology for creating, interrogating, manipulating and visualizing PDF documents, offering precise

More information

VPL-D100 Series Data Projectors. Simulated images VPL-DW125 VPL-DX140 VPL-DX145 VPL-DX120 VPL-DW120

VPL-D100 Series Data Projectors. Simulated images VPL-DW125 VPL-DX140 VPL-DX145 VPL-DX120 VPL-DW120 VPL-D100 Series Data Projectors Simulated images VPL-DW125 VPL-DX140 VPL-DX145 VPL-DX120 VPL-DW120 Sleek Compact Projector with Good TCO and an Energy-efficient Design The VPL-D100 Series delivers convenient

More information

The Processing of Form Documents

The Processing of Form Documents The Processing of Form Documents David S. Doermann and Azriel Rosenfeld Document Processing Group, Center for Automation Research University of Maryland, College Park 20742 email: doermann@cfar.umd.edu,

More information

Determining Document Skew Using Inter-Line Spaces

Determining Document Skew Using Inter-Line Spaces 2011 International Conference on Document Analysis and Recognition Determining Document Skew Using Inter-Line Spaces Boris Epshtein Google Inc. 1 1600 Amphitheatre Parkway, Mountain View, CA borisep@google.com

More information

How to Build a Digital Library

How to Build a Digital Library How to Build a Digital Library Ian H. Witten & David Bainbridge Contents Preface Acknowledgements i iv 1. Orientation: The world of digital libraries 1 One: Supporting human development 1 Two: Pushing

More information

Adaptive Transformation-based Learning for Improving Dictionary Tagging

Adaptive Transformation-based Learning for Improving Dictionary Tagging Adaptive Transformation-based Learning for Improving Dictionary Tagging Burcu Karagol-Ayan, David Doermann, and Amy Weinberg Institute for Advanced Computer Studies (UMIACS) University of Maryland College

More information