Groundtruth Image Generation from Electronic Text (Demonstration)
|
|
- Mervyn Rodgers
- 5 years ago
- Views:
Transcription
1 Groundtruth Image Generation from Electronic Text (Demonstration) David Doermann and Gang Zi Laboratory for Language and Media Processing, University of Maryland, College Park MD 21043, USA {doermann, Abstract The problem of generating synthetic data for the training and evaluating of document analysis systems has been widely addressed in recent years. With the increased interest in processing multilingual sources, there is a tremendous need to be able to rapidly generate data in new languages and scripts, without the need to develop specialized systems. We have developed an approach that uses language support of the MSWindows operating system combined with custom print drivers to render tiff images simultaneously with windows Enhanced Metafile directives. The Metafile information is parsed to generated zone, line, word, and character groundtruth including location, font information and content in any language supported by Windows. The processing is embedded in a collection of tools for data generation, groundtruthing, degradation and evaluation. The discussion here focuses on the Groundtruth Generator. 1 Introduction Generating synthetic document images and symbolic groundtruth files in large scale has become a recent focal point for training algorithms and evaluating the performance of systems [1], [2], [6]. Typically, training and evaluation require the groundtruth data to be keyed in manually from the scanned image, but this is often a prohibitively labor-intensive and error prone process. Furthermore, it may require domain experts, especially for processing multi-lingual documents. In this text we present a methodology for generating synthetic document images and symbolic groundtruth files automatically by using a custom print driver and metafile information. We give a brief survey of related work, describe the system architecture, and present the main component of our system: the groundtruth generator. 2 Related work Using synthetic data has many advantages including the rapid generation of datasets at low cost, easy control of degradations models and parameters, and convenient testing of the same underlying documents with different corruption methods [1]. To generate synthetic data, many methods have been proposed. In [3], the authors presented an approach to get the noise-free document images from DVI (device independent format) files. However, the requirement of DVI files and LATEX typesetting
2 limits the practical application in many cases. In [4], the authors present an approach to propagating groundtruth information from an original collection allowing the reuse of groundtruth information and [5,6] extending previous work on the use of degradation models for data generation. Choosing suitable corpora for the evaluation plays a crucial part in evaluating an system. A representative corpus should have all characteristics of the target applications. Although a number of datasets have been created, they are typically not appropriate for all applications, but nevertheless, allow focused evaluation. For example, English technical journals are used in the UW dataset and magazines; Spanish newspapers, and English and German business letters are used in the UNLV evaluation set. Our approach allows users to supplement traditional groundtruth with images and groundtruth generated from electronic text, formatted in a way that is representative of the domain. 3 System Architecture The system architecture is shown in Figure 1. Starting from the structured electronic files, such as MSWord or HTML files, we import the source to a renderer and generate the noise-free images and the groundtruth files. The system uses MSWindows print drivers, so the document content can be rendered the same way to many different devices. The degraded images can be obtained from the ideal images through a degradation model, or by physically degrading (printing, scanning, faxing, etc) a hard copy. Finally, the synthetic images and groundtruth files can be used for training and evaluation. One application of our work is to study the downstream effect of degradation of information retrieval (IR), and machine translation (MT) systems. STYLE RENDERING FORMATTER RENDERER SOURCE STRUCTURED Such as Word or HTML DEGREDATION AND Ground Truth Generator TIFF IMAGE Symbolic Ground Truth DEGREDATION MODEL EVALUATION RESULTS PROFILE MACHINE TRANSLATION INFORMATION RETREIVAL TASK EVALUATION Figure 1: evaluation system architecture
3 4 Groundtruth Generator In our system, a groundtruth generator (GTG) is used to obtain the synthetic noise free images and the symbolic groundtruth files. First, the structured documents, such as HTML or MSWord files, are input to the GTG system. Image files and the metafiles are generated via a custom printer driver. By parsing the resulting metafiles, we obtain symbolic and layout information, and generate groundtruth in various formats. The synthetic images and layout information is used to create overlaid images, where the bounding boxes are displayed at the character, word, line, and zone level. Three kinds of groundtruth files are generated in GTG: Standard, Raw and Structured. Standard groundtruth contains basic information about the size of the reference pages, fonts used in the document, the character set of the content, and zone, line, word and character information where appropriate. For each zone, we identify the type of zone (Text, Image or Graphic), for each word, we identify the font, and for each character, we identify the font glyph and Unicode text. CONTENT (pixels): (629,264,1863,2279) PAGE SIZE (mm): (0,0,213,273) RESOLUTION: 301 dpi Font 0: Times New Roman, ARABIC_CHARSET Font 1: Times New Roman, ANSI_CHARSET Font 2: Times New Roman, ANSI_CHARSET Font 3: Bold Times New Roman, ARABIC_CHARSET Font 4: Bold Times New Roman, ARABIC_CHARSET Font 5: Bold Courier New, ARABIC_CHARSET ZONE 0: (1248, 2237, 1267, 2279) T LINE 0: (1248, 2237, 1267, 2279) WORD 0: (1248, 2237, 1267, 2279) 2 CHAR 0: (1248, 2237, 1267, 2279), 50, 00 32, 50 ZONE 1: ( 629, 336, 1863, 1205) T LINE 0: (1446, 336, 1863, 406) WORD 0: (1446, 340, 1554, 406) 5 CHAR 0: (1446, 340, 1482, 406), 1575, 06 27, 909 CHAR 1: (1482, 340, 1518, 406), 65194,fe aa, 938 CHAR 2: (1518, 340, 1554, 406), 65191,fe a7, 935 CHAR 3: (1554, 340, 1589, 406), 32,00 20, 3 WORD 1: (1589, 340, 1732, 406) 5 Raw groundtruth files are in Unicode format or in original coding format. These files contain only the character content and can be used for evaluation. The encoding will be identical to the original encoding used to generate the structured document. Structured groundtruth files include HTML and files. The HTML files can be used to check whether the groundtruth file is the same with the original document by simply viewing the results in a browser. The files are used for data exchange or storage. Because the groundtruth files are parsed from metafiles, as long as True Type Font (TTF) files are used, data from any character set can be created. We have tested our system on dozens of languages, including Arabic, Chinese, Farsi, Japanese, Thai, Hindi and Korean. From this point of view, our system provides a universal framework to generate groundtruth files for multi-lingual documents.
4 The synthetic images are noise-free images and can be generated at different resolutions. Those images can be degraded on pixel level with a parameterized model, or degraded on page level with noise templates. Figure 2: Overlaid images in character, line, and zone level For more information about the system, please contact the authors. References [1] D. Doermann, and S. Yao. Generating Synthetic Data for Text Analysis Systems. SDAIR, pages , [2] Tin Kam Ho, Henry S. Baird, Evaluation of Accuracy Using Synthetic Data, SDAIR95, 1995 [3] T. Kanungo, R.M. Haralick, and I.T. Phillips. Nonlinear local and global document degradation models. IJIST, 5(4):220-30, [4] T. Kanungo and R.M. Haralick. An automatic closed-loop methodology for generating character groundtruth for scanned documents. PAMI, 21(2): , February [5] T. Kanungo, R.M. Haralick, H.S. Baird, W. Stuezle, and D. Madigan. A statistical, nonparametric methodology for document degradation model validation. PAMI, 22(11): , November [6] T. Kanungo, and P. Resnik. The Bible, Truth, and Multilingual Evaluation. SPIE Conference on Document Recognition and Retrieval (VI), pages 86-96, JAN [7] D.W. Kim and T. Kanungo. Attributed point matching for automatic groundtruth generation. IJDAR, 5(1):47-66, [8] T. Kanungo, etc., Document Degradation Models: Parameter Estimation and Model Validation, Proc. of IAPR Workshop on Machine Vision and Applications, Kawasaki, Japan, 1994, pp [9] Esko Ukkonen, Algorithm for Approximate String Matching, Information and Control vol. 64, pp , 1985 [10] S.V. Rice, etc., The fifth annual test of accuracy, Tech. Rep. TR-96-01, Information Science Research Institute, University of Nevada, Las Vegas, NV, 1996.
5
GROUNDTRUTH GENERATION AND DOCUMENT IMAGE DEGRADATION. Gang Zi
LAMP-TR-121 MAY 2005 CAR-TR-1008 CS-TR-4699 UMIACS-TR-2005-08 GROUNDTRUTH GENERATION AND DOCUMENT IMAGE DEGRADATION Gang Zi Language and Media Processing Laboratory Institute for Advanced Computer Studies
More informationPower Functions and Their Use In Selecting Distance Functions for. Document Degradation Model Validation. 600 Mountain Avenue, Room 2C-322
Power Functions and Their Use In Selecting Distance Functions for Document Degradation Model Validation Tapas Kanungo y ; Robert M. Haralick y and Henry S. Baird z y Department of Electrical Engineering,
More informationA Point Matching Algorithm for Automatic Generation of Groundtruth for Document Images
A Point Matching Algorithm for Automatic Generation of Groundtruth for Document Images Doe-Wan Kim and Tapas Kanungo Language and Media Processing Laboratory Center for Automation Research University of
More informationAutomatic Generation of Character Groundtruth for Scanned Documents: A Closed-Loop Approach
Automatic Generation of Character Groundtruth for Scanned Documents: A Closed-Loop Approach Tapas Kanungo Caere Corporation 1 Cooper Court Los Gatos, CA, 953, USA tapas 62caere. com Robert M. Haralick
More informationOn Segmentation of Documents in Complex Scripts
On Segmentation of Documents in Complex Scripts K. S. Sesh Kumar, Sukesh Kumar and C. V. Jawahar Centre for Visual Information Technology International Institute of Information Technology, Hyderabad, India
More informationA New Algorithm for Detecting Text Line in Handwritten Documents
A New Algorithm for Detecting Text Line in Handwritten Documents Yi Li 1, Yefeng Zheng 2, David Doermann 1, and Stefan Jaeger 1 1 Laboratory for Language and Media Processing Institute for Advanced Computer
More informationDocument Image Restoration Using Binary Morphological Filters. Jisheng Liang, Robert M. Haralick. Seattle, Washington Ihsin T.
Document Image Restoration Using Binary Morphological Filters Jisheng Liang, Robert M. Haralick University of Washington, Department of Electrical Engineering Seattle, Washington 98195 Ihsin T. Phillips
More informationEstimating Degradation Model Parameters Using Neighborhood Pattern Distributions: An Optimization Approach
520 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 26, NO. 4, APRIL 2004 Estimating Degradation Model Parameters Using Neighborhood Pattern Distributions: An Optimization Approach
More informationA Line Drawings Degradation Model for Performance Characterization
A Line Drawings Degradation Model for Performance Characterization 1 Jian Zhai, 2 Liu Wenin, 3 Dov Dori, 1 Qing Li 1 Dept. of Computer Engineering and Information Technolog; 2 Dept of Computer Science
More informationSingle-Frame Text Super-Resolution: A Bayesian Approach
MITSUBISHI ELECTRIC RESEARCH LABORATORIES http://www.merl.com Single-Frame Text Super-Resolution: A Bayesian Approach Gerald Dalley, Bill Freeman, Joe Marks TR2004-129 December 2004 Abstract We address
More informationText Super-Resolution: A Bayesian Approach
MITSUBISHI ELECTRIC RESEARCH LABORATORIES http://www.merl.com Text Super-Resolution: A Bayesian Approach Gerald Dalley, Bill Freeman, Joe Marks TR2003-147 October 2004 Abstract We address the problem of
More informationAutomatic Ground-truth Generation for Document Image Analysis and Understanding
Automatic Ground-truth Generation for Document Image Analysis and Understanding Pierre Héroux, Eugen Barbu, Sébastien Adam, Éric Trupin To cite this version: Pierre Héroux, Eugen Barbu, Sébastien Adam,
More informationBookmarks for PDF Output(Outline-Group)
Bookmarks for PDF Output(Outline-Group) The axf:outline-group groups bookmark items of PDF, and outputs them collectively. Value: Initial: empty string Applies to: block-level formatting objects
More informationHigh Performance Layout Analysis of Arabic and Urdu Document Images
High Performance Layout Analysis of Arabic and Urdu Document Images Syed Saqib Bukhari 1, Faisal Shafait 2, and Thomas M. Breuel 1 1 Technical University of Kaiserslautern, Germany 2 German Research Center
More informationarxiv: v1 [cs.cv] 9 Aug 2017
Anveshak - A Groundtruth Generation Tool for Foreground Regions of Document Images Soumyadeep Dey, Jayanta Mukherjee, Shamik Sural, and Amit Vijay Nandedkar arxiv:1708.02831v1 [cs.cv] 9 Aug 2017 Department
More information1.
* 390/0/2 : 389/07/20 : 2 25-8223 ( ) 2 25-823 ( ) ISC SCOPUS L ISA http://jist.irandoc.ac.ir 390 22-97 - :. aminnezarat@gmail.com mosavit@pnu.ac.ir : ( ).... 00.. : 390... " ". ( )...2 2. 3. 4 Google..
More informationLinguistic Resources for Handwriting Recognition and Translation Evaluation
Linguistic Resources for Handwriting Recognition and Translation Evaluation Zhiyi Song*, Safa Ismael*, Steven Grimes*, David Doermann, Stephanie Strassel* *Linguistic Data Consortium, University of Pennsylvania,
More informationDOCUMENT IMAGE ZONE CLASSIFICATION A Simple High-Performance Approach
DOCUMENT IMAGE ZONE CLASSIFICATION A Simple High-Performance Approach Daniel Keysers, Faisal Shafait German Research Center for Artificial Intelligence (DFKI) GmbH, Kaiserslautern, Germany {daniel.keysers,
More informationEthiopic Document Image Database for Testing Character Recognition Systems
Ethiopic Document Image Database for Testing Character Systems Yaregal Assabie and Josef Bigun School of Information Science, Computer and Electrical Engineering Halmstad University, SE-301 18 Halmstad,
More informationDocument Image Segmentation using Discriminative Learning over Connected Components
Document Image Segmentation using Discriminative Learning over Connected Components Syed Saqib Bukhari Technical University of bukhari@informatik.unikl.de Mayce Ibrahim Ali Al Azawi Technical University
More informationStochastic Language Models for Style-Directed Layout Analysis of Document Images
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 12, NO. 5, MAY 2003 583 Stochastic Language Models for Style-Directed Layout Analysis of Document Images Tapas Kanungo, Senior Member, IEEE, and Song Mao, Member,
More informationA Comparison of Some Morphological Filters for Improving OCR Performance
A Comparison of Some Morphological Filters for Improving OCR Performance Laurent Mennillo, Jean Cousty, Laurent Najman To cite this version: Laurent Mennillo, Jean Cousty, Laurent Najman. A Comparison
More informationA Technique for Classification of Printed & Handwritten text
123 A Technique for Classification of Printed & Handwritten text M.Tech Research Scholar, Computer Engineering Department, Yadavindra College of Engineering, Punjabi University, Guru Kashi Campus, Talwandi
More informationPerformance Comparison of Six Algorithms for Page Segmentation
Performance Comparison of Six Algorithms for Page Segmentation Faisal Shafait, Daniel Keysers, and Thomas M. Breuel Image Understanding and Pattern Recognition (IUPR) research group German Research Center
More informationA Study on the Document Zone Content Classification Problem
A Study on the Document Zone Content Classification Problem Yalin Wang 1, Ihsin T. Phillips 2, and Robert M. Haralick 3 1 Dept. of Elect. Eng. Univ. of Washington Seattle, WA 98195, US ylwang@u.washington.edu
More informationTable Detection in Heterogeneous Documents
Table Detection in Heterogeneous Documents Faisal Shafait German Research Center for Artificial Intelligence (DFKI GmbH) Kaiserslautern, Germany faisal.shafait@dfki.de Ray Smith Google Inc. Mountain View,
More informationXF RENDERING SERVER 2009 ARCHITECTS OVERVIEW
XF RENDERING SERVER 2009 ARCHITECTS OVERVIEW XF RENDERING SERVER 2009 XF Rendering Server 2009 is a high-volume, high-speed solution for generating a wide range of communication materials from XML. It
More informationpreliminary draft, June 15, :57 preliminary draft, June 15, :57
TUGboat, Volume 0 (9999), No. 0 preliminary draft, June 15, 2018 17:57? 1 FreeType MF Module: A module for using METAFONT directly inside the FreeType rasterizer Jaeyoung Choi, Ammar Ul Hassan and Geunho
More informationRefinement of digitized documents through recognition of mathematical formulae
Refinement of digitized documents through recognition of mathematical formulae Toshihiro KANAHORI Research and Support Center on Higher Education for the Hearing and Visually Impaired, Tsukuba University
More informationAutomatic Reader. Multi Lingual OCR System.
Automatic Reader Multi Lingual OCR System What is the Automatic Reader? Sakhr s Automatic Reader transforms scanned images into a grid of millions of dots, optically recognizes the characters found in
More informationDirect Processing of Document Images in Compressed Domain
Direct Processing of Document Images in Compressed Domain Mohammed Javed* 1, P. Nagabhushan* 2, B.B. Chaudhuri #3 *Department of Studies in Computer Science, University of Mysore, Mysore 570006, India
More informationSAPGUI for Windows - I18N User s Guide
Page 1 of 30 SAPGUI for Windows - I18N User s Guide Introduction This guide is intended for the users of SAPGUI who logon to Unicode systems and those who logon to non-unicode systems whose code-page is
More informationUnconstrained Language Identification Using A Shape Codebook
Unconstrained Language Identification Using A Shape Codebook Guangyu Zhu, Xiaodong Yu, Yi Li, and David Doermann Language and Media Processing Laboratory University of Maryland {zhugy,xdyu,liyi,doermann}@umiacs.umd.edu
More informationOCR correction based on document level knowledge
OCR correction based on document level knowledge T. Nartker, K. Taghva, R. Young, J. Borsack, and A. Condit UNLV/Information Science Research Institute, Box 4021 4505 Maryland Pkwy, Las Vegas, NV USA 89154-4021
More informationLigature-based font size independent OCR for Noori Nastalique writing style
Ligature-based font size independent OCR for Noori Nastalique writing style Qurat ul Ain Akram Sarmad Hussain Center for Language Engineering, Al-Khawarizmi Institute of Computer Science University of
More informationSegmentation Framework for Multi-Oriented Text Detection and Recognition
Segmentation Framework for Multi-Oriented Text Detection and Recognition Shashi Kant, Sini Shibu Department of Computer Science and Engineering, NRI-IIST, Bhopal Abstract - Here in this paper a new and
More informationScanner Parameter Estimation Using Bilevel Scans of Star Charts
ICDAR, Seattle WA September Scanner Parameter Estimation Using Bilevel Scans of Star Charts Elisa H. Barney Smith Electrical and Computer Engineering Department Boise State University, Boise, Idaho 8375
More informationADAPTIVE HINDI OCR USING GENERALIZED HAUSDORFF IMAGE COMPARISON HUANFENG MA, DAVID DOERMANN
LAMP-TR-105 CAR-TR-987 CS-TR-4519 UMIACS-TR-2003-87 August 19, 2003 ADAPTIVE HINDI OCR USING GENERALIZED HAUSDORFF IMAGE COMPARISON HUANFENG MA, DAVID DOERMANN LAMP-TR-105 CAR-TR-987 CS-TR-4519 UMIACS-TR-2003-87
More informationThe PAGE (Page Analysis and Ground-truth Elements) Format Framework
2010,IEEE. Reprinted, with permission, frompletschacher, S and Antonacopoulos, A, The PAGE (Page Analysis and Ground-truth Elements) Format Framework, Proceedings of the 20th International Conference on
More informationAutomated data entry system: performance issues
Automated data entry system: performance issues George R. Thoma, Glenn Ford National Library of Medicine, Bethesda, Maryland 20894 ABSTRACT This paper discusses the performance of a system for extracting
More informationSCRIPT-INDEPENDENT TEXT LINE SEGMENTATION IN FREESTYLE HANDWRITTEN DOCUMENTS LI Yi 1, Yefeng Zheng 2, David Doermann 1, and Stefan Jaeger 1
LAMP-TR-136 CS-TR-4836 CFAR-TR-1017 UMIACS-TR-2006-51 DEC 2006 SCRIPT-INDEPENDENT TEXT LINE SEGMENTATION IN FREESTYLE HANDWRITTEN DOCUMENTS LI Yi 1, Yefeng Zheng 2, David Doermann 1, and Stefan Jaeger
More informationINTERNATIONALIZATION IN GVIM
INTERNATIONALIZATION IN GVIM A PROJECT REPORT Submitted by Ms. Nisha Keshav Chaudhari Ms. Monali Eknath Chim In partial fulfillment for the award of the degree Of B. Tech Computer Engineering UNDER THE
More informationOverview of the FIRE 2011 RISOT Task
Overview of the FIRE 2011 RISOT Task Utpal Garain, 1* Jiaul Paik, 1* Tamaltaru Pal, 1 Prasenjit Majumder, 2 David Doermann, 3 and Douglas W. Oard 3 1 Indian Statistical Institute, Kolkata, India {utpal
More informationOptical Character Recognition (OCR) for Printed Devnagari Script Using Artificial Neural Network
International Journal of Computer Science & Communication Vol. 1, No. 1, January-June 2010, pp. 91-95 Optical Character Recognition (OCR) for Printed Devnagari Script Using Artificial Neural Network Raghuraj
More informationStructural Mixtures for Statistical Layout Analysis
Structural Mixtures for Statistical Layout Analysis Faisal Shafait 1, Joost van Beusekom 2, Daniel Keysers 1, Thomas M. Breuel 2 Image Understanding and Pattern Recognition (IUPR) Research Group 1 German
More informationA Laplacian Based Novel Approach to Efficient Text Localization in Grayscale Images
A Laplacian Based Novel Approach to Efficient Text Localization in Grayscale Images Karthik Ram K.V & Mahantesh K Department of Electronics and Communication Engineering, SJB Institute of Technology, Bangalore,
More information1.1 Create a New Survey: Getting Started. To create a new survey, you can use one of two methods: a) Click Author on the navigation bar.
1. Survey Authoring Section 1 of this User Guide provides step-by-step instructions on how to author your survey. Surveys can be created using questions and response choices you develop; copying content
More informationMono-font Cursive Arabic Text Recognition Using Speech Recognition System
Mono-font Cursive Arabic Text Recognition Using Speech Recognition System M.S. Khorsheed Computer & Electronics Research Institute, King AbdulAziz City for Science and Technology (KACST) PO Box 6086, Riyadh
More informationGoal-Oriented Performance Evaluation Methodology for Page Segmentation Techniques
Goal-Oriented Performance Evaluation Methodology for Page Segmentation Techniques Nikolaos Stamatopoulos, Georgios Louloudis and Basilis Gatos Computational Intelligence Laboratory, Institute of Informatics
More informationInternational Journal of Advance Research in Engineering, Science & Technology
Impact Factor (SJIF): 4.542 International Journal of Advance Research in Engineering, Science & Technology e-issn: 2393-9877, p-issn: 2394-2444 Volume 4, Issue 4, April-2017 A Simple Effective Algorithm
More informationLearning to Segment Document Images
Learning to Segment Document Images K.S. Sesh Kumar, Anoop Namboodiri, and C.V. Jawahar Centre for Visual Information Technology, International Institute of Information Technology, Hyderabad, India Abstract.
More informationOCR and Automated Translation for the Navigation of non-english Handsets: A Feasibility Study with Arabic
OCR and Automated Translation for the Navigation of non-english Handsets: A Feasibility Study with Arabic Jennifer Biggs and Michael Broughton Defence Science and Technology Organisation Edinburgh, South
More informationLanguage Identification for Handwritten Document Images Using A Shape Codebook
Language Identification for Handwritten Document Images Using A Shape Codebook Guangyu Zhu, Xiaodong Yu, Yi Li, David Doermann Institute of Advanced Computer Studies, University of Maryland, College Park,
More information136 TUGboat, Volume 39 (2018), No. 2
136 TUGboat, Volume 39 (2018), No. 2 FreeType MF Module: A module for using METAFONT directly inside the FreeType rasterizer Jaeyoung Choi, Ammar Ul Hassan, Geunho Jeong Abstract METAFONT is a font description
More informationAdaptive Hindi OCR Using Generalized Hausdorff Image Comparison
Adaptive Hindi OCR Using Generalized Hausdorff Image Comparison HUANFENG MA and DAVID DOERMANN University of Maryland, College Park We present an adaptive Hindi OCR implemented as part of a rapidly retargetable
More informationLicensed Program Specifications
AFP Font Collection for MVS, OS/390, VM, and VSE Program Number 5648-B33 Licensed Program Specifications AFP Font Collection for MVS, OS/390, VM, and VSE, hereafter referred to as AFP Font Collection,
More informationDesktop Crawls. Document Feeds. Document Feeds. Information Retrieval
Information Retrieval INFO 4300 / CS 4300! Web crawlers Retrieving web pages Crawling the web» Desktop crawlers» Document feeds File conversion Storing the documents Removing noise Desktop Crawls! Used
More informationKeyword Spotting in Document Images through Word Shape Coding
2009 10th International Conference on Document Analysis and Recognition Keyword Spotting in Document Images through Word Shape Coding Shuyong Bai, Linlin Li and Chew Lim Tan School of Computing, National
More informationUW Document Image Databases. Document Analysis Module. Ground-Truthed Information DAFS. Generated Information DAFS. Performance Evaluation
Performance evaluation of document layout analysis algorithms on the UW data set Jisheng Liang, Ihsin T. Phillips y, and Robert M. Haralick Department of Electrical Engineering, University of Washington,
More informationAletheia - An Advanced Document Layout and Text Ground-Truthing System for Production Environments *
2011 International Conference on Document Analysis and Recognition Aletheia - An Advanced Document Layout and Text Ground-Truthing System for Production Environments * C. Clausner, S. Pletschacher and
More informationWord Slant Estimation using Non-Horizontal Character Parts and Core-Region Information
2012 10th IAPR International Workshop on Document Analysis Systems Word Slant using Non-Horizontal Character Parts and Core-Region Information A. Papandreou and B. Gatos Computational Intelligence Laboratory,
More informationEnhancing Degraded Document Images via Bitmap Clustering and Averaging
V Enhancing Degraded Document Images via Bitmap Clustering and Averaging John D. Hobby Tin Kam Ho Bell Labs, Lucent Technologies Bell Labs, Lucent Technologies Murray Hill, New Jersey 07974 Murray Hill,
More informationMURDOCH RESEARCH REPOSITORY
MURDOCH RESEARCH REPOSITORY http://researchrepository.murdoch.edu.au/ This is the author s final version of the work, as accepted for publication following peer review but without the publisher s layout
More informationIMPLEMENTING ON OPTICAL CHARACTER RECOGNITION USING MEDICAL TABLET FOR BLIND PEOPLE
Impact Factor (SJIF): 5.301 International Journal of Advance Research in Engineering, Science & Technology e-issn: 2393-9877, p-issn: 2394-2444 Volume 5, Issue 3, March-2018 IMPLEMENTING ON OPTICAL CHARACTER
More informationSpeedup of Optical Scanner Characterization Subsystem
Boise State University ScholarWorks Electrical and Computer Engineering Faculty Publications and Presentations Department of Electrical and Computer Engineering --2003 Speedup of Optical Scanner Characterization
More informationFine Classification of Unconstrained Handwritten Persian/Arabic Numerals by Removing Confusion amongst Similar Classes
2009 10th International Conference on Document Analysis and Recognition Fine Classification of Unconstrained Handwritten Persian/Arabic Numerals by Removing Confusion amongst Similar Classes Alireza Alaei
More informationAFP Support for TrueType/Open Type Fonts and Unicode
AFP Support for TrueType/Open Type Fonts and Unicode Reinhard Hohensee Distinguished Engineer October 24, 2003 Ricoh Topics What is Unicode? What are TrueType and OpenType fonts? Why have we extended the
More information6.1 Font Types. Font Types
6 Font This chapter explains basic features of GP-Pro EX's "Font" and basic ways of placing text with each font. Please start by reading "6.1 Font Types" (page 6-2) and then turn to the corresponding page.
More informationDOCLIB: a software library for document processing
DOCLIB: a software library for document processing Stefan Jaeger 1a, Guangyu Zhu a, David Doermann a, Kevin Chen 2b, Summit Sampat b a Institute for Advanced Computer Studies, University of Maryland, College
More informationMulti-scale Techniques for Document Page Segmentation
Multi-scale Techniques for Document Page Segmentation Zhixin Shi and Venu Govindaraju Center of Excellence for Document Analysis and Recognition (CEDAR), State University of New York at Buffalo, Amherst
More informationA Segmentation Free Approach to Arabic and Urdu OCR
A Segmentation Free Approach to Arabic and Urdu OCR Nazly Sabbour 1 and Faisal Shafait 2 1 Department of Computer Science, German University in Cairo (GUC), Cairo, Egypt; 2 German Research Center for Artificial
More informationTable of Contents. Installation Global Office Mini-Tutorial Additional Information... 12
TM Table of Contents Installation... 1 Global Office Mini-Tutorial... 5 Additional Information... 12 Installing Global Suite The Global Suite installation program installs both Global Office and Global
More informationLayout Segmentation of Scanned Newspaper Documents
, pp-05-10 Layout Segmentation of Scanned Newspaper Documents A.Bandyopadhyay, A. Ganguly and U.Pal CVPR Unit, Indian Statistical Institute 203 B T Road, Kolkata, India. Abstract: Layout segmentation algorithms
More informationSegmentation of Characters of Devanagari Script Documents
WWJMRD 2017; 3(11): 253-257 www.wwjmrd.com International Journal Peer Reviewed Journal Refereed Journal Indexed Journal UGC Approved Journal Impact Factor MJIF: 4.25 e-issn: 2454-6615 Manpreet Kaur Research
More informationLECTURE 6 TEXT PROCESSING
SCIENTIFIC DATA COMPUTING 1 MTAT.08.042 LECTURE 6 TEXT PROCESSING Prepared by: Amnir Hadachi Institute of Computer Science, University of Tartu amnir.hadachi@ut.ee OUTLINE Aims Character Typology OCR systems
More informationPLATYPUS FUNCTIONAL REQUIREMENTS V. 2.02
PLATYPUS FUNCTIONAL REQUIREMENTS V. 2.02 TABLE OF CONTENTS Introduction... 2 Input Requirements... 2 Input file... 2 Input File Processing... 2 Commands... 3 Categories of Commands... 4 Formatting Commands...
More informationCharacter Encodings. Fabian M. Suchanek
Character Encodings Fabian M. Suchanek 22 Semantic IE Reasoning Fact Extraction You are here Instance Extraction singer Entity Disambiguation singer Elvis Entity Recognition Source Selection and Preparation
More informationTUGboat, Volume 37 (2016), No
TUGboat, Volume 37 (2016), No. 2 163 MFCONFIG: A METAFONT plug-in module for the Freetype rasterizer Jaeyoung Choi, Sungmin Kim, Hojin Lee and Geunho Jeong Abstract One of METAFONT s advantages is its
More informationLocalizing Intellicus. Version: 7.3
Localizing Intellicus Version: 7.3 Copyright 2015 Intellicus Technologies This document and its content is copyrighted material of Intellicus Technologies. The content may not be copied or derived from,
More informationNEW ALGORITHMS FOR SKEWING CORRECTION AND SLANT REMOVAL ON WORD-LEVEL
NEW ALGORITHMS FOR SKEWING CORRECTION AND SLANT REMOVAL ON WORD-LEVEL E.Kavallieratou N.Fakotakis G.Kokkinakis Wire Communication Laboratory, University of Patras, 26500 Patras, ergina@wcl.ee.upatras.gr
More informationICH M8 Expert Working Group. Specification for Submission Formats for ectd v1.1
INTERNATIONAL COUNCIL FOR HARMONISATION OF TECHNICAL REQUIREMENTS FOR PHARMACEUTICALS FOR HUMAN USE ICH M8 Expert Working Group Specification for Submission Formats for ectd v1.1 November 10, 2016 DOCUMENT
More informationLayout Analysis of Urdu Document Images
Layout Analysis of Urdu Document Images Faisal Shafait*, Adnan-ul-Hasan, Daniel Keysers*, and Thomas M. Breuel** *Image Understanding and Pattern Recognition (IUPR) research group German Research Center
More informationPart III: Survey of Internet technologies
Part III: Survey of Internet technologies Content (e.g., HTML) kinds of objects we re moving around? References (e.g, URLs) how to talk about something not in hand? Protocols (e.g., HTTP) how do things
More informationScanner Parameter Estimation Using Bilevel Scans of Star Charts
Boise State University ScholarWorks Electrical and Computer Engineering Faculty Publications and Presentations Department of Electrical and Computer Engineering 1-1-2001 Scanner Parameter Estimation Using
More informationUrdu Hindi Transliteration System Help Document V1.2 URL:
System Requirements: http://uh.learnpunjabi.org Urdu Hindi Transliteration System Help Document V1.2 URL: http://uh.learnpunjabi.org/ Browser : Internet Explorer 6 or Higher Unicode Font : GIST_UROTNabeel;
More informationOne type of these solutions is automatic license plate character recognition (ALPR).
1.0 Introduction Modelling, Simulation & Computing Laboratory (msclab) A rapid technical growth in the area of computer image processing has increased the need for an efficient and affordable security,
More informationSegmentation of Handwritten Textlines in Presence of Touching Components
2011 International Conference on Document Analysis and Recognition Segmentation of Handwritten Textlines in Presence of Touching Components Jayant Kumar Le Kang David Doermann Wael Abd-Almageed Institute
More informationHandwritten Devanagari Character Recognition Model Using Neural Network
Handwritten Devanagari Character Recognition Model Using Neural Network Gaurav Jaiswal M.Sc. (Computer Science) Department of Computer Science Banaras Hindu University, Varanasi. India gauravjais88@gmail.com
More informationTEXT line segmentation is one of the major components of
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 30, NO. 8, AUGUST 2008 1313 Script-Independent Text Line Segmentation in Freestyle Handwritten Documents Yi Li, Student Member, IEEE,
More informationRestoring Chinese Documents Images Based on Text Boundary Lines
Proceedings of the 2009 IEEE International Conference on Systems, Man, and Cybernetics San Antonio, TX, USA - October 2009 Restoring Chinese Documents Images Based on Text Boundary Lines Hong Liu Key Laboratory
More informationDocument Recognition and Retrieval. DOCLIB: A Software Library for Document Processing
Document Recognition and Retrieval DOCLIB: A Software Library for Document Processing Stefan Jaeger, Guangyu Zhu, David Doermann Institute for Advanced Computer Studies Laboratory for Language and Media
More informationA Ground-Truthed Mathematical Character and Symbol Image Database
A Ground-Truthed Mathematical Character and Symbol Image Database Masakazu Suzuki, Seiichi Uchida and Akihiro Nomura Faculty of Mathematics, Faculty of Information Science and Electrical Engineering, Graduate
More informationWORKSTATION APPLICATION NVIDIA POWERdraft Release Notes. Software Version 15.06
WORKSTATION APPLICATION NVIDIA POWERdraft Release Notes Software Version 15.06 NVIDIA Corporation DECEMBER 2002 Published by NVIDIA Corporation 2701 San Tomas Expressway Santa Clara, CA 95050 Copyright
More informationCID-Keyed Font Technology Overview
CID-Keyed Font Technology Overview Adobe Developer Support Technical Note #5092 12 September 1994 Adobe Systems Incorporated Adobe Developer Technologies 345 Park Avenue San Jose, CA 95110 http://partners.adobe.com/
More informationMako is a multi-platform technology for creating,
1 Multi-platform technology for prepress, document conversion and manipulation Mako is a multi-platform technology for creating, interrogating, manipulating and visualizing PDF documents, offering precise
More informationVPL-D100 Series Data Projectors. Simulated images VPL-DW125 VPL-DX140 VPL-DX145 VPL-DX120 VPL-DW120
VPL-D100 Series Data Projectors Simulated images VPL-DW125 VPL-DX140 VPL-DX145 VPL-DX120 VPL-DW120 Sleek Compact Projector with Good TCO and an Energy-efficient Design The VPL-D100 Series delivers convenient
More informationThe Processing of Form Documents
The Processing of Form Documents David S. Doermann and Azriel Rosenfeld Document Processing Group, Center for Automation Research University of Maryland, College Park 20742 email: doermann@cfar.umd.edu,
More informationDetermining Document Skew Using Inter-Line Spaces
2011 International Conference on Document Analysis and Recognition Determining Document Skew Using Inter-Line Spaces Boris Epshtein Google Inc. 1 1600 Amphitheatre Parkway, Mountain View, CA borisep@google.com
More informationHow to Build a Digital Library
How to Build a Digital Library Ian H. Witten & David Bainbridge Contents Preface Acknowledgements i iv 1. Orientation: The world of digital libraries 1 One: Supporting human development 1 Two: Pushing
More informationAdaptive Transformation-based Learning for Improving Dictionary Tagging
Adaptive Transformation-based Learning for Improving Dictionary Tagging Burcu Karagol-Ayan, David Doermann, and Amy Weinberg Institute for Advanced Computer Studies (UMIACS) University of Maryland College
More information