Groundtruth Image Generation from Electronic Text (Demonstration)

Size: px

Start display at page:

Download "Groundtruth Image Generation from Electronic Text (Demonstration)"

Mervyn Rodgers
5 years ago
Views:

1 Groundtruth Image Generation from Electronic Text (Demonstration) David Doermann and Gang Zi Laboratory for Language and Media Processing, University of Maryland, College Park MD 21043, USA {doermann, Abstract The problem of generating synthetic data for the training and evaluating of document analysis systems has been widely addressed in recent years. With the increased interest in processing multilingual sources, there is a tremendous need to be able to rapidly generate data in new languages and scripts, without the need to develop specialized systems. We have developed an approach that uses language support of the MSWindows operating system combined with custom print drivers to render tiff images simultaneously with windows Enhanced Metafile directives. The Metafile information is parsed to generated zone, line, word, and character groundtruth including location, font information and content in any language supported by Windows. The processing is embedded in a collection of tools for data generation, groundtruthing, degradation and evaluation. The discussion here focuses on the Groundtruth Generator. 1 Introduction Generating synthetic document images and symbolic groundtruth files in large scale has become a recent focal point for training algorithms and evaluating the performance of systems [1], [2], [6]. Typically, training and evaluation require the groundtruth data to be keyed in manually from the scanned image, but this is often a prohibitively labor-intensive and error prone process. Furthermore, it may require domain experts, especially for processing multi-lingual documents. In this text we present a methodology for generating synthetic document images and symbolic groundtruth files automatically by using a custom print driver and metafile information. We give a brief survey of related work, describe the system architecture, and present the main component of our system: the groundtruth generator. 2 Related work Using synthetic data has many advantages including the rapid generation of datasets at low cost, easy control of degradations models and parameters, and convenient testing of the same underlying documents with different corruption methods [1]. To generate synthetic data, many methods have been proposed. In [3], the authors presented an approach to get the noise-free document images from DVI (device independent format) files. However, the requirement of DVI files and LATEX typesetting

2 limits the practical application in many cases. In [4], the authors present an approach to propagating groundtruth information from an original collection allowing the reuse of groundtruth information and [5,6] extending previous work on the use of degradation models for data generation. Choosing suitable corpora for the evaluation plays a crucial part in evaluating an system. A representative corpus should have all characteristics of the target applications. Although a number of datasets have been created, they are typically not appropriate for all applications, but nevertheless, allow focused evaluation. For example, English technical journals are used in the UW dataset and magazines; Spanish newspapers, and English and German business letters are used in the UNLV evaluation set. Our approach allows users to supplement traditional groundtruth with images and groundtruth generated from electronic text, formatted in a way that is representative of the domain. 3 System Architecture The system architecture is shown in Figure 1. Starting from the structured electronic files, such as MSWord or HTML files, we import the source to a renderer and generate the noise-free images and the groundtruth files. The system uses MSWindows print drivers, so the document content can be rendered the same way to many different devices. The degraded images can be obtained from the ideal images through a degradation model, or by physically degrading (printing, scanning, faxing, etc) a hard copy. Finally, the synthetic images and groundtruth files can be used for training and evaluation. One application of our work is to study the downstream effect of degradation of information retrieval (IR), and machine translation (MT) systems. STYLE RENDERING FORMATTER RENDERER SOURCE STRUCTURED Such as Word or HTML DEGREDATION AND Ground Truth Generator TIFF IMAGE Symbolic Ground Truth DEGREDATION MODEL EVALUATION RESULTS PROFILE MACHINE TRANSLATION INFORMATION RETREIVAL TASK EVALUATION Figure 1: evaluation system architecture

3 4 Groundtruth Generator In our system, a groundtruth generator (GTG) is used to obtain the synthetic noise free images and the symbolic groundtruth files. First, the structured documents, such as HTML or MSWord files, are input to the GTG system. Image files and the metafiles are generated via a custom printer driver. By parsing the resulting metafiles, we obtain symbolic and layout information, and generate groundtruth in various formats. The synthetic images and layout information is used to create overlaid images, where the bounding boxes are displayed at the character, word, line, and zone level. Three kinds of groundtruth files are generated in GTG: Standard, Raw and Structured. Standard groundtruth contains basic information about the size of the reference pages, fonts used in the document, the character set of the content, and zone, line, word and character information where appropriate. For each zone, we identify the type of zone (Text, Image or Graphic), for each word, we identify the font, and for each character, we identify the font glyph and Unicode text. CONTENT (pixels): (629,264,1863,2279) PAGE SIZE (mm): (0,0,213,273) RESOLUTION: 301 dpi Font 0: Times New Roman, ARABIC_CHARSET Font 1: Times New Roman, ANSI_CHARSET Font 2: Times New Roman, ANSI_CHARSET Font 3: Bold Times New Roman, ARABIC_CHARSET Font 4: Bold Times New Roman, ARABIC_CHARSET Font 5: Bold Courier New, ARABIC_CHARSET ZONE 0: (1248, 2237, 1267, 2279) T LINE 0: (1248, 2237, 1267, 2279) WORD 0: (1248, 2237, 1267, 2279) 2 CHAR 0: (1248, 2237, 1267, 2279), 50, 00 32, 50 ZONE 1: ( 629, 336, 1863, 1205) T LINE 0: (1446, 336, 1863, 406) WORD 0: (1446, 340, 1554, 406) 5 CHAR 0: (1446, 340, 1482, 406), 1575, 06 27, 909 CHAR 1: (1482, 340, 1518, 406), 65194,fe aa, 938 CHAR 2: (1518, 340, 1554, 406), 65191,fe a7, 935 CHAR 3: (1554, 340, 1589, 406), 32,00 20, 3 WORD 1: (1589, 340, 1732, 406) 5 Raw groundtruth files are in Unicode format or in original coding format. These files contain only the character content and can be used for evaluation. The encoding will be identical to the original encoding used to generate the structured document. Structured groundtruth files include HTML and files. The HTML files can be used to check whether the groundtruth file is the same with the original document by simply viewing the results in a browser. The files are used for data exchange or storage. Because the groundtruth files are parsed from metafiles, as long as True Type Font (TTF) files are used, data from any character set can be created. We have tested our system on dozens of languages, including Arabic, Chinese, Farsi, Japanese, Thai, Hindi and Korean. From this point of view, our system provides a universal framework to generate groundtruth files for multi-lingual documents.

4 The synthetic images are noise-free images and can be generated at different resolutions. Those images can be degraded on pixel level with a parameterized model, or degraded on page level with noise templates. Figure 2: Overlaid images in character, line, and zone level For more information about the system, please contact the authors. References [1] D. Doermann, and S. Yao. Generating Synthetic Data for Text Analysis Systems. SDAIR, pages , [2] Tin Kam Ho, Henry S. Baird, Evaluation of Accuracy Using Synthetic Data, SDAIR95, 1995 [3] T. Kanungo, R.M. Haralick, and I.T. Phillips. Nonlinear local and global document degradation models. IJIST, 5(4):220-30, [4] T. Kanungo and R.M. Haralick. An automatic closed-loop methodology for generating character groundtruth for scanned documents. PAMI, 21(2): , February [5] T. Kanungo, R.M. Haralick, H.S. Baird, W. Stuezle, and D. Madigan. A statistical, nonparametric methodology for document degradation model validation. PAMI, 22(11): , November [6] T. Kanungo, and P. Resnik. The Bible, Truth, and Multilingual Evaluation. SPIE Conference on Document Recognition and Retrieval (VI), pages 86-96, JAN [7] D.W. Kim and T. Kanungo. Attributed point matching for automatic groundtruth generation. IJDAR, 5(1):47-66, [8] T. Kanungo, etc., Document Degradation Models: Parameter Estimation and Model Validation, Proc. of IAPR Workshop on Machine Vision and Applications, Kawasaki, Japan, 1994, pp [9] Esko Ukkonen, Algorithm for Approximate String Matching, Information and Control vol. 64, pp , 1985 [10] S.V. Rice, etc., The fifth annual test of accuracy, Tech. Rep. TR-96-01, Information Science Research Institute, University of Nevada, Las Vegas, NV, 1996.

GROUNDTRUTH GENERATION AND DOCUMENT IMAGE DEGRADATION. Gang Zi

LAMP-TR-121 MAY 2005 CAR-TR-1008 CS-TR-4699 UMIACS-TR-2005-08 GROUNDTRUTH GENERATION AND DOCUMENT IMAGE DEGRADATION Gang Zi Language and Media Processing Laboratory Institute for Advanced Computer Studies