The Processing of Form Documents

Size: px

Start display at page:

Download "The Processing of Form Documents"

Piers Parker
5 years ago
Views:

1 The Processing of Form Documents David S. Doermann and Azriel Rosenfeld Document Processing Group, Center for Automation Research University of Maryland, College Park Abstract In this paper we present an overview of our approach to the generic modeling and processing of known forms. Our system provides a methodology by which models are generated from regions in the document based on their usage. We propose automatic extraction of an optimal set of features to be used for registration and show how specialized detectors can be designed for each feature based on their position, orientation and width properties. Registration of the form with the model is accomplished using probing to establish correspondence. We detect and isolate form components which are corrupted by markings, interpret the intersections, and use the properties of the non-form markings to reconstruct the strokes through the intersections. The feasibility of these ideas is demonstrated through an implementation key components of our system. 1 Introduction The machine understanding of form documents is a problem which is essential to the advancement of office automation. Many systems have been developed each with different goals and assumptions [2, 9, 6, lo]. Our goal is to provide a processing environment for known forms which allows greater flexibility in the types and numbers of forms which are processed. By separating the form from the filled-in information, the filled-in components can be processed separately, and the data can be stored the data can be stored in a compressed format. We assume that the operator has been given a copy of the pre-designed form, so that design is not an issue, and that a homogeneous batch of forms will be processed. The task is to extract the hand or machine printed markings from such a batch of completed instances of the form and pass the recovered information to a storage or OCR system. We address the modeling, pre-processing and text extraction stages of the system. 2 Form Modeling There are three levels of abstraction at which a form can be modeled, depending on the a priori knowledge about the domain. We may analyze a form document 1) given no specific information about the format or application of the form, but only general heuristic knowledge about the form domain, 2) given the class of forms to which the document belongs, but not the exact form layout, or 3) given a detailed model of the original form or a set of specific models for a small group of forms. These cases fall into the general categories of unknown forms (e.g. general document processing), known classes of forms (e.g. checks and invoices), and known forms (e.g. report forms, taxes and surveys). We have demonstrated our approach on the class of known forms. Most form processing systems are designed or can be tuned for a specific known form or small set of forms. For applications which require the mass processing of these forms, we may assume that the format of the original form is known a priori. The fact that we know the form layout suggests that such systems will benefit from a top-down component, whereas the class of unknown forms typically requires a primarily bottom-up approach. The processing requirements obviously depend on the model knowledge and the functionality of the form. In general, the model of a form document may include 1) contextual knowledge about generic document domains, such as constraints on the size variations between text and graphic components, locations of key regions of interest, and relationships between known components; 2) abstract representations of lines, text, and regions such as graphs; 3) bitmap or image representations of form components which are traditionally difficult to model such as logos; and/or 4) generative models, such as form languages or form grammars. In /93 $ IEEE 497

2 addition, models for form images may include a characterization of noise in the imaging process as well as defects found in typical documents. 2.1 A Simplified Model We begin by providing an approach which allows direct modeling of typical office documents. We define a set of basic constructs which allows us to model - large number of simple forms. The goal is to define a model for the original form in such a way as to be able to symbolically subtract the original form from a scanned copy of a completed document. The model contains three primary components - line segments, form regions, and landmark features. Line segments are often used as guides for the form user and are defined in the model by two endpoints and a constant width. More complex graphic constructs such as boxes can be constructed from combinations of line segments. Regions are used to define areas on the form which are to be considered as a single unit. A region is classified as: 1) modeled: filled information within the region is of interest, but the presence of the original form components requires us to have a model to separate form from non-form information; 2) non- modeled: the assumption can be made that any marking within the region is inconsequential, whether it contains filled-in information or not; or 3) data: the system recovers the interior of the region, but ignores any markings which extend past its boundaries (limited data region) or analyzes the exterior if markings extend outside of the region (extended data region). The non-modeled region type also acts as a catch-all for form components which we currently do not model. Since the forms we are dealing with are known, it is not necessary to examine all form components in detail for interaction with filled-in information. Rather, we will use a focus of attention (Section 4) to concentrate on areas which are believed to contain non-form information. Our current model space limits regions to be upright rectangles. Regions can therefore be defined by their lower-left and upper right coordinates. In the case of modeled regions, an image or bitmap of the region is included. Landmark features describe line segment endpoints and the intersections of line segments with themselves and with region components. Line segment landmarks occur in 13 basic configurations: endpoints, t-junctions, and corners, each in four orientations, and one crossing configuration. Similarly, line segment may intersect a region form any of four directions. These relationships are invariant to transla- tion, rotation and scale, and are useful for solving the alignment problem and for analysis of stroke/form interact ions. Our current implementation relies primarily on the existence of line landmarks extracted from horizontal and vertical line segments to perform alignment, but the techniques described in the next section can be extended to use arbitrary regions for alignment at the expense of increased computational cost. 2.2 The Form Modeler A prototype automated modeling system is being developed using the KBVision system [l] to aid the user in the modeling of form documents. The current system takes a scanned version of the document as input and uses classical image processing techniques to extract character regions, line segments, and landmark points and assign them appropriate classifications. An interactive modeler is then used to refine the derived approximation of the form model. The scanning and image processing stages are necessary for forms which are not designed on-line. Ideally, if the form were created on-line, a form model would be produced at the time the form is defined, or a model could be derived from a CAD-based description of the form. The form model is extracted and stored in a file with attributes defining the location and size of each component and landmark. The model components are later organized spatially into a quadtree data structure, as described briefly below. 3 Alignment An approach is proposed in which an optimal basis set of landmark features is extracted directly from the model and the locations of these points are used to invert the transformation and perform a coarse alignment. The alignment is then verified and refined using higher order features. The advantage of automatic feature extraction is that it not only eliminates the human factor in modeling, but also quantifies the process of detecting which features are more or less likely to be confused with each other. Our approach to skew detection and correction is based on geometric probing [7]. A convolution probe can be defined to be more robust to different types of distortions, such as rotation. By passing the probe over an area bounded by the limits on the translation and rotation, we can approximate the locations of candidate feature points. When we have obtained a set of possible feature points, we apply the constraints 498

3 from the model to reduce the set of possible transformations. Other landmarks can then be verified and a fine scale alignment obtained. Our current research is centered around developing algorithms for automatic selection of an optimal set of features from the model, automatic definition of detectors, extracting and constraining features for coarse alignment and adjusting the fine-scale alignment. 4 Form Delineation. J... (Form 1 Components (a) - 1 i Simply subtracting the form model from the image is not sufficient because of possible interaction of the filled-in information with the form and the possibility of single- or sub-pixel shifts. To ensure that we capture all non-form information (marks) in the image, we perform a symbolic subtraction of the model on a component by component basis. Non-modeled regions are discarded immediately without consideration of their internal pixels. Line segments and modeled regions are interpreted using detailed analysis of the pixels around their boundaries. To isolate line segment areas or modeled regions for analysis, we project onto the image plane the rectangle which corresponding to the boundary of a region defined by the segment endpoints and the width. 4.1 Detection of Anomalies Intensity and gradient information is used to determine if a model component is isolated in the image or if it is interfered with by noise or markings on the form. By examining the image at the location of the boundary of each model component, we detect anomalies at the points where non-form features interfere with the form components. If boundary pixels are found to be corrupt, we hypothesize a corrupted feature and attempt to recover the conditions or markings in the document which gave rise to the corruption. If the boundary pixels are not corrupt, the entire model feature can be removed from the document image, and the remaining markings on the page are the filled-in data. There are several situations which give rise to anomalies. Small dents or bumps may by detected with a local analysis of the pixels in the neighborhood of the anomaly. We set a lower bound on the expected size of valid page markings, and delay analysis of features which are smaller than this threshold. The existence of such features is preserved for possible analysis with higher level context. For example, a small bump in the boundary may be the result of a decimal point touching a form line. Figure 1: A small region, anomalies and the computed MBR. Larger, non-stroke-like features which were added to a form (e.g. stickers, coffee stains, scratched-out regions, etc.) also appear as anomalies. For features which are much larger than the width of the line segment, precise representation of the interference may not be necessary. We assume that the feature is continuous across the line segment, the line segment is removed and the region patched. A final case is the intersection of a line segment with a stroke-like marking. Since stroke-like markings are presumably more common, and represent the data which we are trying to recover, we must take special care to reconstruct as accurately as possible the stroke which gave rise to the interference. 4.2 Focus of Attention In order to evaluate the intersection of the model and the form information, the region which requires analysis must be defined. The initial region is defined in such a way that it is almost certainly large enough for properties of the offending strokes to be measured. This region is then modified to take into account the geometry of the form model and the locations of nearby detected anomalies. Based on the expected locations of markings on the completed form, we are able to define a reasonable bound on the size of the region required for analysis. If it is found that this region is too small, it can be extended. The region is on the order of 25 times the expected stroke width or model segment width (Figure la). In an attempt to avoid unnecessary analysis or repeated analysis of the same region, the region can be modified in two ways: It can be expanded to include additional detected anomalies and can be constrained by regions which lack anomalies (Figure lb). If the region boundary is only corrupted on one side, the region of analysis is limited to that side. The compu- 499

4 tation of the Maximum Bounding Region (MBR) an anomaly point is described more fully in [3]. The final result is a region (MBR) which surrounds a given corrupted point, but does not engulf additional uncorrupted form components. A quadtree data structure can be used to efficiently index into the space of line segment and region boundaries [8] and to implement the sweeping algorithms. 5 Stroke/Form Interaction Our approach to the problem of interfering contours is based on the detection, analysis and detailed representation of the stroke-like and non-stroke-like regions in the document image [5]. The process involves two parts, an interpretation of the intersecting region and a reconstruction of the strokes or line segments which formed it. An interpretation of a region is derived from the local configuration of stroke segments and properties of the strokes themselves such as curvature, width and intensity. By treating the strokes as features and retaining a more complete representation of the document, we can use criteria and clues for interpretation which are not available with traditional approaches to document processing. 5.1 The Stroke Recovery Platform The framework we use to address the interpretation and reconstruction problems is based on the concept of a stroke recovery platform described in [4, 31. The platform provides a hierarchical representation of the stroke-like features in a document that extends from the pixel level up through an attributed stroke graph which represents the relationships between strokes and non-stroke features such as endpoints and intersections. The platform attempts to provide a complete representation which links higher level abstract representations with the pixels and other local features. Unfortunately, in many situations a complete interpretation based on low-level information either may not be possible or cannot be obtained with the desired confidence. In such cases, feedback from higher-level application-dependent modules is necessary and the platform can be amended dynamically. 5.2 Interpretation and Reconstruction Our first step is to construct a partial stroke platform for the region of the document image defined by the MBR in Section 4 and identify portions of the (b) Figure 2: Several components of the platform which result from the processing of a small document region. Figure 3: The candidate anchor points derived from the cross section endpoints. stroke graph which uniquely correspond to the given form component. The platform provides us with a representation which contains, most importantly, a set of cross-section groupings which exhibit stroke-like properties (hypothesized stroke segments), regions which are classified as possible junctions or endpoints, and the underlying contours or contour fragments of the stroke segments, junctions, endpoints and unclassified features in the image (Figure 2). The stroke graph supports top-down access to the pixels through the strokes, junctions, cross-sections and retinotopic information. We then identify those portions of the image which correspond to interacting features. Since the properties of the form components such as position, width and orientation are known a priori, the stroke graph can be examined and the features identified. In a more general application, we may identify line segments based on the regularity and size of the cross-sections comprising the hypothesized stroke segments. In either case, if the segment intersects another feature, we will observe a node in the stroke graph corresponding to the intersection. If the intersection occurs over an extended region, the affected portions of the stroke graph will have cross-section widths which are inconsistent with rest of the form component. In Figure 2a, for example, the top-center stroke segment is bounded by two apparent junctions and has cross-sections of significantly greater width than the corresponding left or right end segments. Since these changes contradict the consistency assumptions, such a situation will be examined for possible interpretation as resulting from an interaction of features. (C) 500

5 Once we have an approximate delineation of a form component, we begin the reconstruction. As stated earlier, this is based on properties of the portions of the document that do not involve feature interactions. We identify anchor points which are used to connect the reconstructed feature segments to known feature segments. We then identify a set of candidate anchor point pairs from the cross-sections at the ends of the affected segments (Figure 3). Since the form component is of known dimensions, we generate (or recall as part of our a priori knowledge) a cross-section representation of the model line segment and register it with the representation given by the platform. From this correspondence, we can easily identify the isolated line segment features which are uncorrupted and refine the registration if necessary. We then classify contours between the anchor points of the hypothesized stroke segments which do and do not fit the model. Bounded portions of the contours can be described as follows. A visible contour is a boundary representation of a stroke or region that is derivable from areas of high gradient activity in the image. An occluded contour is a boundary of a stroke or line segment which is obscured or otherwise distorted by another stroke or line segment. A contour is said to be stable if it corresponds to an uncorrupted portion of the stroke and is itself free from distortion caused by noise in the intensity image. A contour is unstable if its location or orientation may be corrupted by neighboring strokes. The platform can thus be annotated to reflect linesegment, non-line-segment and possible-combination cross sections, contours and stroke graph components. The occluded stroke segment contours are then reconstructed from the remaining visible contours and anchor points. For the intersection in this example, the visible stroke contour is connected to the remaining part of the stroke segment so we can assume that it is part of the same stroke. We use the properties of the unoccluded stroke segments to reconstruct the occluded contour and delineate the region which corresponds to the occluded stroke. Figure 4 shows the results of the cross section computation and reconstruction from the intersection with a known line. Once this analysis has been performed for each MBR in the form document, and the markings reconstructed, the remaining markings on the page are passed to an interpreter. References [l] Amerinex Artificial Intelligence, Inc. KBVision system, I Figure 4: Reconstruction of an e touching a line. [2] R. Casey, D. Ferguson, K. Mohiuddin, and E. Walach. Intelligent forms processing system. Machine Vision and Applications, 5: , [3] D. S. Doermann. DoczLment Image Understanding: Integrating Recovery and Interpretation. PhD thesis, University of Maryland, College Park, [4] D. S. Doermann and A. Rosenfeld. Recovery of temporal information from static images of handwriting. Technical Report CAR-TR-595, Center for Automation Research, University of Maryland, College Park, Maryland, To appear in the International Journal of Computer Vision. [5] D. S. Doermann and A. Rosenfeld. The interpretation and reconstruction of interfering strokes. In Proceedings of the International Workshop on Frontiers in Handwriting Recognition, pages 41-50, [SI G. Maderlechner. Symbolic Subtraction from fixed formatted graphics and text from filled in forms. In Machine Vision and Applications, pages , [7] K. Romanik. Approximate testing theory. Technical report, University of Maryland, [8] H. Samet. The Design and Analysis of Spatial Data Structures. Addison-Wesley, Reading, MA, [9] S. Liebowitz Taylor, R. Fritzson, and J.A. Pastor. Extraction of data from preprinted forms. Machine Vision and Applications, 5: ,1992. [lo] D. Wang and S. N. Srihari. Analysis of form images. In Proceedings of the International Conference on Document Analysis and Recognition, pages , I 50 1

Hidden Loop Recovery for Handwriting Recognition

Hidden Loop Recovery for Handwriting Recognition David Doermann Institute of Advanced Computer Studies, University of Maryland, College Park, USA E-mail: doermann@cfar.umd.edu Nathan Intrator School of