An Accurate and Efficient System for Segmenting Machine-Printed Text. Yi Lu, Beverly Haist, Laurel Harmon, John Trenkle and Robert Vogt

An Accurate and Efficient System for Segmenting Machine-Printed Text Yi Lu, Beverly Haist, Laurel Harmon, John Trenkle and Robert Vogt Environmental Research Institute of Michigan P. O. Box 134001 Ann Arbor, MI 48113-4001 ABSTRACT In this paper we present a system that segments machine-printed text of various fonts, styles and sizes for the Roman alphabet and numerals. The system consists of four major computational modules: multiline processing, single line processing, structural analysis, and merged-character splitting. The multiline process is designed to segment fixed-pitch text and is especially effective for broken and dot matrix characters. The process is based on the inherent property of fixed-pitch fonts that gaps must occur at fixed intervals in lines of text. The single line processing, structural analysis and the splitting modules are designed for segmenting proportional pitch text. The single line processing consists of procedures such as vertical projection based segmentation and grouping broken characters. The structural analysis is designed to segment kerned characters, punctuation marks, and also to group broken characters and characters with more than one component. The splitting module is designed to split touched and merged characters in proportional and/or serif fonts. The proposed segmentation system is evaluated using actual address block images from the U.S. mail stream. 1. Introduction Character segmentation is a technique which partitions images of lines or words into individual characters. In most OCR systems, character recognition performs on individual characters. Character segmentation is fundamental to character recognition approaches which rely on isolated characters. It is a critical step because incorrectly segmented characters are not likely to be correctly recognized. Character segmentation is all too often ignored in research community, yet broken characters and touching characters are responsible for the majority errors in automatic reading of both machine-printed and hand printed text[2]. The complexity of

character segmentation stems from the wide variety of fonts, rapidly expanding text styles, and address image characteristics such as poor-quality printers, including sparse dot matrix printers. It is desirable to develop segmentation techniques which are styleindependent[1]. However, it is also reasonable to develop methods which are specialized for handling broad categories of character segmentation situations, such as those listed below (in order of increasing difficulty). Uniformly spaced characters, i.e. fixed-pitch fonts. Well-separated and unbroken characters, not uniformly spaced. Broken characters. Touching characters. Broken and touching characters. Broken and italic characters. Touching and italic characters. Script characters. In this paper, we present a system that segments machine-printed text of various fonts. It is designed to segment the text falling into the first five categories listed above. The segmentation system has the following distinct features: Characterization. The segmentation system is designed to exploit any available information about the image, print, and font characteristics. Processes at earlier stages of the system are less computationally complex than those at the later stages. The early stages are designed to segment the bulk of address block images accurately and efficiently. Processes at later stages require more computational time and are designed to process only the more difficult cases. Prediction. Font-dependent statistics are collected and estimated during the segmentation processes. They are used to predict the locations of break points between characters, and to group broken characters. Statistics are calculated throughout the segmentation process for the character intervals, widths, and gaps, as illustrated in Figure 1. Validation. The control structure of the segmentation system is dictated by the success of the segments. The output segments

generated by various segmentation modules are further validated by various decision criteria. If segments pass the decision criteria, they are sent to the recognition system; otherwise, the line image is sent to the other modules in the segmentation system. Section 2 gives an overall view of the system and its individual process modules. The proposed segmentation system has been evaluated using actual address block images from the U.S. mail stream. Section 3 presents the test results of the segmentation system, and error analysis. The test results will show that the segmentation system is highly efficient, accurate, and robust. 2. ERIM Segmentation System The segmentation system encompasses four major computational modules: multiline processing, single-line processing, structural analysis, and model based splitting. The system flowchart is shown in Figure 2. The multiline process is designed to segment fixed-pitch text and is especially effective for broken characters and characters in dot matrix style. The process is based on the inherent property of fixed-pitch fonts that gaps must occur at fixed intervals in lines of text. It is highly efficient since it operates on groups of line images. The single line processing, structural analysis and the splitting modules are designed for segmenting proportional pitch text. Singleline processing segments characters based on single line vertical projections combined with statistical analysis. It has two merging processes, sequential merging and grouping broken characters based on estimated character widths. The structural analysis module is based on the analysis of neighboring components in a candidate for segmentation. The analysis is based on occupancy ratio, size of components, relative locations, distance between the neighboring components, and locations of components with respect to baseline and the top of the corresponding line image in the text. The structural analysis is successful in segmenting kerned characters, punctuation, and also in grouping broken characters and characters with more than one component. The splitting module is designed to split touching and merged characters in proportional or serif fonts. It uses a peak-to-valley function to find the possible splitting locations. A statistical analysis based on character width, height, and ex-height of the line is followed to find the optimal splitting locations.

2.1 Multiline process Two algorithms were developed at the multiline process level: multiline vertical projection and gap periodic detection. We assume that the address block (AB) image has already been segmented into lines. Both modules are aimed at segmenting address images with uniform spacing. Our initial analysis of 642 address block images found that 67.5% have uniform spacing, and 2.3% of the images had a logo or a different font with the remaining lines possessing uniformly spaced characters (see Figure 3). The motivation of the multiline process is to use the cumulative vertical projection to overcome errors or uncertainties in the single line projection. (Figure 6, below, is a good example.) The single line projection shows gaps within characters, but the multiline projection shows only the gaps between characters. 2.1.1 Multiline vertical projection. In order to simultaneously segment lines of the same font, the multiline vertical projection algorithm begins by grouping lines with similar ex-heights (see Figure 4 for definition). For each group of lines, a multiline projection is formed from the single-line vertical projections. Then, statistics over the multiline projection are computed to the end of the second longest line. This assures that most of the regions are made up of more than one line, which helps to estimate the statistics more accurately. A decision procedure is invoked to determine whether the multiline group will be accepted or sent to further segmentation processes. The decision procedure checks the character width, gap, and interval statistics. If the statistics are within certain ranges, merging and splitting procedures are invoked to segment the last part of the longest line. The merging process uses the interval, width, and gap statistics to combine small neighboring regions. The splitting process splits any region that is greater than a maximum width, which is determined dynamically. If the variances are outside of the acceptable range, but the group is still deemed uniform, a merging or splitting process based on the estimated sum of the character width and gap is applied to regions of the group vertical projection. If the statistics indicate the text is not uniformly spaced, a merging process based on estimated character width and gap is applied to the entire multiline projection in an attempt to group broken characters. The statistics are recomputed after the merging and splitting processes. The decision module is

applied again. If the multiline group is accepted at this level then the segmentation is applied to all of the lines in the group, otherwise each line image in the group is sent to the single line process. Figure 3 shows an example of multigroups in an address block image. The multiline process successfully segmented all groups. The multiline segmentation module is especially effective for broken characters and characters in dot matrix style. Figures 5 and 6 show that multiline segmentation can overcome difficulties encountered in a single line projection. The image in Figure 5 is in dot matrix style, and illustrates the difference between multiline and single line processing. Figure 6 shows an image containing broken characters, and its multiline projection. The multiline process successfully segmented the image based on the multiline projection. 2.1.2 GPD algorithm The gap periodicity detection (GPD) algorithm exploits the fact that gaps must appear at periodic intervals in a fixed pitch font, but not in a proportionally pitched font. First, the approach uses the vertical projection of the address block (or line of text) to determine the best single offset and pitch combination for the address block. The pitch estimation is computed based on the mean gap length and the mean run length of the inverted vertical projection. If the variance in the estimates for these two values is above an empirically determined threshold then the AB (or group) is passed on to other segmentation modules. If the estimated pitch is much greater than the mean line height then the AB image is also passed on to the other modules. Based on the estimated pitch, the algorithm cycles through potential combinations of pitch and offset, which are evaluated by a measure of the goodness of the resulting cuts. The distance from the beginning of the inverted vertical projection to the first line is the current offset and the distance between each line is the current pitch. Figure 7 illustrates a dot matrix image successfully segmented by the GPD algorithm. Figure 8 shows an image that contains proportionally pitched text. The GPD detects that the text is not fixed pitch and passes this image on to other methods for character segmentation. 2.2 Single-Line Process

This module is designed to segment characters that are wellseparated but which cannot be grouped with other lines on the basis of ex-height. It consists of an estimation procedure, a sequential merging procedure, a grouping procedure based on the estimated character width, and a decision procedure. The Single Line process uses the vertical projection of a single line image. The first step is to segment the line into regions based on the vertical projection as illustrated in Figure 9. Statistics are then calculated for the character intervals, widths, and gaps over these regions. Although the majority of dot matrix images are processed at the multiline level, a few of them can be passed to the single line processing level due to noise and/or poor printing quality. An estimation procedure is designed to estimate whether a line image is in dot matrix style. The decision is based on the number of region widths and gap widths that are smaller than a threshold. If a line image is estimated to be in dot matrix style, then a sequential merging procedure is invoked. The sequential merging begins by combining neighboring regions with the smallest gap. Statistics are then recomputed and the next smallest gap is used to merge the neighboring regions. This process continues until the statistics are within an acceptable range. If the line image was estimated to be proportionally-spaced style and the statistics are not within an acceptable range, then a merging procedure is invoked to merge small neighboring regions with small gaps. It uses the estimated character width and gap to merge the neighboring regions. If the neighboring regions are small and the total width after being combined is less than a maximum threshold, the regions are candidates to be merged. The maximum threshold is computed dynamically. Figure 10 shows an example generated by this the merging process. The characters O, L, u, D and v are composed of merged regions. As part of the merging process, a character analysis routine was incorporated to avoid merging narrow characters, such as ll, ti, ri, il, ff. The routine analyzes neighboring regions to determine whether they represent two separate characters. The analysis is based on the relative heights of the regions and the occupancy ratio. If the conditions for separate characters are met, the two regions are not joined.

The statistics are recomputed after the merging process and a decision module is invoked. The decision module checks the character width, gap, and interval statistics to determine whether they are within an acceptable range. If the segmentation result is accepted, the output is the segmented characters. If it is rejected, the regions are sent to the Structural Analysis module. 2.3 Structural Analysis The regions that are rejected by the Single Line Process may contain more than one character, either overlapping or touching. Structural Analysis is designed to segment overlapping (kerned) characters ( VA ), characters in Italic fonts ( IN ), and punctuation marks ( P. ). However it must also avoid separating multiple character components, as found in two stroke characters: ( i, j ), and in broken characters. This module is based on the analysis of neighboring components. Two computational steps are needed: generalized connected components and structural analysis of neighboring components. The generalized connected components process computes the different components within a region. The process is based on a classic connected components method with certain modifications: it optimizes the search by using runs, and it allows for a 2 pixel gap in both the x and y directions. Structural analysis of neighboring components consists of two major procedures: kerned character analysis and punctuation detection. The combination of these two procedures groups broken characters and two-stroke characters, and separates kerned characters and punctuation marks from the neighboring characters. Kerned character analysis is designed to group broken or two-stroke characters and to separate kerned characters. The font illustrated in Figure 11 is kerned. The characters TA and PA are overlapping. Kerned character analysis can easily separate these regions. In order to accomplish this, every two neighboring components are examined for vertical overlap and possible grouping. If the degree of overlap exceeds a threshold, the components are candidates to be joined. Because punctuation marks are frequently overlapped by the preceding character, a punctuation detection procedure was developed.

The punctuation detection procedure determines whether the second of the two neighboring components is a punctuation mark. The analysis is based on the occupancy ratio, size of components, relative locations, distance between the neighboring components, and locations of components with respect to the base and the top-base line of the line. If the second component is determined to be a punctuation mark, then it is not joined, as depicted in Figure 12 (see W. and P. ). The punctuation marks detected by this procedure are period, comma, colon and semi-colon. 2.4 Splitting Merged Characters All of the previously-described modules at the single line level were designed to segment characters that are not touching. The final module addresses the last problem of character segmentation: splitting touching or merged characters. The Splitting module is aimed at splitting touching/merged characters in uniform, proportional and/or serif fonts. The module is based on the analysis of a region s vertical projection, peak-to-valley function and the estimated character width. The key issue in the splitting module is to determine where to split the large regions. It is based on the estimated character width and a generalized differential technique. The generalized differential technique is a modified version of the second derivative method proposed by Kahan and Pavlidis[3]. The vertical projection of the region is mapped to a second projection, called the peak-to-valley ratio, by applying a generalized differential technique. The peak-tovalley function is illustrated in Figure 13. Sharp minima in the vertical projection are represented as maxima in the peak-to-valley ratio (shown in Figure 14). These maxima are then considered as potential break points between the merged characters. Splitting locations are determined from both the maxima in the peak-to-valley function and the estimated character width. The process first determines a allowable splitting region based on the estimated character width. The position of the maximum within the splitting region is the splitting location. Then the widths of the two subregions separated by the splitting location are examined. If a subregion has an acceptable size, it is considered to be a character and accepted. If the width of a subregion is large enough to contain more than one character, the search for a splitting location in the subregion is performed again. If no maxima exist within the splitting

regions, the process rejects the region. Figure 15 illustrates splitting locations found within splitting regions. Two splitting regions (SR) are located in the peak-to-valley function. S1 and S2 are the first and second splitting locations found within the two splitting regions. 3. System Performance This section describes an evaluation of the character segmentation system presented above. The system was trained on a training data set of 1000 gray scale address block images, and tested on the a data set of 492 gray scale address block images. Input to the segmentation sytem consisted of line and field (word) boundaries and the binarized images. The output of the character segmentation system is a state-labeled image in which pixels corresponding to individual segments in the same field are given unique labels. These state-labeled images were then used in scoring and error analysis. For the handful of images in which the line or field boundaries were found to be in serious error, the segmentation algorithms were rerun after corrections. As part of the error analysis, all errors in the test set were reviewed and categorized. We used an interactive software to score the segmentation results. This software used an X-Window graphical user interface to present the scorer with the state-labeled image produced by the character segmentation system. Any field with a disparity between the number of segments and the number of characters in the corresponding word in the truth file is automatically flagged on the screen. Other types of error are detected by the scorer. The scorer reviews each segmentation error and records the number of occurrences of each type of error using buttons in the interface designed for this purpose. Once the review is complete the results are saved to a file. The scoring criteria are based on the following definitions: Correct: a segment contains a complete character without any extraneous components. Oversegment: a segment contains a partial character. Undersegment: a segment contains more than one character.

Figure 16 illustrates the definitions of oversegment and undersegment: (a) is undersegmented and it is scored as 2 errors in undersegment count; (b) is oversegmented, and it is scored as 1 error in oversegment count; and (c) is a combination of under and over segmentation. It is scored as 1 undersegment error and 1 oversegment error. Table 1 presents the segmentation results on both training and test sets. In the training set, there were 88443 characters. Of these, 54303 were from Alpha fields, 18323 from mixed fields, 9530 from ZIP fields and 6287 non-zip numerals. The test set contained 27142 characters, 16993 from Alpha fields, 5119 from mixed fields, 3060 from Zip Code fields and 1970 non-zip numerals. The results from the two sets are comparable: 97.53% correct segments from the training set and 97.44% correct segments from the test set. This indicates that the segmentation system is quite robust. We did a further test by eliminating the characters with crossing lines and underlines in the test set and achieved 98.6% correct segmentation. The segmentation results on the test set at the address-block level are plotted in Figure 17. The bottom curve shows the percentage of address blocks correctly segmented, and the top curve is the cumulative function. The results of the segmentation system are summarized as follows. Correctly segmented characters: 97.53% Training 97.44% Test Eliminating characters with crossing lines: 98.6% Test Correct segmentation per address block 81.22% AB Segmented 100% correctly 93.26% AB Segmented over 90% correctly Address block images segmented less than 70% correctly were all caused by crossing lines. All images with errors in the test set were examined and categorized. Two images were segmented under 70% correctly. Of these, 1 had merged italic characters and 1 had touching script characters. Three images were segmented between 70% and 80% correctly, one with touching scripts and two with merged characters. Images segmented

correctly between 80% and 99% were mostly due to poor image quality, combinations of broken and touching characters, and characters in proportional fonts and serif fonts. 4. Conclusion We have described a character segmentation system developed for machine-printed addresses. It has multiple levels of processing. The processes at the front are efficient and accurate, aimed at segmenting relatively simple cases. The later processes are more complicated, aimed at segmenting proportional fonts, broken characters and merged characters. The system uses statistical and structural analysis combined with multiline and single line vertical projections, peak-and-valley functions, estimated character pitch, line image height and ex-height. We have also presented the results of a system test and error analysis. Most of the errors are caused by underlines and crossing lines, and merged characters in Italic style, which were not in the domain of the system. The test results show the segmentation system is efficient, effective and robust at segmenting machine-printed line images into characters. 5. Acknowledgments This work was sponsored by the Office of Advanced Technology (OAT), United States Postal Service, under Task Order 104230-86-H- 0042. All imagery used in the work was provided by the United States Postal Service.

6. REFERENCES [1] D. G. Elliman and I. T. Lancaster, "A Review of Segmentation and Contextual Analysis Techniques for Text Recognition," Pattern Recognition, Vol. 23, PP. 337-346, 1990. [2] R. L. Hoffman and J. W. Mccullough, "Segmentation Methods for Recognition of Michine-printed Characters," IBM Journal of Research and Development, pp. 153-165, March 1971. [3] S. Kahan and T. Pavlidis, " On the Recognition of Printed Characters of Any Font and Size," IEEE Trans. Patt. Anal. Machine Intell., Vol. PAMI-9, pp. 274-287, March 1987. [4] Y. Tsuji and K. Asai, "Adaptive Character Segmentation Method Based on Minimum Variance Criterion," Systems and Computers in Japan, Vol. 17, No. 7, 1986. A B C D E interval width gap Figure 1. Measurements for which statistics are accumulated.

multiline process single line process structural analysis splitting module reject segmented image Figure 2. Segmentation system flowchart. Figure 3. The last two lines possessing uniformly spaced characters Ex-height Depth Ascenders descending Rise Baseline Figure 4. Ex-height, baseline, depth and rise are estimated from a full line image. (a) Multiline processing successfully segments broken text.

(b) The multiline projection shows gaps only between characters. (c) The vertical projections of individual lines have gaps within characters. Figure 5. Multiline segmentation of an image in dot matrix style. Figure 6. Multiline segmentation of broken characters.

Figure 7. GPD applied to a fixed-pitch dot matrix AB Figure 8. GPD applied to a proportional pitch AB Figure 9. Broken characters may be oversegmented using a single line vertical projection. Figure 10. Merging of regions based on the single line projection recovers broken characters (compare with Figure 9).

Figure 11. Structural analysis module used to segment TA and PA. Figure 12. Detection of punctuation results in the separation of P. and W. Figure 13. The peak-to-valley function emphasizes local minima in the vertical projection. Figure 14. Potential breakpoints derived from the peak-to-valley function

Figure 15. Splitting regions (SR) and locations. LI W LM Undersegment Oversegment Combination of Under and Over Segmentation (a) (b) (c) Figure 16. Examples of segmentation scoring.

Table 1. Segmentation Results on the Phase I Test. Segmentation Results Training Set: Alpha Mixed Zip Non-Zip Combined N = 54303 18323 9530 6287 88443 ERIM Measures % Correct 97.32 97.2 98.39 99.01 97.53 % OverSeg 0.61 0.63 0.25 0.17 0.54 %UnderSeg 2.07 2.17 1.35 0.81 1.92 Test Set: Alpha Mixed Zip Non-Zip Combined N = 16993 5119 3060 1970 27142 ERIM Measures % Correct 97.13 97.44 98.24 98.88 97.44 % OverSeg 0.68 0.51 0.16 0.05 0.54 %UnderSeg 2.19 2.05 1.6 1.07 2.02 100 80 Cumulative Percentage of Address Blocks 60 40 20 0 50 60 70 80 90 100 Percentage of AB Correctly Segmented Figure 17. Segmentation results on a per-address block basis.