Representations and Metrics for Off-Line Handwriting Segmentation

Representations and Metrics for Off-Line Handwriting Segmentation Thomas M. Breuel PARC Palo Alto, CA, USA tbreuel@parc.com Abstract Segmentation is a key step in many off-line handwriting recognition systems but, to date, there are almost no ground truth segmentation databases and no widely accepted and formally defined metrics for segmentation performance. This paper proposes a representation of segmentations and presegmentations in terms of color images. Such representations allow convenient interchange of ground truth and hypothesized segmentations in the form of standard image formats. The paper formally defines the notions of oversegmentation and undersegmentation in terms of the maximal bipartite match between corresponding pixels. It also defines a number of metrics that quantify the frequency and extent of events in handwriting like kerning, splitting, and merging of characters. It is hoped that these metrics and representations will find wider use in the community and serve as a basis for creating standard training and test databases of segmentation data. 1 Introduction In many approaches to off-line, connected handwriting recognition, a distinct segmentation step plays a crucial role (for reviews, see [5, 6, 7]). Generally, such systems apply a segmentation algorithm to a cleaned-up image of handwritten text and obtain a number of character hypotheses. The character hypotheses are then individually classified and the classification results are integrated into an overall interpretation of the input. As also observed in [1], there are no widely used metrics to compare and evaluate segmentation methods, and segmentation methods are usually discussed in terms of their overall effect on system performance. Unfortunately, such an approach makes comparisons among different segmentation methods implemented by different authors difficult. Absent are also databases of ground truth for off-line handwriting segmentation. Such ground truth is useful both in the evaluation of segmentation algorithms, as well as in the training of adaptive segmentation algorithms [8, 1]. This paper first describes how segmentation ground truth and presegmentations can be represented using pixel-based representations. Such representations are convenient both because they admit easy interchange using standard image formats, and because they allow us to give precise definitions to notions of oversegmentation and undersegmentations ( missed segmentations), as well as the geometric accuracy of approximately correct segmentations. 2 Segmentations as Images 2.1 Ground Truth Data A correct segmentation of the image of some handwritten string partitions the image into disjoint subsets S i of foreground pixels 1 Each of these disjoint subsets represents the pixels belonging to one character. Definition. A pixel-based representation of a segmentation {S 1,..., S n } is an image in which each pixel is assigned as its value the index of the subset S i that the pixel is a member of. In practice, it is convenient to implement pixel-based representations of segmentations as 24bit RGB color images. This gives us 2 24 potential labels, enough to represent any segmentation one is likely to encounter in practice. Furthermore, we can save or exchange the segmentation information using any lossless color image format, like PNG (Portable Network Graphics) or PPM (Portable PixMap). To make this practical, there are a few additional conventions we need. If we consider the (r, g, b) triples of a 24bit color image as hexadecimal values, the current software makes the following assigments: 0x000000 This pixel value represents the page background. 1 A small number of OCR systems, such as the DID system[4], also partition the background pixels as part of a segmentation. The techniques described in this paper carry over to that case. However, because of space limitations and for concreteness, we will carry out most of the discussion in terms of segmentation algorithms that segment foreground pixels only. 1

(a) (c) Figure 1. Ground truth segmentation of a handwritten string, represented using a color image. (b) Figure 2. (a) Cuts (dashed) that generate the base segmentation, (b) the segmentation hypothesis graph, representing all hypohtesized segmentations of the input string, (c) the color image representation of the presegmentation. 0x000001-0x00ffff Foreground pixels carrying segmentation information. 0xffffff This pixel value represents a pixel that cannot be assigned unambigously to a single segment or belongs to non-text page components. We will refer to this value symbolically as AMB. 0x800000-0x80ffff Pixels values to represent a segmentation of the page background (future use). all other values Reserved for future use. There is no requirement that pixel values for foreground or background pixels are allocated sequentially. 2.2 Segmentation Hypotheses A complete representation of all the hypothesized segmentations of the image of a handwritten input string usually takes the form of a hypothesis graph (for recent uses and reviews, see [2, 5, 6, 7]): a directed, acyclic graph whose nodes are character hypotheses and whose edges are adjacency relationships. Each path through the hypothesis graph represents one possible segmentation of the input, and it partitions the foreground (ink-) pixels of the handwritten input into disjoint subsets. However, different paths through the hypothesis graph represent alternative segmentations that usually do not permit a single, consistent assignment of colors to character segments. That is, there exists no natural way of representing an arbitrary hypothesis graph as a coloring of the foreground pixels of the image being segmented. Fortunately, most segmentation methods for segmenting images of handwritten text use similar techniques for constructing the hypothesis graph. In a first step, a number of cut points or cut paths (collectively referred to as cuts) through the input image are determined, using either hardcoded rules or adaptively trained methods. These cuts partition the input image into a larger number of disjoint subsets of foreground pixels. We will refer to this as the presegmentation (e.g., [9]). In a second step, adjacent (or, at least, nearby) subsets of foreground pixels are grouped together into character hypotheses, and the character hypotheses are arranged in a hypothesis graph using the constraint that the foreground pixels of different character hypotheses must be non-overlapping or mostly non-overlapping. Unlike the hypothesis graph, the presegmentation does (under simple assumptions) have an equivalent representation as an image, analogous to that of the ground truth representation. We can therefore use such a representation in the evaluation of the quality of the cuts determined by a segmentation algorithm. In most real off-line handwriting recognition systems, identifying cuts reliably appears to be the major problem; once good cuts are identified, the construction of a hypothesis graph is usually simple and depends on only a few parameters [2, 3]. 2.3 Generation of Ground Truth Data In the previous sections, we have seen how both ground truth segmentations and presegmentations can be represented as images. This leaves the question of how we can generate such data. The simplest way of generating segmentation ground truth data is with a standard painting program, like The Gimp, Corel Paint, or Adobe Photoshop. The ability to generate ground truth segmentation data easily using widely available tools is one advantage of using color images to represent segmentation information. The procedure is as follows. First, the binarized image of the handwritten input is loaded and converted to RGB color. Then, the background (paper) is masked using the intelligent mask tool. Now, the user can use a paint tool, pick different colors 2

(a) Figure 3. Characters aligned and segmented automatically using a handwriting recognition system. The clean separation of individual characters shows that automatic segmentations can form a reasonable basis for the creation of ground truth. from a palette, and conveniently paint the individual characters with broad strokes. Finally, the resulting ground truth is saved using a lossless color image format. This is useful for generating small amounts of ground truth data for quick verification or analysis. Using simple scripting tools available in these programs, or by embedding these programs as components in a dedicated user interface, it is also possible to automate the process significantly and obtain a tool that is nearly as good as a dedicated tool for creating ground truth segmentation data. To create larger amounts of segmentation data, an automatic or semi-automatic process is desirable. Fortunately, several handwriting recognition systems already perform fairly reliable segmentation as part of their recognition process. We can use these system to generate candidate segmentations and verify by inspection quickly for each field whether the segmentation is acceptable. If it is, no further intervention is required. If it is not, the field represents a difficult case, and the segmentation can be touched up or recreated manually from scratch. 2.4 Automatic Generation of Multi-Character Fields A third means for creating segmentation ground truth is the construction of multi-character images from isolated characters. The idea is as follows. Assume that we are given a collection of character images. We also assume that all images are the same height (possibly padded with background pixels at the top and bottom) and that the baseline of each character image is at a constant offset from the bottom of each image. Consider now the first two of these images. We can shift these images closer together horizontally so that they overlap. At some point, the foreground pixels from the character in one image will touch the foreground pixels from the character in another image. If we continue the horizontal motion beyond that point, foreground pixels from the two characters will overlap. Prior to that, for some character pairs, (b) Figure 4. Automatic generation of text fields with touching characters. The images in (a) show different parameter settings for maxkern and overlap. The images in (b) show text fields generated using default parameter settings, based on the NIST-3 database of digits. there may be a range of horizontal displacements where the characters are kerned but not yet touching. We can repeat this process for all consecutive pairs of images and thereby arrive at a single text field composed of handwritten characters in which characters are kerned, touching, or overlapping in known ways. Furthermore, we can keep track of the sources of these pixels and label them using different colors, giving us ground truth in the format described above (overlapping foreground pixels are labeled AMB). For additional variability, we can introduce random displacements of the baseline, as well as variable amounts of overlapping. This process is particularly useful for generating hard test cases for digit and touching hand-printed character recognition. Cursive handwriting, of course, requires a different process for generation. Depending on the parameter values chosen for this process (e.g., the default values given above) and the database of isolated characters used, the resulting images of handwritten text can look fairly natural. Allowing large amounts of kerning and yjitter results in images that are very challenging to segment, although they are often still recognizable and plausible. 3 Characterization of Ground Truth Data From experience with recognition algorithms, there are several categories of problems that commonly occur. First, characters that are not touching are generally easy to segment based on connected component analysis. However, if the input data contains a large number of broken characters (characters represented by multiple connected components), segmentation becomes more difficult again because the segmentation algorithm is forced to consider character hypotheses that group together separate connected components. Kerning, where the vertical (or diagonal, if the text is slanted) projections of two characters overlap, causes problems for segmentation methods that attempt to separate 3

(a) (b) (c) (d) (e) Figure 5. Common difficulties encountered when trying to segment images of handwritten text: (a) simple case. (b) touching connected components, (c) kerning, (d) kerning and touching, (e) broken up characters. The occurrence of these events is quantified using the metrics described in the paper. Figure 6. User interface for the hand segmentation of input fields. per field. characters using straight lines. Kerning becomes an even harder problem if the kerned characters touch, making separation using connected component analysis impossible. The larger the amount of kerning, the overlap, in the projection profiles, the harder the segmentation problems generally become. Definition. We define the following parameters for the characterization of images of off-line handwriting, given ground truth. touching fraction The average number of characters corresponding to each connected component. split fraction The average number of connected components corresponding to each character. #kerned, non-touching The number of pairs of characters whose projection profiles overlap, where the characters are not touching. #kerned, touching The number of pairs of characters whose projection profiles overlap, where the characters are touching. avg. kerning The average amount of kerning for all the kerned character pairs. max kerning The maximum amount of kerning among all the kerned character pairs. The values involving kerning are computed at all possible different slants, and the values corresponding to the slant having the minimum average kerning value are reported. To characterize a whole database of images of handwritten text, these numbers are computed for each input field in the database and quartiles are reported. 4 Evaluation of Segmentation Hypotheses Let us now turn to the question of how to compare the quality of a hypothesized segmentation against a ground truth segmentation. That is, we are given two segmentations in image form, the hypothesized segmentation and the ground truth. The images representing these segmentations should have the same dimensions, and for each corresponding pair of pixels in the two images, either both pixels are zero (belong to the background) or are non-zero (belong to the foreground some character hypothesis). Based on these pixel correspondences, we can compute a bipartite graph, which we will refer to as the pixel correspondence graph. Definition. The pixel correspondence graph of two pixel-based representations A and B of segmentations is a weighted bipartite graph. The left and right node sets N A and N B are indexed by the distinct values that pixels in A and B assume, respectively. For each value A ij and B i,j there is an edge between the corresponding nodes. The weight of the edge is its multiplicity. The weight of the edge between two nodes therefore represents the number of foreground pixels in the intersection of the regions covered by the two character hypotheses. Edges going to the node representing the AMB pixel value in the ground truth image are removed from further analysis. If the hypothesized segmentation agrees perfectly with the ground truth segmentation (up to AMB pixels), then this bipartite graph will be a perfect matching. That is, each node on either side of the graph has exactly one edge. If there are differences between the two segmentations, then the bipartite graph will not be a perfect matching. Instead, each node representing a character hypothesis in the hypothesized segmentation may have multiple outgoing edges, and each node representing a character hypothesis in the ground truth will have multiple incoming edges. 4

connected-components 4.6 connected-splits 1.4 segmentation-components 5.7 segmentation-splits 0.16 kerned-pairs 0.89 avg-kerning 1.1 max-kerning 3.1 slant-for-min-avg 0.39 connected-components 13.5 connected-splits 1.8 segmentation-components 12.8 segmentation-splits 0.74 kerned-pairs 3.8 avg-kerning 1.5 max-kerning 4.5 slant-for-min-avg 0.23 groundtruth-components 12.8 segmentation-components 17.3 oversegmented-comps 2.1 undersegmented-comps 1.0 total-oversegmentation 2.4 total-undersegmentation 1.4 frac-oversegmented-fields 0.79 frac-undersegmented-fields 0.52 Table 1. Evaluation of 195 fields from the Cedar bu database of ZIP codes (left) and 185 fields from the NIST Datase 12 of handwritten responses on US Census Forms. Table 2. Evaluation of a simple segmentation algorithm on 185 fields from the NIST Database 12. See the text for a discussion. For each node on either side of the bipartite, we can compute the fraction, or percentage, of pixels overlapping with each of its corresponding nodes. For example, if a character in the ground truth is evenly split between two character hypotheses in the hypothesized segmentation, we would compute two fractions of 50% each for that node in the ground truth. This is an example of oversegmentation: a ground truth character has been split when it should not have been split. Conversely, if a hypothesized character in the hypothesized segmentation is evenly split between two characters in the ground truth, we would compute two fractions of 50% each for that node in the hypothesized segmentation. This is an example of undersegmentation: a ground truth character has not been split when it should have been. When oversegmentation or undersegmentation is present in a recognition result (as opposed to a presegmentation, see below), it shows failure of the system to identify one or more of the characters. As a result, the whole image can likely not be recognized correctly. Using these definitions, we could simply define average oversegmentation and undersegmentation in terms of the average number of edges entering and leaving nodes of the bipartite graph. Unfortunately, things are not quite that simple. Real segmentation systems not only show gross failures to split, but they also show slight differences around the edges of characters from the ground truth. Conceptually, these are neither oversegmentation nor undersegmentation, but slight geometric inaccuracies. How can we proceed? As long as the bulk of each character image in the ground truth corresponds to the bulk of a character image in the segmentation hypothesis, and vice versa, there is no oversegmentation or undersegmentation. But when a significant fraction of the pixels of any character hypothesis are missing, then we have either oversegmentation or undersegmentation. To formalize this notion, we consider the maximal bipartite matching for the bipartite graph we have computed previously. The maximal bipartite matching represents the most optimistic way in which we can put the two segmentation hypotheses into correspondence. We therefore arrive at the following definition: Definition. Let G be the pixel correspondence graph of the pixel base representations of the segmentation hypothesis S and the segmentation ground truth T. Let M be the maximal weighted bipartite matching of G. The number of oversegmented characters at threshold θ is the number of nodes corresponding to T having an associated edge in M whose weight is below the threshold θ. The number of undersegmented characters at threshold θ is the number of nodes corresponding to S having an associated edge in M whose weight is below the threshold θ. We can then define the degree of undersegmentation as the number of ground truth nodes whose edge in the maximal bipartite matching has a fractional weight of less than some threshold θ, and the degree of oversegmentation correspondingly for the segmentation hypothesis nodes. The choice of threshold θ itself depends on the sensitivity of the subsequent isolated character recognizer to variations in character shapes and represents a parameter that ties the performance of the segmentation algorithm to the overall performance of the recognition system. For reporting segmentation performance, we can choose multiple threshold values, although in practice a threshold of θ = 90% appears to be a good choice (the distribution of fractional edge weights is bimodal for good segmentation algorithms). For the ground truth characters that we have not characterized as undersegmented, it is useful to measure how accurately the segmentation algorithm represents their shape. We can capture this by computing the average number of pixel per ground truth character that are not represented by the character s edge in the maximal bipartite matching. 5 Evaluation of Presegmentations Above, we discussed the notion of a presegmentation as the basis for the construction of a segmentation hypothesis graph. It would be nice to be able to evaluate the quality of base segmentation in order to be able to predict how well the corresponding hypothesis graph can represent the possible segmentations of the handwritten input. We can apply the methods described in the previous section directly, substituting the image of the presegmentation for the image of the hypothesized segmentation. The definitions of oversegmentation and undersegmentation carry over with their 5

usual meanings [9, 8, 1]. 6 Experiments All the methods described in this paper have been implemented in C++, and they have been used during the development of the handwriting recognition system described in [2]. Figure 6 shows a graphical application (implemented in the cross-platform wxwindows toolkit) that allows quick and accurate manual segmentation of training and test data. The application functions as a paint program in which the background is masked. With minimal experience, it is possible to label fields at approximately one character per second. Table 1 shows a summary of the evaluation of 195 binary fields from the Cedar bu database (obtained by automatic thresholding) and 185 fields from the NIST Database 12 of handwritten responses on US Census forms (obtained by automatic forms removal). (The programs output a lot of additional information, including information about which specific fields contain difficult cases.) These results give us a good idea of the difficult of the two databases. They show, for example, that kerning occurs four times more frequently in the NIST database than the CEDAR database. Furthermore, the NIST database also contains more than four times as many characters that are split between multiple connected components. A more detailed analysis of this data lets us make predictions of what the limit... Table 2 shows the evaluation of a simple segmentation algorithm on the NIST database. An analysis of these results shows that the algorithm is limited to a recognition rate of 48%. It can likely be improved greatly by more aggressive segmentation, resulting in more oversegmentation, but also reducing undersegmentation. Examining the specific fields determined to be undersegmented by the evaluation method yields further information about which particular fields this segmentation algorithm has problems with. 7 Discussion This paper has described a number of techniques for the evaluation and characterization of databases for off-line handwriting recognition, segmentation hypotheses, and presegmentations. These techniques address the following issues: The representation and interchange of segmentation ground truth and segmentation results in an easily implementable format. The identification and quantization of common difficulties encountered in off-line handwriting recognition databases. The measurement of undersegmentation, oversegmentation, and geometric precision in both final segmentations and presegmentations. The automatic generation of test cases and ground truth from isolated character databases. In the author s experience, they provide useful insights into the performance and failure modes of handwriting recognition systems. Data, metrics, and results corresponding to those experiments will be described elsewhere. A more wide-spread adoption of any such methods and metrics would require creating and publishing significant amounts of ground truth for widely used databases, as well as validating and correlating the proposed metrics against the performance of additional real-world recognition systems. Such efforts first require some agreement in the community about these techniques. The author hopes that this contribution will catalyze discussions at the workshop that may lead to such community efforts. References [1] M. Blumenstein and B. Verma. Analysis of segmentation performance on the cedar benchmark database. In International Conference on Document Analysis and Recognition, pages 1142 1146, 2001. [2] T. Breuel. Recognition of handwritten responses on us census forms. In Proceedings of the International Association for Pattern Recognition Workshop (Document Analysis Systems), pages 237 64, 1995. [3] T. Breuel. Segmentation of handprinted letter strings using a dynamic programming algorithm. In Proceedings of Sixth International Conference on Document Analysis and Recognition, pages 821 6, 2001. [4] G. E. Kopec and P. A. Chou. Document image decoding using Markov source models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 16(6):602 617, June 1994. [5] Y. Lu and M. Shridhar. Character segmentation in handwritten words an overview. Pattern Recognition, 29(1):77 96, 1996. [6] T. Steinherz, E. Rivlin, and N. Intrator. Offline cursive script word recognition a survey. International Journal on Document Analysis and Recognition, 2:90 110, 1999. [7] A. Vinciarelli. A survey on off-line cursive word recognition. Pattern Recognition, 2002. [8] X. Xiao and G. Leedham. Knowledge-based english cursive script segmentation. Pattern Recognition Letters, 21:945 954, 2000. [9] B. Yanikoglu and P. A. Sandon. Segmentation of off-line cursive handwriting using linear programming. Pattern Recognition, 31(12):1825 1833, 1998. 6