A Labeling Approach for Mixed Document Blocks A. Bela d and O. T. Akindele Crin-Cnrs/Inria-Lorraine, B timent LORIA, Campus Scientique, B.P. 39, 54506 Vand uvre-l s-nancy Cedex. France. Abstract A block image labeling method is presented. It does not assume that the blocks to be treated are already segmented nor that they contain homogeneous data. It is based on connected component analysis to label the blocks' contents as small letter text, medium letter text, large letter text, graphics or photographs, giving the percentage of each of these components with respect to the surface area it occupies. It uses a recursive algorithm that allows one to improve on the result of segmentation. The performance of the method is given. 1 Introduction Block classication or labeling is an important and useful step in the document image recognition process. In this step, document image blocks extracted during the segmentation process, are classied into different categories such as: text, graphics, photographs, etc. depending on their contents. The labels given to blocks help in determining the type of treatment to be applied to each block during the analysis and understanding stage. There are two major approaches to block classication. In the rst approach, it is always assumed that blocks contain homogeneous data. This is the case of blocks found in composite documents such as scientic journals, newspapers, etc. Moreover, the segmentation methods employed use global spatial properties of regions to determine their frontiers, without taking into account their contents. Each block is classi- ed into the closest medium satisfying certain properties. These properties correspond mostly to statistical and textural features extracted from the block image. Among the methods in this approach, we can cite [] which uses a feature space partitioning technique to label newspaper image blocks, using regularity, abundance and width of spaces to classify a block as either a small letter, medium letter, large letter, graphics or photograph block. And also [3] that uses block size, block mean black pixel run length, density and eccentricity to classify blocks extracted with rlsa into text, graphics, halftone, horizontal line or vertical line blocks, exploiting the fact that text lines have approximately a constant and small height. In the second approach, it is assumed that a block contains a mixture of text and non-text (generally, text and graphics) such as in technical documents, tables, forms, etc. In this case, the methods employed separate text strings from non-text in the block. Some of these methods use connected component analysis to perform the text separation as in [1] where a Hough transform based algorithm is applied to group collinear connected components of similar size into logical text strings. Others are based on neighborhood line density which is suggestive to the extraction of graphics. In this paper, we describe a new labeling method that is able to locate and identify each type of data in a mixture of media in the same block. The method gives more detailed information than the previous methods and it can be used to improve on the results of the segmentation. It precisely gives the locations of each medium in the block as well as its percentage with respect to the surface area it occupies. Principle This method classies a block by giving the proportion of each of the following categories: small text, medium text, large text, graphics and photograph. It is based on connected components (cc's) analysis by studying for each set of cc's, the classes of spaces between them, as well as their sizes and regularity. The analysis is done in three steps. In the rst step, cc's are merged into sets of approximately aligned cc's. For example, a text line can be partitioned into three sets of cc's, the rst for accents and apostrophes, the second for letters and the third for punctuation. In this manner, two successive text lines are never merged, and large connected components are easily isolated. The cc's in each set
are analyzed individually if they are few, or globally otherwise. In the global analysis, the width of the cc's as well as the space between them are studied. If there are more than three types of spaces, the analysis is recursively applied to the two sets of cc's around the largest space (this allows the separation of two columns, for example). If there is a cc whose width is much more larger than those of the rest, it is separated and analyzed apart. If there is only one class of spaces and the regularity of the spaces is very strong, the cc set is taken as graphics, otherwise, it is considered as text. In the individual cc analysis, certain characteristics, such as density, height/width ratio, the percentage of horizontal black segments whose lengths are equal to the cc's width, etc., are extracted to determine the type of the cc. In the second step, the sets obtained in the previous step are globally analyzed with respect to their neighbors in order to either correct the errors of the previous classication or to merge similar sets into bigger ones. The last step is concerned with the calculation of the percentage of each category in the block. 3 Dierent Steps The document is deskewed if its skew-angle is greater than a certain degree harmful to horizontal alignment. After the extraction of cc's and the elimination of those considered as noise (i.e. those whose number of black pixels or surface area is less than an a priori xed threshold) we then proceed to merge them into bigger entities. The connected components are represented by the coordinates of the top left and the bottom right corners of their circumscribing rectangles, say [(x 1 ; y 1 ) (x ; y )]. They are extracted in ascending order of y. For equivalent y, they are obtained in ascending order of their x 1. 3.1 Fusion of Connected Components into Sets Two cc's are merged into the same set/line when they are approximately aligned, i.e. if the y- coordinates of their top left corners are not too far from each other, and likewise for the y-coordinates of their right bottom corners. The closeness of these coordinates is determined with the following rule: jy1? y 0 1j max[ (y?y1) max[ (y?y1) ; (y0?y0 1 ) ] ; (y0?y0 1 ) ] & jy? yj 0 It is to be noted that a line can be formed by cc's whose abscissas are far apart. With this method, it is possible to extract several line portions from a text line, and separate line portions that might likely be connected (above or below) to another line of text or graphics. 3. Fusion of Sets into Lines The line portions so formed are then merged into larger sets to obtain real text lines and to discard those that are not horizontally aligned. This is to avoid the merging of either the line of an underlined text with the text or two successive text lines. The fusion is performed if the circumscribing rectangles are very close in either the horizontal or vertical direction, or have a none empty intersection, or even overlap. This fusion of lines improves the results of the previous fusion (fusion of cc's). 3.3 Line Classication The classication of the formed lines is based on some coecients extracted from the constituting cc's (such as size, density and the percentage of the black segments whose width is approximately equal to that of the cc), as well as homogeneity of the spaces separating them. It is performed in two manners depending on the number of cc's in the lines. When there is only one cc, it is passed through a series of lters to determine its type. Otherwise, the line is either cut into smaller sets with respect to the homogeneity of spaces and sizes of its cc's or classied globally. The classication algorithm is given below. 3.3.1 Case of many cc's /* LHavg : Average Height of cc's in the Line, LWavg : Average Width of cc's in the Line, M IHslt : Minimum Height of small letter Text M AHslt : Maximum Height of small letter Text M IHmlt : Minimum Height of medium letter Text M AHmlt : Maximum Height of medium letter Text M IHllt : Minimum Height of large letter Text M AHllt : Maximum Height of large letter Text */ if LHavg < M IHslt /* very small average height of cc's */ then line_type = graphics else calculate N Bsc /* number of space_classes */ if N Bsc 3 then /* non regular spaces between cc's */ cut the line into two at the largest space; recall the classication on each sub_line else /* regular & more or less regular spaces */ if largest cc 4 LWavg then /* a cc dierent from the others */ cut the line around the largest cc (on the right and on the left);
largest cc recall the classication on the sub_lines and the else /* cc's of regular sizes and spaces */ if N Bsc = 1 & LHavg M IHslt then classify each cc individually; line_type = type of the majority else text : small if M IHslt LHavg M AHslt : medium if M IHmlt LHavg M AHmlt : large if M IHllt LHavg M AHllt 3.3. Case of a single cc In this case, the cc is passed through a series of lters, on the basis of attributes extracted from it, until its type is obtained. In all, there are sixteen lters which are applied in order. There are many thresholds used in these lters, but these are determined before hand during a learning stage on many kinds of documents, thus assuring their stability. The lters are given below. F1 if density < minimum density of photograph then graphics F if No. of segments (whose width 6= that of cc) < a certain threshold then if vertically extended black block (1, I) then text else graphics F3 if low density and extended block then graphics F4 if eccentricity is between that of text and photograph and high density then if the height is important than photograph else text F5 if exentricity > High threshold of that of photograph then graphics F6 if exentricity < low threshold of that of photograph then graphics F7 if height < that of text then graphics F8 if height < that of photograph and density > that of photograph then graphics F9 if height > that of photograph and density > that of photograph then graphics F10 if average number of segments per line > number of segments in a text letter then if density > that of graphics then photograph else graphics F11 if No. of segments per line - average of No. of segments per line is important then if density > that of graphics then photograph else graphics F1 if No. of segment length classes that of a graphics line and average of No. of segments per line is equal to that of graphics line then graphics F13 if No. of segment length classes > that of a letter then if density > that of graphics then photograph else graphics F14 if length of segments is very irregular then if density > that of graphics then photograph else graphics F15 if low eccentricity and density that of letter then photograph F16 if many lines with irregular segment lengths then if density > that of graphics then photograph else graphics F17 else text 3.4 Error Detection and Particular Cases It is possible to have some imperfections in the classication of the lines. Therefore, we try to detect and correct any error. This is done in two phases. Firstly, incoherences at the level of cc's are located and resolved. Secondly, the incoherences at the line level or particular cases are located and resolved. 3.4.1 Overlapping Connected Components Often, photographs and graphics are usually fragmented when passed through a scanner. Some of their fragments are usually confused with text. In order to reconstitute these kinds of patterns, we proceed to locate and study cc's that overlap with them. The correction algorithm is given below. foreach c of type photograph (P ) or graphics (G) do done foreach c 0 6= c : do done area_of(c 0 ) < area_of(c) and area_of(c \ c 0 ) > area_of(c 0 )/ if type_of(c) = P and type_of(c 0 ) 6= G then type_of(c 0 ) := P if type_of(c) = G and type_of(c 0 ) = P then type_of(c 0 ) := G 3.4. Particular Cases In this phase, we compare each line with its neighboring lines to determine if we have a particular case. A particular case can be: the accents, the apostrophes, dots on i, j, broken characters, or part of graphics mis-labeled as photograph. It is also necessary to make uniform text lines where letters, individually recognized, can have dierent sizes. Text Line with dierent sized letters When a text line contains a mixture of small, medium and large letters, line is given the label of its components that occupy the largest surface area. Misclassied Medium and large letter text A medium or large text line can be cut horizontally or vertically, or may contain some points, apostrophes or punctuation. In the case of the horizontal cut, we examine two lines that are horizontal neighbors, while
in the other case, we examine two lines that are vertical neighbors. In the rst case, if there is line to the left or to the right of a medium or large text line, we merge it with the text line if its height is much less than that of the text line and does not contain any component whose label is large text. This case corresponds to large characters either cut on top or containing accents or dots. In the second, if there is line to the top or to the bottom of a medium or large text line, we merge it with the text line if its height is much less than that of the text line and does not contain any component whose label is large text or medium text. This case corresponds to large characters either cut on top or bottom. Table 1 shows the results of the classication. We can observe that 6% of Medium Text blocks are classi- ed as Graphics blocks. This is due to underlined text where the letters touch the line, and also to erect and isolated letters such as l,i. The 1% of Large Text confused with Photographs is a result of isolated and dense very large letters. The 5% of Photographs labeled as either Large Text or Graphics resulted from cuts in photographs during scanning, or the fact that some photographs contain white streams. Some graphics blocks were labeled as Medium Text blocks (7%), this is due to the fact that graphics are not usually well connected and may contain small forms that are mistaken for letters (for example, graphics representing chemical structures). Examples of the results obtained are given in gure 1. Graphics Classied as photographs If a line contains a mixture of graphics and photographs and the surface area of photographs is less than a certain threshold (0% of the surface area of the line), we change the label of the line components to graphics. This is due to the fact that certain graphics may contain dense components that can be confused with photographs. 3.5 Calculation of Percentages We have chosen to use the percentage of the surface area of each type of components. However, we like to give to text a percentage close to that we would have given visually. (Human eyes often surround a text zone with an invisible rectangle and consider the interline spaces as integral part of the text). Therefore, when a text line is obtained, we do not consider the total of the surface areas of its cc's, but the area of its circumscribing rectangle. Furthermore, when a rectangle circumscribing a cc is enclosed in a rectangle circumscribing another cc, the area of the former is subtracted from that of the latter (for example, a surrounded title). 4 Experiments and Results The method has been tested on about 10 blocks for each class, chosen from scientic journals like IEEE, IBM, ACM, etc. and technical reports. When the page images are not segmented before hand, the method can be used as a means of separating dierent media in the page. We observed that Small Letter Text blocks are very rare. This reinforces the idea that the notion of Small Letter Text is very subjective. In fact, each font has a particular size for small letters. Therefore, in a multifont document, Small Letter blocks are always confused with Medium Letter blocks. MT LT Gr Ph MT 94% 0% 6% 0% LT 0% 99% 0% 1% Gr 7% 0% 93% 0% Ph 0% 5% 5% 90% Table 1: Classication Results where MT stands for Medium Text, LT for Large Text, Gr for Graphics and Ph for Photographs. Even though the method gives satisfactory results, it has its own limits. Joined letters are usually labeled as graphics due to their eccentricities. Also fragmented photographs are labeled either as text or graphics. This is as a result of lack of contextual rules to assemble fragments of the same medium. 5 Conclusion The method we present gives satisfactory results on all tested document images. It is general in that it can locate and identify any medium in a document. It can also be used to separate text from non-text in technical documents. The algorithm employed tolerates a reasonable orientation of the document images. The manner in which the results are given is very useful in document analysis and treatment because it permits one to focus on a type of medium and determine the type of treatments to be applied. References [1] L. A. Fletcher and R. Kasturi, A Robust Algorithm for Text String Separation from Mixed Text/Graphics Image. PAMI,, 10(6): 910-918, 1988
[] D. Wang and S. N. Srihari, Classication of Newspaper Image Blocks Using Texture Analysis. CVGIP,, 47: 37-35, 1989 [3] K. Y. Wong, R. G. Casey and F. M. Wahl, Document Analysis System. IBM Journal of Research and Development, 6(6): 647-656, 198. Petit texte :.9 % (b) Texte moyen : 3. % Grand Texte : 1. % Photographie : 43.7 % Graphique : 0.0 % (c) (d) (e) Figure 1: Labeling Results for a composite document. (a) Original Image, (b) Photograph part (43.7%), (c) Small Text (35%) and (d) Large Text part (1.%).