A Labeling Approach for Mixed Document Blocks. A. Bela d and O. T. Akindele. Crin-Cnrs/Inria-Lorraine, B timent LORIA, Campus Scientique, B.P.

Similar documents
UW Document Image Databases. Document Analysis Module. Ground-Truthed Information DAFS. Generated Information DAFS. Performance Evaluation

Going digital Challenge & solutions in a newspaper archiving project. Andrey Lomov ATAPY Software Russia

Isolated Handwritten Words Segmentation Techniques in Gurmukhi Script

Multi-scale Techniques for Document Page Segmentation

An Accurate and Efficient System for Segmenting Machine-Printed Text. Yi Lu, Beverly Haist, Laurel Harmon, John Trenkle and Robert Vogt

Character Recognition

Layout Segmentation of Scanned Newspaper Documents

A Document Image Analysis System on Parallel Processors

Skew Detection for Complex Document Images Using Fuzzy Runlength

COMBINED WARNING EDITING GUIDANCE DOCUMENT. European Commission Health and Consumer Protection Directorate-General

BUILDING DETECTION AND STRUCTURE LINE EXTRACTION FROM AIRBORNE LIDAR DATA

The Processing of Form Documents

Use of Shape Deformation to Seamlessly Stitch Historical Document Images

Text Extraction from Gray Scale Document Images Using Edge Information

Hybrid Page Layout Analysis via Tab-Stop Detection

Segmentation of Characters of Devanagari Script Documents

Recognition-based Segmentation of Nom Characters from Body Text Regions of Stele Images Using Area Voronoi Diagram

1. Introduction 16 / 1 SEGMENTATION AND CLASSIFICATION OF DOCUMENT IMAGES. 2. Background. A Antonacopoulos and R T Ritchings

DATA EMBEDDING IN TEXT FOR A COPIER SYSTEM

A System towards Indian Postal Automation

Extracting Layers and Recognizing Features for Automatic Map Understanding. Yao-Yi Chiang

Automatic Recognition and Verification of Handwritten Legal and Courtesy Amounts in English Language Present on Bank Cheques

Separation of Overlapping Text from Graphics

Project Report for EE7700

Refine boundary at resolution r. r+1 r. Update context information CI(r) based on CI(r-1) Classify at resolution r, based on CI(r), update CI(r)

Stefano Ferilli 1 Floriana Esposito 1 Domenico Redavid 2

Robust line segmentation for handwritten documents

Word extraction using irregular pyramid C. L. Tan a and P. K. Loo b

Time Stamp Detection and Recognition in Video Frames

Extending Page Segmentation Algorithms for Mixed-Layout Document Processing

On Segmentation of Documents in Complex Scripts

Mouse Pointer Tracking with Eyes

The 12 most common newsletter design mistakes

Line Net Global Vectorization: an Algorithm and Its Performance Evaluation

A Survey of Problems of Overlapped Handwritten Characters in Recognition process for Gurmukhi Script

Recognition of Gurmukhi Text from Sign Board Images Captured from Mobile Camera

Optical Flow-Based Person Tracking by Multiple Cameras

Adaptive technology for mail-order segmentation. this approach lies mainly in the absence of a rigid a priori model, replaced by a simply and

Postprint.

Khmer OCR for Limon R1 Size 22 Report

OCR For Handwritten Marathi Script

Automatic Detection of Change in Address Blocks for Reply Forms Processing

Segmentation of Bangla Handwritten Text

Skew Detection Technique for Binary Document Images based on Hough Transform

Image Segmentation Based on Watershed and Edge Detection Techniques

CHAPTER 4: MICROSOFT OFFICE: EXCEL 2010

FACIAL RECOGNITION BASED ON THE LOCAL BINARY PATTERNS MECHANISM

Prototype Selection for Handwritten Connected Digits Classification

Locating 1-D Bar Codes in DCT-Domain

EasyDone for AutoCAD

Wavelet Based Page Segmentation Puneet Gupta Neeti Vohra Santanu Chaudhury Shiv Dutt Joshi

Error-Diffusion Robust to Mis-Registration in Multi-Pass Printing

A New Algorithm for Detecting Text Line in Handwritten Documents

A Graphics Image Processing System

Localization, Extraction and Recognition of Text in Telugu Document Images

(Refer Slide Time 00:17) Welcome to the course on Digital Image Processing. (Refer Slide Time 00:22)

Hidden Loop Recovery for Handwriting Recognition

A Fast Caption Detection Method for Low Quality Video Images

Arabic Newspaper Page Segmentation

An Accurate Method for Skew Determination in Document Images

How to draw and create shapes

Page 1. Area-Subdivision Algorithms z-buffer Algorithm List Priority Algorithms BSP (Binary Space Partitioning Tree) Scan-line Algorithms

Text identification for document image analysis using a neural network

TextFinder: An Automatic System To Detect And Recognize Text In Images Victor Wu, R. Manmatha, Edward M. Riseman Abstract There are many applications

Adaptive Technology for Mail-Order Form Segmentation

Using Game Theory for Image Segmentation

Biometrics Technology: Image Processing & Pattern Recognition (by Dr. Dickson Tong)

Keyword Spotting in Document Images through Word Shape Coding

Handwritten text segmentation using blurred image

Pixels. Orientation π. θ π/2 φ. x (i) A (i, j) height. (x, y) y(j)

A System for Joining and Recognition of Broken Bangla Numerals for Indian Postal Automation

Handwritten Digit Recognition with a. Back-Propagation Network. Y. Le Cun, B. Boser, J. S. Denker, D. Henderson,

Recognition of Multi-Oriented, Multi-Sized, and Curved Text

LOGO USE GUIDELINES BRAND GUIDELINES PUBLISHED ON FEBRUARY 17,

A Model-based Line Detection Algorithm in Documents

2D rendering takes a photo of the 2D scene with a virtual camera that selects an axis aligned rectangle from the scene. The photograph is placed into

Integrating Low-Level and Semantic Visual Cues for Improved Image-to-Video Experiences

Prewitt. Gradient. Image. Op. Merging of Small Regions. Curve Approximation. and

Keywords Connected Components, Text-Line Extraction, Trained Dataset.

ADOBE ILLUSTRATOR CS3

Bus Detection and recognition for visually impaired people

Janitor Bot - Detecting Light Switches Jiaqi Guo, Haizi Yu December 10, 2010

Optimized XY-Cut for Determining a Page Reading Order

Solving Word Jumbles

Lecture 3 Form & Space Form Defines Space

Symbol Detection Using Region Adjacency Graphs and Integer Linear Programming

FRAGMENTATION OF HANDWRITTEN TOUCHING CHARACTERS IN DEVANAGARI SCRIPT

Problem definition Image acquisition Image segmentation Connected component analysis. Machine vision systems - 1

Research on QR Code Image Pre-processing Algorithm under Complex Background

Fabric Defect Detection Based on Computer Vision

EE368 Project: Visual Code Marker Detection

What is Publisher, anyway?

Scene Text Detection Using Machine Learning Classifiers

A typed and handwritten text block segmentation system for heterogeneous and complex documents

DISCRETE DOMAIN REPRESENTATION FOR SHAPE CONCEPTUALIZATION

Structural and Syntactic Techniques for Recognition of Ethiopic Characters

Vision. OCR and OCV Application Guide OCR and OCV Application Guide 1/14

Lecture 4 Form & Space Form Defines Space

Content-based Image and Video Retrieval. Image Segmentation

Text Area Detection from Video Frames

Transcription:

A Labeling Approach for Mixed Document Blocks A. Bela d and O. T. Akindele Crin-Cnrs/Inria-Lorraine, B timent LORIA, Campus Scientique, B.P. 39, 54506 Vand uvre-l s-nancy Cedex. France. Abstract A block image labeling method is presented. It does not assume that the blocks to be treated are already segmented nor that they contain homogeneous data. It is based on connected component analysis to label the blocks' contents as small letter text, medium letter text, large letter text, graphics or photographs, giving the percentage of each of these components with respect to the surface area it occupies. It uses a recursive algorithm that allows one to improve on the result of segmentation. The performance of the method is given. 1 Introduction Block classication or labeling is an important and useful step in the document image recognition process. In this step, document image blocks extracted during the segmentation process, are classied into different categories such as: text, graphics, photographs, etc. depending on their contents. The labels given to blocks help in determining the type of treatment to be applied to each block during the analysis and understanding stage. There are two major approaches to block classication. In the rst approach, it is always assumed that blocks contain homogeneous data. This is the case of blocks found in composite documents such as scientic journals, newspapers, etc. Moreover, the segmentation methods employed use global spatial properties of regions to determine their frontiers, without taking into account their contents. Each block is classi- ed into the closest medium satisfying certain properties. These properties correspond mostly to statistical and textural features extracted from the block image. Among the methods in this approach, we can cite [] which uses a feature space partitioning technique to label newspaper image blocks, using regularity, abundance and width of spaces to classify a block as either a small letter, medium letter, large letter, graphics or photograph block. And also [3] that uses block size, block mean black pixel run length, density and eccentricity to classify blocks extracted with rlsa into text, graphics, halftone, horizontal line or vertical line blocks, exploiting the fact that text lines have approximately a constant and small height. In the second approach, it is assumed that a block contains a mixture of text and non-text (generally, text and graphics) such as in technical documents, tables, forms, etc. In this case, the methods employed separate text strings from non-text in the block. Some of these methods use connected component analysis to perform the text separation as in [1] where a Hough transform based algorithm is applied to group collinear connected components of similar size into logical text strings. Others are based on neighborhood line density which is suggestive to the extraction of graphics. In this paper, we describe a new labeling method that is able to locate and identify each type of data in a mixture of media in the same block. The method gives more detailed information than the previous methods and it can be used to improve on the results of the segmentation. It precisely gives the locations of each medium in the block as well as its percentage with respect to the surface area it occupies. Principle This method classies a block by giving the proportion of each of the following categories: small text, medium text, large text, graphics and photograph. It is based on connected components (cc's) analysis by studying for each set of cc's, the classes of spaces between them, as well as their sizes and regularity. The analysis is done in three steps. In the rst step, cc's are merged into sets of approximately aligned cc's. For example, a text line can be partitioned into three sets of cc's, the rst for accents and apostrophes, the second for letters and the third for punctuation. In this manner, two successive text lines are never merged, and large connected components are easily isolated. The cc's in each set

are analyzed individually if they are few, or globally otherwise. In the global analysis, the width of the cc's as well as the space between them are studied. If there are more than three types of spaces, the analysis is recursively applied to the two sets of cc's around the largest space (this allows the separation of two columns, for example). If there is a cc whose width is much more larger than those of the rest, it is separated and analyzed apart. If there is only one class of spaces and the regularity of the spaces is very strong, the cc set is taken as graphics, otherwise, it is considered as text. In the individual cc analysis, certain characteristics, such as density, height/width ratio, the percentage of horizontal black segments whose lengths are equal to the cc's width, etc., are extracted to determine the type of the cc. In the second step, the sets obtained in the previous step are globally analyzed with respect to their neighbors in order to either correct the errors of the previous classication or to merge similar sets into bigger ones. The last step is concerned with the calculation of the percentage of each category in the block. 3 Dierent Steps The document is deskewed if its skew-angle is greater than a certain degree harmful to horizontal alignment. After the extraction of cc's and the elimination of those considered as noise (i.e. those whose number of black pixels or surface area is less than an a priori xed threshold) we then proceed to merge them into bigger entities. The connected components are represented by the coordinates of the top left and the bottom right corners of their circumscribing rectangles, say [(x 1 ; y 1 ) (x ; y )]. They are extracted in ascending order of y. For equivalent y, they are obtained in ascending order of their x 1. 3.1 Fusion of Connected Components into Sets Two cc's are merged into the same set/line when they are approximately aligned, i.e. if the y- coordinates of their top left corners are not too far from each other, and likewise for the y-coordinates of their right bottom corners. The closeness of these coordinates is determined with the following rule: jy1? y 0 1j max[ (y?y1) max[ (y?y1) ; (y0?y0 1 ) ] ; (y0?y0 1 ) ] & jy? yj 0 It is to be noted that a line can be formed by cc's whose abscissas are far apart. With this method, it is possible to extract several line portions from a text line, and separate line portions that might likely be connected (above or below) to another line of text or graphics. 3. Fusion of Sets into Lines The line portions so formed are then merged into larger sets to obtain real text lines and to discard those that are not horizontally aligned. This is to avoid the merging of either the line of an underlined text with the text or two successive text lines. The fusion is performed if the circumscribing rectangles are very close in either the horizontal or vertical direction, or have a none empty intersection, or even overlap. This fusion of lines improves the results of the previous fusion (fusion of cc's). 3.3 Line Classication The classication of the formed lines is based on some coecients extracted from the constituting cc's (such as size, density and the percentage of the black segments whose width is approximately equal to that of the cc), as well as homogeneity of the spaces separating them. It is performed in two manners depending on the number of cc's in the lines. When there is only one cc, it is passed through a series of lters to determine its type. Otherwise, the line is either cut into smaller sets with respect to the homogeneity of spaces and sizes of its cc's or classied globally. The classication algorithm is given below. 3.3.1 Case of many cc's /* LHavg : Average Height of cc's in the Line, LWavg : Average Width of cc's in the Line, M IHslt : Minimum Height of small letter Text M AHslt : Maximum Height of small letter Text M IHmlt : Minimum Height of medium letter Text M AHmlt : Maximum Height of medium letter Text M IHllt : Minimum Height of large letter Text M AHllt : Maximum Height of large letter Text */ if LHavg < M IHslt /* very small average height of cc's */ then line_type = graphics else calculate N Bsc /* number of space_classes */ if N Bsc 3 then /* non regular spaces between cc's */ cut the line into two at the largest space; recall the classication on each sub_line else /* regular & more or less regular spaces */ if largest cc 4 LWavg then /* a cc dierent from the others */ cut the line around the largest cc (on the right and on the left);

largest cc recall the classication on the sub_lines and the else /* cc's of regular sizes and spaces */ if N Bsc = 1 & LHavg M IHslt then classify each cc individually; line_type = type of the majority else text : small if M IHslt LHavg M AHslt : medium if M IHmlt LHavg M AHmlt : large if M IHllt LHavg M AHllt 3.3. Case of a single cc In this case, the cc is passed through a series of lters, on the basis of attributes extracted from it, until its type is obtained. In all, there are sixteen lters which are applied in order. There are many thresholds used in these lters, but these are determined before hand during a learning stage on many kinds of documents, thus assuring their stability. The lters are given below. F1 if density < minimum density of photograph then graphics F if No. of segments (whose width 6= that of cc) < a certain threshold then if vertically extended black block (1, I) then text else graphics F3 if low density and extended block then graphics F4 if eccentricity is between that of text and photograph and high density then if the height is important than photograph else text F5 if exentricity > High threshold of that of photograph then graphics F6 if exentricity < low threshold of that of photograph then graphics F7 if height < that of text then graphics F8 if height < that of photograph and density > that of photograph then graphics F9 if height > that of photograph and density > that of photograph then graphics F10 if average number of segments per line > number of segments in a text letter then if density > that of graphics then photograph else graphics F11 if No. of segments per line - average of No. of segments per line is important then if density > that of graphics then photograph else graphics F1 if No. of segment length classes that of a graphics line and average of No. of segments per line is equal to that of graphics line then graphics F13 if No. of segment length classes > that of a letter then if density > that of graphics then photograph else graphics F14 if length of segments is very irregular then if density > that of graphics then photograph else graphics F15 if low eccentricity and density that of letter then photograph F16 if many lines with irregular segment lengths then if density > that of graphics then photograph else graphics F17 else text 3.4 Error Detection and Particular Cases It is possible to have some imperfections in the classication of the lines. Therefore, we try to detect and correct any error. This is done in two phases. Firstly, incoherences at the level of cc's are located and resolved. Secondly, the incoherences at the line level or particular cases are located and resolved. 3.4.1 Overlapping Connected Components Often, photographs and graphics are usually fragmented when passed through a scanner. Some of their fragments are usually confused with text. In order to reconstitute these kinds of patterns, we proceed to locate and study cc's that overlap with them. The correction algorithm is given below. foreach c of type photograph (P ) or graphics (G) do done foreach c 0 6= c : do done area_of(c 0 ) < area_of(c) and area_of(c \ c 0 ) > area_of(c 0 )/ if type_of(c) = P and type_of(c 0 ) 6= G then type_of(c 0 ) := P if type_of(c) = G and type_of(c 0 ) = P then type_of(c 0 ) := G 3.4. Particular Cases In this phase, we compare each line with its neighboring lines to determine if we have a particular case. A particular case can be: the accents, the apostrophes, dots on i, j, broken characters, or part of graphics mis-labeled as photograph. It is also necessary to make uniform text lines where letters, individually recognized, can have dierent sizes. Text Line with dierent sized letters When a text line contains a mixture of small, medium and large letters, line is given the label of its components that occupy the largest surface area. Misclassied Medium and large letter text A medium or large text line can be cut horizontally or vertically, or may contain some points, apostrophes or punctuation. In the case of the horizontal cut, we examine two lines that are horizontal neighbors, while

in the other case, we examine two lines that are vertical neighbors. In the rst case, if there is line to the left or to the right of a medium or large text line, we merge it with the text line if its height is much less than that of the text line and does not contain any component whose label is large text. This case corresponds to large characters either cut on top or containing accents or dots. In the second, if there is line to the top or to the bottom of a medium or large text line, we merge it with the text line if its height is much less than that of the text line and does not contain any component whose label is large text or medium text. This case corresponds to large characters either cut on top or bottom. Table 1 shows the results of the classication. We can observe that 6% of Medium Text blocks are classi- ed as Graphics blocks. This is due to underlined text where the letters touch the line, and also to erect and isolated letters such as l,i. The 1% of Large Text confused with Photographs is a result of isolated and dense very large letters. The 5% of Photographs labeled as either Large Text or Graphics resulted from cuts in photographs during scanning, or the fact that some photographs contain white streams. Some graphics blocks were labeled as Medium Text blocks (7%), this is due to the fact that graphics are not usually well connected and may contain small forms that are mistaken for letters (for example, graphics representing chemical structures). Examples of the results obtained are given in gure 1. Graphics Classied as photographs If a line contains a mixture of graphics and photographs and the surface area of photographs is less than a certain threshold (0% of the surface area of the line), we change the label of the line components to graphics. This is due to the fact that certain graphics may contain dense components that can be confused with photographs. 3.5 Calculation of Percentages We have chosen to use the percentage of the surface area of each type of components. However, we like to give to text a percentage close to that we would have given visually. (Human eyes often surround a text zone with an invisible rectangle and consider the interline spaces as integral part of the text). Therefore, when a text line is obtained, we do not consider the total of the surface areas of its cc's, but the area of its circumscribing rectangle. Furthermore, when a rectangle circumscribing a cc is enclosed in a rectangle circumscribing another cc, the area of the former is subtracted from that of the latter (for example, a surrounded title). 4 Experiments and Results The method has been tested on about 10 blocks for each class, chosen from scientic journals like IEEE, IBM, ACM, etc. and technical reports. When the page images are not segmented before hand, the method can be used as a means of separating dierent media in the page. We observed that Small Letter Text blocks are very rare. This reinforces the idea that the notion of Small Letter Text is very subjective. In fact, each font has a particular size for small letters. Therefore, in a multifont document, Small Letter blocks are always confused with Medium Letter blocks. MT LT Gr Ph MT 94% 0% 6% 0% LT 0% 99% 0% 1% Gr 7% 0% 93% 0% Ph 0% 5% 5% 90% Table 1: Classication Results where MT stands for Medium Text, LT for Large Text, Gr for Graphics and Ph for Photographs. Even though the method gives satisfactory results, it has its own limits. Joined letters are usually labeled as graphics due to their eccentricities. Also fragmented photographs are labeled either as text or graphics. This is as a result of lack of contextual rules to assemble fragments of the same medium. 5 Conclusion The method we present gives satisfactory results on all tested document images. It is general in that it can locate and identify any medium in a document. It can also be used to separate text from non-text in technical documents. The algorithm employed tolerates a reasonable orientation of the document images. The manner in which the results are given is very useful in document analysis and treatment because it permits one to focus on a type of medium and determine the type of treatments to be applied. References [1] L. A. Fletcher and R. Kasturi, A Robust Algorithm for Text String Separation from Mixed Text/Graphics Image. PAMI,, 10(6): 910-918, 1988

[] D. Wang and S. N. Srihari, Classication of Newspaper Image Blocks Using Texture Analysis. CVGIP,, 47: 37-35, 1989 [3] K. Y. Wong, R. G. Casey and F. M. Wahl, Document Analysis System. IBM Journal of Research and Development, 6(6): 647-656, 198. Petit texte :.9 % (b) Texte moyen : 3. % Grand Texte : 1. % Photographie : 43.7 % Graphique : 0.0 % (c) (d) (e) Figure 1: Labeling Results for a composite document. (a) Original Image, (b) Photograph part (43.7%), (c) Small Text (35%) and (d) Large Text part (1.%).