LOCAL SKEW CORRECTION IN DOCUMENTS

Size: px

Start display at page:

Download "LOCAL SKEW CORRECTION IN DOCUMENTS"

Brendan Booker
5 years ago
Views:

International Journal of Pattern Recognition and Artificial Intelligence Vol. 22, No. 4 (2008) 691 710 c World Scientific Publishing Company LOCAL SKEW CORRECTION IN DOCUMENTS P. SARAGIOTIS and N.

1 International Journal of Pattern Recognition and Artificial Intelligence Vol. 22, No. 4 (2008) c World Scientific Publishing Company LOCAL SKEW CORRECTION IN DOCUMENTS P. SARAGIOTIS and N. PAPAMARKOS Image Processing and Multimedia Laboratory Department of Electrical & Computer Engineering Democritus University of Thrace, Xanthi, Greece papamark@ee.duth.gr In this paper we propose a technique for detecting and correcting the skew of text areas in a document. The documents we work with may contain several areas of text with different skew angles. First, a text localization procedure is applied based on connected components analysis. Specifically, the connected components of the document are extracted and filtered according to their size and geometric characteristics. Next, the candidate characters are grouped using a nearest neighbor approach to form words and then based on these words text lines of any skew are constructed. Then, the top-line and baseline for each text line are estimated using linear regression. Text lines in near locations, having similar skew angles, are grown to form text areas. For each text area a local skew angle is estimated and then these text areas are skew corrected independently to horizontal or vertical orientation. The technique has been extensively tested on a variety of document images and its accuracy and robustness is compared with other existing techniques. Keywords: Skew correction; text area localization; connected component analysis; linear regression; optical character recognition. 1. Introduction Optical Character Recognition (OCR) has become an increasingly important technology in the office automation software and lots of commercial document analysis systems are available to the end users. Document layout analysis is used in such systems to improve their capabilities in dealing with complicated document layouts and diverse scripts. An efficient and accurate method for determining document image skew is an essential need, which can simplify layout analysis and improve character recognition. Most document analysis systems require a prior skew detection before the images are forwarded for processing by the subsequent layout analysis and character recognition stages. Document skew is a distortion that often occurs during scanning or copying of a document or as a design feature in the document s layout. This mainly concerns the orientation of text lines, where a zero skew occurs when the lines are horizontal or vertical, depending on the language and page layout. Skew estimation and correction are therefore significant preprocessing document restoration stages before the actual document analysis. Inaccurate deskew 691

2 692 P. Saragiotis & N. Papamarkos will significantly deteriorate the subsequent processing stages and may lead to incorrect layout analysis, erroneous word or character segmentation and misrecognition. The overall performance of a document analysis system will thereby be severely decreased due to the skew. In addition, automatic skew detection and correction also have practical value in improving the visual output of facsimile machines and duplicating machines. Ideally, a skewed input could be automatically corrected to produce a desirable output from the machines for more pleasant reading. In general, there can be three types of skew within a page: a global skew, when all text areas have the same orientation; a multiple skew, when certain text areas have a different slant than the others; and a nonuniform text line skew, when the orientation fluctuates within a line, e.g. a line is bent at one or both of its ends, or a line has a wave-like shape. In this work we focus on documents with multiple skew, which is also the novelty of the proposed technique Related work A number of methods have previously been proposed for identifying document image skew angles. The main methods proposed in the literature may be categorized into the following groups: methods based on (a) projection profile analysis, (b) nearest-neighbor clustering, (c) Hough transform, (d) cross-correlation, and (e) morphological transforms. A survey was reported by Hull 8 andanextendedreference is made by Okun et al. 13 Gatos et al., 6 used an interline cross-correlation for two or more vertical lines located at a fixed distance d for skew estimation. The cross-correlation function is computed for an entire image to obtain the document s skew angle. This can, however, be time consuming and the presence of graphics degrades the accuracy. Lu and Tan 10 followed a nearest-neighbor chain based approach developing a skew estimation method with a high accuracy and with language-independent capability. Their approach detects only a dominant skew for the document. Cao et al. 2 introduced a technique based on straight line fitting and the concept of eigen-point to detect the skew of a document. Yuan and Tan 16 estimated the skew of a document using the calculation of the slopes of the virtual lines that pass through the connected components in an image and a special convolution on the resultant histogram. Shivakumara and Kumar 14 introduced another boundary growing approach that estimates a single skew angle for a document. Chou et al. 5 detected a dominant skew in documents using piecewise covering of document objects by parallelograms. Most of the above methods have some inherent weakness. Some of them, especially the earlier ones, are actually tailor-made algorithms that are applicable to a particular document layout. As a result, some of them may fail to estimate skew angles of documents containing complicated layouts with multiple font styles and sizes, arbitrary text orientation and script, or high proportion of nontext regions

3 Local Skew Correction in Documents 693 such as graphics and tables. Other techniques consider the expected skew angle to be quantized or restricted in a specific range. Moreover, all of the referenced methods estimate a global skew angle of the document and fail to recognize multiple local skew angles associated with different document areas. Messelodi and Modena 12 showed that projection profiles in combination with a clustering procedure based on simple heuristics may overcome the problem of the limited angle range. Although this method can detect multiple skew angles and small interline spacing, it was only tested on small-sized ( pixels) images of book covers containing a few text lines. Kapoor et al. 9 used the Radon transform and projection profiles to detect the skewing of individual words only. Still, they did not include any results of their algorithm on a full document. Finally, Yuan and Tan 17 used the convex hull of each connected component to group character areas in a document and estimated each area s local skew angle. However, there is not any example in their paper that can prove the applicability of this technique to local skew correction. 2. Proposed Technique The proposed technique aims at correcting the local skew angles in documents that contain several text areas bent in different slopes. Moreover, the technique must be robust enough to handle a great variety of printed documents, including book and magazine covers, spreadsheets and documents with regular layout in any language. This assumes that the text is supposed to be correctly oriented in either vertical or horizontal alignment. It is also assumed that the document can contain from severaldowntoasingletextareaslope. To achieve this goal, a bottom-up approach is applied which is better suited to the specific problem. The technique is applicable to gray-scale images. In the preprocessing stage, a binary version of the document image is obtained. Then, the connected components of the document are extracted using a simple serial labeling algorithm and their bounded rectangles are constructed. A filtering procedure is applied to the connected components to discard nontext connected elements according to their geometrical characteristics and indicate connected components that are candidate characters. These candidate characters are grouped using a nearest neighbor approach to form words. The words are grouped again, based on a rough slope calculation, to form lines of text. Using linear regression on the edge pixels of the connected components bounding rectangles, a set of straight lines are estimated for each text line representing its top and bottom boundaries. The text lines in near locations having similar skew angles are grown to form text areas and their slope is defined according to the slope of the text boundary lines. The connected components that have been filtered or failed to construct words but are included in a text area are supposed to be part of that text area. Finally, each text area is skew corrected to horizontal or vertical orientations. The proposed technique is

4 694 P. Saragiotis & N. Papamarkos able to avoid cases of overlapping text areas. The final result of the technique is a single binary image ready to be processed by the layout analysis module of an OCR system. An analysis of the main stages of the proposed technique is given below Preprocessing The proposed technique is applied to gray scale document images that have a resolution high enough to retain separated elements. For example, the scanning resolution of a book cover could be as low as 50 dpi but a densely written document must be scanned with 300 dpi. The document is filtered with a Gaussian filter to remove noise and strengthen the connectivity of its elements. After filtering, the document is binarized by using the powerful technique of Gatos et al Connected component analysis After preprocessing, the index color of the binary image that represents text is chosen and a simple serial algorithm is used to label the elements of the document based on its fourth neighborhood connectivity. Each document image line is scanned looking for text index pixels. When found, its upper and left pixels are checked for labels. If they are labeled with the same label the current pixel gets that label. Otherwise the equality of the labels is marked on an equality array. A second scan of the document image lines is performed and the equal labels are replaced with a single one. For each connected element a bounding rectangle (BR) is constructed. We have no knowledge of the nature of the connected element; it could be a character, connected characters or even images from the document. These bounding rectangles form the skeleton for all future analysis on the page. The position and dimensions of the bounding rectangle are marked. The text index pixels that touch the bounding rectangle are the boundary rectangle edge points Filtering At this stage, each connected component accompanied by its bounding rectangle represents a region of text index pixels without any further knowledge of its Fig. 1. Bounding rectangle with its edge points marked.

Local Skew Correction in Documents 695 contents. We assume that the set of connected components will contain all text components mixed with several nontext ones.

5 Local Skew Correction in Documents 695 contents. We assume that the set of connected components will contain all text components mixed with several nontext ones. We will try to identify the nontext or unreadable text components by analyzing their geometrical structure and discard them. 3,18 We specify some size and geometrical filters in order to discard components which are likely to correspond to nontextual objects. The selection of filters is based on the analysis of some internal features, i.e. inherent to the single connected component, and possibly to its neighborhood. The internal features that are used in order to eliminate connected components that have low probability of being text objects are: Area. The area of a component is defined as the number of its bounding rectangle pixels divided by the scanning resolution of the document. Connected components with area less than 2 mm 2 are suppressed in the proposed technique, as we supposed they correspond to noisy connected components. The punctuation marks are removed at this stage. This fact is acknowledged and used later in the processing. Connected components with area larger than 100 mm 2 are also removed, as they probably represent large nontext objects. Density. It is defined as the ratio between the component area and the area of its bounding box. It permits the detection of sparsely filled or compact connected components. In the proposed technique we remove connected components with density less than 10%, as we suppose they are line art, or greater than 70% as they are probably images or frames. Width to Height Ratio. The width to height ratio permits the detection of long components or narrow components. Width to height ratio is a filter that regards Fig. 2. Filtered connected components.

6 696 P. Saragiotis & N. Papamarkos the orientation of a text. Thus, width to height ratio in a horizontally aligned text is height to width ratio in a vertically aligned text. In the proposed technique, we prevent connected components with bounding box s width to height ratio more than 150% or less than 30% from forming horizontal text lines. This is because most text character s proportions are between those limits. In all three filter rules the proposed technique uses, there is a small probability of classifying text connected components as nontext. This possibility is being acknowledged, and connected components that have been removed will be included in the text area growing step of the proposed technique. The connected components that remain after the filtering stage are considered candidate characters that will form words and text lines Word grouping At this stage of the proposed technique, the candidate characters are grouped to form words aligned either horizontally or vertically. The grouping is based on the Euclidean distance of the candidate characters bounding rectangles. 15 First we group the horizontally aligned candidate characters. Let WH be an ordered group of connected components: WH: {C 1,C 2,...,C n } (1) and WH averagewidth is the average width of its member s bounding rectangles: n i=1 CWidth i WH averagewidth =. (2) n For each candidate character C l that is not a member of any ordered group or it is the first member of an ordered group WH k the Euclidean distance d l n = Cl ml Cn mr between its middle left BR edge point and the middle right BR edge point of the C n is calculated. If this distance is less than the maximum of WH averagewidth and Cl Width multiplied by a factor f chosen to be equal to 1.0: d l n < max(wh averagewidth,cl Width ) f (3) C l is the next candidate character of the ordered group WH. A list with all candidate members C l of WH is formed and the C l with the smallest Euclidean distance d l n will be the next character of the ordered group WH. If the chosen candidate character C l is the first member of the ordered group WH k then the two groups are merged to create the new group WH. WH =WH WH k. (4) This procedure is repeated until no next candidate character can be found for any ordered group WH. The algorithm ensures that all possible grouping has been done in the fewer possible repetitions. To group the vertically aligned candidate characters in WV ordered groups, we use the same reparative algorithm considering the widths in the vertical direction

7 Local Skew Correction in Documents 697 WH C n d n l C l C l d n l C n WV Fig. 3. Construction of the horizontal and vertical words using the distance between the connected components. and calculating and comparing the Euclidean distance d l n = C mb l Cn ml between themiddlebrtopedgepointofc n and the middle bottom BR edge point of the C l. Either horizontal or vertical groups with less than five members are discarded. This may give the impression that these words are not going to be aligned at the result of this technique, but that is not the case. It has been observed that small groups create problems in the text line identification stage. Furthermore, small words like those are most often part of a greater text area with the same alignment. So, the proposed technique includes those words in the text area growing stage. The next step of the word grouping stage is the revocation of the ambiguity of a candidate character C l belonging to both a horizontal WH and a vertical WV word. This is achieved efficiently with the use of a simple rule. If the members of WV are more than those of WH, then the WH word is destroyed and C l remains a member of WV, else the WV word is destroyed Text line identification To identify text lines in a document we must merge sets of identified words and candidate characters that failed to construct words in the previous steps. These words and candidate characters are separated by large gaps (otherwise they would have been merged in the previous stage of the proposed technique). The first step in doing so is to calculate a rough angle ϑ WH for each horizontal and vertical word (ϑ WV ) as the average slope between the central points of the BR edges of consecutive characters C i : n 1 i=1 ϑ WH = tan(crm i Ci+1 lm ). (5) n

8 698 P. Saragiotis & N. Papamarkos WH C n d n l θ C l C l d n l θ C n WV Fig. 4. Construction of the horizontal and vertical words using the distance between the connected components. Then a line is extended from each side of a word with length WH averagewidth f word and angle ϑ, for the horizontal words first. The angle ϑ is measured from the x-axis and the right side of the word. At the left side of the word the angle is ϑ. This line may cross the vertical edges of several candidate characters. The candidate character C l that is crossed by the line with the smallest distance d l n will be the last character or the first, depending on the line, of the ordered group WH. Again, if the chosen candidate character C l isthefirstorlastmemberofthe ordered group WH k and the rough calculated angles of the two groups ϑ, ϑ k differ less than a predefined ϑ rd, then the two groups are merged to create the new group WH.Thevalueofϑ rd has been calculated to ensure both that errors in the rough calculation of the angle will not restrict two adjacent words to create a new group, and that unaligned words will not be grouped. The value used is 20. Next, the same step is repeated for the vertically aligned words. The angle ϑ WV is measured from the y-axis and the top side of the word. At the bottom side of the word the angle is WV ϑ. At the end of this stage, the text lines have been identified and are represented as ordered groups of characters WH and WV Text line skew calculation In order to determine the skew angle of a text line, we first estimate its lower and top base line. This approach is similar to the procedure used by Marti and Bunke. 11 Specifically, the position of the bottom edge pixel from the characters C i of each identified horizontal text line WH is used to construct the set P 1 of pixels. The set P 2 of pixels is constructed from the position of the top edge pixels. The sets P 1, P 2 of pixels approximate the lower and upper contours of the text lines. Formally,

9 Local Skew Correction in Documents 699 it can be represented as follows: P 1 = {p i =(x i,y i ) bottom edge of C i } (6) P 2 = {p i =(x i,y i ) top edge of C i }. (7) Assume that each set P of P 1, P 2 has k entries. On this set of points a linear regression can be applied. The final goal of skew detection is to find the parameters a and b of a straight line expressed by the following equation: x = ax + b. (8) For this purpose the mean values of the two variables x and y have to be computed: µ x = 1 k k x i, i=1 µ y = 1 k k y i. (9) Then, the line parameters a and b can be obtained using the two following formulas: i=1 a = k i=1 x iy i kµ x µ y k i=1 x2 i, b = µ y aµ x. (10) kµ2 x Linear regression minimizes the error between the line and the given set of points. A problem with this approximation is, however, that outliers and punctuation marks in the set P may influence the result disturbing the calculated line. The punctuation marks have been removed by filtering. For the task of baseline estimation, descender characters can be regarded as outliers. The same will be considered for capital characters in a line of lowercase characters or ascender characters in top line estimation. To reduce their influence on the regression line, the summed square error between the line and the set P is computed by: e = k (ax i + b y i ) 2. (11) i=1 If the total error e is larger than a predefined threshold t e,thepointp i with the largest amount in the sum is eliminated from set P. This procedure is repeated until the error e is smaller than threshold t e. The procedure is also stopped if t max removed percentage of points has been removed from the set. This is an indication of a poor result and it is taken into account in the area growing stage. To estimate the top and baseline of the vertically aligned text lines we use the sets of pixels P 1, P 2 : P 1 = {p i =(x i,y i ) right edge of C i } (12) P 2 = {p i =(x i,y i ) left edge of C i }. (13) Theskewangleofthetextlineisϑ =tan 1 (a), where a is selected from Table 1. This table is constructed with the presumption that P 1 is the baseline of the text line. The result of this stage can be seen in Fig. 4.

10 700 P. Saragiotis & N. Papamarkos Table 1. Selection of a value used to calculate the skew angle of a text line based on the regression result of the set of fitted lines. Accepted Result on Selected a P 1 a 1 P 2 a 2 none a 1 Fig. 5. Fitting of a pair of lines in horizontally and vertically aligned text. In the zoomed part of the image, edge pixels can be identified as well as poor fit indication Text area growing The next stage of the proposed technique is the construction of the text areas. The seed in this procedure are the text lines and their estimated skew. A text area is created for each identified text line. The text area is a rectangle rotated by an angle ϑ from the x-axis containing all the text index pixels of characters C i members of the text line, as shown in Fig. 6. This is done efficiently by using only the edge pixels of the characters bounding boxes shown in Fig. 1. Two text areas A j, A k are grown into an area A l = A j A k when the following two conditions hold: There is no candidate character C i member of an area A m contained in the result rectangle of area A l. Fig. 6. Construction of a text area from a single identified text line.

11 Local Skew Correction in Documents 701 The two areas, rectangle rotation angles ϑ j, ϑ k differ only be a few degrees ϑ d. The chosen value for ϑ d is 5, a value estimated from the experiments that will allow most adjacent text areas with the same skew to be joined. It has also been observed that adjacent text areas, differently aligned in a document have skew angles many times greater than the selected value. The resultant area s A l rectangle will contain all the text index pixels of character C i members of both areas and its skew angle ϑ l will be calculated as a weighted average: ϑ l = ϑ jm j + ϑ k m k m j m k (14) where m j, m k are the C i member counts for each text area. If the rotation angle ϑ k of text area A k, in the previous stage, is a result of two poor fitted lines, then the rotation angle ϑ l of the resultant text area A l will be ϑ j.withthismeasure, the proposed technique eliminates the skew error that results from poorly fitted regression lines. The step of area growth by joining two areas is repeated until no more areas can be joined. The next step of this stage of the proposed technique is the inclusion into text areas of the candidate characters that failed to form words and the connected components that have been filtered. The candidate characters C l that failed to form words are joined with their nearest text area by their addition to the area. Text areas are not ordered groups like words and text lines, so the addition of a member is not difficult. The text area s rectangle is grown to include the text index pixels of the newly added member. The rotation angle does not change. The connected components that have been filtered, i.e. punctuation marks and nontext elements, whose bounding box has common areas with a text area, are considered parts of that area and will be rotated by the text area rotation angle ϑ. Connected components outside the text areas are considered as noise or frames. At the end of this stage, the text of the document would have been localized and its skew rotation is measured Text area rotation The last stage of the proposed technique is the rotation of the identified text areas to a vertical or horizontal orientation according to the obtained skew angles. To avoid the possibility of overlapping target rectangles, the local text areas are moved left as needed. The pixels of each text area are projected to the target rectangle using bilinear interpolation. Bilinear interpolation is used to minimize the effect of artifacts produced by simply rotating binary images. The result of this stage is a single layered binary image ready to be processed by the layout analysis module of an OCR system.

702 P. Saragiotis & N. Papamarkos 3. Experimental Results The proposed technique has been tested on a great variety of complex document images having global or multiple skews.

12 702 P. Saragiotis & N. Papamarkos 3. Experimental Results The proposed technique has been tested on a great variety of complex document images having global or multiple skews. The documents were scanned with a variety of resolutions ranging from 50 to 300 dpi. The documents tested come from magazine and book pages, spreadsheets, book covers and advertisements. Some of the results, focusing on the ability of the proposed technique to handle different types of documents with global or multiple skews, can be seen in Fig. 7. In Fig. 8, we present the result of our technique on a complex book cover with local text areas rotated by different angles. To test the proposed technique s accuracy we executed two experiments, one with the well-known University of Washington set of documents, that contain mostly documents with a single slope and one with artificially created documents that contained text paragraphs skewed in multiple angles. For the first experiment we used the full set of 979 documents that constitute the University of Washington English Document Image Database (UW-I). These documents are skewed with a mean average angle of Weskewedeachdocument by 5,15,30 and 40 and estimated the skew angle using our technique. Then we calculated the absolute detection error, which is defined as the absolute Fig. 7. Experimental result. (a) Skewed magazine page, (b) book cover scanned with 200 dpi and 50 dpi (in centre), (c) advertisement with multiple skewed text areas, (d) skewed paragraph in Japanese and (e) spreadsheet with multiple skewed text areas.

Local Skew Correction in Documents 703 (a) (b) Fig. 8. Experimental results for a complex book cover. (a) Original image. (b) Document image after local skew correction. Table 2.

147 0.211 0.88 30 0.179 0.305 0.79 40 0.339 0.726 0.48 difference between the detected skew angle and the given ground-truth and its standard deviation.

13 Local Skew Correction in Documents 703 (a) (b) Fig. 8. Experimental results for a complex book cover. (a) Original image. (b) Document image after local skew correction. Table 2. Results indicating the robustness of the proposed technique in handling all possible skew angles. Rotation Detection Error Standard Deviation Correlation difference between the detected skew angle and the given ground-truth and its standard deviation. We have also calculated the correlation between the detected skew angle and the given ground-truth. From the results, which are presented in Table 2, we conclude that the proposed technique is robust in handling all possible rotation angles without any variation on its accuracy. Also, the technique is proven able to handle all the different types of documents that constitute the UW-I database as shown from the calculated standard deviation. Furthermore, the estimated skew angle is strongly correlated with the given ground-truth. For the second experiment, we scanned with 200 dpi resolution ten document paragraphs taken from magazine columns, newspapers and scientific publications. Examples of the paragraphs are shown in Fig. 9. A great effort has been made in

704 P. Saragiotis & N. Papamarkos (a) (b) (c) (d) Fig. 9.

robustness of the proposed technique in handling text areas skewed in multiple

Example documents with the rotated paragraphs used in the second experiment.

Then we rotated each paragraph with different angles and constructed a document

Two of the documents are shown in Fig. 10.

14 704 P. Saragiotis & N. Papamarkos (a) (b) (c) (d) Fig. 9. Example document paragraphs used in the second experiment for evaluating the robustness of the proposed technique in handling text areas skewed in multiple angles in a single document. (a) (b) Fig. 10. Example documents with the rotated paragraphs used in the second experiment. aligning the paragraphs correctly. Then we rotated each paragraph with different angles and constructed a document for each paragraph that contained all of its rotated versions. Two of the documents are shown in Fig. 10. The results for the absolute detection error and its standard deviation using the proposed technique are reported in Table 3. These results show that the proposed technique can handle accurately documents with text areas rotated with different, multiple skew angles.

Local Skew Correction in Documents 705 Table 3. Results indicating the robustness of the proposed technique in handling text areas skewed in multiple angles in a single document.

15 Local Skew Correction in Documents 705 Table 3. Results indicating the robustness of the proposed technique in handling text areas skewed in multiple angles in a single document. Rotation Detection Error Standard Deviation The mean absolute detection error for the documents in this experiment is similar to those in the experiment with the documents that contained globally skewed text. The experiments run on an Intel Core2 CPU running at 2.13 GHz. The processing of the documents depends on its complexity, as the technique uses connected component analysis. For the set of documents used in the first experiment that were skewed by 5 angle the average processing time was 11.2 sec. The histogram of the processing time for that set is shown in Fig. 11. The proposed skew correction technique gives high performance results with documents having well separated characters. In order to achieve this, it is preferable to use a powerful binarization technique. As we mentioned above, we include in our technique the powerful binarization technique of Gatos et al. 7 Experimental results with degraded documents and documents with text on complex backgrounds are shown in Fig. 12. Fig. 11. Histogram of the processing time for the set of documents used in the first experiment.

706 P. Saragiotis & N. Papamarkos (a) (b) (c) (d) Fig. 12.

Comparisons We compare our results with three published techniques using the same UW-I database by Chen and

17 Chen s method was based on a recursive morphological transform on a down-sampled image with a regression

16 706 P. Saragiotis & N. Papamarkos (a) (b) (c) (d) Fig. 12. Two experimental results: (a and c) the original document images, (b and d) the corrected final documents. 4. Comparisons We compare our results with three published techniques using the same UW-I database by Chen and Haralick (the creators of the UW-I databases), 4 Bloomberg et al. 1 and Yuan and Tan. 17 Chen s method was based on a recursive morphological transform on a down-sampled image with a regression method for parameter fitting. Bloomberg s method was projection-profile based and counted pixels along varying scanning lines. Yuan used the convex hulls of the individual components of a document and then extracted the rotation angle from an edges slope histogram. Yuan provided numerical results for his work and converted to numerical values the charts that Chen and Bloomberg provided on their published papers. All the experiments

17 Local Skew Correction in Documents 707 Table 4. Performances comparison using the 979 real document images in UW-I. Shaded rows are the best performances from Chen, Bloomberg, Yuan and the proposed technique. Absolute Error (Degrees) Method Chen (2 3, manual) 89% 93% 97% 99% 99% Chen (2 2, manual) 75% 88% 93% 95% 97% Chen (2 3, auto) 55% 78% 89% 93% 95% Chen (2 2, auto) 24% 43% 61% 74% 83% Bloomberg (2 reduction) 44% 75% 93% 98% 99% Bloomberg (8 reduction) 37% 64% 82% 90% 95% Yuan Convex Haul 4% 61% 86% 95% 98% 99% Proposed Technique 23% 63% 86% 95% 97% 98% Sources: Figs. 1 and 3 in Ref. 4; Figs. 3 and 6 in Ref. 1; Table 1 in Ref. 17. Charts digitization uncertainty: ±0.5%. used the full set of 979 samples from UW-I against the provided ground-truth. Table 4 shows the accumulated percentage of samples versus the absolute detection error of the participating methods. As shown in Table 4, within 0.1 of absolute error, the best performance is achieved by Chen s technique (manual mode, 2 3 structuring element) at 86%, followed by the proposed technique at 63%, then that of Yuan at 61%, Chen s in auto mode at 55%, and that of Bloomberg (quarter-sized) at 44%. All the participants are able to detect about 98% of the samples within 0.5 in their best parameter settings. Figure 13 shows the best performances of the participants listed. Chen s method Fig. 13. Comparison using the test suite UW-I. See Table 4 for the numerical values and the original sources.

18 708 P. Saragiotis & N. Papamarkos Table 5. Improving OCR applications text segmentation of 20 UW-I documents by correcting the rotation of differently skewed text areas. Improved Result Partly Improved Result Failed to Improve 18 1 (I049) 1 (A001) achieved its best results when their machine-learning algorithm used the same set of samples for both training and testing which is not the appropriate procedure to follow. For further results comparison, we selected the 20 images from the UW binary image library that contain documents with multiple skews. These are scanned twopage documents of a book, a full page and a segment of a second page. We imported these documents to a commercial OCR application and noticed that the segment of the page could not be identified as text area, resulting in a loss of recognizable text. Our technique recognized the two differently skewed areas, rotated them and the result was once again imported to the commercial OCR application. In most cases, the OCR application recognized the text area correctly and obtained the included text. The results of this experiment are shown in Table Conclusions In this paper, a new technique is proposed for skew correction in documents with several differently skewed text areas. The proposed approach is based on a text localization technique and on a skew angles estimation algorithm applied locally in each identified text area. The technique has proven to be robust in handling a variety of documents, such as magazine and book pages, spreadsheets, book covers and advertisements in multiple languages and unlimited rotation angles. The main contribution of the proposed technique is its ability to correct the skew in documents with several differently skewed text areas. Its novelties are the vertical and horizontal grouping of candidate characters, which improves the ability to handle a great range of skew angles, the calculation of two skew angles for each identified text line, which improves the technique accuracy, and the area growing technique which allows handling of unclassified connected components and punctuation marks. References 1. D. S. Bloomberg, G. E. Kopec and L. Dasari, Measuring document image skew and orientation, Document Recognition II, Proc. SPIE, San Jose, CA, Vol (6 7, February 1995), pp Y. Cao, S. Wang and H. Li, Skew detection and correction in document images based on straight-line fitting, Patt. Recogn. Lett. 24(12) (2003) W. Y. Chen and S. Y. Chen, Adaptive page segmentation for color technical journal s cover images, Imag. Vis. Comput. 16 (1998)

19 Local Skew Correction in Documents S. Chen and R. M. Haralick, An automatic algorithm for text skew estimation in document images using recursive morphological transforms, Proc. IEEE Int. Conf. Image Processing, Austin, TX (13 16, November 1994), pp C.-H. Chou, S.-Y. Chu and F. Chang, Estimation of skew angles for scanned documents based on piecewise covering by parallelograms, Patt. Recogn. 40(2) (2007) B. Gatos, N. Papamarkos and C. Chamzas, Skew detection and text line position determination in digitized documents, Patt. Recogn. 30(9) (1997) B. Gatos, I. Pratikakis and S. J. Perantonis, Adaptive degraded document image binarization, Patt. Recogn. 39(3) (2006) J. J. Hull, Document image skew detection: Survey and anotated bibliography, in Document Analysis Systems II, eds. J. J. Hull and S. L. Taylor (World Scientific, 1998), pp R. Kapoor, D. Bagai and T. S. Kamal, A new algorithm for skew detection and correction, Patt. Recogn. Lett. 25(11) (2004) Y. Lu and C. L. Tan, A nearest-neighbor chain based approach to skew estimation in document images, Patt. Recogn. Lett. 24(14) (2003) U.-V. Marti and H. Bunke, Using a statistical language model to improve the performance of an HMM-based cursive handwriting recognition system, Int. J. Patt. Recogn. Artif. Intell. 15(1) (2000) S. Messelodi and C. M. Modena, Automatic identication and skew estimation of text lines in real scene images, Patt. Recogn. 32(5) (1999) O. Okun, M. Pietikainen and J. Sauvola, Document skew estimation without angle range restriction, Int. J. Docum. Anal. Recogn. (1999) P. Shivakumara and G. H. Kumar, A novel boundary growing approach for accurate skew estimation of binary document images, Patt. Recogn. Lett. 27(7) (2006) C. Strouthopoulos, N. Papamarkos and C. Chamzas, Identification of text-only areas in mixed type documents, Engin. Appl. Artif. Intell. 10(4) (1997) B. Yuan and C. L. Tan, Fiducial line based skew estimation, Patt. Recogn. 38(12) (2005) B. Yuan and C. L. Tan, Convex hull based skew estimation, Patt. Recogn. 40(2) (2007) Y. Zhong, K. Karu and A. K. Jain, Locating text in complex color images, Patt. Recogn. 28(10) (1995)

710 P. Saragiotis & N. Papamarkos Panagiotis Saragiotis received his diploma in 1996 and master s degree in 2004 in electrical and computer engineering from Democritus University of Thrace, Greece.

His research interests include document analysis and restoration.

20 710 P. Saragiotis & N. Papamarkos Panagiotis Saragiotis received his diploma in 1996 and master s degree in 2004 in electrical and computer engineering from Democritus University of Thrace, Greece. He is currently a research and teaching assistant and is studying towards the Ph.D. degree at the Department of Electrical and Computer Engineering, Democritus University of Thrace. His research interests include document analysis and restoration. Nikos Papamarkos received his diploma degree in electrical and mechanical engineering from the University of Thessaloniki, Greece, in 1979 and the Ph.D. in electrical engineering in 1986, from the Democritus University of Thrace, Greece. From 1987 to 1990 Dr. Papamarkos was a Lecturer, from 1990 to 1996 Assistant Professor, Associate Professor in the Democritus University of Thrace where he is currently Professor since In 1987 and 1992 he has also served as a Visiting Research Associate at the Georgia Institute of Technology, USA. His current research interests are in digital signal processing, image processing, pattern recognition, neural networks and computer vision. Professor Nikos Papamarkos is a Senior Member of IEEE.

Part-Based Skew Estimation for Mathematical Expressions

Part-Based Skew Estimation for Mathematical Expressions Soma Shiraishi, Yaokai Feng, and Seiichi Uchida shiraishi@human.ait.kyushu-u.ac.jp {fengyk,uchida}@ait.kyushu-u.ac.jp Abstract We propose a novel method for the skew estimation on text images containing