SCENE TEXT BINARIZATION AND RECOGNITION

Chapter 5 SCENE TEXT BINARIZATION AND RECOGNITION 5.1 BACKGROUND In the previous chapter, detection of text lines from scene images using run length based method and also elimination of false positives has been discussed. Next step is that how to recognize the detected text. In other words, for recognition, the method considers text line as input. In this chapter, a new method called Adaptive Histogram based method (AHM) for binarizing text lines has been proposed. Then we use existing OCRs such as ABBYY and Tesseract (Google) to recognize the text line at word level and character level rather than using classifiers. For each text line, the method first segments words using boundary growing and then segment characters from words with same boundary growing technique. Finally, we also compare with proposed method with well known global thresholding technique of binarization called Otsu. 5.2 REVIEW OF EXISTING METHODS There are plenty of methods for binarization in document analysis but few in scene text analysis. In this work, we review both binarization methods related to document analysis and scene text analysis. Thresholding techniques (global and local) are quite popular in document analysis. Several improvements over thresholding techniques are also proposed recently in document analysis and people try the same methods to extend for scene text binarization also. Otsu s method is a parameterless global thresholding binarization method. It assumes the presence of two distributions (one for the text and another for the

background), and calculates a threshold value in such a way as to minimize the variance between the two distributions (Otsu 1979, Ye et al. 2001). The twodistribution limit of Otsu s method was removed by Ye et al. (2001), where the degradation modes on the histogram of the image are discarded one by one by recursively applying Otsu s method until only one mode remains on the image. In another work, the global restriction of the method is removed by Farrahi Moghaddam et al. (2010a) and an adaptive method is introduced which uses the same concept as Otsu s method, but on local patches. A measure based on the global Otsu threshold was used in that work to reveal the non-text regions that have only one class of pixels. Among the adaptive threshold binarization approaches, Sauvola s method ( Sauvola et al. 2000) is one of the best known. In this method, the threshold value, inspired by Niblack s method (Niblack 1985, Trier and Jain 1995) has been modified in order to capture open non-text regions (Sauvola et al. 2000). The threshold has two parameters to set and estimate. One of the state-of-the-art binarization methods is introduced by Gatos et al. (2006). In this method, a rough binarization of the document image is obtained first (usually using Sauvola s method). Then, a rough background is estimated. In the next step, local threshold values are calculated, based on the estimated background, as well as some parameters. These threshold values are used to calculate the final binarization, which is postprocessed to remove noise. A binarization method that was proposed by Su, Lu, and Tan (2010), and placed first in the DIBCO 09 binarization contest by Gatos et al. (2010). The method consists of four steps: (i) background extraction by polynomial fitting on the rows; (ii) stroke edge detection using Otsu s method on gradient information; (iii) local thresholding by averaging the detected edge pixels within a local neighborhood window; and, finally, (iv) postprocessing of the result. The method that was placed second at DIBCO 09 was proposed by Fabrizio and Marcotegui (Gatos et al. 2010). It is based on the toggle mapping morphological operator by Fabrizio et al. (2009). To avoid the salt-and-pepper noise associated with toggle mapping, they excluded from the analysis the pixels whose erosion and 108

dilation are too close. Pixels are then classified as text, background, and uncertain. The uncertain pixels are assigned to text and background, according to their boundary class. The method that was placed third at DIBCO 09 was proposed by Rivest-He nault, Farrahi Moghaddam, and Cheriet (Gatoset al. 2010, Rivest et al. 2011). This method uses the level set framework to locate the boundaries of text strokes and binarizes a document image (Rivest et al. 2011). Like the others, this algorithm consists of several steps: (i) initialization using a stroke map (SM) (Farrahi Moghaddam et al. 2009); (ii) correction of the SM using the level set framework in erosion mode and local linear models; and finally, (iii) a second round of level set operations, this time with a stroke Gray level force, which provides the final text regions as the interior regions of the level set function. Although document images may suffer from severe and variable degradation, it may be assumed that there are regions on them that could be labeled as true text or background. This hypothesis has been the foundation of many learning methods, which start from a rough estimation of the text and background regions, and then attempt to learn their behavior, in order to classify regions that are in the confusion interval. For example, a simple thresholding has been used to identify the text and background classes by Don (2001). Then, a noise model is built and used to adjust the threshold value. Su et al. (2010) present a framework which uses any binarization method to identify three classes; namely, text, background, and uncertain pixels. Then, it reclassifies the uncertain pixels using a classifier trained using the text and background classes. In this method (Su et al. 2010), image contrast, defined based on the local maxima and minima, is used to detect high contrast image pixels instead of the image gradient. Image contrast is less sensitive to uneven illumination. Then, the document is segmented using a local threshold estimated based on the image contrast. This method (Perret et al. 2010), which is an extension of the component tree based on flat zones to hyperconnections, defines the tree by a special order on the hyperconnections and allows non-flat nodes. The steps of the method are as 109

follows: (i) removal of the background using a hypercomponent tree; (ii) adaptive thresholding based on the values of the image edges, which are detected using the Sobel operator with an Otsu thresholding; and, finally, (iii) postprocessing. Thanks to the grid-based modelling introduced by Farrahi Moghaddam et al. (2010b), the computational cost of Sauvola s method can be reduced significantly. This enables the introduction of the multiscale grid-based Sauvola method (Farrahi Moghaddam et al. 2010b), which is capable of capturing the text pixels on high scales and track them on the lower scales in order to avoid strongly interfering patterns. In this work, they have used a similar multiscale approach combined with the AdOtsu method to improve its performance. The adaptive Otsu formula by Bernsen (1986) was the first successful attempt to make Otsu s method adaptive. However, this method has some limitations. The main drawback of the method is the presence of the parameters R in the formula that push it far from Otsu s method toward other parameter-based methods such as Sauvola s method. A constant value, such as 0.1 can be used for R, but this will put an upper limit on the performance of the method. Also, learning of the parameters from the document image itself is a challenge in front of all adaptive methods that need a thorough understanding of document images. The second limit of the method is the global Otsu threshold itself. Although the global Otsu threshold is used to stabilize the method and identify most probable background regions, it puts a limit on the performance of the method because the global threshold can be completely independent from the local behaviour of text and background. The concept of background estimation has been used in many works (Gatos et al. 2004; Farrahi Moghaddam and Cheriet 2010b; Farrahi Moghaddam and Cheriet 2008; Lettner et al. 2010). For example, by Gatos et al. (2004), estimate an approximate background using interpolation of the pixel values assigned to background according to a rough binarization on a patch of the size of two characters. In another work by Lu and Tan (2007), an estimation of background is obtained using polynomial surface smoothing. It is worth noting that, in contrast to the other methods, it does not look for the accurate value of the background but rather an approximate of the average background. 110

Halabi et al. (2009) have used a method similar to that of Gatos et al. (2004). In that work, they used window swell filter to recover disconnected weak strokes. It is worth noting that, in contrast, we will use multiscale approach to preserve weak strokes. The method is especially successful for high-intensity document images with degraded background. Reza Farrahi Moghaddam and Mohamed Cheriet (2012) have proposed AdOtsu: An adaptive and parameterless generalization of Otsu s method for document image binarization. Adaptive binarization methods play a central role in document image processing. In this work, an adaptive and parameterless generalization of Otsu s method is presented. The adaptiveness is obtained by combining grid-based modelling and the estimated background map. The parameterless behavior is achieved by automatically estimating the document parameters, such as the average stroke width and the average line height. The proposed method is extended using a multiscale framework, and has been applied on various datasets, including the DIBCO 09 dataset, with promising results. It is observed from the above methods that the document OCR engine does not work for camera based natural scene images due to failure of binarization in handling non-uniform background and non-illumination. Therefore, poor character recognition rate (67%) is reported for ICDAR-2003 competition data (Neumann and Matas 2011). This shows that despite high contrast of camera images, the best accuracy reported is 67% so far (Chen and Odobez (2005). It is noted that character recognition rate varies from 0% to 45% (Chen and Odobez 2005) if we apply OCR directly on natural scene images. The experimental result of the existing baseline methods such as Niblack (1986) and Sauvola et al. (1997) show that thresholding techniques give poor accuracy for the scene images. It is reported by He et al. (2005) that the performance of these thresholding techniques is not consistent because the character recognition rate changes as the application and dataset change. Ntirogiannis et al. (2011) have proposed a binarization method based on baseline and stroke width extraction to obtain body of the text information and convex hull analysis with adaptive thresholding is done for obtaining final text information. 111

However, this method focuses on artificial text where pixels have uniform color but not on both artificial and scene text where pixels do not have uniform color values. An automatic binarization method for color text areas in images and video based on convolutional neural network is proposed by Saidane and Garcia (2007). The performance of the method depends on the number of training samples. Edge based binarization for video text image has been proposed by Zhou et al. (2010) to improve the video character recognition rate. This method takes Canny of the input image as input and it proposes a modified flood fill algorithm to fill the gap if there is a small gap on the contour. This method works well for small gaps but not for big gaps on the contours. In addition to this, the method s primary focus is graphics text and big font but not both graphics and scene text. Recently, Sangheeta Roy et al. (2012) have proposed Wavelet-Gradient-Fusion for Video/Image text binarization method. In this work, they propose a new method using fusion of horizontal, vertical and diagonal information obtained by the wavelet and the gradient on text line images to enhance the text information. We apply kmeans with k=2 on row-wise and column-wise pixels separately to extract possible text information. Next, the method uses connected component analysis to merge some subcomponents based on nearest neighbor criteria. The foreground (text) and background (non-text) are separated based on new observation that the color values at edge pixel of the components are larger than the color values of the pixel inside the component. Finally, they use Google Tesseract OCR to validate our results and the results are compared with the baseline thresholding techniques to show that the proposed method is superior to existing methods in terms of recognition rate on 236 video and 258 ICDAR 2003 text lines. From the above discussion, it is found that there is no perfect method to give perfect solution to binarization and recognition of scene text images. Hence, we propose a new method called adaptive histogram based method to overcome the problems of the existing methods. 112

5.3 PROPOSED METHODOLOGY In order to recognize the text line detected by the text detection methods, either we need to propose our own classifier or use available classifiers. In this work, we choose second option that uses existing OCR rather than developing our own OCR. We know that OCR accepts only binary image to recognize the text. It is also true that separating foreground (text) and background (non-text) of scene text line is challenging due to degradations, loss of information and distortions. Therefore, we propose a method based on Gray scale information and Otsu thresholding. For Gray image of input, we perform sliding window operation over text lines and for each sliding window, we plot a histogram by considering pixel values in X axis and number of pixels in Y axis. Then, we choose pixels that give highest peak in the histogram as text pixels and display them as white pixels. We also test Otsu in the similar way to obtain text pixels. Once the method binarizes the text line image, we modify the boundary growing proposed earlier for multioriented text detection, to segment the words and characters. The space between the word and characters has been studied to fix dynamic threshold for segmentation. We pass segmented words and characters to the above binarization methods separately. Further, we pass the results of binarization methods to ABBYY OCR and Tesseract OCR to recognize the characters. 5.3.1 Words and character segmentation We modify the boundary growing by studying the number of iterations between words and characters while it is merging words and characters to extract text lines. First, we segment the words from the text lines and then we send segmented words to same boundary growing to segment characters. The main intuition to segment the words and character is that the space between the words is higher than the space between the characters. We use this clue to fix dynamic threshold for segmentation. Sample results for segmentation is shown in Table 5.1 where the method segments words and characters correctly for the text line input image. 113

Table 5.1. Sample results of the word and character segmentation Input: Text line Word segmentation Character Segmentation 5.3.2 Adaptive histogram based method for binarization We observe from pixel values in text lines detected by the text detection method that the values of text pixel in each character component have uniform colors compared to whole word. This observation leads us to propose Adaptive Histogram based Method (AHM). For each text line image, we compute Height (H) of the text block and it is considered as height of the window. The same length of the height is considered as width of window. This gives a square window. Then, we move square as a sliding window over text line image. For each sliding window, we plot a histogram to choose highest peak by considering Gray values in X axis and number of pixels in Y axis. We display all pixels in the highest peak as white pixels in the separate image. This process continues till end of the text lines. Sample results of the Otsu are shown in Table 5.2 where we have tested the Otsu on whole image without sliding window to find effectiveness. Table 5.2 shows Otsu on whole image does not give good binary results because of complex background in the images. Therefore, we can conclude that text lines detected by the text detection method is necessary to reduce the effect of complex background as shown in Table 5.3 where we apply Otsu on each sliding window as done in the proposed AHM, which we call Adaptive Otsu Method (AOM). In addition, Otsu without sliding window over text line image is called Otsu method. Table 5.3 114

shows that the results of the AOM are better than results shown in Table 5.2 and when we compare AOM results with the proposed AHM, the results given by the proposed method is better. It is also observed from Table 5.3 that Otsu on whole text line does not give good results compared to the results of AOM and AHM. It is confirmed from the recognition results shown in double quotation where the proposed AHM method gives better binarization results for the scene text line images compared to Otsu and AOM. In summary, Otsu on whole text line without sliding operation is not good while the adaptive AOM and AHM are good for scene text line image binarization. Table 5.2. Sample results of the Otsu on whole image Input Otsu Input Image Otsu on whole image 115

Table 5.3. Sample results of the proposed AHM in comparing with AOM Input text Otsu Adaptive Otsu Method (AOM) Proposed Adaptive Histogram based Method (AHM) flst.city HOSPITAL flst.city HOSPITAL BELFAST CITY HOSPITAL 1.QIAN A1RIH0RCE 1.QIAN A1RIH0RCE INDIAN AIR FORG LOUNGE-1 WO ENTRY] irorl CANDIDATES WO ENTRY] irorl CANDIDATES NO ENTRY FOR CANDIDATES 116

5.4 EXPERIMENTAL RESULTS We consider different datasets to show that the proposed method is capable of handling different situations and diversified datasets. The proposed method is tested on 312 High Resolution Camera Images (HCI), 230 Low Resolution Mobile Camera Images (LMI), and the 210 standard dataset ICDAR-2003 competition data to evaluate the performance of the proposed method. In total, the proposed method is tested on 752 images to show that the proposed method is superior to existing methods. For all three datasets, we test Otsu on whole text lines, Otsu on sliding window (AOM) and the proposed histogram on sliding window (AHM) to study the effectiveness of the methods. In addition, Otsu and AOM are considered as existing methods for comparative study in this work since Otsu is well known method for document binarization. Further, character recognition rate is considered as a measure to evaluate the methods. The results given by the binarization methods are sent to both ABBYY and Tesseract OCR to obtain recognition results. The recognition results are shown in double quotation in Tables 5.3. We have also conducted experiments on words and characters to test the character recognition rate by both ABBYY and Tesseract OCR. These experiments show that the character recognition rate improves when we give segmented word as input because of background complexity reduction compared to text line image background complexity. 5.4.1 Experiments on high resolution camera images (HCI) Table 5.4 shows sample results of the proposed and existing methods for the HCI data, where one can notice that proposed AHM gives better results compared to Otsu and AOM. The quantitative results are reported in Table 5.5 where character recognition rate given by the ABBYY and Tesseract OCR of the proposed AHM is better than Otsu and AOM at text line, word and character level as well. Since ABBYY OCR is advanced and improved over Tesseract, it gives better results than Tesseract OCR for all the cases in our experimentation. However, it is observed from Table 5.5 that character recognition rate at character is lower than word and line level in contrast to our discussion on word and character segmentation. This is 117

because when we apply sliding window operation on segmented character, the methods fail to select global parameter for Otsu and highest peak for the proposed method correctly though background complexity reduces compared to background complexity of words. This is not true for the words as we can see higher character recognition rate at words level than text line level by both the OCRs. Table 5.5 also show experiments of Otsu on whole image, where it gives worst accuracy than other methods including the proposed method. Overall, we can infer that the proposed method is good for scene text recognition at word level. Table 5.4. Sample results on high resolution camera images Input: Text line images Otsu AOM Proposed AHM 1 NO ENTRY ror, CANDIDATES 1 NO ENTRY ror, CANDIDATES NO ENTRY FOR CANDIDATES INDIAN AIR FORG JNDIAN AIR FOR@ JNDIAN AIR FOR@ NO SHOKINO NO SHOKINO NO SMOKING 118

Table 5.5. Character recognition rate in % for the HCI data Character Recognition Rate (CRR) Methods ABBYY OCR Tesseract OCR (Google) Image Text Word Character Image Text Word Character AHM - 89.32 91.15 88.00-85.43 86.34 84.54 AOM - 42.80 39.35 39.02-38.52 39.00 37.06 Otsu 42.40 45.69 44.64 43.55 37.67 40.22 42.14 40.55 5.4.2 Experiments on low resolution mobile camera images (LMI) The objective of this experiment is to show that the proposed method works well for low resolution text images also when it works for high resolution images. Sample results of the proposed and existing methods are shown in Table 5.6 where one can find that the proposed method gives better results than the existing methods. The quantitative results of the proposed method and existing methods at image, line, word and character level given by both ABBYY and Tesseract OCR are shown in Table 5.7 where it is noticed that the proposed method at word gives better results compared to existing methods in terms of character recognition rate at all levels. The reason for poor accuracy of the existing methods is same as discussed in previous section. 119

Table 5.6 Sample results for low resolution mobile camera images Input: Text line images Otsu AOM Proposed AHM INFORMATION TECHNOLOG\ SECTION INFORMATION TECHNOLOG\ SECTION INFORMATION TECHNOLOGY SECTION CONFIDEV7I\L ROOM NOEVTRY CONFIDEV7I\L ROOM NOEVTRY CONFIDENTIAL ROOM NO ENTRY LIBRflRV NOTICE BOARD LIBRflRV NOTICE BOARD LIBRAR! NOTICE BOARD 120

Table 5.7. Character recognition rate in (%) for the LMI Character Recognition Rate (CRR) Methods ABBYY OCR Tesseract OCR (Google) Image Text Word Character Image Text Word Character AHM - 86.20 87.23 85.92-82.49 83.72 81.66 AOM - 43.00 47.16 43.84-39.19 37.04 38.43 Otsu 36.84 42,59 46.34 42.61 28.64 38.29 36.51 37.61 5.4.3 Experiments on ICDAR 2003 data This dataset is benchmark data for scene text detection available publicly. Our method is tested on this dataset to show that the proposed method is suitable for this dataset also as this dataset is challenging due to complex background, nonuniform illumination and unfavourable characteristics of scene text. Sample results of the proposed and existing methods are shown in Table 5.8 where one can find that the proposed method gives better results than the existing methods. The quantitative results of the proposed method and existing methods at image, line, word and character level given by both ABBYY and Tesseract OCR are shown in Table 5.9 where it is noticed that the proposed method at word level gives better results compared to existing methods in terms of character recognition rate at all levels. The reason for poor accuracy of the existing methods is same as discussed in Section 5.4.1. 121

Table 5.8 Sample results for ICDAR-2003 competition data Input: Text line images Otsu AOM Proposed AHM $ ' HARWICH $ ' HARWICH HARWICH COURT HOUSE FLATS 61to69 FLATS 61to69 FLATS 61to69 APPLICATION FORM APPLICATION FOEM, APPLICATION FORM, Table 5.9. Character recognition rate in % for ICDAR 2003 data Character Recognition Rate (CRR) Methods ABBYY OCR Tesseract OCR (Google) Image Text Word Character Image Text Word Character AHM - 89.06 90.08 88.62-82.36 85.32 81.11 AOM - 39.54 40.42 37.39-32.54 38.72 34.06 Otsu 43.08 42.32 43.71 40.05 36.81 37.32 38.01 35.54 122

5.5 CONCLUSIONS This chapter presents new binarization method for scene text recognition. This method explores color information of the character components where it is observed that color of pixel in each component have same values. With this intuition, we propose adaptive histogram based method to choose uniform color values by performing sliding window operation over text line, words and characters. This simple idea works better than well known Otsu method and adaptive Otsu method. We modify the boundary growing method to segment words and characters from the text line images based on number of iterations while growing. This method works even for multi-oriented text lines also. Experimental results show that character recognition rate at word level improves over character rate at text line level but not at character level. 123