TEXT DETECTION SYSTEM FOR THE BLIND

Size: px

Start display at page:

Download "TEXT DETECTION SYSTEM FOR THE BLIND"

Hubert Campbell
6 years ago
Views:

1 TEXT DETECTION SYSTEM FOR THE BLIND Marcin Pazio, Maciej Niedźwiecki, Ryszard Kowalik, Jacek Lebiedź Faculty of Electronics, Telecommunications and Computer Science Department of Automatic Control, Gdańsk University of Technology Narutowicza 11/12, Gdańsk, Poland Tel: , Fax: s: {mapa, maciekn, kowalik, jacekl}@eti.pg.gda.pl ABSTRACT The system capable of localizing and reading aloud text embedded in natural scene images can be very helpful for blind and visually impaired persons - providing information useful in everyday life, it increases their confidence and autonomy. Even though the currently available optical character recognition (OCR) programs are fast and accurate, most of them fail to recognize text embedded in natural scene images. The goal of the algorithm described in this paper is to localize text-like image regions and pre-process them in a way that will make OCR work more reliably. The approach described in the paper is based on color image segmentation and segment shape analysis. Preliminary tests have shown that the proposed algorithm offers satisfactory detection rate and is pretty robust to typical text distortions, such as slant, tilt and bend. 1. INTRODUCTION Advances in modern information technology and digital signal processing allow one to design devices and computer applications that help visually impaired persons. With the increasing life expectancy the demand for such equipment steadily grows. The GPS-based navigation systems for blind pedestrians, such as the system recently developed in Gdańsk [1], [2], provide support to blind persons when they go outside without assistance. Experiments performed in urban environments by blind volunteers show that, in addition to providing path planning and on-line guidance, such portable navigation devices can play a very important role in exploring the surrounding environment [3]. When combined with the electronic map and its embedded city database, the navigation unit allows blind users to learn about their whereabouts, e.g. to get information about the surrounding shops, service points, bus and tram stops, offices and institutions. It turns out that such information support aspect of navigation systems is very much appreciated by blind persons. Portable text reading devices are useful extensions of navigation systems as they can provide valuable pieces of information, usually not contained in the electronic databases. Street information boards, shop signboards, text signs besides the office entrances and traffic signs with a text content (such as stop signs, for example) are often the source of important and/or helpful text messages that can be acquired and read aloud to a blind person by a camera-based text reading system. Since the commercially available OCR systems are designed to process typical clean documents, scanned under good illumination conditions, they usually fail to read text messages embedded in natural scene images, where the text background is textured or heavily cluttered, text fonts may differ in size, orientation and alignment, lighting conditions are uneven etc. Therefore, to successfully use the OCR software one should localize text areas first [4]. Text localization in natural scenes has been a subject of many recent studies [4], [5], [6], [7], [8], [9], [10], [11], [12]. The problem is usually solved in three steps as follows. First, the image is searched for text-like areas. The most commonly used approaches involve: (1) Color-based image segmentation, followed by shape analysis of feasible segments [6], [10] (2) Local analysis of the image edge maps [6], [10] (3) Application of morphological operations (such as top-hat transformations) [6], [7] (4) Local analysis of the spatial frequency image structure (using Gabor filters, for example) [5], [9] The classifiers, usually combining many different local features, range from very simple OR-type combiners, through time-efficient cascade detection systems, to fairly complex decision networks optimized using the machine learning methods [12]. In the second step, the detected text areas are combined into larger blocks or strings. This task is usually performed using the connected component analysis based on different similarity criteria, such as similarity of size, texture and color, spatial closeness, alignment etc. Finally, the bounding boxes of the text strings are generated and the annotated image is passed on to the OCR software for higher-level processing. 2. SUMMARY OF THE TEXT-DETECTION ALGORITHM One of our main concerns at the time of starting the text detection project was how to guarantee sufficient degree of robustness of the text localization algorithm to typical text distortions, such as slant, tilt and bend. One should realize that in the case considered text distortions are practically unavoidable. First, images of natural scenes are projections from the 3-D space and therefore many text elements are 2007 EURASIP 272

2 subject to perspective distortions. Second, blind users are not very good at keeping the camera horizontally aligned, which introduces some additional degree of slantness, even if perspective distortions are absent. Therefore, even though we have focused on localization of horizontally aligned text strings (which constitute majority of text messages embedded in natural scene images) we had to assume, right from the beginning, that our system should cope favorably with high degree of text slantness, up to ±45 o or so. The proposed text detection algorithm combines classical approaches, mentioned in Section 1, with some new techniques. It consists of three steps: color image segmentation, featurebased segment filtration and hierarchical segment clusterization. All steps will be described in some detail in the subsequent sections. 3. COLOR IMAGE SEGMENTATION Most of text messages encountered in the outdoor environment are written with colored letters, on a background that assures adequate contrast. Usually the letters have the same color within the text. Therefore information about colors may be very useful in spotting text regions. The segmentation method we use tries to preserve as much of the color content of the original image as possible. The image segmentation process is, in general, a time-consuming operation, especially when it is required to arrive at a reasonable number of properly shaped segments. For the OCR to work properly the shapes of segments should preserve the shapes of letters. The proposed solution is a compromise between the processing speed and accuracy. The segmentation algorithm was designed so as to preserve the basic properties of letters, such as their height, width, general shape and color. The shape of the letter-like segments can be later corrected at the second stage of processing called resegmentation. The applied segmentation algorithm is of the region growth type. A brief summary of the segmentation procedure is given below. (1) Convert original image to the Lab color model. (2) Create the edge image based on the L component of the original image. (3) Equalize histograms of the L, a, and b components using 5 bits coding. (4) Create the 3-D Lab histogram of the image, neglecting all edge pixels. (5) Find the color corresponding to the maximum of the histogram. (6) Take any pixel corresponding to the histogram maximum and regard it as a seed pixel. Grow the segment around the seed pixel; when finished correct (decrease) the corresponding histogram values. (7) If all values of the corrected histogram are zero stop, otherwise go to step 5. In the first step the image is converted to the Lab colorspace. This model of color coding allows one to choose between lightness and color as properties of the processed image. The second step is used to create an auxiliary edge image based on the L component of the Lab model. Histogram equalization, performed in the third step, significantly reduces the 3-D Lab histogram size, while preserving the image readability. A typical 24-bit RGB color image of a natural scene contains less than different colors from the palette of the 2 83 = possible ones. The 3 5 bit coding is sufficient to preserve important image details; at the same time it reduces the size of the histogram, created in the forth step, to values. The result of the fourth step is the 3-D histogram matrix. To avoid creation of segments that consist of edge pixels, all such pixels are excluded from the histogram. The fifth step is used to find the maximum of the histogram. The L, a and b coordinates of this maximum determine the color of seed pixels. The iterative procedure performed in the sixth step grows segments around the seed pixels. The weighted distance in the Lab space is used as the region homogeneity measure, namely a new pixel is added to the existing segment if δ = w 1 L a L s + w 2 ( a a a s + b a b s ) δ max where L a, a a, b a and L s, a s, b s denote the Lab color coordinates of the analyzed pixel and the seed pixel, respectively. The weights w 1 and w 2 (trimmed experimentally) were introduced to make the analysis less sensitive to effects caused by uneven lighting. Each pixel added to the existing segment modifies (decreases) the corresponding histogram value, except for the situation where the pixel is placed on the edge. The last, seventh step is used to check the stopping condition; when the stopping criterion is fulfilled the segmentation process comes to an end. 4. FEATURE-BASED SEGMENT FILTRATION The goal of the feature based segment filtration is to reduce the number of candidate segments further considered. Several geometrical properties of segments are analyzed. Segments that do not have the letter-like shape are marked as not letters. The features we use for such a preliminary screening are: the absolute segment size the relative segment height the height to width ratio the bounding rectangle fill ratio versus the height to width ratio 4.1 Absolute segment size The absolute segment size test allows us to eliminate very small and very large segments. Small segments, containing less than 10 pixels, are usually segmentation artifacts, or they correspond to characters that are too small to be properly recognized by OCR. Large segments, with size comparable to the image size, correspond usually to non-text objects (e.g architectonic elements or the sky). 4.2 Relative segment height The relative segment height test amounts to checking whether the relative segment height α obeys the following inequality α = S w α max 2007 EURASIP 273

3 Fig. 1. The bounding rectangle fill ratio versus the height to width ratio test where S denotes the segment area (in pixels), w denotes the width of the bounding rectangle and α max is the experimentally determined threshold. This test allows us to eliminate small squares and vertical elements that are too small to be considered as letters or digits. 4.3 Height to width ratio The height to width ratio test checks whether the height to width ratio β obeys β = h w β max where h and w denote the height and width of the bounding rectangle, respectively. This test allows us to eliminate segments that are too narrow and too high. In particular, it eliminates high, vertical segments (usually corresponding to architectonic details) which can be easily confused with the letter I. 4.4 Bounding rectangle fill ratio versus the height to width ratio The bounding rectangle fill ratio versus the height to width ratio test is based on the intuitive assumption, that only the vertically positioned I -shaped alphanumerical signs are close in its shape to rectangles. Therefore most of the vertical rectangle-like shapes may be eliminated. In real images this relation is less trivial. Figure 1 shows typical values of the two ratios observed for the letter-shaped segments (each point corresponds to one segment). The segments with relatively high values of the fill ratio correspond to fragments of letters, usually obtained as a result of oversegmentation. In most cases oversegmentation is caused by uneven lightning conditions. The test is a graphical one - all segments that fall into the shaded areas in Figure 1 are rejected as not letters. Fig. 2. Formation of elementary text clusters. 5. CONNECTED COMPONENT ANALYSIS In order to localize large text-like image areas we use the hierarchical clustering procedure. First, the analysis is carried at the segment level and its goal is to combine isolated letterlike segments into larger text-like structures, further referred to as elementary text clusters. At the second stage an iterative procedure is used to combine elementary clusters into larger text structures, called text chains. 5.1 Formation of elementary text clusters Three types of criteria, based on the horizontal alignment, relative height and color similarity, are used to create elementary clusters. The search starts from defining the left and right square neighborhoods of analyzed segments (such as segment A in Figure 2). The height and vertical location of such neighborhoods is identical with the height and vertical location of the analyzed segment. When the left or right neighbor was already found, only one neighborhood is examined. Each segment with a center of gravity located inside the left/right neighborhood is examined to check whether it is similar to the analyzed segment. Such selection of the candidate neighboring segments guarantees a reasonable tolerance to text distortions. First, the inter-segment relative height index γ is computed and compared with the upper and lower thresholds (determined in the experimental way) γ min γ = h a h c h a γ max where h a denotes the height of the analyzed segment and h c - the height of the compared segment. Second, the color similarity is examined using the weighted distance measure test δ = w 1 L a L c + w 2 ( āa ā c + b a b c ) δ max where L a, ā a, b a and L c, ā c, b c denote the mean Lab color coordinates of the analyzed segment and the compared segment, respectively EURASIP 274

4 From all segments that fulfill the similarity criteria described above, we choose the one that is closest to the analyzed segment - as a measure of closeness of two segments we use the distance between their centers of gravity. Only one segment in each neighborhood (if any) can be chosen. For each segment comprising the elementary text cluster the following coefficients, characterizing the slopes of the segments connecting the centre of gravity of the analyzed segment with the centres of gravity of its left/right neighbors, are evaluated (see Figure 2) m l = x x l, n l = y y l d l d l m r = x x r, n r = y y r d r d r d l = (x x l ) 2 + (y y l ) 2 d r = (x x r ) 2 + (y y r ) 2 Finally, for all segments that have both the left and right neighbors (as in Figure 3), the mean slope coefficients are computed m = m l + m r, n = n l + n r 2 2 Otherwise we set m = m l, n = n l (if only the left neighbor exists) or m = m r, n = n r (if only the right neighbor exists). 5.2 Formation of text chains At the second stage of clusterization, elementary clusters are grouped into text chains. Two elementary clusters can be considered for merging if their horizontal and vertical distances do not exceed certain limits. The horizontal and vertical distances between two clusters are defined as horizontal/vertical distances between centers of gravity of their closest, in the Euclidean sense, boundary (i.e. leftmost or rightmost) segments. It is required that the inter-cluster distance, measured in both directions, be smaller than half of the average height of each of the merged clusters. The average cluster height is calculated as an arithmetic mean of the heights of its component segments. In the case of three clusters, shown in Figure 3, in order to check whether the first cluster can be merged with the second cluster or the third one, one should check distances between centers of gravity of segments denoted as A, B and C. Since the vertical distance between segments A and C exceeds the threshold defined above, the third cluster must be rejected. On the contrary, since both horizontal and vertical distances between segments A and B are sufficiently small, the second cluster can be considered as a match for the first cluster. Every pair of clusters that passes the distance test, described above, is subject to a more detailed consistency check, based on the following weighted similarity measure D(k,l) = c 1 D Lab (k,l) + c 2 D h (k,l) + c 3 D mn (k,l) where k and l are the numbers of the compared clusters, D Lab (k,l) = L(k) L(l) + ā(k) ā(l) + b(k) b(l) Fig. 3. Formation of text chains. denotes the average (unweighted) distance in the Lab color space, based on comparison of the average Lab components determined for both clusters, D h (k,l) = h(k) h(l) denotes the absolute difference between the average cluster heights h(k) and h(l), and finally D mn (k,l) = m(k) m(l) + n(k) n(l) is the cluster slope similarity measure, based on comparison of the average m and n coefficients, determined for each cluster. 6. PRELIMINARY EXPERIMENTAL RESULTS AND FUTURE RESEARCH The proposed text detection algorithm was tested on a large number of photographs taken in the urban environment. Additionally, it was cross-checked using the benchmark database prepared for the ICDAR 2003 Robust Reading Competition. Since our system is still under development we focused on qualitative aspects of its performance. The two text-detection examples, shown in Figures 4 and 5, confirm that one of our main objectives - high tolerance to text slantness - was actually attained. To further increase system robustness and its detection efficiency we continue research in several directions, which include: resegmentation - to improve text readability, the detected text areas can be processed again using more sophisticated segmentation algorithms text rectification - OCRs have problem reading heavily slanted text; slant elimination can be performed easily using the segment and cluster slope parameters, described in Section 5.1. detection of large isolated text signs - the current version of the program rejects all isolated segments, even if they are letter/digit-shaped; while in the case of small segments this is usually a good strategy, for large segments it may result in loosing important pieces of information, such as single-digit bus and street numbers, for example EURASIP 275

Ceranka, GPS and dead reckoning based localization system for the blind, PhD Thesis (in Polish), Gdańsk University of Technology, 2006. [3] T. Strotthe et al.

5 Fig. 4. Original image and the results of text detection. REFERENCES [1] S. Ceranka and M. Niedźwiecki, Application of particle filtering in navigation system for the blind, 7th International Symposium on Signal Processing and its Applications, Paris, France, 2003, pp [2] S. Ceranka, GPS and dead reckoning based localization system for the blind, PhD Thesis (in Polish), Gdańsk University of Technology, [3] T. Strotthe et al., Mobility of blind and elderly people interacting with computers, National Institute for the Blind, report on the MOBIC project, [4] V. Gaudissart, S. Ferreira, C. Thillou and B. Gosselin, SYPOLE: a mobile assistant for the blind, Proc. 13th European Signal Processing Conference, EUSIPCO 2005, Antalya, Turkey, 2005, pp [5] S. Ferreira, C. Thilou and B. Gosselin, From picture to speech: an innovative application for embedded environment, Proc. of the 14th ProRISC Workshop on Circuits, Systems and Signal Processing, ProRISC 03, [6] N. Ezaki, M. Bulacu and L. Schomaker, Text detection from natural scene images: towards a system for visually impaired persons, Proc. 17th IEEE International Conference on Pattern Recognition, ICPR 04, 2004, pp Fig. 5. Original image and the results of text detection. [7] N. Ezaki, K. Kiyota, B.T. Minh, M. Bulacu and L. Schomaker, Improved text-detection methods for a camera-based text reading system for blind persons, Proc. 8th International Conference on Document Analysis and Recognition, ICDAR 05, 2005, pp [8] E.D. Haritaoglu and I. Haritaoglu, Real time image enhancement and segmentation for sign/text detection, Proc. IEEE International Conference on Image Processing, ICIP 03, 2003, pp. II993 II996. [9] J. Zhang, X. Chen, A. Hanneman, J. Yang and A. Waibel, A robust approach for recognition of text embedded in natural scenes,, Proc. IEEE International Conference on Pattern Recognition, ICPR 02, 2002, p [10] J. Gao and J. Yang, An adaptive algorithm for text detection from natural scenes, Proc. IEEE Computer Society CVPR 01, 2001, pp. II84 II89. [11] X. Chen and A.L. Yuille, Detecting and reading text in natural scenes, Proc IEEE Computer Society CVPR 04, 2004, pp. II366 II387. [12] X. Chen and A.L. Yuille, A time-efficient cascade for real-time object recognition with applications for the visually impaired, Proc IEEE Computer Society CVPR 05, 2005, p EURASIP 276

Text Detection in Indoor/Outdoor Scene Images

Text Detection in Indoor/Outdoor Scene Images B. Gatos, I. Pratikakis, K. Kepene and S.J. Perantonis Computational Intelligence Laboratory, Institute of Informatics and Telecommunications, National Center