Text Block Detection and Segmentation for Mobile Robot Vision System Applications

Size: px

Start display at page:

Download "Text Block Detection and Segmentation for Mobile Robot Vision System Applications"

Richard Lyons
6 years ago
Views:

1 Proc. of Int. Conf. onmultimedia Processing, Communication and Info. Tech., MPCIT Text Block Detection and Segmentation for Mobile Robot Vision System Applications Too Boaz Kipyego and Prabhakar C. J. Department of Computer Science, Kuvempu University, Karnataka, India Abstract We proposed a technique to detect and segment text block from natural scene based on stereo disparity map. The literature survey reveals that the techniques developed for scene text extraction are using an image of a scene of interest which is captured by monocular camera. In order to make robot to take decisions based on semantic information like text in an image, monocular camera based text extraction techniques cannot be employed because robot is fitted with two identical stereo cameras. Therefore, we proposed a technique to detect and extract scene text blocks using stereo images. The main application of the proposed technique is to make robot to detect and segment text blocks in a scene using stereo images of the scene, which will further enable robot to recognize the text written on the boards. The proposed technique comprises three major phases such as estimation of Disparity map using stereo images, detection of candidate planar surfaces from the disparity space using gradient derivative; finally segmentation of candidate text block by mapping connected component analysis of homograph image with detected candidate planes. The experiments are carried out using our dataset, which consists of stereo images captured in outdoor environment. The experimental results are evaluated for text detection using recall, precision and f-measure. The results indicate outstanding improvement in areas with complex background where conventional method fails. Index Terms disparity map, scene text, text detection, homograph image, Stereo images. I. INTRODUCTION The extraction of semantic information in an image is very essential for mobile robot to make high level decisions while it is navigating. The natural images contain semantic information like text of various languages written on sign or advertisement board. This semantic information is useful for mobile robot to take high level decisions based on meaning of text written on the board. In order to make robot to recognize the text contained in the natural scene images, the first step is to detect and extract the text from the image. In recent years, the automatic detection of texts from natural images has gained increasing attention due to its wide range of applications like content-based multimedia indexing and OCR; text information embedded in digital images is considered to be an important aspect of overall image understanding. Texts in natural scene images usually contains useful summarized information regarding the scene and if we are able to extract image objects accurately in real time we can design vision systems that aid the navigation of moving robots or the blinds [1]. Nonetheless, extracting text information from natural scene images has many challenging issues. Lots of efforts have been put on to address these challenges. The approaches [2] [5] developed for natural scene text detection and extraction are solely based on images captured using monocular camera. The researchers DOI: 03.AETS Association of Computer Electronics and Electrical Engineers, 2013

2 have attained high accuracy for text detection and extraction in natural scenes. However, these techniques cannot be adapted for robot applications and as per our knowledge no one has attempted to develop text detection and extraction technique meant for mobile robot applications. Therefore, in this paper, we proposed a technique to build text detection and segmentation vision system application for mobile robot. Outdoor images containing sign or advertisement boards, walls, sidewalks, roads, roofs and other objects like vehicles can appear planar when viewed from a distance. This has heavily drawn research for its detection and segmentation. Many researchers have used stereo disparity [6], [7] to design vision systems using stereo images in order to detect these objects by mobile robot. Putting this in mind a greater research area has been proposed to equip robots with applications embedded on stereo cameras that performs computational disparity map for the reconstruction of 3D images based on the 2D stereo images. The applications require accurate labeling of the scene [8] to perform high level decisions based on the image semantic information, and are mostly applicable to mobile robot localization. The Building Facade labelling model is proposed by Jeffrey A. D. et al [9], they introduce the concept of detecting, segmenting and finding parameter estimates in a bid to identify individual facades for localization and guidance of a robot. They sampled and cluster candidate planes with Random sample Consensus (RANSAC) using local normal estimates calculated from principal Component Analysis (PCA) to inform the planar model. Dongil H. et al [10], Presented an algorithm for real-time object segmentation of a noisy disparity map obtained with stereo matching algorithm. Jarson C. et al [8], Presented a plane tracking algorithm that maintained iteratively least square approximation of the plane parameters with sub-pixel accuracy based on stereo images. Extension to Boosting on Multi-level Aggregate (BMA) methods to incorporate features based on stereo images for building facade detection on mobile stereo vision platforms has been proposed by Jeffrey A. [11]. Their method incorporates BMA with an extension to working with disparity map and its associated features. Konolige et al [12] used stereo images to integrate appearance and disparity information for object avoidance and used AdaBoost to learn colour and geometry models for ideal routes of travel along the ground. The stereo information was used to detect the ground plane and distinguish it with obstacles, but, not for classification or labelling the objects. Luo et al. [13] used algebraic constraint on planar surfaces for the purpose of correcting disparity, and they relied upon the assumption that all urban scenes will be planes, so their geometric properties was used to enhance occlusion and poor disparity calculations. Li et al [14] proposed an AdaBoost template to recognize human upper body pose from disparity images for natural robot interaction with the advantage of performing both classification and segmentation. Walk et al [15], incorporated object specific features into a combination of classifiers for the detection of pedestrians by putting bounding boxes. We have proposed a method for detection and segmentation of candidate text blocks based on disparity map that will be incorporated to build vision systems that enables the robot to navigate with precision by identifying the name of places and finding surrounding information. Figure 1 is a workflow diagram of our proposed technique showing the major steps that we followed in order to achieve our goal of detecting and segmenting text blocks from stereo images. We have exploited the property that plane surfaces have constant gradient [11] to identify plane regions. The disparity map is computed in a bid to aid plane surface segmentation against those that do not constitute planes, this is done by computing directional gradient on the disparity map, those planar surfaces will have a constant gradient in both vertical and horizontal directions while the other non-planar surfaces does not have this property. The technique employs three major phases; First, Depth map, we generated disparity map using Region-based stereo matching algorithm by global error energy minimization. Second, plane detection using gradient images of disparity map to detect and identify plane surfaces. Third, Segmentation, we estimated a labelled field based on detected planes and connected component analysis result of homograph image. The paper is organized as follows; Section II, III and IV present our detailed proposed work. Section V provides the evaluation metrics and the experimental results, while Section VI concludes the paper with challenges and, an outlook on future work. II. CANDIDATE PLANE DETECTION We obtain stereo images using two similar configured cameras placed horizontally to each other and at a distance of 10 cm apart. We estimate the disparity map using Region-based stereo matching algorithm by global error energy minimization using captured stereo images, though our aim is not a full 3D 252

reconstruction. As we assume that outdoor scene text are contained in a single or multiple plane-like surfaces. Planar surfaces exhibit some properties when viewed from non-verged stereo cameras.

The disparity specific features intended to help discriminate between planar and non-planar pixels.

surfaces will have constant gradient [11] in disparity space.

3 reconstruction. As we assume that outdoor scene text are contained in a single or multiple plane-like surfaces. Planar surfaces exhibit some properties when viewed from non-verged stereo cameras. Based on the work done by Jeffrey A. D. [11] to build facade features, we extract planar surfaces from the disparity map of the outdoor scene images. The disparity specific features intended to help discriminate between planar and non-planar pixels. By measuring the uniformity of the disparity gradient across an aggregate, we can separate the candidate planar surface, which may contain text block and background scene by the property that planar surfaces will have constant gradient [11] in disparity space. We compute the x gradient images of the disparity map by filtering with the directional derivative of a 1-D Gaussian distribution in the x-direction (similarly for y): (1) Left imagestereo Right imagestereo Disparity map Homograph image X-derivative Y-derivative CC Analysis Planar image Mapping Text/non-Text plane Classification Figure 1: Proposed system workflow (a) (b) (c) (d) Figure 2. (a) Left Image (b) Right Image (c) Filtered Disparity Image (d) Gradient Map III. EXTRACTING CONNECTED COMPONENTS We consider original input stereo images to generate projective homograph image using Random Sample Consensus (RANSAC) [16]. The Figure 3 (a) shows the generated homograph image for pair of stereo images shown in the Figure 2. We adopted the method in [17] to extract connected components from our generated projective homograph image in a bid to achieve robust extraction of text candidate blocks. The binarization is applied on a small colour image region and a searching is done to its neighbouring areas. An image binarization technique with a seed colour is conducted in the RGB colour space to classify the area into regions; those that have similar colours to the seed and to those others with different colours. The 253

binarization method can effectively separate scene text candidate blocks from complex background in the case that the text pixels have similar RGB colour values distinguishable from the background.

(a) (b) (c) Figure 3. (a) Homograph Image (b) Connected component Analysis (c) Binarization IV.

The extraction of connected component analysis presented in section III, yields decomposition of the components in the scene.

4 binarization method can effectively separate scene text candidate blocks from complex background in the case that the text pixels have similar RGB colour values distinguishable from the background. Furthermore, it has a tendency to extract the text block regions as a single component even though the text colour varies smoothly due to the light reflection or uneven illumination. (a) (b) (c) Figure 3. (a) Homograph Image (b) Connected component Analysis (c) Binarization IV. SEGMENTATION OF CANDIDATE TEXT BLOCK From the Figure 2 (d), it is observed that the Gradient map of the disparity space shows the candidate planar surfaces, which may or may not contain text. The extraction of connected component analysis presented in section III, yields decomposition of the components in the scene. These components may belong to background and some of the components may be having planar property. We wanted to keep the components which are planar because we assume that text contain in planar surface. We discard non planar components by mapping the estimated components with the gradient map of the disparity map (Figure 2(d)). The Figure 4(a) shows the mapping of connected component analysis result with gradient map. The mapping detects the candidate planar surfaces. Based on the location of detected candidate surface planes, the candidate planar surfaces are segmented in the homograph image. The Figure 4(b) shows the segmented candidate planar surfaces in the homograph image. The image may contain more than one candidate planar surfaces. Therefore, segmented planes are further classified into text and non-text planes to detect planar surface which contain text area. The Figure 4(c) shows the detected textured planar surface. (a) (b) (c) Figure 4. (a) Mapping (b) Segmentation (c) Textured Image V. EXPERIMENTAL RESULTS We have performed an experiment using our own collected dataset taken from the outdoor scene. For the best of our knowledge this is the first work done for text detection and segmentation based on computed disparity map from stereo images. There has been no research work done using this technique and thus there is no bench mark stereo dataset for text detection is available. Our dataset consist of two pairs (Dataset#1, Dataset#2) of RGB stereo images taken from two cameras having the same calibration and are horizontally aligned with a distance of about 10cm apart. All the images are 360x200 resolutions and all these were used for testing. The images were taken only focusing on outdoor scene, where sign boards that contain text are available and are taken with a cameras positioned just perpendicular to the object of interest. We tested our results using this type of images and we were able to achieve outstanding results since the complex background consisting mainly trees and some other non-planar objects are evident. The experimental results obtained for Dataset#1 using proposed method is shown in the Figure 2, 3 and 4. The experimental results obtained for Dataset#2 is shown in the Figure

A. Evaluation Metrics We have adapted two metrics for the evaluation of the experimental results of proposed method for text block detection

$The Precision is the fraction of detections that are positives where as Recall is the fraction of positives that are detected rather than$ Ground truth results are obtained by marking the bounding box by hand which surround the entire text block on the stereo image data sets.

Ground truth results are obtained by marking the bounding box by hand which surround the entire text block on the stereo image data sets.

The precision and recall rates have been computed based on the area ratio r of the bounding box between ground truth and result of our

Results for Dataset#2: (a) Left Image (b) Right Image (c) Filtered Disparity Image (d) Gradient Map (e) Homograph Image (f) Connected

Illustration of the overlap of a ground truth box and detected bounding box TABLE I.

Experimental Results and Discussion We have conducted experiments on some types of outdoor scene images shown above.

The experimental results shows that, our algorithm has excellent performances in recall (R), precision (P), and f- measure.

5 A. Evaluation Metrics We have adapted two metrics for the evaluation of the experimental results of proposed method for text block detection and segmentation. The evaluation metrics considered are Precision and Recall. The Precision is the fraction of detections that are positives where as Recall is the fraction of positives that are detected rather than missed. Ground truth results are obtained by marking the bounding box by hand which surround the entire text block on the stereo image data sets. Given the marked ground truth and detected result by the algorithm, we can automatically calculate the Recall and Precision. The precision and recall rates have been computed based on the area ratio r of the bounding box between ground truth and result of our algorithm as shown in Figure 6.. (a) (b) (c) (d) (e) (f) (g) (h) (i) (j) Figure 5. Results for Dataset#2: (a) Left Image (b) Right Image (c) Filtered Disparity Image (d) Gradient Map (e) Homograph Image (f) Connected component Analysis (g) Binarization, (h) Mapping (i) Segmentation and (j) Textured Image. Figure 6. Illustration of the overlap of a ground truth box and detected bounding box TABLE I. RESULTS ON RECALL, PRECISION AND F-MEASURE Dataset # Precision Recall f- measure Dataset # Dataset # B. Experimental Results and Discussion We have conducted experiments on some types of outdoor scene images shown above. Table 1 shows the recall, precision rates and f-measure for these outdoor images. The experimental results shows that, our algorithm has excellent performances in recall (R), precision (P), and f- measure. Among our experiments, the worst experimental result was due to domination of one colour over the others. The proposed method was implemented using MATLAB. Though, our method performed poorly with respect to computation time, for the outdoor stereo scene images on a PC with a 2.93 GHz core 2duo processor and 256MB memory. VI. CONCLUSIONS We have presented a method to localize and segment text blocks from the stereo images for mobile robot vision system applications. We computed the directional 1D gradient derivative on both x and y directions from the disparity space. Since the plane surfaces had constant gradients, regions that satisfy our target 255

6 features and later on classify textured and non-textured planes. We only used our own dataset obtained with two cameras having the same configuration and horizontally aligned with approximate distance of 10cm apart. Our test data was taken by positioning the cameras perpendicular to the object of interest making it easier to detect plane features because they will be at the same level on a disparity depth map. We achieve best results on a complex background as they are successfully removed due to depth levels and the fact that their gradient derivative was highly inconsistent. The main advantage for this technique is to build a vision system application for moving robot which is equipped with stereo cameras. The classification result to obtain textured planes is the input for the extraction algorithm to extract text from the localized text blocks. We propose to use standardized data in our next work to include all types of text orientations in order to achieve excellent results of the text candidate block in the scene images and as well improve the time it takes by incorporating faster algorithms. REFERENCES [1] Byun, H. R., Roh, M. C., Kim, K. C., Choi, Y. W., & Lee, S. W. (2002). Scene text extraction in complex images. In Document Analysis Systems V (pp ). Springer Berlin Heidelberg [2] N. Ezaki, M. Bulacu, and L. Schomaker. Text Detection from Natural Scene Images: Towards a System for Visually Impaired Persons. In International Conference on Pattern Recognition, pages , [3] B. Gatos, I. Pratikakis, K. Kepene, and S. Perantonis. Text detection in indoor/outdoor scene images. In Proc. First Workshop of Camera-based Document Analysis and Recognition, pages , [4] K. Kim, H. Byun, Y. Song, Y. Choi, S. Chi, K. Kim, and Y. Chung. Scene Text Extraction in Natural Scene Images Using Hierarchical Feature Combining and Verification. In Proceedings of the 17th International Conference on Pattern Recognition, volume 2, pages , [5] J. Park, H. Yoon, and G. Lee. Automatic Segmentation of Natural Scene Images Based on Chromatic and Achromatic Components. Lecture Notes In Computer Science, 4418:482, 2007 [6] K. Okada, S. Kagami, M. Inaba, and H. Inoue. Plane segment finder: algorithm, implementation and applications. In IEEE International Conference on Robotics and Automation, volume 2, pages vol.2, [7] E. Trucco, F. Isgro, and F. Bracchi. Plane detection in disparity space. In International Conference on Visual Information Engineering, pages 73 76, 2003 [8] Corso, Jason, Darius Burschka, and Gregory Hager. "Direct plane tracking in stereo images for mobile navigation." Robotics and Automation, Proceedings. ICRA'03. IEEE International Conference on. Vol. 1. IEEE, 2003 [9] Delmerico, Jeffrey A., Philip David, and Jason J. Corso. "Building facade detection, segmentation, and parameter estimation for mobile robot localization and guidance." Intelligent Robots and Systems (IROS), 2011 IEEE/RSJ International Conference on. IEEE, [10] Han, Dongil, et al. "Real-time object segmentation using disparity map of stereo matching." Applied Mathematics and Computation (2008): [11] Delmerico, Jeffrey A., Jason J. Corso, and Philip David. "Boosting with stereo features for building facade detection on mobile platforms." Image Processing Workshop (WNYIPW), 2010 Western New York. IEEE, 2010 [12] Konolige, K., Agrawal, M., Bolles, R. C., Cowan, C., Fischler, M., & Gerkey, B. (2008, January). Outdoor mapping and navigation using stereo vision. In Experimental Robotics (pp ). Springer Berlin Heidelberg [13] Luo, W., and H. Maitre. "Using surface model to correct and fit disparity data in stereo vision." Pattern Recognition, Proceedings., 10th International Conference on. Vol. 1. IEEE, [14] Li, Liyuan, et al. "Human upper body pose recognition using adaboost template for natural human robot interaction." Computer and Robot Vision (CRV), 2010 Canadian Conference on. IEEE, [15] D. Doermann, J. Liang, and H. Li. Progress in camera based document image analysis. In Document Analysis and Recognition, Proceedings. Seventh International Conference on, pages , 2003 [16] M. A. Fischler and R. C. Bolles, Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography, Commun. ACM, vol. 24, no. 6, pp , [17] Kim, Egyul, SeongHun Lee, and JinHyung Kim. "Scene text extraction using focus of mobile camera." Document Analysis and Recognition, ICDAR'09. 10th International Conference on. IEEE,

Measurement of Pedestrian Groups Using Subtraction Stereo

Measurement of Pedestrian Groups Using Subtraction Stereo Kenji Terabayashi, Yuki Hashimoto, and Kazunori Umeda Chuo University / CREST, JST, 1-13-27 Kasuga, Bunkyo-ku, Tokyo 112-8551, Japan terabayashi@mech.chuo-u.ac.jp