Scenery Character Detection with Environmental Context

Size: px

Start display at page:

Download "Scenery Character Detection with Environmental Context"

Elizabeth Manning
5 years ago
Views:

1 cenery haracter Detection with Environmental ontext Yasuhiro Kunishige, Feng Yaokai, and eiichi Uchida Kyushu University, Fukuoka, 9-9, Japan Abstract For scenery detection, we introduce environmental context, which is modeled by scene components, such as sky and building. Environmental context is expected to regulate the probability of existence at a specific region in a scenery image. For example, if a region looks like a part of a building, the region has a higher probability than another region like a part of the sky. In this paper, environmental context is represented by state-of-the-art texture and color s and utilized in two different ways. Through experimental results, it was clearly shown that the environmental context has an effect of improving detection accuracy. Keywords-scenery detection, environmental context,, I. INTRODUTION The main contribution of this paper is to show the usefulness of environmental context, which is modeled by scene components, such as sky, ground, building, etc., for scenery detection. onsider a small region (a block or a connected component ()) in a scenery image. If the region (and its surroundings) look like a part of the sky, we can say that a existence probability at this region is low. This is because s never exist in the sky. This simple example indicates that environmental context is useful for detecting scenery s. Past trials on scenery detection (e.g., [] []) have never utilized environmental context. They tried to employ various s just for representing areas. Unfortunately, the past trials suffer from many false detections due to complicated non- areas. One naive strategy to suppress the false detections is to apply more strict conditions to the detectors; however, this strategy will result in the situation that various s (such as decorated s) are overlooked. To summarize, there is a severe trade-off when we try to detect s only by s. Our trials will utilize not only s for representing s but also s for representing environmental context, in order to relax the trade-off. Among several methodologies of utilizing environmental context in scenery detection, we will examine two methods, later called Method and Method. The former is a rather straightforward method where environmental context s are concatenated with s and then a /non- discrimination is performed. The latter is a more elaborated method where the likelihoods of all scene component categories (such as sky ) are first calculated and the discrimination is performed using the likelihoods as s. Experimental results will show the better performance of those methods over the simple method without environmental context, which is called Method later. The remainder of this paper is organized as follows. After a brief description of the preprocessing step in ection II, the extraction of not only s but also environmental s is described in ection III. Then, the three /non- discrimination methods using those s are described in ection IV. In ection V, the performance of detection are evaluated qualitatively and quantitatively through experiments using real ground-truthed scenery images. Finally, ection VI draws our conclusion and future work. II. PREPROEING In this paper, scenery detection is formulated as a /non- discrimination problem at each connected component (). Accordingly, the first-step is the decomposition of the entire image into non-overlapping s. is a better unit for our detection problem than a fixed-size block because the size of a (and environmental context) varies largely in the target scenery image and any fixed-size block cannot deal with this large variation. For the decomposition, trinarization by Niblack s method [] is used. Its two thresholds are determined at each pixel by local mean and standard deviation values. Precisely speaking, we use an improved version [], [] of the Niblack s original method. This improvement is effective to increase the robustness against noises and shadings around s. Note that very small s (less than pixel size) are always treated as non-s because they are unreadable even if they are actually s. III. FEATURE EXTRATION A. Extraction of haracter Features Many researcher have proposed their own s which are expected to be suitable for evaluating the likelihood that a target (or block, or pixel) is a. In this paper, by referring very recent trials [], [], the following twelve s (-) are extracted for each : the ratio between areas of the and the entire image, the ratio between width of the and the entire image, the aspect ratio of the, contour roughness, the number of holes, the ratio between area and squared contour length of the. the ratio between areas of the and its bounding

method (dim) scenery image trinarization connected component Figure. (dim) environ. context (99dim) extraction +envion. method (dim) method score (dim) discrimination Overview of the proposed method.

detection result box, mean of stroke width, standard deviation of stroke width, the ratio between areas of a dilated and the entire image, the ratio between contour lengths of original and the

For this purpose, we introduce 99 environmental context s of each. Among them, s are texton s [] of the target. Texton s have been recently proposed for describing textures.

2 method (dim) scenery image trinarization connected component Figure. (dim) environ. context (99dim) extraction +envion. method (dim) method score (dim) discrimination Overview of the proposed method. There are three different methods to be compared. detection result box, mean of stroke width, standard deviation of stroke width, the ratio between areas of a dilated and the entire image, the ratio between contour lengths of original and the dilated s, and edge contrast. B. Extraction of Environmental ontext Features Again, the main purpose is to introduce environmental context into the detection problem. For this purpose, we introduce 99 environmental context s of each. Among them, s are texton s [] of the target. Texton s have been recently proposed for describing textures. In [9], [], it was proved that they can provide a good performance of environmental objects, such as sky, road, green, etc. Texton s are based on responses from 9 Gaussian filters, Laplacian of Gaussian (LoG) filter, and Gaussian derivative filters. Then for each response the mean, standard derivation, Kurtosis, and skewness within the are calculated. ( = (9++).) Hereafter, the s from the Gaussian filters, the LoG filter, and the Gaussian derivative filters are denoted as E-, E-, E-, respectively. It is important to note that on the above calculation, -pixel margin is added around the. By this margin, we can incorporate environmental context around the target. The remaining s are employed to represent non-texture istics of environmental context. pecifically, s are L*a*b* color s, s are positional s, and is an area. The color s are extracted as the mean, standard derivation, Kurtosis, and skewness for each of L* (E-), a* (E9-9), and b* (E9-9) values of the target. The positional s (E9,9) are the x and y-coordinates of gravity point of the. The area (E99) is the number of pixels of. Note that the positional s are employed because they can represent an environmental context []; for example, sky will be generally located in an upper part of the image. IV. HARATER/NON-HARATER DIRIMINATION WITH ENVIRONMENTAL ONTEXT A. Random Forest The three discrimination methods (Methods ) commonly employ []. As shown in Fig., Test sample Random subset tree Training data tree K lass lass Voting Freq. Figure. Random. Random sampling Random subset K lass is an ensemble of K classifiers. Each classifier is a decision tree and trained by a subset of whole training data. During the training of each decision tree, effective s are selected automatically. The final classification result is determined by the majority voting of K decision results. In this paper, K and the depth of each tree are fixed at and, respectively. There are several merits of for our detection task. First, because of its sampling strategy and ensemble framework, is robust to outliers in training data. ince there are very special shapes in scenery images, overfitting will become serious without this robustness. econd, its selection mechanism is suitable for the task because we do not know useful s in advance. Third, we can investigate the importance of each on the discrimination by. (In ection V, we will observe the importance of each introduced in ection III.) Other more general merits are its fast computation and multi-class recognition ability. B. Three Discrimination Methodologies Method uses only the s (-). A is trained with those s, that is, in the -dimensional space. If votes for are more than votes for non- among K = votes, the target is detected as a. Detection results by Method will indicate an up-to-date accuracy of scenery detection, where environmental context is not considered. In other words, we can consider Method

Detection results by Method will indicate whether the environmental context s are useful. Method is the most elaborated for utilizing environmental contexts more explicitly.

3 Table I DETETION RATE. PARENTHEIZED VALUE ARE PIXEL-WIE DETETION RATE. Recall Precision F-value Method... (.) (.) (.) Method... (.) (.) (.9) Method...9 (.) (.9) (.) Figure. Ground-truth of scene component category. better better as one of the best past trials. Method uses not only the s (-) but also the environmental context s (E-99). A is trained in -dimensional space. Detection results by Method will indicate whether the environmental context s are useful. Method is the most elaborated for utilizing environmental contexts more explicitly. The likelihoods of several scene component categories are first calculated and then used as values for the discrimination. Precisely, Method is organized in the following two-step manner. ) As the first step, we calculate seven score s, which are comprised of one score and six scene component score s. The former is derived as the number of votes for in the of Method. The latter six s are derived by another trained to classify the into one of six scene component categories, sky, green, sign(board), ground, building, and others. For example, the sky score is derived as the number of votes for the sky category in the. ) As the second step, the final /non- discrimination is done by a single and the above seven score s. V. EXPERIMENTAL REULT A. Dataset A scenery image dataset were prepared for our experiments. Using Google Image earch TM, top photo images (each of which contains some s and has a size around ) were first collected. The keywords used in the search were park and sign. Those images were then decomposed into a training dataset ( images) and a test dataset ( images). For each image, a ground-truth was attached manually. pecifically, for each, label or non- label was assigned by a human operator. For a which contains both and non- regions, non label was assigned. For Method, the scene component category was also assigned to the. Figure shows an example of the ground-truth of environmental context. Precision[%] Method Method Method Recall[%] (a) -wise evaluation Figure. B. Quantitative Evaluation Method Method Method Recall[%] (b) Pixel-wise evaluation Detection accuracy. Table I and Fig. show the detection accuracy of the three methods. Figure is an RO curve showing the change of recall and precision according to a parameter k ( k K = ) of the ; if the number of votes for (against non- ) exceeds k, the target is detected as a region. The evaluation has been done in two ways; -wise evaluation and pixel-wise evaluation. The results by the -wise evaluation clearly show the superiority of Methods and over Method. Again, Method is one of the best conventional detection methods, where the environmental context is not used. Thus, the superiority confirms that the environmental context is very useful to scenery detection. Methods and have no significant difference by wise evaluation. This fact is interesting at the following point; Method uses -dimensional score s and its final decision totally relies on these seven s. This indicates that score s, the likelihoods of the main scene components, are very compact and sufficient representations of the environment for scenery detection. The pixel-wise evaluation showed that Method is better than Method ; this indicates that Method failed at some larger s. A typical example of this failure will be shown in ection V-. This is also the reason why Method could not outperform Method by the pixel-wise evaluation as shown in Fig. (b). Figure shows an importance for each at the /non- discrimination by. The importance was evaluated by the Out of Bag method, where the to be evaluated is replaced by another

4 (a) Method (b) Method Figure. E 9 E 9 E 99 E 9-9 E 9-9 E - E - E - E - E 9- E - E - E - E - E 9- E - E - E - E - E 9- E - E - E - E - E 9- E - E - 9 Importance (x-) Importance (x-) 9 Importance (x-) (c) Method Importance of each on discrimination by. (a) (b) (c) (d) Original Method Figure. Method Method Detection results. (a) (b) Figure. Visualization of score value. From left to right, original image, and,, sky, green, sign(board), ground, and building s.

5 and the accuracy degradation by the replacement is measured. If a larger degradation is observed, the is more important. Figure (a) shows that no has very low importance and the ratio between areas of the and its bounding box () has the top importance among the s. A more interesting fact is that as shown in Fig. (b), both of the s and the environmental context s are important in Method. This fact supports that Method utilizes the environmental context s for better detection accuracy than Method. Figure (c) shows that Method utilizes all of the seven score s and especially, green () and signboard () s are important for scenery detection.. Qualitative Evaluation Figure shows detection results by Methods on four scenery images. In Fig. (a) and (b), Method produced many false detections around sky and green regions. Those false detections decrease drastically by using environmental context. Figure (a) is a visualization of the score of Fig. (b). The false detections by Method are found in the sky region (Fig. (b)) and the region has high sky values (Fig. (a)). This fact indicates that the high sky score values could suppress the false detection and thus environmental context is useful to improve detection performance. Figure (c) is an example that environmental context could not improve the result. In this scenery image, thin grasses are overlapped on a signboard. Accordingly, most s of s are severely broken and far different from s of normal s. In the present framework, if a of a is broken, it is difficult to recognize it as a correctly, even with environmental context s. Figure (d) is an example showing a difference between Methods and. Both methods could successfully remove false detections in the green region, Method, however, wrongly detect the of a signboard region as a (large). One reason of this false detection is existence of the edges caused by the s on the signboard. Method was badly affected by these edges. In contrast, as shown in Fig. (b), Method could give larger signboard values correctly for this signboard region and finally give a correct detection result as a signboard. VI. ONLUION Environmental context, which is modeled by scene components, was utilized for scenery detection. The environmental context was represented by many s, which are mainly of texture s and color s, and utilized in two different methods. One method utilized the environmental context s directly in /non discrimination. The other utilized them as scores showing the likelihoods with respect to environmental components, such as sky and green. Both methods could show clear superiority over the detection method without environmental s in experimental results. Future work will focus on the following points. (i) A better segmentation method is necessary since it affects detection performance severely. (ii) The environmental context s of s around the target should be utilized. (iii) Another kind of context s, such as geometric context [] (which can estimate flat areas) and visual saliency [], will be useful because areas with higher flatness and/or saliency will have a higher probability of existing s. AKNOWLEDGMENT The authors greatly thank Prof. K. Kise and Prof. M. Iwamura (Osaka Pref. Univ., Japan) and Prof.. Omachi (Tohoku Univ., Japan) for their precious comments. This research was partially supported by JT, RET. REFERENE [] X. hen, and A. L. Yuille, Detecting and reading text in natural scenes, VPR,. []. Lucas, et al., IDAR robust reading competitions: entries, results, and future directions, IJDAR,. [] K. Zhu, F. Qi, R. Jiang, and L. Xu, Automatic detection and segmentation in natural scene images, J. Zhejiang University, cience A, vol., No.,. [] L. Xu, H. Nagayoshi, and H. ako, Kanji detection from complex real scene images based on properties, DA,. [] H. Goto, Redefining the DT-based for scene text detection, IJDAR,. [] W. Niblack, An Introduction to Digital Image Processing, Prentice Hall, 9. [] L. L. Winger, J. A. Robinson, and M. E. Jernigan, Lowcomplexity extraction in low-contrast scene images, IJPRAI,. [] T. Leung, and J. Malik, Representing and recognizing the visual appearance of materials using three-dimensional textons, IJV,. [9] J. hotton, J. Winn,. Rother, and A. riminisi, TextonBoost: Joint appearance, shape and context modeling for multi-class object recognition and segmentation, EV,. [] Y. Kang, K. Yamaguchi, T. Naito, and Y. Ninomiya, Road scene labeling using fm module and D bag of texton, IV-dRR, 9. [] L. Breiman, Random s, Machine Learning,. [] D. Hoiem, A. A. Efros, and M. Hebert, Geometric context from a single image, IV,. [] L. Itti,. Koch, and E. Niebur, A model of saliency-based visual attention for rapid scene analysis, PAMI, 99.

Conspicuous Character Patterns

Conspicuous Character Patterns Seiichi Uchida Kyushu Univ., Japan Ryoji Hattori Masakazu Iwamura Kyushu Univ., Japan Osaka Pref. Univ., Japan Koichi Kise Osaka Pref. Univ., Japan Shinichiro Omachi Tohoku