TEXT DETECTION and recognition is a hot topic for

Size: px

Start display at page:

Download "TEXT DETECTION and recognition is a hot topic for"

Philomena French
6 years ago
Views:

1 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 23, NO. 10, OCTOBER Gradient Vector Flow and Grouping-based Method for Arbitrarily Oriented Scene Text Detection in Video Images Palaiahnakote Shivakumara, Trung Quy Phan, Shijian Lu, and Chew Lim Tan, Senior Member, IEEE Abstract Text detection in videos is challenging due to low resolution and complex background of videos. Besides, an arbitrary orientation of scene text lines in video makes the problem more complex and challenging. This paper presents a new method that extracts text lines of any orientations based on gradient vector flow (GVF) and neighbor component grouping. The GVF of edge pixels in the Sobel edge map of the input frame is explored to identify the dominant edge pixels which represent text components. The method extracts edge components corresponding to dominant pixels in the Sobel edge map, which we call text candidates (TC) of the text lines. We propose two grouping schemes. The first finds nearest neighbors based on geometrical properties of TC to group broken segments and neighboring characters which results in word patches. The end and junction points of skeleton of the word patches are considered to eliminate false positives, which output the candidate text components (CTC). The second is based on the direction and the size of the CTC to extract neighboring CTC and to restore missing CTC, which enables arbitrarily oriented text line detection in video frame. Experimental results on different datasets, including arbitrarily oriented text data, nonhorizontal and horizontal text data, Hua s data and ICDAR-03 data (camera images), show that the proposed method outperforms existing methods in terms of recall, precision and f-measure. Index Terms Arbitrarily oriented text detection, candidate text components (CTC), dominant text pixel, gradient vector flow (GVF), text candidates (TC), text components. I. Introduction TEXT DETECTION and recognition is a hot topic for researchers in the field of image processing, pattern recognition and multimedia. It draws attention of the contentbased image retrieval (CBIR) community in order to fill Manuscript received July 23, 2012; revised November 30, 2012, January 25, 2013; accepted February 20, Date of publication March 28, 2013; date of current version September 28, This research is supported in part by the A*STAR Grant (WBS no. R252-s ). This paper was recommended by Associate Editor S. Battiato. P. Shivakumara is with the Multimedia Unit, Department of Computer Systems and Information Technology, University of Malaya, Kuala Lumpur 50603, Malaysia, ( hudempsk@yahoo.com). C. L. Tan and T. Q. Phan are with the Department of Computer Science, School of Computing, National University of Singapore, Singapore ( phanquyt@comp.nus.edu.sg; tancl@comp.nus.edu.sg). S. Lu is with the Department of Computer Vision and Image Understanding, Infocomm of Research (I 2 R), Singapore ( slu@i2r. a-star.edu.sg). Color versions of one or more of the figures in this paper are available online at Digital Object Identifier /TCSVT c 2013 IEEE the semantic gap between low level and high level features to some extent if text is available in the video [1] [4]. In addition, the text detection and recognition can be used to retrieve the exciting and semantic events from the sports video [5] [7]. Therefore, text detection and extraction is essential to improve the performance of the retrieval system in real-world applications. Video consists of two types of texts that are scene text and graphics text. Scene text is part of the image captured by camera. Examples of scene text include street signs, billboards, text on trucks, and writing on shirts. Therefore, the nature of scene text is unpredictable compared to graphics text which can be more structured and closely related to the subject. However, scene text can be used to uniquely identify objects in sports events, navigate Google maps, and assist visually impaired people. Since the nature of scene text is unpredictable, it poses lots of challenges. Out of these, arbitrary orientation is more challenging as it is not as easy as processing straight text lines. Several methods have been developed for text detection and extraction that achieve reasonable accuracy for natural scene text (camera images) [8] [13] as well as multi-oriented text [11]. However, it is noted that most of the methods use classifier and large number of training samples to improve the text detection accuracy. To tackle the multi-orientation problem, the methods use connected component analysis. For instance, the stroke width transform based method for text detection in scene images by Epshtein et al. [8] works well for connected components which preserve shapes. Pan et al. [9] also proposed a hybrid approach for text detection in natural scene images based on conditional random field. The conditional random field involves connected component analysis to label the text candidates. Since the images are high contrast images, the connected component analysis based features with classifier training work well for achieving better accuracy. However, the same methods cannot be used directly for text detection in video because of low contrast and complex background which causes disconnections, loss of shapes, etc. In this case, deciding classifier and geometrical features of the components is not that easy. Thus, these methods are not suitable for video text detection. Plenty of methods have been proposed since the last decade for text detection in video based on connected component, [14] [15], texture [16] [19], and edge and gradient [20] [25]. Connected component based methods are good for caption text

2 1730 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 23, NO. 10, OCTOBER 2013 and uniform color text but not for multiple color characters text line and clutter background text. Texture based methods consider the appearance of the text a special texture. These methods are good for complex background to some extent, but at the cost of computations, due to a large number of features and large number of training samples for classification of text and nontext pixels. Therefore, the performance of these methods depends on the classifier in use and the number of training samples chosen for text and nontext. Edge and texture features without a classifier is proposed by Liu et al. [26] for text detection but the method uses a large number of features to discriminate text and nontext pixels. A set of texture features without a classifier is also proposed by Shivakumara et al. [27], [28] for accurate text detection in video frames. Although these methods work well for different varieties of frames, they require more time to process due to large number of features. In addition, the scope of the methods is limited to horizontal text. Similarly, combination of edge and gradient features is good for both text detection accuracy and efficiency compared to texture based methods. For example, text detection using gradient and statistical analysis of intensity values is proposed by Wong and Chen [21]. This method suffers from grouping of text and nontext components. The colour information is also used along with edge information for text detection by Cai et al. [22]. This method works well for caption text but the performance of the method degrades when the font size varies. In general, edge and gradient based methods produce more false positives due to heuristics that are used for text and nontext pixel classification. To the best of our knowledge, none of the methods as discussed above address the arbitrarily oriented text detection in video properly. The reason is that arbitrarily oriented text generally comes from scene text which poses many problems compared to graphics text. Zhou et al. [29] have proposed a method for detecting both horizontal and vertical text lines in video using multiple stage verification and effective connected component analysis. This method is good for caption text but not for other text and the orientation is limited to horizontal and vertical only. Shivakumara et al. [30] have addressed this multi-oriented issue based on the Laplacian and the skeletonization methods. This method gives low accuracy because the skeleton based method is not good enough to classify simple and complex components when clutter background is present. In addition, the method is said to be computationally expensive. Recently, the method [31] based on Bayesian classifier and boundary growing is proposed to improve accuracy for multi-oriented text detection in video. However, the boundary growing method used in this paper is good when sufficient space is present between the text lines otherwise it considers nontext as text components. Therefore, the method considers only nonhorizontal straight text lines instead of arbitrary oriented ones where the space between the text lines is often limited. The arbitrary text detection is proposed in [32] using gradient directional features and region growing. This method requires classification of horizontal and nonhorizontal text images and when the image contains multi-oriented text then fails to classify them. Therefore, it is not effective for arbitrary text detection. Thus, the arbitrarily oriented text detection in video is still considered as a challenging and interesting problem. Hence, in this paper, we propose the use of gradient vector flow for identifying text components in a novel way. The work presented in [33] for identifying object boundaries using gradient vector flow (GVF) which has the ability to move into concave boundaries without sacrificing boundary pixels motivated us to propose GVF based method for arbitrary text detection in this paper. This property helps in detecting both high and low contrast text pixels, unlike the gradient in [32] that detects only high contrast text pixels, which is essential for video text detection of any orientation to improve the accuracy. II. Proposed Methodology We explore GVF for identifying dominant text pixel using Sobel edge map of the input image for arbitrary text detection in video in this paper. We prefer Sobel than other edge operators such as Canny because Sobel gives fine details for text and less details for nontext while Canny gives lots of erratic edges for background along with fine details of text. Next, edge components in Sobel edge map corresponding to dominant pixels are extracted and we call them text candidates (TC). This operation gives representatives for each text line. To tackle arbitrary orientation, we propose a new two-stage grouping criterion for the TC. The first stage grows the perimeter of each TC to identify the nearest neighbor based on size and angle of the TC to group them, which gives text components. Before proceeding to the second stage of grouping, we introduce a skeleton concept on text components given by the first stage to eliminate false text components based on junction points. We name this output candidate text components (CTC). In the second stage, we use tails of the CTC to identify the direction of the text information and the method grows along the identified direction to find the nearest neighbor CTC, which outputs the final result of arbitrarily oriented text detection in video. To the best of our knowledge, this is the first work addressing the issue of arbitrarily oriented text detection in video with promising accuracy using GVF information. A. GVF for Dominant Text Pixel Selection The GVF is a vector that minimizes the energy functional as defined in (1), [33] ε = μ ( u 2 x + u2 y + v2 x + v2 y + f 2 ) g f 2 dxdy (1) where g (x, y) = (u (x, y), v(x, y)) is the GVF field and f (x, y) is the edge map of the input image. This GVF has been used in [33] for object boundary detection and it is shown that GVF is better than traditional gradient and sneak. It is also noted from [33] that there are two problems with the traditional gradient operation: 1) these vectors generally have large magnitudes only in the immediate vicinity of the edges, and 2) in homogeneous regions, where pixel values are nearly constant, and f is nearly zero. The

SHIVAKUMARA et al.: GVF AND GROUPING-BASED METHOD FOR SCENE TEXT DETECTION 1731 Fig. 1. Dominant point selection based on GVF. (a) Input. (b) GVF. (c) Dominant text pixels.

3 SHIVAKUMARA et al.: GVF AND GROUPING-BASED METHOD FOR SCENE TEXT DETECTION 1731 Fig. 1. Dominant point selection based on GVF. (a) Input. (b) GVF. (c) Dominant text pixels. (d) Dominant pixels on input frame. GVF is extension of gradient which extends the gradient map farther away from the edges and into homogeneous regions using computational diffusion process. This results in the inherent competition of the diffusion process which will create vectors that point into boundary concavities. This is a special property of the GVF. In summary, GVF helps to propagate gradient information, i.e., the magnitude and the direction, into homogenous regions. In other words, GVF helps in detecting multiple forces at corner points of object contours. This cue allows us to use multiple forces at corner points of edge components in the Sobel edge map of the input video text frame to identify them as dominant pixels. This dominant pixel selection removes most of the background information which simplifies the problem of classifying text and nontext pixels and retains text information irrespective of the orientation of the text in video. This is the great advantage of dominant pixel selection by GVF information. It is illustrated in Fig. 1 where (a) is the input, and (b) is the GVF for all pixels in the image in Fig. 1(a). It is observed from Fig. 1(b) that dense forces at corners of contours and at curve boundaries of text components as text components are more cursive than nontext components in general. Therefore, for each pixel, we count how many forces are pointing to the text pixels and other pixels (based on GVF arrows). A pixel is classified as a dominant text pixel if the pixel attracts at least four GVF forces. The threshold of four is determined by running an experiment of counting GVF forces between one and five GVF forces over 100 test samples randomly selected from our database as reported quantitative results in Table I. Table I shows that for two GVF, f-measure is low and misdetection rate is high compared to three GVF due to more nontext pixels (background) represented by two GVF while for three GVF, f-measure is low and misdetection rate is high compared to four GVF due to the same reason. On the other hand, for four GVF, f-measure is high and misdetection rate is low compared to five GVF. This shows that five GVF loses text pixels and it increases the misdetection rate. It is also observed from Table I that the five GVF gives high precision and low recall compared to four GVF. This indicates that five GVF loses dominant pixels which represent true text and nontext pixels as well. Therefore, it is inferred that four GVF is better than other GVF for identifying dominant text pixels which represent true text pixels and few nontext pixels. In addition, at this stage, our objective to propose four GVF for dominant pixel selection is to remove nontext pixels as many as possible despite the fact that it eliminates a few dominant pixels which represent text pixels because the proposed grouping presented (in Sections II-C and II-D) have the ability to restore missing text information. Therefore, losing a few dominant text pixels for characters in a text line does not affect much overall performance of the method. The dominant text pixel selection is illustrated in Fig. 1(c) for the frame shown in Fig. 1(a). Fig. 1(c) shows that the dominant text pixel selection that removes almost all nontext components. Fig. 1(d) shows dominant text pixels overlaid on the input frame. One can notice from Fig. 1(d) that each text components have dominant pixel. In this way, dominant text pixel selection facilitates arbitrarily oriented text detection. As an example, we choose a character image a from the input frame shown in Fig. 1(a). This is reproduced in Fig. 2(a) to illustrate that how GVF information helps in selecting dominant text pixels. To show GVF arrows for the character image in Fig. 2(a), we get the Sobel edge map as shown in Fig. 2(b) and GVF arrows on the Sobel edge map as shown in Fig. 2(c). From 2(c), it is clear that all the GVF arrows are pointing toward the inner contour of the character image a. This is because of the low contrast in the background and the high contrast at the inner boundary of the character image a. Thus, from Fig. 2(d), we observe that corner points and cursive text pixel on the contour attract more GVF arrows compared to non-corner points and nontext pixels. For instance, for a text pixel on the inner contour of the character a shown in Fig. 2(a), the GVF corresponding to this pixel is marked by the oval in the middle of Fig. 2(d). The oval area shows that a greater number of GVF forces are pointing toward that text pixel. Similarly, for a nontext pixel at the top left corner of the character a in Fig. 2(a), the corresponding GVF marked by top left oval in Fig. 2(d) shows that lesser number of GVF forces are pointing toward that pixel. For the same two text and nontext pixels, we show the GVF arrows in their 3 x 3 neighborhood. Darker arrows shown in Figs. 3(a) and (b) are those that point to the middle pixel (the pixel of interest); lighter arrows are those that are attracted elsewhere. In Fig. 3(a), the middle pixel attracts four arrows. Hence it is classified as a corner point (dominant text pixel) and the other pixel shown in Fig. 3(b) attracts only one arrow and is classified as a nontext pixel. We test some pixels that attract two and three GVF arrows as shown in Fig. 3(c) and (d), and Fig. 3(e) and (f), respectively. One can see that dominant pixels (DP) shown in Fig. 3(d) and (f) corresponding to GVF (red color) in Fig. 3(c) and (e), represent not only text pixels but also nontext pixel (background pixels). On the other hand, in Fig. 3(g) and (h) we see that the pixels selected by four GVF are real candidate text

1732 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 23, NO. 10, OCTOBER 2013 Fig. 4. Four GVF for the characters O and I to identify DP. (a) Four GVF. (b) DP. (c) Four GVF.

(a) Character chosen from Fig. 1(a). (b) Sobel edge map. (c) GVF overlaid on Sobel edge map. (d) GVF for the character image shown in (a).

65 0.53 B. Text Candidates Selection We use the result of dominant pixel selection shown in Fig. 1(c) for text candidate selection. For each dominant pixel in Fig.

Fig. 5(b) shows that this operation extracts almost all text components with few false positives.

4 1732 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 23, NO. 10, OCTOBER 2013 Fig. 4. Four GVF for the characters O and I to identify DP. (a) Four GVF. (b) DP. (c) Four GVF. (d) DP. Fig. 5. Text candidates selection based on dominant pixels. (a) Sobel edge map. (b) Text candidates. Fig. 2. Magnified GVF for corner and non-corner pixels marked by oval shape. (a) Character chosen from Fig. 1(a). (b) Sobel edge map. (c) GVF overlaid on Sobel edge map. (d) GVF for the character image shown in (a). TABLE I Experiments on 100 Random Samples Chosen From Different Databases for Choosing GVF Arrows GVF Arrows R P F MDR B. Text Candidates Selection We use the result of dominant pixel selection shown in Fig. 1(c) for text candidate selection. For each dominant pixel in Fig. 1(c), the method extracts edge components from the Sobel edge map shown in Fig. 5(a) corresponding to dominant pixels. We call these extracted edge components as text candidates as shown in Fig. 5(b). Fig. 5(b) shows that this operation extracts almost all text components with few false positives. Then the extracted text candidates are used in the next section to restore complete text information with Sobel edge map. Fig. 3. Illustration for selection of dominant text pixels (DP) with GVF arrows. (a) GVF arrows at text pixel. (b) GVF arrows at nontext. (c) Two GVF. (d) DP. (e) Three GVF. (f) DP. (g) Four GVF. (h) DP. pixels because these pixels indeed represent only text pixels as shown in Fig. 3(h) for the GVF in red color shown in Fig. 3(g). In addition, Fig. 4 shows that four GVF selection identifies dominant pixels [Fig. 4(b) and (d)] well for the characters like O and I [Fig. 4(a) and (c)] where there are no corners but they have extreme points. Thus, it confirms that four GVF work well for any other characters. C. First Grouping for Candidate Text Components For each text candidate shown in Fig. 5(b), the method finds its perimeter and it allows the perimeter to grow in five iterations, pixel by pixel, in the direction of the text line in the Sobel edge map of the input frame to group neighboring text candidates. The perimeter is defined as contour of the text candidates. The method computes minor axis for the perimeter of the text candidates and it considers length of the minor axis as radius to expand the perimeter. At every iteration, the method traverses the expanded perimeter to find the text pixel (white pixel) of the neighboring text candidate in the text line. The objective of this step is to merge segments of character components and neighbor characters to form a word. This process merges text candidates which have close proximity within five iterations of the perimeter. The five is

SHIVAKUMARA et al.: GVF AND GROUPING-BASED METHOD FOR SCENE TEXT DETECTION 1733 Fig. 6. Illustration for candidate text components selection. (a) g. (b) c. (c) g prev. (d) c last. (e) g next.

As a result, we get two groups of text candidates, namely the current group and the neighbor group.

Generally, the length of the major axes of the character components will have almost the same lengths and the angle difference between the character components have almost the same angle.

5 SHIVAKUMARA et al.: GVF AND GROUPING-BASED METHOD FOR SCENE TEXT DETECTION 1733 Fig. 6. Illustration for candidate text components selection. (a) g. (b) c. (c) g prev. (d) c last. (e) g next. determined empirically by studying the space between the text candidates. The five pixel tolerance is acceptable because it is lower than the space between the characters. As a result, we get two groups of text candidates, namely the current group and the neighbor group. Then the method verifies the following properties based on the size and angle of the text candidate groups before merging them. Generally, the length of the major axes of the character components will have almost the same lengths and the angle difference between the character components have almost the same angle. However, we fix θ min 1 as 5 because in case of arbitrarily oriented text, each character has slight different orientations according to nature of text line orientation. To take care of little orientation variation, we fix the 5. Size medianlength(g) < length(c) < medianlength(g) 3 3 where length( ) is the length of the major axis of a text candidates group and medianlength( ) is the median length of the major axes of all the text candidates in the group so far. Angle g = g prev {clast } g next = g {c} θ 1 = angle(g) angle(g prev ) θ 2 = angle(g) angle(g next ) where g is the current group, c last is the text candidate group that was last added to g, and c is the new text candidate group that we are considering to add to g. It follows that g prev and g next are the group immediately before the current group and the candidate (next) group, respectively. Angle(.) returns the orientation of the major axis of each group based on PCA. The angle condition is θ 1 θ 2 θ min 1. This condition is only checked when g has at least four components. If a text candidate group passes these two conditions, we merge the neighbor group with the current group to get candidate text components (word patches). These two conditions fail when we get large angle difference between two words due to clutter background while grouping. It is illustrated in Fig. 6 where (a) (e) show g, c, g prev, c last, and g next, respectively chosen from Fig. 5(b). The angles are computed for the groups as follows. In this case θ 1 = 5.33 θ 2 = 4.02 length(c) = medianlength(g) = Fig. 7. Word patches extraction. (a) First grouping. (b) Staircase effect. (c) Skeleton. (d) End and junction points. (e) Candidates text components after false positive elimination. Fig. 8. Illustration for word grouping. (a) w 1. (b) w 2. (c) t 1. (d) t 2. (e) t 12. So the conditions are satisfied and c is merged into g as shown in Fig. 6(e). In this way, the method groups the broken segments and neighboring characters to get candidate text components. The final results of grouping for the text candidates in Fig. 5(b) are shown in Fig. 7(a) where we can see different colors representing different formed groups. The staircase effect in Fig. 7(b) shows that grouping mechanism for obtaining candidate text groups. This process repeats until there are no remaining unvisited text candidates. This grouping essentially gives word patches by grouping character components. It is observed from Fig. 7(b) that there are false text candidates groups. To eliminate them, we check the skeleton of each group as shown in Fig. 7(c), and count the number of junction points shown in Fig. 7(d). If intersection(skeleton(g)) > 0 false text candidate groups and not retained. Skeleton(.) returns the skeleton of a group and intersection(.) returns the set of intersection (junction) points. The final results can be seen after removing false text candidates group in Fig. 7(e). However, there is still a false text candidates group. D. Second Grouping for Text Line Detection The first grouping mentioned above produces the word patches by grouping character components. For each word patch, the second grouping now finds two tail ends using the major axis of the word patch. The method considers text candidates at both tail ends of the word to grow its perimeter based on the direction of the major axis for a few iterations

1734 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 23, NO. 10, OCTOBER 2013 Fig. 9. Arbitrary text extraction. (a) Second grouping. (b) Text line detection.

While growing the perimeter by pixel by pixel, the method looks for white pixels of the neighboring word patches.

6 1734 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 23, NO. 10, OCTOBER 2013 Fig. 9. Arbitrary text extraction. (a) Second grouping. (b) Text line detection. to find neighboring word patches. The number of iterations is determined based on the experiments on space between words and characters. While growing the perimeter by pixel by pixel, the method looks for white pixels of the neighboring word patches. The Sobel edge map of the input frame has been used for growing and finding neighboring word patches. The two word patches are grouped based on the angle properties of word patches. Let t 1 and t 2 be the right tail end and left tail end of the second word patch, respectively t 1 = tail(w 1,c 1 ),t 2 = tail(w 2,c 2 ),t 12 = t 1 t 2 θ 1 = angle(t 1 ) angle(t 12 ), θ 2 = angle(t 2 ) angle(t 12 ) where w 1 is the current word patch, and c 1 is the text candidate that is being used for growing. c 2 is that text candidate of the word patch w 2 that it belongs to. The idea is to check that the tail angles of the two words are compatible with each other. tail(w, c) returns up to three text candidates immediately connected to c in w. t 12 is then the next text candidate tail of both t 1 and t 2. The angle condition is θ 1 θ min 2 θ 2 θ min 2. This condition is only checked if both t 1 and t 2 contain three components. If a word patch passes this condition, it is merged to the current word. Here we set θ min 2 to 25 to take care of orientation difference between the words in the text line. The little orientation difference between the words is expected because the input is arbitrarily oriented text. This 25 may not affect much grouping process because of enough space between the text lines. Illustration for grouping word patches chosen from Fig. 7(e) can be seen in Fig. 8 where (a) (e) represent w 1, w 2, t 1, t 2 and t 12, respectively. Suppose we are considering whether to merge w 1 and w 2. In this case, θ 1 =20.87, θ 2 =20.68 so the condition is satisfied and w 1 and w 2 are merged; this is shown in Fig. 9(a) in red color. This process repeats until there are no remaining unvisited words and the output of the second grouping is shown in Fig. 9(a) where the staircase effect with different colors shows how the words are grouped with the final results shown in Fig. 9(b) where the curving text line is extracted with a false positive. E. False Positive Removal Sometimes the false positives are merged with the text lines (like in the above case), which makes it difficult to remove the Fig. 10. Illustration for false positives elimination. (a) Input. (b) Before false positive removal. (c) Area for false positive removal. (d) Density for false positive removal. false positives. However, in other cases, the false positives may stand alone and thus we propose the following rules to remove these kinds of false positives. The rules for eliminating such false positives based geometrical properties of the text block are common practice in text detection [14] [32] to improve the accuracy. Therefore, we also propose similar rules in this paper. False positive checking: if area(w) < 200 or edge density(w) < 0.05 false positive and removed edge density(w) = edge length(sobel(w)). area(w) where sobel( ) returns the sobel edge map and edge length( ) returns the total length of all edges in the edge map. Fig. 10(a) shows the input, (b) shows the results before false positive elimination, (c) shows the results of false positive elimination using the area of the text block and (d) shows the results of false positive elimination using edge density of the text block. III. Experimental Results We create our own dataset for evaluating the proposed method along with standard dataset such as Hua s data of 45 video frames [34]. Our dataset includes 142 arbitrarily oriented text frames (almost all scene text frames), 220 nonhorizontal text frames (176 scene text frames and 44 graphics text frames), 800 horizontal text frames (160 Chinese text frames, 155 scene text frames and 485 English graphics text frames), and publicly available Hua s data of 45 frames (12 scene text frames and 33 graphics text frames). We also tested our method on the ICDAR-03 competition dataset [35] of 251 camera images (all are scene text images) to check the effectiveness of our method on camera based images. In total, 1207 ( ) video frames and 251 camera images are used for experimentation. To compare the results of the proposed method with existing methods, we consider seven popular existing methods, which are the Bayesian and boundary growing based method [31], Laplacian and skeleton based method [30], Fourier-RGB based method [28], and those presented in [21], [22], [26] and

SHIVAKUMARA et al.: GVF AND GROUPING-BASED METHOD FOR SCENE TEXT DETECTION 1735 Fig. 11. Sample results for arbitrarily oriented text detection. (a) Input. (b) Proposed. (c) Bayesian. (d) Laplacian.

The main reason to consider these existing methods is that these methods work with fewer constraints, for complex background without a classifier and training as in our proposed method.

7 SHIVAKUMARA et al.: GVF AND GROUPING-BASED METHOD FOR SCENE TEXT DETECTION 1735 Fig. 11. Sample results for arbitrarily oriented text detection. (a) Input. (b) Proposed. (c) Bayesian. (d) Laplacian. (e) Zhou et al. (f) Fourier-RGB (g) Liu et al. (h) Wong and Chen (i) Cai et al. [29]. The main reason to consider these existing methods is that these methods work with fewer constraints, for complex background without a classifier and training as in our proposed method. We evaluate the performance of the proposed method at the text line level, which is a common granularity level in the literature [17] [25], rather than the word or character level because we have not considered text recognition in this paper. The following categories are defined for each detected block by a text detection method. Truly detected block (TDB): a detected block that contains at least one true character. Thus, a TDB may or may not fully enclose a text line. Falsely detected block (FDB): a detected block that does not contain text. Text block with missing data (MDB): a detected block that misses more than 20% of the characters of a text line (MDB is a subset of TDB). The percentage is chosen according to [30] [31], in which a text block is considered correctly detected if it overlaps at least 80% of the pixels of the ground-truth block. We count manually actual number of text blocks (ATB) in the images, and it is considered as the ground truth for evaluation. The performance measures are defined as follows. Recall (R) = TDB/ATB. Precision (P) = TDB/(TDB + FDB). F- measure (F) = (2 P R) / (P + R). Misdetection Rate (MDR) = MDB / TDB. In addition, we also measure the average processing time (APT) in terms of seconds for each method in our experiment. A. Experiment on Video Text Data In order to show the effectiveness of the proposed method over the existing methods, we assemble 142 arbitrary images with 800 horizontal and 220 nonhorizontal images to form Fig. 12. Sample results for nonhorizontal text detection. (a) Input. (b) Proposed. (c) Bayesian. (d) Laplacian. (e) Zhou et al. (f) Fourier-RGB. (g) Liu et al. (h) Wong and Chen. (i) Cai et al. a representative variety set of general video data to calculate the performance measures, namely, recall, precision, f-measure and misdetection rate. The quantitative results of the proposed and the existing methods for 1162 images ( ) are reported in Table II. We highlight sample arbitrary, horizontal and nonhorizontal images for discussion in Figs. 11, 12, and 13, respectively. For the curve text line like circle shaped shown in Fig. 11(a), the proposed method extracts text lines with one false positive while the existing methods fail to detect curve text line properly. The main reason is that the existing methods are developed for horizontal and nonhorizontal text line detection but not for arbitrary text detection. It is observed from Fig. 12 that for the input frame having different orientations and complex background as shown in Fig. 12(a), the proposed method detects almost all text with a few misdetections as shown in Fig. 12(b), while the Bayesian method does not fix bounding boxes properly, as shown in Fig. 12(c); the Laplacian method detects two text lines and it loses one text line, as shown in Fig. 12(d), due to complex background in the frame. On the other hand, Zhou et al. s method fails to detect text, as shown in Fig. 12(e), as it is limited to horizontal and vertical text lines only and caption text but not scene text and multi-oriented text. It is also observed from Fig. 12 that the Fourier-RGB method, Liu et al. s, Wong and Chen s and Cai et al. s methods fail to detect text lines because these methods are developed for horizontal text detection but not for nonhorizontal text detection. Sample experimental results for both the proposed and existing methods on horizontal text detection are shown in Fig. 13 where input image shown in Fig. 13(a) has complex background with horizontal text. It is noticed from Fig. 13 that the proposed method, the Bayesian, the Laplacian, the Fourier-

69 0.71 0.15 10.3 Laplacian [30] 0.74 0.77 0.75 0.19 9.6 Zhou et al. [29] 0.54 0.72 0.61 0.28 1.5 Fourier-RGB [28] 0.63 0.77 0.69 0.13 16.9 Liu et al. [26] 0.57 0.64 0.60 0.12 23.

8 1736 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 23, NO. 10, OCTOBER 2013 TABLE II Performance on Arbitrary + Nonhorizontal + Horizontal Data ( = 1162) Methods R P F MDR APT (sec) Proposed Method Bayesian [31] Laplacian [30] Zhou et al. [29] Fourier-RGB [28] Liu et al. [26] Wong and Chen [21] Cai et al. [22] Fig. 13. Sample results for horizontal text detection. (a) Input. (b) Proposed. (c) Bayesian. (d) Laplacian. (e) Zhou et al. (f) Fourier-RGB. (g) Liu et al. (h) Wong and Chen. (i) Cai et al. results reported in Table II also show the proposed method outperforms the existing methods in terms of recall, precision, f-measure and misdetection rate. However, the APT of the proposed method is longer than most of the existing methods, except for the Fourier-RGB and Liu et al. s methods, as shown in Table II, as well as in subsequent experiments, namely Tables III and IV. The higher APT is attributed to the process of GVF determination and grouping which incurs higher computational cost. It is this GVF process that enables the proposed method to deal with arbitrarily oriented text lines. Our previous methods, namely, the Bayesian and the Laplacian methods, give lower accuracy compared to the proposed method according to Table II. This is because these methods were developed for nonhorizontal and horizontal text detection but not for arbitrary orientation text detection. As a result, the boundary growing and the skeleton based methods proposed, respectively, in the Bayesian and the Laplacian for handling multi-oriented problems fail to perform on arbitrary text. Zhou et al. s method works well for only vertical and horizontal caption text but not for arbitrary orientation and scene text, and hence, the method gives poor accuracy. Since Liu et al. s, Wong and Chen s, and Cai et al. s methods were developed for horizontal text detection but not for nonhorizontal and arbitrary orientation text detection, these methods give poor accuracy compared to the proposed method. Fig. 14. Sample results for Hua s data. (a) Input. (b) Proposed. (c) Bayesian. (d) Laplacian. (e) Zhou et al. (f) Fourier-RGB. (g) Liu et al. (g) Liu et al. (i) Cai et al. RGB and Cai et al. s methods detect almost all text lines while other methods miss text lines. The Bayesian method does not fix bounding box properly and it gives more false positives due to the problem of boundary growing. The Fourier-RGB method detects text properly. The other existing methods do not detect text properly as we can notice that Zhou et al. s method misses a few text lines, Liu et al. s method misses a few words in addition to false positives, while Wong and Che s, and Cai et al. s methods do not fix the bounding boxes properly for the text lines. Observations of the above sample images show that the proposed method detects well for arbitrary, nonhorizontal and horizontal texts compared to existing methods, the quantitative B. Experiment on Independent Data (Hua s Data) We found a small publicly available dataset of 45 video frames [34], namely, Hua s dataset for evaluating the performance of the proposed method in comparison with the existing methods. We included this set in our experiment as it serves as an independent test set in addition to our own data set in the preceding section. However, we caution that this set contains only horizontal text and hence does not give a full comparison for the entire spectrum of the text detection capability from horizontal and nonhorizontal to arbitrary orientation. Fig. 14 shows sample results for the proposed and existing methods, where (a) is the input frame having huge and small font text, and (b) (i) are the results of the proposed and existing methods, respectively. It is observed from Fig. 14 that the proposed method detects both the text lines in the input frame while the Bayesian method does not detect all text and the Laplacian method fails to detect complete text lines; hence rendering them as

SHIVAKUMARA et al.: GVF AND GROUPING-BASED METHOD FOR SCENE TEXT DETECTION 1737 TABLE III Performance With Hua s Data Methods R P F MDR APT (sec) Proposed Method 0.88 0.74 0.80 0.05 10.

9 SHIVAKUMARA et al.: GVF AND GROUPING-BASED METHOD FOR SCENE TEXT DETECTION 1737 TABLE III Performance With Hua s Data Methods R P F MDR APT (sec) Proposed Method Bayesian [31] Laplacian [30] Zhou et al. [29] Fourier-RGB [28] Liu et al. [26] Wong and Chen [21] Cai et al. [22] either misdetection or false positives. Therefore, misdetection rate is high compared to the proposed method as shown in Table III. The Fourier-RGB method detects text properly and hence it gives good recall. The other existing methods fail to detect text lines in the input frame due to font variation. From Table III, it can be concluded that the proposed method and our earlier methods [30], [31] outperform the other existing methods in terms of recall, precision, f-measure and misdetection rate. We take note that the Bayesian method [31] and the Laplacian method [30] achieve better f-measure than the proposed method. However, as we cautioned earlier, Hua s dataset does not contain arbitrarily oriented text, and both the Bayesian and the Laplacian methods are given an advantage of not being tested with arbitrary text lines. If Hua s dataset had contained arbitrarily oriented text lines, then the Bayesian and the Laplacian methods would have shown poorer f-measures like in Table II. C. Experiment on ICDAR-03 Data (Camera Images) We added another independent test set in this experiment like in the preceding section. The objective of this experiment is to show that the proposed method works well for high resolution camera images when the proposed method works well for low resolution video frames. This dataset is available publicly [35] as ICDAR-03 competition data for text detection from natural scene images. We show sample results for the proposed and existing methods in Fig. 15 where (a) is a sample input frame, and (b)- (i) show the results of the proposed and the existing methods, respectively. It is observed from Fig. 15 that the proposed method, the Fourier-RGB method and Cai et al. s method work well for the input frame but other methods including our earlier methods, namely, the Bayesian and the Laplacian methods fail to detect text lines properly. The results reported in Table IV shows that the proposed method is better in terms of recall, f-measure and misdetection rate compared to the Bayesian, the Laplacian and Fourier-RGB methods. This is because for high contrast and resolution images, the classification methods proposed in the Bayesian and the Laplacian methods and the dynamic threshold used in Fourier-RGB all fail to classify text and nontext pixels properly. However, the proposed method and our earlier methods are better than the other existing methods in terms of recall, precision and f-measure but in terms of misdetection rate, Wong and Chen s method is better according to results reported in Table IV. Wong and Chen s method is worst in recall, precision and f-measure compared to Fig. 15. Sample results for scene text detection (ICDAR-2003 data). (a) Input. (b) Proposed. (c) Bayesian. (d) Laplacian. (e) Zhou et al. (f) Fourier- RGB. (g) Liu et al. (h) Wong and Chen. (i) Cai et al. TABLE IV Line Level Performance on ICDAR-03 Data Methods R P F MDR APT (sec) Proposed Method Bayesian [31] Laplacian [30] Zhou et al. [29] Fourier-RGB [28] Liu et al. [26] Wong and Chen [21] Cai et al. [22] the proposed method. This experiment shows that the proposed method is good for even high resolution and contrast images. We also conduct experiments on ICDAR data using ICDAR 2003 measures for our proposed method and the results are reported in Table V. Since our primary goal is to detect text in video, we develop and evaluate the method at the line level as it is a common practice in video text detection survey [14] [32]. In order to calculate recall, precision and f-measure according to ICDAR 2003, we modify the method to fix the bounding box for each word in the image based on the space between the words and characters. Table V shows that the proposed method does not achieve better accuracy than the best method (Hinner Becker) but it stands in the third position among the five methods. The poor accuracy is due to the problem of word segmentation, fixing closed bounding box and strict measures. In addition, the method does not utilize the advantage of high resolution images as the participating methods use connected component analysis for text detection and grouping. Hence, the proposed method misses true text blocks. The results of the participating methods reported in Table V are taken from the ICDAR 2005 [35] to compare with the proposed method.

10 1738 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 23, NO. 10, OCTOBER 2013 TABLE V Word Level Performance on ICDAR 2003 Data Methods R P F Proposed Method Hinner Becker [35] Alex Chen [35] Qiang Zhu [35] Jisoo Kim [35] Nobuo Ezaki [35] IV. Conclusion and Future Work In this paper, we explored GVF information for the first time for text detection in video by selecting dominant text pixels and text candidates with the help of the Sobel edge map. This dominant text pixel selection helps in removing nontext information in complex background of video frames. Text candidate selection and the first grouping method ensured that text pixels were not missed. The second grouping tackled the problems created by arbitrarily oriented text to achieve better accuracy for text detection in video. Experimental results of the variety of the datasets, such as arbitrarily oriented data, nonhorizontal data, horizontal data, Hua s data and ICDAR- 03 data, showed that the proposed method works well for text detection irrespective of contrast, orientation, background, script, fonts and font size. However, the proposed method may not give good accuracy for horizontal text lines with less spacing between text lines. To overcome this problem, we are planning to develop another method which can detect text lines without considering their spacing using an alternative grouping criterion. Acknowledgment The authors would like to thank the editor and the reviewers for their constructive comments and suggestions that helped in improving the quality of this paper. References [1] N. Sharma, U. Pal, M. Blumenstein, Recent advances in video based document processing: A review, in Proc. DAS, pp [2] J. Zang and R. Kasturi, Extraction of text objects in video documents: Recent progress, in Proc. DAS 2008, pp [3] K. Jung, K. I. Kim, and A. K. Jain, Text information extraction in images and video: A survey, Pattern Recognit., vol. 37, pp , Oct [4] D. Crandall and R. Kasturi, Robust detection of stylized text events in digital video, in Proc. ICDAR, 2001, pp [5] D. Zhang and S. F. Chang, Event detection in baseball video using superimposed caption recognition, in Proc. ACM MM, 2002, pp [6] C. Xu, J. Wang, K. Wan, Y. Li, and L. Duan, Live sports event detection based on broadcast video and web-casting text, in Proc. ACM MM, 2006, pp [7] W. Wu, X. Chen, and J. Yang, Incremental detection of text on road signs from video with applications to a driving assistant systems, in Proc. ACM MM, 2004, pp [8] B. Epshtein, E. ofek, and Y. Wexler, Detecting text in natural scenes with stroke width transform, in Proc. CVPR, 2010, pp [9] Y. F. Pan, X. Hou, and C. L. Liu, A hybrid approach to detect and localize texts in natural scene images, IEEE Trans. Image Process., vol. 20, no. 3, pp , Mar [10] X. Chen, J. Yang, J. Zhang, and A. Waibel, Automatic detection and recognition of signs from natural scenes, IEEE Trans. Image Process., vol. 13, no. 1, pp , Jan [11] C. Yao, X. Bai, W. Liu, Y. Ma, and Z. Tu, Detecting texts of arbitrary orientations in natural images, in Proc. CVPR, 2012, pp [12] L. Neumann and J. Matas, Real-time scene text localization and recognition, in Proc. CVPR, 2012, [13] T. Q. Phan, P. Shivakumara, and C. L. Tan, Detecting text in the real world, in Proc. ACM MM, 2012, pp [14] A. K. Jain and B. Yu, Automatic text location in images and video frames, Pattern Recognit., vol. 31, no. 12, pp , Dec [15] V. Y. Mariano and R. Kasturi, Locating uniform-colored text in video frames, Proc. ICPR, 2000, pp [16] H. Li, D. Doermann, and O. Kia, Automatic text detection and tracking in digital video, IEEE Trans. Image Process., vol. 9, no. 1, pp , Jan [17] Y. Zhong, H. Zhang, and A. K. Jain, Automatic caption localization in compressed video, IEEE Trans. Pattern Anal. Mach. Intell., vol. 22, no. 4, pp , Apr [18] K. L Kim, K. Jung, and J. H. Kim Texture-based approach for text detection in images using support vector machines and continuous adaptive mean shift algorithm, IEEE Trans. Pattern Anal. Mach. Intell., vol. 25, no. 4, pp , Dec [19] V. Wu, R. Manmatha, and E. M Riseman, Text finder: An automatic system to detect and recognize text in images, IEEE Trans. Pattern Anal. Mach. Intell., vol. 21, no. 11, pp , Nov [20] R. Lienhart and A. Wernickle, Localizing and segmenting text in images and videos, IEEE Trans. Cicuits Syst. Video Technol., vol. 12, no. 4, pp , Apr [21] E. K. Wong and M. Chen, A new robust algorithm for video text extraction, Pattern Recognit., vol. 36, no. 6, pp , Jun [22] M. Cai, J. Song, and M. R. Lyu, A new approach for video text detection, in Proc. ICIP, 2002, pp [23] A. Jamil, I. Siddiqi, F. Arif, and A. Raza, Edge-based features for localization of artificial Urdu text in video images, in Proc. ICDAR, 2011, pp [24] M. Anthimopoulos, B. Gatos, and I. Pratikakis, A two-stage scheme for text detection in video images, Image Vision Comput., vol. 28, pp , Mar [25] X. Peng, H. Cao, R. Prasad, and P. Natarajan, Text extraction from video using conditional random fields, in Proc. ICDAR, 2011, pp [26] C. Liu, C. Wang, and R. Dai, Text detection in images based on unsupervised classification of edge-based features, in Proc. ICDAR, 2005, pp [27] P. Shivakumara, W. Huang, C. L. Tan, and P. Q. Trung, Accurate video text detection through classification of low and high contrast images, Pattern Recognit., vol. 43, no. 6, pp , Jun [28] P. Shivakumara, T. Q. Phan, and C. L. Tan, New Fourier-statistical features in RGB space for video text detection, IEEE Trans. Cicuits Syst. Video Technol., vol. 20, no. 11, pp , Nov [29] J. Zhou, L. Xu, B. Xiao, and R. Dai, A robust system for text extraction in video, in Proc. ICMV, 2007, pp [30] P. Shivakumara, T. Q. Phan, and C. L. Tan, A Laplacian approach to multi-oriented text detection in video, IEEE Trans. Pattern Anal. Mach. Intell., vol. 33, no. 2, pp , Feb [31] P. Shivakumara, R. P. Sreedhar, T. Q. Phan, S. Lu, and C. L. Tan, Multioriented video scene text detection through Bayesian classification and boundary growing, IEEE Trans. Cicuits Syst. Video Technol., vol. 22, no. 8, pp , Aug [32] N. Sharma, P. Shivakumara, U. Pal, M. Blumenstein, and C. L. Tan, A new method for arbitrarily-oriented text detection in video, in Proc. DAS, 2012, pp [33] C. Xu and J. L. Prince, Snakes, shapes, and gradient vector flow, IEEE Trans. Image Process., vol. 7, no. 3, pp , Mar [34] X. S Hua, L. Wenyin, and H. J. Zhang, An automatic performance evaluation protocol for video text detection algorithms, IEEE Trans. Cicuits Syst. Video Technol., vol 14, no. 4, pp , Apr [35] S. M. Lucas, ICDAR 2005 text locating competition results, in Proc. ICDAR, 2005, pp

SHIVAKUMARA et al.: GVF AND GROUPING-BASED METHOD FOR SCENE TEXT DETECTION 1739 Palaiahnakote Shivakumara received the B.Sc., M.Sc., M.Sc Technology by research, and Ph.

He is currently a Visiting Senior Lecturer in the Department of Computer Systems and Information Technology, University of Malaya, Kualalampur, Malayasia.

mosaicing, character recognition, skew detection, face detection, and face recognition.

He also worked as a Research Consultant on image classification at Nanyang Technological University, Singapore, for a period of six months, in 2007.

11 SHIVAKUMARA et al.: GVF AND GROUPING-BASED METHOD FOR SCENE TEXT DETECTION 1739 Palaiahnakote Shivakumara received the B.Sc., M.Sc., M.Sc Technology by research, and Ph.D degrees in computer science in 1995, 1999, 2001 and 2005, respectively, from the University of Mysore, Mysore, Karnataka, India. He is currently a Visiting Senior Lecturer in the Department of Computer Systems and Information Technology, University of Malaya, Kualalampur, Malayasia. From 1999 to 2005, he was a Project Associate at the Department of Studies in Computer Science, University of Mysore, where he conducted research on document image analysis, including document image mosaicing, character recognition, skew detection, face detection, and face recognition. From 2005 to 2007, he worked as a Research Fellow in image processing and multimedia at the Department of Computer Science, School of Computing, National University of Singapore (NUS), Singapore. He also worked as a Research Consultant on image classification at Nanyang Technological University, Singapore, for a period of six months, in He was a Research Fellow on video text extraction and recognition at NUS, from 2008 to His current research interests include image processing, pattern recognition, including text extraction from video, and document image processing. Dr. Shivakumara has published more than 100 research papers in national, international conferences and journals. He has been reviewer for several conferences and journals. Trung Quy Phan received the B.Sc. degree in computer science from the School of Computing, National University of Singapore, Singapore, in Currently he is pursuing the Ph.D. degree from the same university. His current research interests include image and video analysis. Shijian Lu received the Ph.D. degree in electrical and computer engineering from the National University of Singapore, Singapore, in Currently he is a Senior Research Fellow at the Institute for Infocomm Research, Agency for Science, Technology and Research, Singapore. His current research interests include document image analysis and medical image analysis. He has published over 40 peer-reviewed journal and conference papers. Dr. Lu is a member of the International Association for Pattern Recognition. Chew Lim Tan (SM 85) received the B.Sc. (Hons.) degree in physics from the University of Singapore, Singapore, in 1971, the M.Sc. degree in radiation studies from the University of Surrey, Surrey, U.K, in 1973, and the Ph.D. degree in computer science from the University of Virginia, Virginia, USA, in He is currently a Professor at the Department of Computer Science, School of Computing, National University of Singapore. His current research interests include document image analysis, and text and natural language processing. He has published more than 400 research publications in these areas. Dr. Tan is an associate editor of Pattern Recognition and ACM Transactions in Asian Language Information Processing, and an editorial member of the International Journal on Document Analysis and Recognition. He is a Fellow and Member of the Governing Board of the International Association for Pattern Recognition.

A Laplacian Based Novel Approach to Efficient Text Localization in Grayscale Images

A Laplacian Based Novel Approach to Efficient Text Localization in Grayscale Images Karthik Ram K.V & Mahantesh K Department of Electronics and Communication Engineering, SJB Institute of Technology, Bangalore,