Available online at www.sciencedirect.com ScienceDirect Procedia Computer Science 96 (2016 ) 1409 1417 20th International Conference on Knowledge Based and Intelligent Information and Engineering Systems, KES2016, 5-7 September 2016, York, United Kingdom A multi-stage method for Chinese text detection in news videos Yaqi Wang, Liangrui Peng, Shengjin Wang Tsinghua National Laboratory for Information Science and Technology Deparment of Electronic Engineering, Tsinghua University, Beijing,100084, China Abstract With the rapid increase of on-line video resources, there is an urgent demand for text detection and recognition technologies to build content-based video indexing and retrieval systems. Chinese news video texts contain highly condensed and rich information, but the low resolution of videos on the Internet and the complexity of Chinese character structures bring challenges for text detection. In this paper, we present a multi-stage scheme for Chinese news video text detection. We propose an improved Stroke Width Transform (SWT) method by incorporating text color consistency constraint for candidate text blocks generation. Then we use divide and conquer strategy to distinguish candidate text blocks into three sub-spaces according to their geometric shapes and size. For each sub-space, a neural network is designed to filter the candidates into text or non-text blocks. Finally, the text blocks are merged into text lines based on the stroke width, color and other heuristic information. Experimental results on self-collected Chinese news video dataset and ICDAR 2013 dataset show that the proposed method is effective to detect both news video captions and scene texts. c 2016 Published The Authors. by Elsevier Published B.V. This by Elsevier is an open B.V. access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/). Peer-review under responsibility of KES International. Peer-review under responsibility of KES International Keywords: Chinese text detection, News video, Stroke Width Transform, Neural network 1. Introduction Content-based video indexing and retrieval systems are in high demand because of the rapid proliferation of video resources on the Internet. News video, as a common type of videos, contains a large amount of meaningful information, including captions and scene texts in video images. So the research on text detection and recognition is essential to building content-based news video indexing and retrieval systems. However, video text detection is difficult because of the complex background, various text formats and styles, and low resolution of on-line videos. Some examples from the China Central Television (CCTV) news video Internet resources are shown in Fig. 1. Applying conventional Optical Character Recognition (OCR) systems to these news video samples directly cannot achieve satisfactory results. New effective text detection technologies are needed to overcome these difficulties. Corresponding author. Tel.: +86-10-62773710 ext. 308; fax:+86-10-62773710 ext. 313. E-mail address: {wangyaqi,plr,wsj}@ocrserv.ee.tsinghua.edu.cn 1877-0509 2016 Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/). Peer-review under responsibility of KES International doi:10.1016/j.procs.2016.08.186
1410 Yaqi Wang et al. / Procedia Computer Science 96 ( 2016 ) 1409 1417 Text detection methods for video and images can be mainly classified into two categories: feature-based methods and machine learning-based methods. Feature-based methods make use of attributes of text regions, such as local intensities and edge densities, and extract proper feature descriptors to identify text regions 1. According to the type of features used, feature-based methods can be further divided into three groups, namely connected component-based method 1,2,3,4,5, edge-based method 6,7 and texture-based method 8,9,10. Stroke Width Transform (SWT) 7 method is one of the edge-based methods, which uses the information of stroke edges with the assumption that characters in the same text line have uniform stroke width. This method can detect various kinds of text robustly but may result in more false alarms. Fig. 1. Examples of captions and background texts in news video. Machine learning-based methods adopt classifiers including Support Vector Machine (SVM) 11, neural network 12,13 or Bayesian classifier 14 to distinguish text or non-text blocks. Sun et al. 15 proposed a method for text and non-text classification based on shallow neural networks. Deep learning methods 16 can also be adopted. There are two major types of texts in Chinese news video, captions and scene texts. The main challenges is the low quality of caption in on-line video images, and the unconstraint scene text in complex background. In this paper, we aim to develop a reliable multi-stage scheme for Chinese text detection in news videos. A color-enhanced SWT method is used to generate candidate text blocks; Then a set of neural networks are designed to filter non-text block which further decrease the false alarm rate; Finally, the text blocks are merged into text lines in the post-processing stage. The rest of this paper is organized as following. Section 2 presents the principle of the proposed Chinese text detection scheme. Experimental results are reported in Section 3. The final section is the conclusion. 2. Method The proposed text detection algorithm mainly includes the following three stages: (1) Candidates generation, (2) Candidates filtering, (3) Text line generation, as shown in Fig. 2. 2.1. Candidate generation We adopt SWT 7 to extract connected components. SWT can be seen as a local operator that can assign each pixel with width of the stroke that most likely contains the pixel. The main steps of SWT computation include Canny edge detection and rays shooting. The result of SWT operator is a stroke width map with the same size as the original image. The initial value of a stroke width map is set to default background value. In order to recover strokes, the first step is using Canny Edge Detector 17 to find edge pixels and their gradient directions from the original image, as shown in Fig. 3. Then we shoot a ray from each edge pixel p i along the gradient direction d p as the red arrow shown in Fig. 4. If another edge pixel q i is found and the gradient direction d q is roughly opposite to the gradient direction d p, all the elements in the output stroke width map covered by the segment [p i,q i ] are assigned with the same stroke width value p q. The ray will be removed if the appropriate opposite edge pixel is not found. After the first pass of processing, all the edge pixels are considered, but some values in the stroke width map are not correct in some complex situations as the blue arrow shown in Fig. 4. To avoid this kind of situation, for all the values of elements which are above the mean value m in the stroke map, the second pass of processing is needed to replace
Yaqi Wang et al. / Procedia Computer Science 96 ( 2016 ) 1409 1417 1411 Fig. 2. System flowchart. Fig. 3. Examples of Canny edge detection results Fig. 4. Explanation of the SWT operator them with the mean value m. The examples of stroke width maps are shown in Fig. 5, and the generated connected components from the stroke width maps are shown in Fig. 6. There are still some problems for the extracted text connected components, e.g., one character is split into two or more connected components, or some characters may be merged into one connected component. To avoid these problems, we add color consistency constraint to improve the performance. In the ray shooting process, the current
1412 Yaqi Wang et al. / Procedia Computer Science 96 ( 2016 ) 1409 1417 Fig. 5. Examples of the stroke width maps Fig. 6. The generated connected components from the stroke width maps ray s color C r is defined as the median of R, G, B of all the pixels in the original image along the ray direction. If the color value of a pixel q i along the current ray satisfies C qi C r <λ c, where λ c decreases from 200 to 100 linearly as the number of pixels in the current ray increases, the pixel q i will be considered as an edge pixel on the current stroke. By adding color consistency constraint, the connected component results of stroke width maps are improved, as shown in Fig. 7. Fig. 7. Before and after using color consistency constraint for SWT After generating the connected components, some flexible rules are employed to filter out the non-text blocks 7. The parameters of each rule can be estimated on the training set of samples. The definition of these rules and thresholds are summarized below. 1. Component size. Components whose sizes are too large or too small may be ignored. The height of a component should be less than 1/3 and greater than 1/100 of the image s height. 2. Stroke width variance. The variance of the stroke width within each connected component should be less than the threshold which is the half of the average stroke width. 3. Stroke color variance. The variance of the stroke color within each connected component should be less the threshold which is the half of the average stroke color. 4. Aspect ratio. The ratio of height and width of a text block s bounding box is limited between 0.1 and 10. 5. Overlap of bounding boxes. The bounding box of a text block should not include more than two other text blocks, to eliminate connected components that may surround text such as sign frames. 6. Area ratio. The ratio of the number of stroke pixels to the total pixels of the connected component should be greater than 0.25. The thresholds of the above rules are set to avoid missing text blocks. The remaining connected components are text block candidates for the next stage of processing.
Yaqi Wang et al. / Procedia Computer Science 96 ( 2016 ) 1409 1417 1413 2.2. Text and non-text classification As the text block candidates contain various blocks, we utilize a divide and conquer strategy to solve this problem. Given a text candidate C i, it will be classified into three sub-spaces, namely Single Character, Multi-Characters and Radical according to its height h i, width w i, and aspect ratio Ar i, as shown in Fig. 8. The detailed classification rules are defined in Table 1. Fig. 8. Sub-spaces for text candidates Table 1. Empirical criteria for text candidate subspace partition. Radical Single Character Multi-Characters Aspect ratio Ar i Width w i Height h i Ar i TH Ar1 and w i TH; or Ar i TH Ar2 and h i TH TH Ar1 Ar i TH Ar2 Ar i TH Ar1 or Ar i TH Ar2, and w i > TH and h i > TH In our experiment, we set TH = 12, TH Ar1 = 0.6, TH Ar2 = 1.5 empirically. In addition, if the aspect ratio of the Multi-Characters block exceeds a threshold of 3.5, it will be segmented into several blocks according to the vertical projection histograms. Then all the candidates in different sub-spaces will be normalized. The normalized size is 12*28 pixels or 28*12 pixels for Radical, 28*28 pixels for Single Character, and 18*45(height*width) pixels for Multi-Characters. For each sub-space, a neural network is trained to classify the text and non-text candidates. We use a two-hidden layer Multilayer Perceptron Network (MLP) structure. The specific numbers of nodes for each layers in those three networks are shown in Table 2. Table 2. Numbers of nodes in each layer of the specific neural networks. Radical Single Character Multi-Characters Input layer 432 784 810 Hidden layer I 256 400 576 Hidden layer II 128 200 288 Output layer 2 2 2 Examples of candidate filtering are shown in Fig. 9.
1414 Yaqi Wang et al. / Procedia Computer Science 96 ( 2016 ) 1409 1417 Fig. 9. Examples of text block candidate filtering, yellow, green and red bounding boxes correspond to Radical, Single Character and Multi- Characters As Convolutional Neural Network (CNN)is one of the emerging deep learning techniques, it is also feasible to adopt CNN for candidate filtering. The CNN model structure 16 we use to compare with MLP has two convolutional layers with 16 and 20 filters respectively, and two fully connected layers. The comparison results are presented in the Section 3. 2.3. Text line generation The remaining components will be further grouped into text lines at this stage. Characters appearing in news video frames usually group into lines. Each pair of neighbor candidates is considered at first. Whether two neighbor candidates can be linked into a text line depends on the possibility of belonging to the same line 18. Two candidates in a text line should have similar block height and width, stroke width as well as stroke color, and the distance between the two candidates should not less than a threhold, e.g., 2 times of the sum of their widths. The process will be repeated until no candidates can be merged into a line. And the remained isolated candidates are considered as noises. The merged text lines are shown in Fig. 10. Fig. 10. Examples of text line generation results 3. Experimental Results 3.1. Dataset In order to evaluate the performance of the proposed algorithm, we build a dataset that contains 1,693 video frames selected from 183 clips of CCTV news videos available on the Internet. Each frame is 640 480 pixels. According to the position of texts and the way it appears, text regions are classified manually into several categories, including captions, superimposed text passages, scene text, text in graphs etc. All the collected samples are manually labeled with text bounding boxes. In order to compare the proposed method with existing methods, we also evaluate our method on the public dataset of ICDAR 2013 robust reading competition Challenge 2 19, which has been widely used as one of the standard
Yaqi Wang et al. / Procedia Computer Science 96 ( 2016 ) 1409 1417 1415 benchmarks for text detection in natural images. The dataset contains 229 images in the training set and 233 images in the testing set. The quantitative evaluation uses Precision, Recall Rate and F-measures which can be computed by comparing the ground truth and the detected boxes 2. 3.2. Experiments on text block candidate generation We compare the performance of our color-enhanced SWT method for text candidates generation with the original SWT method. The result is listed in Table 3. On one hand, the color consistency constraint can efficiently reduce the number of the rays shooting from stroke edges to background regions. On the other hand, some pairs of stroke edges of a character whose gradient directions are not strictly opposite to each other can be matched with respect to the same stroke color. Table 3. Comparison of SWT and Color-enhanced SWT. Methods Precision(%) Recall(%) F-measure (%) SWT 53.2 92.1 67.4 Color-enhanced SWT 62.1 94.4 74.9 We further evaluate the performance of text block candidate generation for different text types of Chinese news videos. Results are shown in Table 4. Table 4. Text candidates detection results of Color-enhanced SWT on different text types in CCTV news video dataset. Text type Testing set Number of Chinese characters Precision(%) Recall(%) F-measure(%) Caption 74 1,108 62.4 93.2 73.5 Superimposed 12 935 77.5 95.5 85.6 text passages Scene text 32 276 51.8 88.9 65.5 Text in graph 36 314 61.7 93.1 74.2 All types 123 2,633 62.1 94.4 74.9 3.3. Experiments on text and non-text classification MLP networks for the corresponding different candidate sub-spaces are trained and tested on the CCTV news video dataset. 723 images from the dataset are used to train the networks. After the candidate generation by colorenhanced SWT, all the text candidates are labeled as Radical, Single Character or Multi-Characters manually to train each network. Table 5 shows the sizes of training and testing sets and the classification results for each type. In order to validate the performance of the shallow MLP network, we take the Single Character sub-space for example to build CNN to compare the accuracy. The comparison results are shown in Table 6. The accuracy of the classification results of the CNN is just a little bit higher than the shallow MLP network. But considering the time spend on network training and the speed of classification, shallow MLP network is more feasible and is therefore adopted in our method.
1416 Yaqi Wang et al. / Procedia Computer Science 96 ( 2016 ) 1409 1417 Table 5. Results of text and non-text candidates classification. Type of candidates Number of blocks in the training set Number of text blocks in the testing set Acc.(%) Number of non-text blocks in the testing set Acc.(%) Radical 4,256 315 93.1 518 94.2 Single character 6,012 610 98.1 552 97.7 Multi-Characters 1,435 124 91.3 102 90.9 Table 6. Comparison of deep and shallow neural networks. Architecture Acc. for text blocks (%) Acc. for non-text blocks (%) Speed of classification (ms/image) Shallow neural network (2 hidden layer MLP) Deep neural network (4 layer CNN) 98.1 97.7 4 98.4 98.1 13(without GPU) 3.4. Experiments on ICDAR2013 dataset The dataset is from ICDAR 2013 robust reading competition Challenge 2, Task 2.1 Text Localization. Another set of MLP classifiers is trained on the ICDAR 2013 training set. Example results of our method on this dataset are shown in Fig. 11. The quantitative comparisons of different methods are shown in Table 7. Results show that our method achieves satisfactory recall rate performance. Fig. 11. Examples of text line generation results on ICDAR 2013 dataset Table 7. Comparison of results on ICDAR 2013 dataset. Method Precision (%) Recall (%) F-measure(%) Our method 81.38 72.42 74.46 USTB TexStar 19 88.47 66.45 75.89 I2R NUS FAR 19 75.08 69.00 71.91
Yaqi Wang et al. / Procedia Computer Science 96 ( 2016 ) 1409 1417 1417 4. Conclusion In this paper, a novel approach for Chinese text detection in news videos is proposed. We extend the SWT method by considering color information for text candidates generation, and then design a set of shallow MLP networks to classify candidates into text and non-text blocks before text line generation. Experimental results show that the proposed method is effective for Chinese text detection in news videos. In future work, more context information obtained by integrating text recognition will be utilized to improve the performance of the text detection method for Chinese news video. We will also try to apply deep learning based methods with GPU acceleration. Acknowledgements This research is funded by 973 National Basic Research Program of China (No. 2014CB340506) and National Natural Science Foundation of China (No. 61573028). References 1. Phan, T.Q., Shivakumara, P., Lu, T., Tan, C.L.. Recognition of video text through temporal integration. In: Proceedings of the 12th International Conference on Document Analysis and Recognition. 2013, p. 589 593. 2. Zhong, Y., Karu, K., Jain, A.K.. Locating text in complex color images. In: Proceedings of the 3rd International Conference on Document Analysis and Recognition; vol. 1. 1995, p. 146 149. 3. Jain, A.K., Yu, B.. Automatic text location in images and video frames. Pattern recognition 1998;31(12):2055 2076. 4. Kim, H.K.. Efficient automatic text location method and content-based indexing and structuring of video database. Journal of Visual Communication and Image Representation 1996;7(4):336 344. 5. Liu, Y., Ikenaga, T.. A contour-based robust algorithm for text detection in color images. IEICE transactions on information and systems 2006;89(3):1221 1230. 6. Wu, V., Manmatha, R., Riseman, E.M.. Textfinder: An automatic system to detect and recognize text in images. IEEE Transactions on Pattern Analysis & Machine Intelligence 1999;(11):1224 1229. 7. Epshtein, B., Ofek, E., Wexler, Y.. Detecting text in natural scenes with stroke width transform. In: Proceedings of the 23rd IEEE Conference on Computer Vision and Pattern Recognition. 2010, p. 2963 2970. 8. Kim, K.I., Jung, K., Kim, J.H.. Texture-based approach for text detection in images using support vector machines and continuously adaptive mean shift algorithm. IEEE Transactions on Pattern Analysis and Machine Intelligence 2003;25(12):1631 1639. 9. Wang, K., Babenko, B., Belongie, S.. End-to-end scene text recognition. In: Proceedings of the 13th International Conference on Computer Vision. 2011, p. 1457 1464. 10. Hanif, S.M., Prevost, L.. Text detection and localization in complex scene images using constrained adaboost algorithm. In: Proceedings of the 10th International Conference on Document Analysis and Recognition. 2009, p. 1 5. 11. Cortes, C., Vapnik, V.. Support-vector networks. Machine learning 1995;20(3):273 297. 12. Zhang, X., Sun, F.. Pulse coupled neural network edge-based algorithm for image text locating. Tsinghua Science & Technology 2011; 16(1):22 30. 13. Huang, W., Qiao, Y., Tang, X.. Robust scene text detection with convolution neural network induced mser trees. In: Proceedings of the European Conference on Computer Vision. 2014, p. 497 511. 14. Shivakumara, P., Sreedhar, R.P., Phan, T.Q., Lu, S., Tan, C.L.. Multioriented video scene text detection through Bayesian classification and boundary growing. IEEE Transactions oncircuits and Systems for Video Technology 2012;22(8):1227 1235. 15. Sun, L., Huo, Q., Jia, W., Chen, K.. Robust text detection in natural scene images by generalized color-enhanced contrasting extremal region and neural networks. In: Proceedings of the 22nd International Conference on Pattern Recognition. 2014, p. 2715 2720. 16. LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.. Gradient-based learning applied to document recognition. Proceedings of the IEEE 1998; 86(11):2278 2324. 17. Canny, J.. A computational approach to edge detection. IEEE Transactions on Pattern Analysis and Machine Intelligence 1986;(6):679 698. 18. Chen, X., Yuille, A.L.. Detecting and reading text in natural scenes. In: Proceedings of the IEEE Computer Society Conference oncomputer Vision and Pattern Recognition; vol. 2. 2004, p. II 366. 19. Karatzas, D., Shafait, F., Uchida, S., et al. ICDAR 2013 robust reading competition. In: Proceedings of the 12th International Conference ondocument Analysis and Recognition. 2013, p. 1484 1493.