TEVI: Text Extraction for Video Indexing

TEVI: Text Extraction for Video Indexing Hichem KARRAY, Mohamed SALAH, Adel M. ALIMI REGIM: Research Group on Intelligent Machines, EIS, University of Sfax, Tunisia hichem.karray@ieee.org mohamed_salah@laposte.net adel.alimi@ieee.org Abstract Efficient indexing and retrieval of digital video is an important aspect of video databases. One powerful index for retrieval is the text appearing in them. It enables content based browsing. In this paper, we describe a system for detecting and extracting text appearing in video frames A supervised learning method based on color and edge information is used to detect text regions. After an unsupervised clustering for text segmentation and binarization is applied using color information. Experimental results demonstrate that the proposed approach is robust for font-size, font-color, background complexity and language. Key-words video OCR, multi-frame integration, text detection, localization, segmentation, neural networks, fuzzy C-means. I. Introduction The video is a topic of paramount importance which continues to be dealt with directly as a basic (non-decomposable) object in multimedia documents. Its contents remain rarely clarified and it is often very difficult to classify and extract any knowledge. In many applications, such as the indexing and research through content, we are asked to reach the internal structure of the video, and to lie out or handle data of finer granularity, such as the text or visual objects. Classification and the annotation are usually carried the out manually according to a list of keywords chosen by user. This technique is tiresome and the automation of the indexing process is of great interest. The extraction of relevant information like the text can really provide us with additional data regarding the semantic contents of these videos. evertheless, the detection and the text recognition encounter several problems. If it is often relatively well contrasted in relation to its environment, the text can be found superimposed at a heterogamous and complex bottom. Moreover the text can be multicoloured and heterogeneous. These characteristics make its extraction difficult. In this paper, we propose an approach to automatically localize, segment and binarize text appearing in video frames. We first apply a new multiple frames integration (MFI) method to minimize the variation of the background of the video frames. Second, a supervised learning method based on color and edge information is used to efficiently detect text regions. Third, an unsupervised clustering for text segmentation and binarization is applied using color information.. II. Text Extraction from Video Works on text extraction may be generally grouped into four categories: connected component methods [][3][0], texture classification methods [5][], edge detection methods [9][3][][8][], and correlation based methods [6][4]. The connected component methods detect text by extracting the connected components of monotonous colours that obey certain size, shape, and spatial alignment constraints. The texture-based methods treat the text region as a special type of texture and employ conventional texture classification method to extract

the text. Edge detection methods have been increasingly used for caption extraction due to the rich edge concentration in characters. The correlation based methods are those that use any kind of correlation in order to decide if a pixel belongs to a character or not. All the methods that have been mentioned don't use temporal information or use it as a complementary tool. In this paper we present a new approach which in we combine color and edges to extract text. III. Proposed Method TEVI is composed of four steps. First, we eliminate from video frames columns and rows of pixels which are not containing text. Second, we localize text in the remaining columns and rows. Finally, we extract the text from the frame..pixels filtering In our approach, we suppose that text must persist for a given duration to be readable. So, the temporal aspect will play a key role in the process of text extraction. Indeed, we will work by frames windows. The length of every window is one second. Besides, we will operate only on the first and the last frame of every window. Then, we realize a correlation analysis between the first and the last frame by computing a correlation coefficient on pixels of rows (respectively of columns) of these two frames. This coefficient is computed as follows: re ( ) = i i ( E ( i) E )*( E ( i) E ( i)) Fst Fst Lst Lst ( E ( i) E ) * ( E ( i) E ) Fst Fst Lst Lst i () Where E(i) indicates the gray level of the pixel i in the cluster of rows (or of columns) E. In this step are kept only rows and columns containing correlated pixels which may be text pixels (Figure ). Figure : Row and column filtering.text detection and localisation From the remaining rows and columns clusters, we will try to detect and to localize text pixels. Every window is represented by two frames. One is the middle frame of the window filtered along rows and the other is the middle frame but filtered along the columns. For every frame we achieve two operations. First, we realize a transformation from the RGB space to HSV space. Second, we generate using Sobel filters, an edge picture. For every cluster of these frames, we formulate a vector composed of ten features: five representing the HSV image and five representing the edge picture. These features are computed as follows:

f (, ) (, )/( ( ) EI = MEI MI () f( EI, ) = µ ( EI, )/ µ ( I) (3) f3( EI, ) = µ 3( EI, )/ µ 3( I) (4) c_ sup( E, I) f4 ( E, I) = c_ sup( I) (5) c_inf( E, I) f5 ( E, I) = c_inf( I) (6) E represents rows clusters or columns clusters. I represents the HSV image or the edge picture. M ( EI, ) is the mean of color pixels of the cluster E in the picture I. M ( EI, ) = E( I) (7) i = 0 µ ( EI, ) is the second order moment µ ( EI, ) = ( EI ( i) MEI (, )) (8) i= 0 µ ( EI, ) is the third order moment 3 3 µ 3( EI, ) = ( EI ( i) MEI (, )) (9) i= 0 c_sup(e,i) is the maximum value of the confidence interval tα * µ ( E, I) c_ sup( E, i) = M( E, I) + (0) c_inf( E, i ) is the minimum value of the confidence interval I tα * µ ( E, I) c_inf( E, i) = M( E, I) () The generated vectors will be presented to a trained neural network containing 3 hidden nodes and output node. The results of the classifications are two images: an image containing rows considered as text rows and an image containing columns considered as text columns. 3

Finally, we merge results of the two images to generate an image containing zones of text (Figure) Figure : Text localization through neural networks 3.Text segmentation After localizing text in the frame, the following step consists in segmenting and binarizing text. First, we compute the Gray levels image. Second, for each pixel in the text area, we create a vector composed of two features: the standard deviation and the entropy of the 8 neighborhood of pixels which are computed as follows: () i j std ( p ) = ( f ( i, j ) f ) * = = ent ( p ) = ( f ( i, j)*log( f ( i, j ))) (3) i = j = Where f(i,j) indicates the Gray level of the pixel in position (i,j). Third, we run the fuzzy C-means clustering algorithm to classify the pixels into text cluster and background cluster. Finally, we binarize the text image by marking text pixels in black (Fig 4). This technique is motivated by two observations. First, text has usually a unique texture. Second, the border of text character result in high a contrast edges. Finally, we binarize the text image by marking text pixels in black. The extracted text will be recognized by an OCR module. (a) (a) Original frame (b) (b) Segmented text frame Figure 3: Example of text binarization IV. Experimental results In order to evaluate the proposed automatic text extraction solution we have used a varied database composed of different sources including TV news, commercials, sports and movies 4

at resolution of 35 88 with a total of 60 minutes and graphical text, aligned with multiple fonts and sizes. For evaluating the text detection performance, the precision and recall metrics have been used: Recall=CCD/CGT; Precision=CCD/TCD where CCD represents the number of characters correctly detected by the algorithm, CGT represents the number of ground truth characters and TCD represents the total number of characters detected by the algorithm. The results are shown in table We notice that our approach is more efficient than the method of J. Gllvata et al.[7]. In fact, our approach is more robust to various font size, font styles, contrast levels and background complexities because it uses both of color features and edges features to differentiate text pixels from background pixels, besides it s based on a neural network trained on different types of text styles. TABLE EVALUATIO RESULTS OF TEXT LOCALIZATIO Recall Precision Our approach 96% 93% J. Gllvata and al. [7] 90% 87% V. Conclusion In this paper, we have proposed on approach to automatically localize and segment text appearing in video. The encouraging results prove that the proposed method is robust to various font size, font styles, contrast levels and background complexities. For this reason we have integrated our system of text extraction in a global system of video indexation (Figure4). Offline Video Database Features extraction Feature Text Online Query Figure4: Video indexation system The indexation system integrates many features in which text takes an important place. An offline step is achieved on video database to extract text. This task is achieved by TEVI system. REFERECES [] A. K. Jain and B. Yu, Automatic text location in images and video frames, Pattern recognition, Vol.3, o., pp.055-076, 998. 5

[] C. Garcia and X. Apostolidis, Text detection and segmentation in complex color images, Proc. of IEEE International Conf. on Acoustics, Speech, and Signal Processing, Vol. 4, pp. 36-39, 000. [3] C. M. Lee and A. Kankanhalli, Automatic extraction of characters in complex scene images, International Journal of Pattern Recognition and Artificial Intelligence, Vol. 9, o., pp. 67-8, 995. [4] E.K.Wong., M.Chen, "A Robust Algorithm for Text Extraction in Color Video", Proceedings of IEEE International Conference on Multimedia and Expo, 000, vol., pp. 797-800 [5] H. P. Li, D. Doemann, and O. Kia, Automatic text detection and tracking in digital video, IEEE Trans. on Image Processing, Vol.9, o., pp.47-56, 000. [6] H.Karray, A.M.Alimi, Detection and Extraction of the Text in a video sequence in Proc. IEEE International Conference on Electronics, Circuits and Systems 005 ( ICECS 005), vol., pp. 474 478 [7] J. Gllavata, R.Ewerth and B. freisleben, A text detection, localization and segmentation system for OCR in images, in Proc. IEEE Sixth International Symposium on Multimedia Software Engineering, 004. [8] L.Agnihotri.,.Dimitrova.,,M.Soletic., Multi-layered Videotext Extraction Method, IEEE International Conference on Multimedia and Expo (ICME), Lausanne (Switzerland), August 6-9, 00 [9] L. Agnihotri and. Dimitrova, Text detection for video analysis, Workshop on Content-based access to image and video libraries, in conjunction with CVPR, Colorado, June, 999. [0] R. Lienhart and F. Stuber, Automatic text recognition indigital videos, Proceedings of SPIE Image and Video Processing IV 666, pp.80-88, 996. [] S.Hua,X., X.-R.Chen. and al., Automatic Location of Text in Video Frames, Intl Workshop on Multimedia Information Retrieval (MIR00, In conjunction with ACM Multimedia 00), 00 [] V. Wu, R. Manmatha, and E. M. Riseman, TextFinder: anautomatic system to detect and recognize text in images, IEEE Trans. PAMI, Vol., pp.4-9, ov. 999. [3] X. Gao and X. Tang, Automatic news video caption extraction and recognition, in Proc. of Intelligent Data Engineering and Automated Learning 000, pp. 45-430,Hong Kong, Dec. 000. 6