TEXT DETECTION AND RECOGNITION FROM IMAGES OF NATURAL SCENE

TEXT DETECTION AND RECOGNITION FROM IMAGES OF NATURAL SCENE Arti A. Gawade and R. V. Dagade Department of Computer Engineering MMCOE Savitraibai Phule Pune University, India ABSTRACT: Devanagari is the most popular scripts in India. Devanagari Text detection and recognition in a scene images is an extremely challenging task. Detection of text is done by considering the characteristics of Devanagari script. Scene images consist of street signs, shop names, product advertisements, posters on streets, etc. Such images are prone to multiple sources of noise and these make the text detection and segmentation very challenging. The proposed system is consisting of four step process that is preprocessing, text localization, text detection and text recognition. System primarily based on two characteristics of Devanagari texts - (i) variations in stroke width for text components of a script and (ii) existence of a headline along with a few vertical downward strokes connecting to this headline. The proposed approach detects the background and text by using Otsu s threshold selection method. The Scanline method is used to detect the headline of Devanagari texts and adjacency measures are applied to identify the text regions. A methodology to segment the Devanagari words extracted from the scene images into characters is also presented. Distance measures are used to recognize the characters. The proposed approach has been simulated on a repository of 500 images taken from roads and the results are encouraging. Keywords: Text detection in natural scenes, Text extraction, Segmentation, Text recognition, Scene images, Devanagari. [1] INTRODUCTION Detection of texts in images of natural scenes has enough application potentials. However, related studies are primarily restricted to English and a few other scripts of developed countries. Two surveys of existing methods for detection, localization and extraction of texts embedded in images of natural scenes can be found in some literature. In the Indian context, there are often texts in one or more Indian script(s) in an image of natural outdoor scenes. Devanagari is a most popular scripts used by around 500 and 220 million people respectively. Thus, studies on detection of Devanagari texts in scene images are important. Scene images are often captured by cameras. Compared with images scanned by image scanners, camera images have more difficult problems of text extraction, such as uneven lighting, lower resolution, complex backgrounds, and blurred edges. Challenges are as follows: Size: the range of font size variation could be diverse. 171

TEXT DETECTION AND RECOGNITION FROM IMAGES OF NATURAL SCENE Alignment: scene texts are often aligned in many directions and have geometric distortions Scene complexity: In natural environments, numerous man-made objects, such as buildings, symbols and paintings appear, that have similar structures and appearances to text. Uneven lighting: When capturing images in the wild, uneven lighting is common due to the illumination and the uneven response of sensory devices. Uneven lighting introduces color distortion and deterioration of visual features, and consequently introduces false detection, segmentation and recognition results. Blurring and degradation: With flexible working conditions and focus-free cameras, defocusing and blurring of text images occur. Aspect ratios: Text has different aspect ratios such as to detect text, a search procedure with respect to location, scale and length needs to be considered, which introduces high computational complexity. Distortion: Perspective distortion occurs when the optical axis of the camera is not perpendicular to the text plane. Text boundaries lose rectangular shapes and characters distort, decreasing the performance of recognition models trained on undistorted samples. Fonts: Characters of italic and script fonts might overlap each other, making it difficult to perform segmentation. Proposed algorithm works in following steps converting the image in gray scale by calculating the average of R, G, B values of color image, use blurring to reduce the extra pixels and noise from the image by using Gaussian method. Set the threshold value of image to separate the foreground and background of image, for this we are used Otsu's global thresholding. To identify the text line segment from the image we used the horizontal Scanline technique, in the first each horizontal scan line of the image is processed to identify potential text line segment. A text line segment is a continuous one pixel thick segment on a scan line that contains the text pixels. After this by using the frequency count clustering localization and detection of the text area is carried out. Text to be localized is shown by drawing the red and green lines above and below the localized text. Detected text is shown by the bounding rectangles. After detecting the text area we crop that region of interest for the recognition phase. On this cropped images system again perform thresholding and smoothing by using median filter to reduce extra pixels and noise. Then apply Stentiford Thinning method to get thin text and detect the headline by using Scanline method. It is necessary to remove this detected headline for proper segmentation of word in the form of single characters. For recognition we are creating the training dataset after that we are generating the templates by using feature extraction. At the end of stage, we performed template matching for recognition of text. For this Subsequent use of script specific characteristics helps to identify the presence of headline in candidate text regions. Figure 1.1 shows some examples of text detection from natural scene images. 172

Fig.1.1. Examples of scene Text Detection.(a) and (c) original images and (b)and (d) shows the detected text by red color rectangles. [2] DEVANAGARI SCRIPT CHARACTERISTICS There are 50 basic characters in the alphabets of Devanagari scripts. The alphabets in Devanagari consist of consonants, vowels, conjuncts. Two or more consonants or one vowel and one or two consonants combine to form compound characters. Most of the characters of Devanagari scripts have a horizontal line at their upper part. This line is called the headline. In a continuous text of these scripts, the characters in a word often get connected through this headline. However, in some words, all the characters are not connected. A text line of any of these two scripts has three distinct horizontal zones. These are shown in Fig.2.1 The portion above the headline is the upper zone and below it but above an imaginary line called the base line, is the middle zone while the part below the base line is called the lower zone. Devanagari script is written from left to right and it does not have any upper or lower case letters. Fig. 2.1 Three zone of Devanagari text [3] LITERATURE SURVEY: A survey work of traditional methods for detection, localization and extraction of texts in images of natural scenes can be found in [8].Two well-known categories of existing methods are connected component (CC) based and texture based algorithms. The connected component (CC) based method first segments an image into a set of CCs, and then classifies each CC as either text or non-text. CC-based algorithms are simple, but often they fail to be robust. Texture-based methods are based on the assumption that texts in images have dissimilar textural properties 173

TEXT DETECTION AND RECOGNITION FROM IMAGES OF NATURAL SCENE Sr. No Author s Name Title of Paper Data Set Used Methods Results (%) 1 Prakriti Banik, Ujjwal Bhattacharya, Swapan K. Parui Segmentation of Bangla Words in Scene Images December 16-19, 2012 Database(260 scene images) 2460 word images 507 numerals K-means clustering and Otsu's threshold selection. 2282 words (92.8%) 11430 characters (92.33%) 405 numerals (96.64%) 2 Roy Chowdhury, Bhattacharya and Parui Text Detection of Two Major Indian Scripts in Natural Scene Images 2011 120 images taken from Indian roads Euclidean distance Transform and probabilistic Hough line transform. Recall (r)= 0.74 Precision (p)= 0.72 3 Bhattacharya, U., Parui S. K., Mondal, Devanagari and Bangla Text Extraction from Natural Scene Images 2009. 100 test images acquired by camera. Morphological operations. Connected component method Precision (p)= 0.69 Recall (r)=0.71 4 Epshtein, B., Ofek, E., Wexler, Y Detecting Text in Natural Scenes with Stroke Width Transform 31 October, 2011 ICDAR dataset contains 258 images in training set and 251 images in test set. Stroke Width Transform Precision (p)= 0.59 Recall (r)= 0.64 5 Vipin Narang Sujoy Roy O. V. R. Murthy Devanagari Character Recognition in Scene Images 2013 Dataset is either machine printed or handwritten Devanagari characters. Part based model DSHND- 30K =42.33% Dataset DSMP- 28K =56.10% 6 Sezer Karaoglu, Basura Fernando Alain Trémeau A Novel Algorithm for Text Detection and Localization in Natural Scene Images 2013 ICDAR 2003 test dataset with 249 images which contains images with various resolution taken both indoors and outdoor morphological operations. Random Forest classifier, merging algorithm for further processing. Recall (r)=0.90. Precision (p)= 0.94 174

compared to the other nontext regions. A few authors studied different combinations of the above two categories of methods. Among early works, Zhong et al. [12]detected text in images of compact disc, book cover, or traffic scenes in two steps. In the first step, rough locations of text lines were obtained and then in second step text components in those lines were extracted using color segmentation. Texture segmentation method to generate candidate text regions is proposed by Wu et al.[13] A set of feature components for each pixel is computed and these components are clustered using K-means algorithm. Jung et al. [14] employed a multi-layer perception classifier to distinguish between text and non-text pixels. A sliding window scans the whole image and this image serve as input to a neural network. A probability map is created where high probability areas are considered as candidate text regions. In [15], Li et al. extracted features from wavelet decomposition of grayscale image and used a neural network classifier for classifying of small windows as text or non-text. Gllavata et al. [16] considered wavelet transform based texture analysis for text detection. They used K-means algorithm to cluster text and nontext regions. Saoi et al. [17] used a similar but enhanced method for detection of text in scene images. In this method, wavelet transform is used to all of R, G and B channels of input color image separately. Ezaki, Bulacu and Schomaker [18] studied morphological operations for recognition of connected text components in images. They used a disk filter finding the difference between the closing and the opening image. Then these filtered images are binarized to extract connected components from images. Mathematical morphology based algorithm is used to extract texts from scene images. In [19] worked on a modified morphological filter to recover extraction accuracy. Due to lack of a single threshold value it divided input images into different clusters based on the size of texts. In [3] a novel part-based method is proposed for recognizing the Devanagari characters. This is computationally demanding, particularly the K-means clustering stage. Unlike Nearest Neighbor or SVM classifier where the class prediction of a test character is based on a few comparisons either with the class centers or the support vectors. In [7]. a new scene text detection algorithm based on two machine learning classifier are described one generate candidate word regions and the other filters out nontext part of scene images. In this method extraction of connected components (CCs) in images are done by using the maximally stable extremal region algorithm. Then form the clusters from extracted CCs so that it can generate candidate regions. Then train an AdaBoost classifier that determines the adjacency relationship and cluster CCs by using their pair wise relations and after that normalize candidate word regions and determine whether each region contains text or not. There are several methods for text extraction from real scenes have been proposed so far, Methods based on the adaptive binarization are tolerant of shadings of images [10]. However they are not suitable for images with complex backgrounds. However they are not suitable for images with complex backgrounds. Detection of texts in images of natural scenes has enough application potentials. However, related studies are primarily restricted to English and a few other scripts of developed countries. Two surveys of existing methods for detection, localization and extraction of texts from images of natural scenes can be found in [8] [14]. In the Indian context, Devanagari is most popular scripts used in India. Thus, studies on detection of Devanagari texts in scene images are important. In a recent study, Bhattacharya et al. [4] proposed a method based on morphological operations for extraction of text from scene images Table 3.1 Analysis of various methods of text detection and there results 175

TEXT DETECTION AND RECOGNITION FROM IMAGES OF NATURAL SCENE Based on literature survey, there are various methods for Devanagari text detection from natural scene images which have different results on different datasets. In proposed system we are using Scanline method and distance Measure for text detection and recognition. [4] PROPOSED SYSTEM: Fig. 4.1 System Architecture A. Preprocessing Of Images 1. Grayscale: We are taking the color scene image as input to the system the color image is 24 bit. The color image includes the separate 8 bit value of each Red, Green, and Blue so that it is 24 bit. So that it is very difficult to work on color images so that we first convert it into the grayscale form. So that first gets the red, green, and blue values of pixel. Here we use fancy math to turn those numbers into a single gray value. Algorithm Steps: 1. Get the red, green, and blue values of a pixel. 2. Use fancy math to turn those numbers into a single gray value. gs = (Red + Green + Blue) / 3 3. Replace the original red, green, and blue values with the new gray value. 4. Repaint the image. 2. Thresholding Algorithm: Thresholding is an image processing technique for converting a grayscale or color image to a binary image based upon a threshold value. If a pixel in the image has an intensity value less than the threshold value, the corresponding pixel in the resultant image is set to black. Otherwise, if the pixel intensity value is greater than or equal to the threshold intensity, the resulting pixel is set to white. Algorithm Steps: 1. Initialize the fgcount = 0 and bgcount = 0 2. Take the grayscale image as a output to algorithm. 3. Scan the grayscale image horizontally and vertically i.e. height and width of image. 176

4. Get the value of pixel i.e p [y][x] 5. Set the threshold value Th to 128 i.e. th==128 6. If (gs < 128) then Increment the background value count. i.e. bgcount + + Change the value of that pixel i.e. p2[y][x] = 0 7. Else Increment the foreground count value i.e. fgcount + + And Change the value of that pixel i.e. p[y][x] = 1 8. End 9. The output of this is the binarized image B. Text Localization and Detection: The objective of text localization is to localize text components precisely as well as to group them into candidate text regions with as little background as possible The input to the text localization step may be complex images which contains the various non-text objects in images. The task of text localization is to locate and circumscribe text occurrences in all kinds of multimedia data by tight rectangular boxes. Each so-called text bounding box is supposed to circumscribe only a single text line. In proposed method we are using the horizontal Scanline and Frequency count clustering to localize and detect the text area from scene images. In the first each horizontal scan line of the image is processed to identify potential text line segment. A text line segment is a continuous one-pixel thick segment on a scan line that contains the text pixels. Typically text segment cuts across a character string and contains interleaving groups of text pixels and background pixels. The end points of a text line segment should be just outside the first and last characters of the character string. Algorithm Steps: 1. Initialize the value of minfcthreshold = 4 minheight = 20; 2. Assign the frequency count fc[] = new [h]; 3. Scan for values of y from 0 to less than the value of height for (int y = 0; y < h; y++) and set fc[y] = 0; 4. Start scanning for x from middle of image by dividing the image width by 2 for (int x = w / 2; x < (w - 100); x++) if the inpixels[y][x]!= inpixels[y][x + 1]) then increment the frequency count fc[y]++; 5. Scan for values of x for (int x = w / 2; x > 100; x--) if the value of inpixels[y][x]!= inpixels[y][x + 1] then increment the frequency count fc[y]++ Detecting clusters 1. Initialize values of starty = -1, endy = -1 2. If frequency count fc[y] >minfcthreshold and starty == -1 then assign starty = y and endy = y; 3. If starty!= -1 and if value of (endy - starty) >minheight then draw a rectangle add a rectangle set color to red and draw a line starting from (0, starty - 1, w, starty - 1) and set color to green and draw a line starting from (0, endy + 1, w, endy + 1)assignstartY = endy = -1 Filtering the ROI from image 1. To select the ROI check if rectangles.size()==0 then show error message "No Text Localized To Auto Crop!" then chop = false 177

TEXT DETECTION AND RECOGNITION FROM IMAGES OF NATURAL SCENE 2. Else To filter rectangles from images check if value of mr.midy() < 100 or mr.midy() > h-100) then chop = true 3. Rectangle is removed Scanning the image to get height and width and segment it in sub images 1. Initialize index = 0; 2. for(myrectanglemr : rectangles) 3. require three scans. 0-center, 1-mid left, 2-mid right. 4. search the value of Start X 5. To detect the boundary Initialize intstartx = 0; Scan from startx = w/2 which is the center Scan from startx = w/4 which is mid left Scan from startx = 3 * w/4,which is -mid right. 6. Detect first black line in left. 7. Detect first black line in right. 8. Black line on top 9. Black line on bottom. 10. By using these boundary values select the rectangle 11. check if rectangle is not already present then Verify enough width and height and the cropped the rectangle and save the sub images C. Text Recognition: 1. Thinning For thinning the Devanagari text we used Stentiford Algorithm. The algorithm can be stated as follows: 1. Find a pixel location where the pixels in the image match those in template. With this template all pixels along the top of the image are removed moving from left to right and from top to bottom. If the central pixel is not an endpoint, and has connectivity number = 1, then mark this pixel for deletion. Fig.4.2 Templates for matching 2. Endpoint pixel: A pixel is considered an endpoint if it is connected to just one other pixel. That is, if a black pixel has only one black neighbor out of the eight possible neighbors. 3. Connectivity number: It is a measure of how many objects are connected with a particular pixel. where: Nk is the color of the eight neighbors of the pixel analyzed. N0 is the center pixel. N1 is the color value of the pixel to the right of the central pixel and the rest are numbered in counterclockwise order around the center. S ={1,3,5,7} 178

2. Headline removal and segmentation: This function detects and removes the headline from the detected word for segmentation of words in terms of characters. For detection of headline we used the shirolekha detection algorithm by using the Scanline technique. In segmentation it scans the image after headline removal cluster the white pixels which are connected. And check for the white space. If the white space is occurred draw a rectangle outside the character. Extract the features of character and store in the template. 3. Template matching: In this we are generate the template and perform the template matching for recognition. Template matching or matrix matching, is one of the most common classification methods. Here individual image pixels of images are used as features. Classification is performed by comparing input characters with a set of templates from each character class. Each comparison results in a similarity measure count between the input characters and set of templates. The measure increases the amount of similarity count when a pixel in the observed character is identical to the same pixel in template image. If the pixels differ the measure of similarity count may be decreased. In this way all templates are compared with input character image, the character is recognized by the character which is having maximum similarity count. Algorithm Steps: 1. Initialize the variable count=0. 2. Select the first word segment which is the output of previous step. 3. If rectangle is not found then display message no text to recognize 4. If rectangle is found then Compare the generated template p[y][x] to template which is stored in training dataset t[y][x]. 5. If pixel p[y][x]==t[y][x] i.e. pixel value is match then increment the count by 1. 6. Else decrement the count by 1 7. Store count value in result array. 8. Take the next values of p[y][x] and t[y][x] 9. Repeat step 1 to 7 to match all templates of training data set. 10. Compare the values result array to select maximum count. i.e. 11. Repeat this form all word segments. 12. Select those templates, having largest count value and show recognized output. [5] RESULT AND DATASETS The results of system are based on the rate of character detection and recognition from natural scene images. This rate is calculated in terms of precision and recall. We summarize the results of our simulation using 500 sample images by providing values of two quantities, recall and precision defined as follows. Precision (p) = Number of correctly detected Devanagari words Total number of detections Recall (r) = Number of correctly detected Devanagari words 179

TEXT DETECTION AND RECOGNITION FROM IMAGES OF NATURAL SCENE Total number of Devanagari words in the sample images Fig. 5.1 Graph of results Figure 5.1 shows the graph of results. Class 1 and class 2 are two datasets of scene images consists of society name plates, road side signboards etc. Proposed system has given the results in terms of precision and recall. For Class 1, Precision is 0.944 and recall is 0.85 and for Class 2, Precision is 0.958 and recall is 0.92. The total Precision and recall is 0.9513 and 0.885 respectively. Datasets: For the Devanagari text detection and recognition system from natural scene images we are using the images which are captured by cameras and having the resolution of 600X480. We maintain the data set of 500 images which include the roadside boards, sign boards, society name plates, direction boards etc. these plates contains the printed text which is located on boards. These images also contain effect like shadow, sunlight, embossing, or having the signs like logos, arrows or other creative designs etc. Figure 5.2: Sample dataset [6] CONCLUSION In this project we proposed new approach for detecting and recognizing Devanagari text from natural scene images horizontal and vertical Scanline, frequency count clustering and distance measure. In our proposed system we have used different preprocessing techniques. A Stentiford algorithm was used for thin the image. Segmentation was used for generating the template. From template we have extracted the feature of the image. Finally we have used the neural network for recognition purpose. We got encouraging result for our proposed system 180

REFERENCES [1] A. R. Chowdhury, U. Bhattacharya, and S. K. Parui, Text detection of two major indian scripts in natural scene images. Proc. of CBDAR 2011, pages 73-78, 2011. [2] Prakriti Banik, Ujjwal Bhattacharya, Swapan K. Parui Segmentation of Bangla Words in Scene Images Proc ICVGIP 12, December 16-19, 2012, Mumbai, India [3] Vipin Narang, Sujoy Roy O. V. R. Murthy Devanagari Character Recognition in Scene Images Proc.2013 12th International Conference on Document Analysis and Recognition 2013 IEEE, pages 902-906, 2013. [4] U. Bhattacharya, S. K. Parui, and S. Mondal, Devanagari and bangla text extraction from natural scene images. Proc. of Int. Conf. on Document Analysis and Recognition, pages 171-175, 2009. [11] Kumar, S., Perrault, A.: Text Detection on Nokia N900 Using Stroke Width Transform. available at http://www.cs.cornell.edu/courses/cs4670/2010fa/projects/final/results/group of arp86 sk2357/writeup.pdf (last accessed on 31 October, 2011) [12] Y. Zhong, K. Karu, A. K. Jain, Locating text in complex color images, 3rd International Conference on Document Analysis and Recognition, vol. 1, 1995, pp. 146-149. [13] V. Wu, R. Manmatha, E. M. Risemann, Text Finder: an automatic system to detect and recognize text in images, IEEE Transactions on PAMI, vol. 21, pp. 1224-1228, 1999. [14] K. Jung, K. I. Kim, T. Kurata, M. Kourogi, J. H. Han, Text Scanner with Text Detection Technology on Image Sequences, Proceedings of 16 th International Conference on Pattern Recognition (ICPR), vol. 3, 2002, pp. 473-476. [15] H. Li, D. Doermann, O. Kia, Automatic text detection and tracking in digital video, IEEE Trans. Image Processing, vol. 9, no. 1, pp. 147-167, 2000. [16] J. Gllavata, R. Ewerth, B. Freisleben, Text Detection in Images Based on Unsupervised Classification of High Frequency Wavelet Coefficients, Proc. of 17th Int. Conf. on Pattern Recognition (ICPR), vol. 1, 2004, pp. 425-428. [17] T. Saoi, H. Goto, H. Kobayashi, Text Detection in Color Scene Images Based on Unsupervised Clustering of Multihannel Wavelet Features, Proc. of 8th Int. Conf. on Doc. Anal. and Recog. (ICDAR), pp. 690-694, 2005. [18] N. Ezaki, M. Bulacu, L. Schomaker, Text detection from natural scene images: towards a system for visually Impaired Persons, Proc. of 17 th Int. Conf. on Pattern Recognition, vol. II, pp. 683-686, 2004. [19] Mohammad ShorifUddin, Madeena Sultana, Tanzila Rahman, and Umme Sayma Busra Extraction of Texts from a Scene Image Proc. 2012 IEEE, ICIEV 2012 181