Information Extraction from Signboards

Size: px

Start display at page:

Download "Information Extraction from Signboards"

Calvin Watts
5 years ago
Views:

Balakrishnan Guided by Anupam Sobti and Rajesh Kedia A Report

1 Information Extraction from Signboards By Anil Kumar Meena 2014CS10210 Under Prof. M. Balakrishnan Guided by Anupam Sobti and Rajesh Kedia A Report submitted in partial fulfillment of of the requirements for the degree of Bachelor of Technology Computer Science Dept. IIT Delhi India

2 Information Extraction from signboards Arrow detection and OCR Abstract This is the age of information. The future is here. But unfortunately future doesn t come to everyone at the same pace. While cloud services provide scalable, state-of-the-art solutions to vision problems, they are not accessible to everyone. Even today less than 30% of indian population has access to internet which is a highly generous estimate after accounting for double counting. This project is an attempt to OCR signboards locally on a pi and provide results comparable to that on popular cloud services. 1

3 Acknowledgement I am using this opportunity to express my gratitude Prof M. Balakrishnan and everyone else in MAVI team who supported me throughout the course of this project. I am thankful for their aspiring guidance, invaluably constructive criticism and friendy advice during the project work. I am sincerely grateful to them for sharing their truthful and illuminating views on a number of issues related to the project. I express my warm thanks to Mr. Anupam Sobti and Mr. Chetan Arora for their guidance. Thank you, Anil Kumar Meena 2

4 Contents 1 Introduction Signboard detection(prior work) Perspective transform Rescaling Text Extraction Binarization Tesseract parameters Spell fixer Results Arrow Detection Procedure Results Time Analysis Arrow detection Text extraction Final thoughts 11 3

5 1. Introduction Information Extraction from signboards consists of two separate components. Directional information from arrows and textual data from the given text. This project divides the two after some initial pre-processing on the image which are : 1. Signboard detection 2. Perspective transform 3. Rescaling 1.1. Signboard detection(prior work) As my predecessors have worked on this project before me, they built a novel method for detection of signboards in images using white and blue detection and then finding the smallest bounding rectangle. I would not go too deeper in it s working as it s beyond the scope of this report. Results are as shown below. Type Number Percentage True Positive % True Negative % False Positive 4 1.6% False Negative % Table 1: Signboards detection results 1.2. Perspective transform Since the viewer can be looking at the signboard from different angles, the images being sent to tesseract can be rotated. This often results in incorrect results. So I modified the SB detection algorithm to account for such cases and warp before exporting. See Figure 1 for reference Rescaling Tesseract is known to perform poorly on low dpi images and so as a thumb rule, it s a better idea in general to upscale images before passing it on to the OCR. 4

6 Figure 1: Old bounding box(pink) and new one(red) 2. Text Extraction Tesseract is arguably the best open source OCR out there. However, since Google bought it from HP in the 90s and made it open source, there hasn t been much progress. In fact, it s original page segmentation and other image processing algorithms used prior to passing the image to the core OCR were never made public and are owned by HP till date. This hole left in a complete software made incomplete open source was never filled completely even after years of open source contributions. This is why tesseract requires a high level of customization and pre-processing of images to provide readable results Binarization Tesseract takes single channel grayscale images which means we need to binarize our natural images. For binarization, we employ text binarization based on the paper by Kasar[1]. Briefly, the method employs an edge-based connected component approach and automatically determines a threshold for each component. It has several advantages over existing binarization methods. Firstly, the method is applicable to images having text of widely varying degrees of exposure, usually not handled by global binarization methods. Thirdly, the method automatically computes the threshold for binarization and the logic for inverting the output from the image data and does not require any input parameter. Since we don t even have to provide kernel size as in case of local binarization, and all parameters are automatically computed according to each component s individual requirement, this method is particularly useful for our 5

7 use case as it allows signboard images taken from varying distances having varying exposures to be binarized accordingly Tesseract parameters Although tesseract is trained for English and Hindi out of the box, since it has to work universally, it has way too many characters, dictionary bigrams and more. Thus we reduce the overall possibilities for tesseract to provide better results. First, we reduce the scope of tesseract to a much smaller set of characters. This can be done through configurations, explicit alterations in trained data by tesseract, or custom training Configurations are a good way to fine tune tesseract but in some cases tesseract doesn t give enough power to them. Although, custom training sounds great, the trained data by tesseract is very good and works great in our use case once we make the necessary alterations. Secondly, we use an exhaustive custom dictionary of words and provide it as dictionary to tesseract trained data. Then we set penalty for non dictionary words. However, even if the penalty is set to max, it does not mean the result will always be from dictionary. Tesseract only uses dictionary as a hint. This is why we add a spell fixer after getting results from tesseract. As a note to whoever might continue this project, page segmentation is a major problem as described at the beginning of this section. So, we parse the document in sparse text and osd mode to find as much text as possible rather than forcing it. Also, order of languages is important while calling tesseract. tesseract tessdata-dir././testing/bilingual.jpg./testing/bilingual-enghin -l eng+hin (1) tesseract tessdata-dir././testing/bilingual.jpg./testing/bilingual-enghin -l hin+eng (2) Commands 1 and 2 can result in entirely different results. Currently we use hindi as primary even if our motive is to get English only since we get better results that way. 6

8 2.3. Spell fixer After tesseract finishes, we take the text and run a spell fixing script which uses jaro distance to find similarity between the word and words in dictionary. The word with maximum similarity is chosen to replace the word in final result Results In table 3, we can see the OCR results for 4 types of images showing improvements by addition of perspective transform and exhaustive dictionary (this includes dictionary in tesseract and spell fixer both). Type Complete w/d w/p English Hindi English Hindi English Hindi Skew(165) 69.28% 53.14% 48.15% 29.92% 57.32% 40.85% Glare(41) 34.42% 22.56% 25.48% 13.84% 29.31% 18.31% Shadow(162) 71.22% 53.83% 48.94% 31.54% 65.61% 47.42% Blur(27) 23.96% 13.35% 20.61% 11.30% 17.30% 8.01% Full Dataset 70.18% 52.44% 47.85% 30.19% 60.04% 43.81% Table 2: OCR results, Complete, without dictionary, and without perspective transform 3. Arrow Detection This module deals with detecting arrows from signboards and figuring out their orientations. There are multiple possible types of arrows in our dataset 3.1. Procedure Listed below are the steps followed to get the results. 1. Template Matching 2. Thresholding 3. Dilation and erosion 4. Edge detection 5. Hough transform 7

(a) Match template (b) Erode and dilate (c) Canny edge detect Figure 2: Steps in arrow detection Firstly, we match the image against templates with iterative scaling to account for unknown size of

This also helps denoise the image, and since noise can lead to false positives, it is a necessary measure. Thirdly, we do canny edge detection on the arrow, from which we get houghlines.

9 (a) Match template (b) Erode and dilate (c) Canny edge detect Figure 2: Steps in arrow detection Firstly, we match the image against templates with iterative scaling to account for unknown size of the arrow, this also cuts down on the amount of area we have to do rest of the processing on. Secondly, we erode and dilate in case the arrow is not a single connected component after binarization. This also helps denoise the image, and since noise can lead to false positives, it is a necessary measure. Thirdly, we do canny edge detection on the arrow, from which we get houghlines. Lastly, we do checks on rho theta values from our known values to find the orientation of the arrow. Type Number Percentage True Positive % True Negative 38 95% False Positive % False Negative 2 5% Table 3: Arrow detection results (a) Blur (b) Glare Figure 3: Failure cases 8

10 3.2. Results As shown in table 2, the used procedure works fairly well but while this approach works in most cases, it often fails with blurry images since houghlines finds too many or too less lines as the figure is a distorted blob after erosion and dilation rather than an arrow. With glare, binarization often ends up giving white for a bigger connected component of a larger and distorted shape. 4. Time Analysis The results have been calculated on an Intel Core i5-4210u 1.70GHz with 8GB memory and 16GB swap which remained unused during the entire process. GPU was disabled during all tests Arrow detection Time taken in arrow detection is shown in Figure 3. As we can see, although it seems to stay under a second for most cases, for over 368 images, it takes more than 2 seconds, which is a considerable amount of time. This is because we have multiple templates and orientations of arrows, using iterative scaling increases time exponentially for increase in input image size Text extraction Text extraction takes time in 4 stages, namely, 1. SB detection 2. Pre-processing 3. OCR 4. Spell fixing As we can see through the graphs, SB detection and spell fixing are not contributing significantly to the total time for text extraction. OCR and Preprocessing on the other hand, can take upto 4.5 and 3.5 seconds respectively. 9

11 Figure 4: Time taken in arrow detection Figure 5: Time taken for text 10

12 5. Final thoughts On a good note, as we realized through open house, even though the accuracy might be low, in real practice, we do not have to give results every time. We can just set a threshold on similarity index below which we do not give results and keep capturing frames. But there are also shortcomings. For example, iterative scaling for template matching takes over 8 seconds at times even on laptops. And in real applications where types of arrows would only increase in number, time taken would go even higher. [1] A. G. R. T Kasar, J Kumar, Font and background color independent text binarization, IISc Bangalore,

Solving Word Jumbles

Solving Word Jumbles Debabrata Sengupta, Abhishek Sharma Department of Electrical Engineering, Stanford University { dsgupta, abhisheksharma }@stanford.edu Abstract In this report we propose an algorithm