Chapter 2. OCR System: A Literature Survey. 2.1 History of machine recognition of scripts

Size: px

Start display at page:

Download "Chapter 2. OCR System: A Literature Survey. 2.1 History of machine recognition of scripts"

Olivia Anthony
6 years ago
Views:

1 Chapter 2 OCR System: A Literature Survey 2.1 History of machine recognition of scripts The overwhelming volume of paper-based data in corporations and offices challenges their ability to manage documents and records. Computers, working faster and more efficiently than human operators, can be used to perform many of the tasks required for efficient document and content management. Computers understand alphanumeric characters as ASCII code typed on a keyboard where each character or letter represents a recognizable code. However, computers cannot distinguish characters and words from scanned images of paper documents. Therefore, where alphanumeric information must be retrieved from scanned images such as commercial or government documents, tax returns, passport applications and credit card applications, characters must first be converted to their ASCII equivalents before they can be recognized as readable text. Optical character recognition system (OCR) allows us to convert a document into electronic text, which we can edit and search etc. It is performed off-line after the writing or printing has been completed, as opposed to on-line recognition where the computer recognizes the characters as they are written. For these systems to effectively recognize hand-printed or machineprinted forms, individual characters must be well separated. This is the reason why most typical administrative forms require people to enter data into neatly spaced boxes and force spaces between letters entered on a form. Without the use of these boxes, conventional technologies reject fields if people do not follow the structure when filling out forms, resulting in a significant overhead in the administration cost. Optical character recognition for English has become one of the most successful applications of technology in pattern recognition and artificial intelligence. OCR is the machine replication of human reading and has been the subject of intensive research for 11

2 more than five decades. To understand the evolution of OCR systems from their challenges, and to appreciate the present state of the OCRs, a brief historical survey of OCRs is in order now. Depending on the versatility, robustness and efficiency, commercial OCR systems may be divided into the following four generations [Line, 1993; Pal & Chaudhuri, 2004]. It is to be noted that this categorization refers specifically to OCRs of English language First generation OCR systems Character recognition originated as early as 1870 when Carey invented the retina scanner, which is an image transmission system using photocells. It is used as an aid to the visually handicapped by the Russian scientist Tyurin in However, the first generation machines appeared in the beginning of the 1960s with the development of the digital computers. It is the first time OCR was realized as a data processing application to the business world [Mantas, 1986]. The first generation machines are characterized by the constrained letter shapes which the OCRs can read. These symbols were specially designed for machine reading, and they did not even look natural. The first commercialized OCR of this generation was IBM 1418, which was designed to read a special IBM font, 407. The recognition method was template matching, which compares the character image with a library of prototype images for each character of each font Second generation OCR systems Next generation machines were able to recognize regular machine-printed and handprinted characters. The character set was limited to numerals and a few letters and symbols. Such machines appeared in the middle of 1960s to early 1970s. The first automatic lettersorting machine for postal code numbers from Toshiba was developed during this period. The methods were based on the structural analysis approach. Significant efforts for standardization were also made in this period. An American standard OCR character set: OCR-A font (Figure 2.1) was defined, which was designed to facilitate optical recognition, although still readable to humans. A European font OCR-B (Figure 2.2) was also designed. 12

3 Figure 2.1 OCR-A font Figure 2.2 OCR-B font Third generation OCR systems For the third generation of OCR systems, the challenges were documents of poor quality and large printed and hand-written character sets. Low cost and high performance were also important objectives. Commercial OCR systems with such capabilities appeared during the decade 1975 to OCRs Today (Fourth generation OCR systems) The fourth generation can be characterized by the OCR of complex documents intermixing with text, graphics, tables and mathematical symbols, unconstrained handwritten characters, color documents, low-quality noisy documents, etc. Among the commercial products, postal address readers, and reading aids for the blind are available in the market. Nowadays, there is much motivation to provide computerized document analysis systems. OCR contributes to this progress by providing techniques to convert large volumes of data automatically. A large number of papers and patents advertise recognition rates as high as 99.99%; this gives the impression that automation problems seem to have been solved. Although OCR is widely used presently, its accuracy today is still far from that of a 13

4 seven-year old child, let alone a moderately skilled typist [Nagy, Nartker & Rice, 2000]. Failure of some real applications show that performance problems still exist on composite and degraded documents (i.e., noisy characters, tilt, mixing of fonts, etc.) and that there is still room for progress. Various methods have been proposed to increase the accuracy of optical character recognizers. In fact, at various research laboratories, the challenge is to develop robust methods that remove as much as possible the typographical and noise restrictions while maintaining rates similar to those provided by limited-font commercial machines [Belaid,1997]. Thus, current active research areas in OCR include handwriting recognition, and also the printed typewritten version of non-roman scripts (especially those with a very large number of characters). 2.2 Components of an OCR System Before we present a survey on various approaches used in the literature for recognizing fonts and characters, a brief introduction to the general OCR techniques is given now. The objective of OCR software is to recognize the text and then convert it to editable form. Thus, developing computer algorithms to identify the characters in the text is the principal task of OCR. A document is first scanned by an optical scanner, which produces an image form of it that is not editable. Optical character recognition involves translation of this text image into editable character codes such as ASCII. Any OCR implementation consists of a number of preprocessing steps followed by the actual recognition, as shown in Figure 2.3. Figure 2.3 OCR process 14

5 The number and types of preprocessing algorithms employed on the scanned image depend on many factors such as age of the document, paper quality, resolution of the scanned image, amount of skew in the image, format and layout of the images and text, kind of the script used and also on the type of characters: printed or handwritten [Anbumani & Subramanian, 2000]. After preprocessing, the recognition stage identifies individual characters, and converts them into editable text. Figure 2.4 depicts these steps and they are described in the following section. Figure 2.4 Steps in an OCR Preprocessing Typical preprocessing includes the following stages: Binarization, Noise removing, 15

Skew detection and correction, Line segmentation, Word segmentation, Character segmentation, and Thinning 2.2.1.

In any image analysis or enhancement problem, it is very essential to identify the objects of interest from the rest.

6 Skew detection and correction, Line segmentation, Word segmentation, Character segmentation, and Thinning Binarization A printed document is first scanned and is converted into a gray scale image. Binarization is a technique by which the gray scale images are converted to binary images. In any image analysis or enhancement problem, it is very essential to identify the objects of interest from the rest. In a document image this usually involves separating the pixels forming the printed text or diagrams (foreground) from the pixels representing the blank paper (background). The goal is to remove only the background, by setting it to white, and leave the foreground image unchanged. Thus, binarization separates the foreground and background information. This separation of text from background is a prerequisite to subsequent operations such as segmentation and labeling [Bunke & Wang, 1997]. Figure 2.5 shows a gray image (a), and binary image (b) of a newspaper document. (a) Gray image Figure 2.5 (b) Binary image Document image binarization 16

The most common method for binarization is to select a proper intensity threshold for the image and then convert all the intensity values above the threshold to one intensity value, and to convert

7 The most common method for binarization is to select a proper intensity threshold for the image and then convert all the intensity values above the threshold to one intensity value, and to convert all intensity values below the threshold to the other chosen intensity. Thresholding (Binarization) methods can be classified into two categories: global and adaptive thresholding. Global methods apply one threshold value to the entire image. Local or adaptive thresholding methods apply different threshold values to different regions of the image [Wu & Amin, 2003]. The threshold value is determined by the neighborhood of the pixel to which the thresholding is being applied. Among those global techniques, Otsu s thresholding technique [Otsu, 1979] has been cited as an efficient and frequently used technique [Leedham et al., 2002]. Niblack s method [Niblack, 1986] and Sauvola s method [Sauvola & PietikaKinen, 2000] are among the most well known approaches for adaptive thresholding Noise Removal Scanned documents often contain noise that arises due to printer, scanner, print quality, age of the document, etc. Therefore, it is necessary to filter this noise before we process the image. Commonly used approach is to process the image through a low-pass filter and use it for later processing. The objective in the design of a noise- filter is that it should remove as much of the noise as possible while retaining the entire signal [Kasturi et. al, 2002] Skew detection and correction When a document is fed to the scanner either mechanically or by a human operator, a few degrees of tilt (skew) is unavoidable. Skew angle is the angle that the lines of text in the digital image make with the horizontal direction. Figure 2.6 shows an image with skew. Figure 2.6 An image with skew 17

8 A number of methods have previously been proposed in the literature for identifying document image skew angles. Mainly, they can be categorized into the following groups [Lu & Tan, 2003]: (i) methods based on projection profile analysis, (ii) methods based on nearest- neighbor clustering, (iii) methods based on Hough transform, (iv) methods based on cross-correlation, and (v) methods based on morphological transforms. Survey on different skew correction techniques can be found in [Hull, 1998; Chaudhuri & Pal, 1997]. Most existing approaches use the Hough transform or enhanced versions [Srihari & Govindaraj, 1989; Pal & Chaudhuri, 1996; Amin & Fischer, 2000]. Hough transform detects straight lines in an image. The algorithm transforms each of the edge pixels in an image space into a curve in a parametric space. The peak in the Hough space represents the dominant line and it s skew. The major draw back of this method is that it is computationally expensive and is difficult to choose a peak in the Hough space when text becomes sparse [Shivakumara et al., 2003]. Approaches based on the correlation [Yan, 1993] use cross correlation between the text lines at a fixed distance. It is based on the observation that the correlation between vertical lines in an image is maximized for a skewed document; in general if one line is shifted relatively to the other lines such that the character base line levels for two lines are coincident. The algorithm can be seen as a special case of the Hough transform. It is accurate for skew angles less than ±10 degrees. Postl [Postl, 1986] proposed a method in which the horizontal projection profile is calculated at a number of different angles. A projection profile is a one-dimensional array with a number of locations equal to the number of rows in an image. Each location in the projection profile stores a count of the number of black pixels in the corresponding row of the image. This histogram has the maximum amplitude and frequency when the text in the image is skewed at zero degrees since the number of co-linear black pixels is maximized in this condition [Hull, 1998].To determine the skew angle of a document, the projection profile is computed at a number of angles, and for each angle, the difference between peak and trough heights is measured. The maximum difference corresponds to the best alignment 18

9 with the text line direction. This, in turn, determines the skew angle. The image is then rotated according to the detected skew angle. Nearest neighbor methods described by Hoashizume et al. [Hoashizume et al., 1986] finds nearest neighbors of all connected components, the direction vector for all nearest neighbor pairs are accumulated in a histogram and the histogram peak is found to obtain a skew angle. Since only one nearest neighbor connectivity is made for each component, connection with noisy sub parts of characters reduce the accuracy of the method. Approaches using morphological operations [Chen & Haralick, 1994] remove ascenders and descenders and merge adjacent characters to produce one connected component for each line of text. A line is then fit to the pixels in each component using a least-squares technique. Histogram of the angles of the lines detected is constructed in a document and a search procedure is applied to the histogram to determine the skew of the document. In the methods based on Fourier Transform [Postl, 1986], the direction for which the density of the Fourier space is the largest, gives the skew angle. Fourier method is computationally expensive for large images Line, word, and character segmentation Once the document image is binarized and skew corrected, actual text content must be extracted. Commonly used segmentation algorithms in document image analysis are: Connected component labeling, X-Y tree decomposition, Run-length smearing, and Hough transform [Bunke & Wang, 1997]. In connected component labeling, each connected component in the binary document image is assigned a distinct label. A connected component algorithm scans an image and groups its pixels into components based on pixel connectivity, i.e. all pixels in connected component share similar pixel intensity values and are in some way connected to each other. Once all groups have been determined, each pixel is labeled according to the component (group) it was assigned to. This technique assigns to each connected component of a binary image a distinct label. The labels are natural numbers starting from 1 to the total number of connected components in the image [Rosenfeld & Kak, 1976]. 19

10 The X-Y tree decomposition [Nagy & Seth, 1984] uses Horizontal projection profile of the document image to extract the lines from the document. If the lines are well separated, the horizontal projection will have separated peaks and valleys, which serve as the separators of the text lines. These valleys are easily detected and used to determine the location of boundaries between lines (Figure 2.7). Similarly, gaps in the vertical projection of a line image are used in extracting words in a line, as well as extracting individual characters from the word. Overlapping, adjacent characters in a word (called kerned characters) cannot be segmented using zero-valued valleys of the vertical projection profile. Special techniques/heuristics have to be employed to solve this problem. Figure 2.7 An image and its horizontal projection profile Run-length smearing algorithm (RLSA) [Wong, Casey &.Wahl, 1982] first detects all white runs (sequence of 1 s) of the line. It then converts those runs whose length is shorter than a predefined threshold T, to black runs. To obtain segmentation, RLSA is applied line-by-line first; and then column-by-column, yielding two distinct bitmaps; these are then combined by a logical AND operation. Hough-transform based methods, described in earlier section, doesn't require connected or even nearby edge points. They work successfully even when different objects are connected to each other. 20

11 Thinning Thinning, or, skeletonization is a process by which a one-pixel-width representation (or the skeleton) of an object is obtained, by preserving the connectedness of the object and its end points [Gonzalez & Woods, 2002]. The purpose of thinning is to reduce the image components to their essential information so that further analysis and recognition are facilitated. For instance, an alphabet can be handwritten with different pens giving different stroke thicknesses, but the information presented is the same. This enables easier subsequent detection of pertinent features. Letter e is shown in Figure 2.8 before and after thinning. A number of thinning algorithms have been proposed and are being used. The most common algorithm used is the classical Hilditch algorithm [Hilditch, C.J., 1969] and its variants. For recognizing large graphical objects with filled regions which are often found in logos, region boundary detection is useful, but for small regions such as those which correspond to individual characters, neither thinning nor boundary detection is performed, and the entire pixel array representing the region is forwarded to the subsequent stage of analysis [Kasturi et al, 2002]. Figure 2.8 An image before and after thinning Recognition By character recognition, the character symbols of a language are transformed into symbolic representations such as ASCII, or Unicode. The basic problem is to assign the digitized character into its symbolic class. This is done in two steps: (i) Feature extraction, and selection, and (ii) classification; and is shown in Figure

12 Figure 2.9 Recognition process Feature extraction and selection The heart of any optical character recognition system is the formation of feature vector to be used in the recognition stage. Feature extraction can be considered as finding a set of parameters (features) that define the shape of the underlying character as precisely and uniquely as possible. The term feature selection refers to algorithms that select the best subset of the input feature set. Methods that create new features based on transformations, or combination of original features are called feature extraction algorithms. However, the terms feature selection and feature extractions are used interchangeably in literature [Jain, Duin & Mao, 2000]. The features are to be selected in such a way that they help in discriminating between characters. Selection of feature extraction methods is probably the single most important factor in achieving high performance in recognition [Trier, Jain & Taxtt, 1996]. A large number of feature extraction methods are reported in literature; but the methods selected depend on the given application. There is no universally accepted set of feature vectors in document image understanding. Features that capture topological and geometrical shape information are the most desired ones. Features that capture the spatial distribution of the black (text) pixels are also very important [Bunke & Wang, 1997]. Hence, two types of approaches are defined as follows: 22

13 i. Structural approach In structural approaches features that describe the geometric and topological structures of a symbol are extracted. Structural features may be defined in terms of character strokes, character holes, end points, intersections between lines, loops and other character attributes such as concavities. Compared to other techniques, structural analysis gives features with high tolerance to noise and style variations. However, these features are only moderately tolerant to rotation and translation [Line, 1993]. Structural approaches utilize structural features and decision rules to classify characters. The classifier is expected to recognize the natural variants of a character but discriminate between similar looking characters such as O and Q, c and e, etc. ii. Statistical approach In statistical approach, a pattern is represented as a vector: an ordered, fixed length list of numeric features [Jain, Duin, & Mao, 2000]. Many samples of a pattern are used for collecting statistics. This phase is known as the training phase. The objective is to expose the system to natural variants of a character. Recognition process uses this statistics for identifying an unknown character. Features derived from the statistical distribution of points include number of holes, geometrical moments, black-to-white crossing counts, width, height, and aspect ratio. Representation of a character image by statistical distribution of points takes care of style variations to a large extent. Template matching is also one of the most common and oldest classification methods. In template matching, individual image pixels are used as features. Classification is performed by comparing an input character image with a set of templates (or prototypes) from each character class. The template, which matches most closely with the unknown, provides recognition. The technique is simple and easy to implement and has been used in many commercial OCR machines. However, it is sensitive to noise and style variations and has no way of handling rotated characters. Statistical approach and structural approach both have their advantages and disadvantages. Statistical features are more tolerant to noise than structural descriptions provided the sample space over which training has been performed is representative and 23

14 realistic. The variation due to font or writing style can be more easily abstracted in structural descriptions. In hybrid approach, these two approaches are combined at appropriate stages for representation of characters and utilizing them for classification of unknown characters. Segmented character images are first analyzed to detect structural features such as straight lines, curves, and significant points along the curves. For regions corresponding to individual characters or graphical symbols, local features such as aspect ratio, compactness (ratio of area to square of perimeter), asymmetry, black pixel density, contour smoothness, number of loops, number of line crossings and line end points etc. are used. Feature extraction techniques are evaluated based on: (a) robustness against noise, distortion, size, and style/font variation, (b) Speed of recognition, c) complexity of implementation, and d) how independent the feature set is without requiring any supplementary techniques [Line, 1993] Classification Classification stage in an OCR process assigns labels to character images based on the features extracted and the relationships among the features [Kasturi et al, 2002]. In simple terms, it is this part of the OCR which finally recognizes individual characters and outputs them in machine editable form. A number of approaches are possible for the design of classifier; and the choice often depends on which classifier is available, or best known to the designer. Decision-theoretic methods are used when the description of the character can be numerically represented in a feature vector. The principal approaches to decision-theoretic recognition are: minimum-distance classifiers, statistical classifiers and neural networks. Jain, Duin & Mao [Jain, Duin & Mao, 2000] identified the following two main categories. 1. The simplest approach is based on matching / identification of a similar neighbor pixel with the nearest distance. Matching covers the groups of techniques based on similarity measures where the distance between the feature vector describing the extracted character and the description of each class is calculated. Different measures may 24

15 be used, but the common is the Euclidean distance. This minimum distance classifier works well when the classes are well separated, that is when the distance between the mean values is sufficiently large compared to the spread of each class. When the entire character is used as input to the classification, and no features are extracted (template-matching), a correlation approach is used. Here the distance between the character image and prototype images representing each character class is computed. A special type of classifier is Decision tree [Brieman et al., 1984] which is trained by an iterative selection of individual features that are most salient at each individual node of the tree. Another category is to construct decision boundaries directly by optimizing certain error criterion. Examples are: Fisher s linear discriminant [Fisher, 1936; Chernoff & Moses, 1959; Fukunaga, 1990] that minimizes the Mean Squared Error (MSE) between classifier output and the desired labels. Neural networks are another category. Considering a back-propagation neural network, the network is composed of several layers of interconnected elements. A feature vector enters the network at the input layer. Each element of the layer computes a weighted sum of its input and transforms it into an output by a nonlinear function. During training, the weights at each connection are adjusted until a desired output is obtained. A problem of neural networks in OCR may be their limited predictability and generality, while an advantage is their adaptive nature [Line, 1993]. Other approaches include Single layer and Multilayer Perceptrons [Raudys, 1998], and Support vectors [Vapnik, 1995]. 2. The second main concept used in the classifier design is based on the probabilistic approaches such as Bayes decision rule and maximum likelihood rule [Bayes, 1763; Chow, 1957; Fukunaga, 1990]. The principle is to use a classification scheme that is optimal in the sense that, on average, it gives the lowest probability of making classification errors. A classifier that minimizes the total average loss is called the Bayes classifier. Given an unknown symbol described by its feature vector, the probability that the symbol belongs to class c is computed for all classes c =1...N. The symbol is then assigned 25

16 the class which gives the maximum probability. For this scheme to be optimal, the probability density functions of the symbols of each class must be known, along with the probability of occurrence of each class. Having dealt with the general techniques of character recognition, we now consider the importance of font recognition. 2.3 Font Consideration Optical character recognition systems deal with the recognition of printed or handwritten characters. Printed characters may have various fonts and sizes. Typographically, a font is a particular instantiation of a typeface design, often in a particular size, weight and style [Rubinstein, 1988]. A font family (for example, Arial) includes several styles (plain, italic, bold); the angular slope given to vertical strokes would result into italic variation of the same font. Arial font family in italic, bold and plain styles is shown in Figure This is Arial Text in Italic. This is Arial Text in Bold. This is Plain Arial Text. Figure 2.10 Arial font families in italic, bold and plain styles The type font family can be identified through its stylistic renderings of elements, weights, transformation variations and sizes. For example: Times Roman have thin smooth serifs at the end of vertical strokes, so this stylistic feature will be observed in all its family members. Different styles of two font families are shown below for the character A. A A A ---> Plain, Italic, and Bold styles in Times New Roman font family. 26

17 A A A ---> Plain, Italic, and Bold styles in Arial font family. A font face (for example, Arial, Regular) is a complete set of a single style, in all sizes, as shown in Figure This is Arial font face in different sizes. Figure 2.11 Arial font face Font is an important factor for both character recognition and script identification. At present, many attempts were made to construct OCRs with reasonable accuracy, but only for limited fonts. The performance of these OCRs is expected to be well as long as the same font size and type are maintained. Since this requirement is not practical, often we get poor results. The recognition accuracy often drops significantly when a document printed in a different font is encountered. A document reader must cope with many sources of variations notably that of font and size of the text. In commercial devices, the multi-font aspect was for a long time neglected for the benefit of speed and accuracy, and substitution solutions were proposed. At first, the solution was to work on customized fonts such as OCR-A and OCR-B. The accuracy was quite good even on degraded images on the condition that the font is carefully selected. However, recognition scores drop rapidly when fonts or sizes are changed. An optical character recognition system that works efficiently on documents that contain any font is highly desirable. 27

18 2.3.1 OCR Classification based on Fonts Based on the OCR systems capability to recognize different character sets, a classification [Line, 1993], by the order of difficulty is as follows Fixed font OCRs OCR machines of this category deal with the recognition of characters in only one specific typewritten font. Examples of such fonts are OCR-A, OCR-B, Pica, Elite, etc. These fonts are characterized by fixed spacing between each character. The OCR-A and OCR-B are the American and European standard fonts specially designed for optical character recognition, where each character has a unique shape to avoid ambiguity with other characters similar in shape. Using these character sets, it is quite common for commercial OCR machines to achieve a recognition rate as high as 99.99% with a high reading speed. The first generation OCRs were fixed font machines, and the methods applied were usually based on template matching and correlation Multi-font OCRs Multi-font OCR machines recognize characters from more than one font, as opposed to a fixed font system, which could only recognize symbols of one specific font. For the earlier generation OCRs, the limit in the number of recognized fonts was due to the pattern recognition algorithm used: template matching, which required that a library of bit map images of each character from each font was stored (requirement of a huge database). The accuracy is quite good, even on degraded images, as long as the fonts in the library are selected with care Omni font OCRs An omni font OCR machine can recognize symbols of most non-stylized fonts without having to maintain huge databases of specific font information. Usually omni fonttechnology is characterized by the use of feature extraction. The database of an omni font system will contain a description of each symbol class instead of the symbols themselves. This gives flexibility in automatic recognition of characters from a variety of fonts. A 28

19 number of current OCR-systems for English claim to be omni font. Although omni font is the common term used for these OCRs, this does not mean that they recognize characters from all existing fonts. Font classification can reduce the number of alternative shapes for each class, leading essentially to single-font character recognition [Zhu, Tan & Wang, 2001]. Following is the overview of approaches used in the literature for recognizing fonts in English Font Recognition approaches There are basically two approaches used for font identification in English [Zramdini & Ingold, 1998]: 1. A priori font classification that identifies the font without any knowledge of the content of characters, and, 2. A posteriori font classification approach that classifies the font using the knowledge of the characters. In the first one, global features are extracted from word/ line/ paragraph. These are the features that are generally detected by non-experts in typography (text density, size, orientation and spacing of the letters, serifs, etc.). In the second approach, local features are extracted from individual characters. These features are based on letter peculiarities like the shapes of serifs and the representation of particular letters like g and g, a and a. This kind of approach may derive substantial benefit from the knowledge of the letter classes. Zramdini and Ingold [Zramdini & Ingold, 1993] aim at the identification of the global typographical features such as font weight, typeface, slope and size of the text from an image block from a given set of already learned fonts. Jung, Shin and Srihari [Jung, Shin & Srihari 1999] proposed a font classification method based on the definition of typographical attributes such as ascenders, descenders and serifs, and the use of a neural network classifier. Bazzi et al. [Bazzi et al., 1997] focused on the problem of language independent recognition i.e., script-independent features are used. For example, a line is divided into a 29

20 sequence of cells, and features such as intensity, vertical derivative of intensity, horizontal derivative of intensity are used, but script-dependent features are not used. In the method used by Khoubyari and Hull, [Khoubyari & Hull, 1996], clusters of word images are generated from an input document and matched to a database of function words derived from fonts and document images. The font or document that matches best provides the identification of the predominant font and function words. Khorsheed and Clocksin [Khorsheed & Clocksin, 2000] transformed each word into a normalized polar image, and a two-dimensional Fourier transform is applied to the polar image. The resultant spectrum tolerates variations in size, rotation or displacement. Allier and Emptoz [Allier & Emptoz, 2003] treated the printed document as texture at character level. Multi-channel Gabon filtering approach is used. Each filter is applied to the original textured image, and a simple feature vector is created using statistical calculations. Classification is done using Bayesian theory. Shi and Pavlidis [Shi & Pavlidis, 1997] use a hybrid font recognition approach that combines an apriori approach and an a posteriori approach. Page properties such as histogram of word length and stroke slopes are used for font feature extraction. This font information is extracted from either the entire page, or, some selected short words like a, an, am, as. This system utilizes contextual information at word-level. Following a similar method, Zhu, Tan and Wang [Zhu, Tan & Wang, 2001] considered document as an image containing some specific textures and regarded font recognition as texture identification. The original image is preprocessed to form a uniform block of text. A block of text printed in each font can be seen as having a specific texture. The spatial frequency and orientation contents represent the features of each texture. These texture features are used to identify different fonts. Multi-channel Gabor filtering technique is used to extract features from this uniform text block. A weighted Euclidean distance classifier is used to identify the fonts. 2.4 Indian language OCRs At present, reasonably efficient and inexpensive OCR packages are commercially available to recognize printed texts in widely used languages such as English. These 30

21 systems can process documents that are typewritten, or printed. While a large amount of literature is available for the recognition of Roman, Chinese and Japanese language characters, relatively less work is reported for the recognition of Indian language scripts. Nevertheless, under the aegis of TDIL Programme, thirteen Resource Centers for Indian Language Technology Solutions (RCILTS) have been established at various educational institutes and R&D organizations covering all Indian Languages [JLT, July 2003]. Under this program, OCRs, human machine interface systems and other tools are being developed in different Indian languages. Thus, OCR systems for Indian scripts have just started appearing. A brief summary of the techniques used in these OCRs as given in the news letter [JLT, October 2003] along with other reported works are described in this section. A detailed performance reports as given in Language Technology Products Testing Reports from July 2004 news letter of TDIL [JLT, July 2004] are also presented in this section Techniques used in different Indian script OCRs We now present the studies in OCRs of different Indian scripts, along with a detailed description of the methods used in them Devnagari OCR OCR work on printed Devnagari script started in 1970s. Sinha and Mahabala [Sinha & Mahabala, 1979] presented a syntactic pattern analysis system with an embedded picture language for the recognition of handwritten and machine printed Devnagari characters. For each symbol of the Devnagari script, the system stores structural description in terms of primitives and their relationships. Pal and Chaudhuri [Pal & Chaudhuri, 1997] reported a complete OCR system for printed Devnagari. In this, headline deletion is used to segment the characters from the word. Text lines are divided into three horizontal zones for easier recognition procedure. After preprocessing, and segmentation using zonal information and shape characteristics; the basic, modified and compound characters are separated. A structural feature-based tree classifier recognizes modified and basic characters, while compound characters are 31

22 recognized by a tree classifier followed by template matching approach. The method reports about 96% accuracy. Bansal [Bansal, 1999] described Devnagari OCR in her doctoral thesis. Here, segmentation is done using a two-stage, hybrid approach. The initial segmentation extracts the header line, and delineates the upper strip from the rest. This yields vertically separated character boxes that could be conjuncts, touching characters, shadow characters, lower modifiers or a combination of these. Segmentation is done based on structural information obtained from boundary traversal in the second stage. Vertical bar features, horizontal zero crossings; number and position of vertex points, and moments, etc are used in the classification. For a feature, the distances from the reference prototypes for the candidate characters are computed; a classifier based on the distance matching is employed for recognition. An error detection and correction phase is also included as post processing. Performance of 93% accuracy at character level is reported. Problems that arise in developing OCR systems for noisy images are addressed in the work by Iyer et al., [Iyer et al, 2005]. Lines are segmented into word-like units, based on the dips in the vertical projection profile of the line. Some statistical data such as minimum and average widths, height, etc, are computed. Basic geometrical shapes such as full vertical bar, a horizontal line, diagonal lines in both the orientations, circles and semicircles of varying radii, and orientations are used to form the feature vector. Characters are classified using a rule-based approach. The rule base consists of more than one rule for a given character to account for different font-specific representations of the same character. Hamming distance metric is employed for classification. Character recognition rate of only 55% is reported. The authors also trained a feed-forward back propagation neural network, with a single hidden layer. Character recognition rate of 76% is reported with this neural network approach Bangla OCR Ray and Chatterjee presented a recognition system based on a nearest neighbor classifier employing features extracted by using a string connectivity criterion [Ray & Chatterjee, 1984]. 32

23 Chaudhuri and Pal reported a complete OCR for printed Bangla [Chaudhuri & Pal, 1998], in which a combination of template and feature-matching approach is used. A histogram-based thresholding approach is used to convert the image into binary images. Skew angle is determined from the skew of the headline. Text lines are partitioned into three zones and the horizontal and vertical projection profiles are used to segment the text into lines, words, and characters. Primary grouping of characters into the basic, modified and compound characters is made before the actual classification. A few stroke features are used for this purpose along with a tree classifier where the decision at each node of the tree is taken on the basis of presence/absence of a particular feature. Recognition of compound characters is done in two stages: In the first stage, characters are grouped into small sub-sets by the above tree classifier. At the second stage, characters in each group are recognized by a run-based template matching approach. Some character level statistics like individual character occurrence frequency, bi-gram and tri-gram statistics etc. are utilized to aid the recognition process. For single font, clear documents 99.1% character level recognition accuracy is reported. Recognition of isolated and continuous printed multi-font Bengali characters is reported by Mahmud et al., [Mahmud et al., 2003]. This is based on Freeman-chain code features, which are explained as follows. When objects are described by their skeletons or contours, they can be represented by chain coding, where the ON pixels are represented as sequences of connected neighbors along lines and curves. Instead of storing the absolute location of each ON pixel, the direction from its previously coded neighbor is stored. The chain codes from center pixel are 0 for east, 1 for North- East, and so on. This is represented pictorially in Figure 2.12 (a) and (b). Chain code gives the boundary of the character image; slope distribution of chain code implies the curvature properties of the character. In this work, connected components from each character are divided into four regions with the center of mass as the origin. Slope distribution of chain code, in these four regions is used as local feature. Using chain code representation, classification is done by a feed forward neural network. Testing on three types of fonts with accuracy of approximately 98% for isolated characters and 96% for continuous characters is reported. 33

24 Figure 2.12 (a) (b) Chain code, and graphical representations Gurmukhi (Punjabi) OCR Lehal and Singh presented an OCR system for printed Gurumukhi script [Lehal & Singh, 2000]. The skew angle is determined by calculating horizontal and vertical projections at different angles at fixed interval in the range 0 to 90. The angle, at which the difference of the sum of heights of peaks and valleys is maximum is identified as the skew angle. For line and word segmentation horizontal and vertical projection profiles are respectively used. Each word is segmented into connected components or sub-symbols, where each sub-symbol corresponds to the connected portion of the character lying in one of the three zones. Connected components are formed by grouping together black pixels having 8-connectivity. Primary feature set is made up of features which are expected to be font and size invariant such as number of junctions with the headline equals 1, presence of sidebar, presence of a loop, and loop along the headline. The secondary feature set is a combination of local and global features: number of endpoints and their location, number of junctions and their location, horizontal projection count, right profile depth, left profile depth, right and left profile directions, and aspect ratio. Binary tree classifier is used for primary features, and the nearest neighbor classifier with a variant sized vector was used for the secondary features. This multi-stage classifier is used to classify the sub -symbols and they are then combined using heuristics and finally converted to characters. A recognition rate of 96.6% was reported. Lehal and Singh also developed a post processor for Gurmukhi [Lehal & Singh, 2002]. In this, statistical information of Punjabi language such as word length, shape of the words, frequency of occurrence of different characters at specific positions in a word, 34

25 information about visually similar-looking words, grammar rules of Punjabi language, and heuristics are utilized Telugu OCR The first reported work on OCR of Telugu Characters is by Rajasekaran and Deekshatulu [Rajasekaran & Deekshatulu, 1977]. It identifies 50 primitive features and proposes a two-stage syntax-aided character recognition system. A knowledge-based search is used in the first stage to recognize and remove the primitive shapes. In the second stage, the pattern obtained after the removal of primitives is coded by tracing along points on it. Classification is done by a decision tree. Primitives are joined and superimposed appropriately to define individual characters. Rao and Ajitha utilized the characteristic feature of Telugu characters as composing of circular segments of different radii [Rao & Ajitha, 1995]. Recognition consists of segmenting the characters into the constituent components and identifying them. Feature set is chosen as the circular segments, which preserve the canonical shapes of Telugu characters. The recognition scores are reported as ranging from 78 to 90% across different subjects, and from 91 to 95% when the reference and test sets were from the same subject. Sukhaswami, Seetharamulu and Pujari proposed a neural network based system [Sukhaswami, Seetharamulu & Pujari, 1995]. Hopfield model of neural network working as an associative memory is chosen for recognition purposes initially. Due to the limitation in the storage capacity of the Hopfield neural network, they later proposed a Multiple Neural Network Associative Memory (MNNAM). These networks work on mutually disjoint sets of training patterns. They demonstrated that storage shortage could be overcome by this scheme. Negi, Bhagvati and Krishna reported an OCR for Telugu [Negi, Bhagvati & Krishna 2001]. Instead of segmenting the words into characters as usually done, words are split into connected components (glyphs). Run Length Smearing Algorithm and Recursive XY Cuts [Nagy, Seth & Vishwanathan, 1992] methods are used to segment the input document image into words. About 370 connected components (depending on the font) are identified 35

26 as sufficient to compose all the characters including punctuation marks and numerals. Template matching based on the fringe distance [Brown, 1994] is used to measure the similarity or distance between the input and each template. The template with the minimum fringe distance is marked as the recognized character. The template code of the recognized character is converted into ISCII, the Indian Standard Code for Information Interchange. Raw OCR accuracy with no post processing is reported as 92%. Pujari, Naidu and Jinaga proposed a recognizer that relies on wavelet multiresolution analysis for capturing the distinctive characteristics of Telugu script [Pujari, Naidu & Jinaga, 2002]. Gray level input text images are line-segmented using horizontal projections; and vertical projections are used for the word segmentation. Images are uniformly scaled to 32x32 sizes using zero-padding technique. Wavelet representation with three levels of down-sampling reduces a 32x32 image into a set of four 8x8 images, of which only an average image is considered for further processing. Character images of size 8x8 are converted to binary images using the mean value of the grey level as the threshold. The resulting bit string of 64 bits is used as the signature of the input symbol. A Hopfieldbased Dynamic Neural Network is designed for the recognition purpose. For 444 number of testing patterns, the system correctly recognized 415, giving a recognition rate of 93.46%. The authors reported that the same system, when applied to recognize English characters, resulted in very low recognition rate since the directional features that are prevalent in Latin scripts are not preserved during signature computation with wavelet transformation. Lakshmi and Patvardhan presented recognition of basic Telugu symbols [Lakshmi & Patvardhan, 2003]. Each character (basic symbol) is resized to 36 columns, while maintaining the original aspect ratio. A preliminary classification is done by grouping all the symbols with approximately same height (rows). Feature vector is computed out of a set of seven invariant moments from the second and third order moments. Recognition is done using k-nearest neighbor algorithm on these feature vectors. A single font type is used for both training and test data. Testing is done on noisy character images with Gaussian noise, salt and pepper noise and speckle noise added. Preprocessing such as line, word, and character segmentation is not addressed in this work. The authors extended the work to 36

27 multi font OCR [Lakshmi & Patvardhan, 2002]. Preprocessing stages such as binarization, noise removal, skew correction using Hough transform method, Lines and words segmentation using horizontal and vertical projections are included in this work. Basic symbols from each word are obtained using connected components approach. After preliminary classification as in the previous work, pixel gradient directions are chosen as the features. Recognition is done again using the k-nearest neighbor algorithm on these feature vectors. The training vectors are created with three different fonts and three sizes: 25, 30 and 35. Testing is done on characters with different sizes, and also with some different fonts. Recognition accuracy of more than 92% for most of the images is claimed. DRISHTI is the OCR for Telugu language developed by the Resource Center for Indian Language Technology Solutions (RCILTS), at the University of Hyderabad [JLT, July 2003].The techniques used in Drishti are as follows: For binarization three options are provided: global (the default), percentile based and iterative method. Skew Detection and Correction are done by maximizing the variance in horizontal projection profile. Text and Graphics Separation is done by horizontal projection profile. Multi-column Text Detection is done using Recursive X-Y Cuts technique. It is based on recursively splitting a document into rectangular regions using vertical and horizontal projection profiles alternately. Word segmentation is done using a combination of Run-Length Smearing Algorithm (RLSA) and connected-component labeling. Words are decomposed into glyphs by running the connected component labeling algorithm again. Recognition is based on template matching using fringe distance maps. The template with the best matching score is output as the recognized glyph. A semi-automatic, adaptive OCR is developed by Rawat, et al., for recognition of Indian Languages [Rawat, et al,, 2006]. Features in the Eigenspace are used for classification using a Support Vector Machine (SVM). Principal Component Analysis (PCA) is used for feature dimensionality reduction. To resolve confusing characters during the recognition process, a resolver module that uses language-specific information is designed. Postprocessor is based on contextual information and language-based reverse dictionary approach to correct any wrongly recognized words. Performance of the prototype system is tested on datasets taken from four Indian languages: Hindi, Telugu, Tamil and 37

A Survey of Problems of Overlapped Handwritten Characters in Recognition process for Gurmukhi Script

A Survey of Problems of Overlapped Handwritten Characters in Recognition process for Gurmukhi Script Arwinder Kaur 1, Ashok Kumar Bathla 2 1 M. Tech. Student, CE Dept., 2 Assistant Professor, CE Dept.,