Chapter 2. Components

Size: px

Start display at page:

Download "Chapter 2. Components"

Ralph Paul
5 years ago
Views:

1 Chapter 2 [2]OCR: General Architecture and Components In some areas which require the automation of human intelligence, such as chess playing, tremendous improvements are achieved over the last few decades. On the other hand, humans still outperform even the most powerful computers in the relatively routine functions such as vision. Automation of character recognition is one of these areas, which was the subject of intense research for the last three decades, yet it leaves many problems still open. The literature survey shows that there is extensive work carried out on English Language OCR. Researchers from different countries have also worked on their languages giving rise to different technologies suitable for their languages. But the result is still not satisfactory. We studied papers covering the International languages like English, Chinese, Korean, Japanese, Mongolian, Thai, Arabic, etc. A few papers were available on Indian languages like Bangla, Devanagari, Tamil, Telugu, Malayalam and Kannada. Since the amount of HCR work on Indian languages was found to be limited, we broadened the survey to include offline / online, printed / handwritten and across any language with the view to identify useful directions and guidelines from their approaches. Different architectural stages and the technologies used in each stage are analyzed. Though the specifics of architecture and model vary with the approach used, in general, an OCR system has four stages as shown in figure

2 Input raw data Preprocessing Feature Extraction Classification Recognition result Post- Processing Figure 2.1 Genreral Architecture of OCR The extensive survey work investigated the kind of methods used by different researchers in each stage of OCR system. As the different stage processing methods depend on the data acquisition, we also analyzed the ways of acquiring the images and the different parameters that influence the input data. 2.1 Input Data Since the invention of the printing press in the fifteenth century by Johannes Gutenberg, most of archived written language has been in the form of printed-paper documents. In such documents, text is presented as a visual image mostly in black on a high contrast background that is generally white. Written language is also encountered in the form of handwriting inscribed on paper or registered on an electronically sensitive surface. Some of the major aspects of collecting and storing the input image are discussed below. The image: The image acquired for recognition by OCR system can be a page of text, or a word or a character. A page of text is available only in the offline cases and it may be a printed text or handwritten text or a mixture of both. In online cases, the image is of a cursive style or mixed style [refer ] handwritten word or a character. Online acquisition: In the case of online writing, a special pen is used to write on an electronic surface such as a Liquid Crystal Display (LCD). The digitizers are usually electromagnetic-electrostatic tablets which send the coordinates of the pen tip to the host computer at regular intervals. Hence the noise generated along with the data is less and can be controlled by the writer. This gives the image of the character or word along with the information of the movement of the pen in horizontal (x) and vertical (y) directions as two separate 1-D digital signals captured while writing. There are different devices with different sensors to collect the online information while writing. Devices like PDAs, Tablet PC, etc, are transparent position sensing devices and they also give the pen tip 26

3 position co-ordinates along with visual display. Some devices like super pen can also measure pressure exerted while writing. Offline acquisition: In the case of offline writing, the printed or handwritten data is converted to digital form either by scanning the writing on paper using scanner or by capturing the still image using camera producing offline text images. This process yields a digital image in a specified image file format. Cameras are used in cases where fragile documents need to be preserved. Such paper documents cannot be forced flat and the light source for digital cameras is usually uneven. These make the input images to be handled in a different way than the scanner scanned images. Resolution: The resolution of the captured character image plays an important role in deciding the quality of the input image. Since the character is made of two states, with a black trace of the character contour on a white paper, capturing the image as a binary image can reduce the tonal resolution and hence the storage space. In some cases, this can result in character images with broken edges, background noise, etc, due to improper ink flow, non uniform paper quality, etc. The reduction in the spatial resolution results in step change in the contour. The contour of the character image will be smooth and continuous only when both the tonal (gray level) and spatial (dots/inch) resolutions are good. As the resolution increases, the storage space and the processing time increases. To capture a text image with a quality of information as on paper, both these resolutions should be set properly based on the smallest character size, fine details like small strokes, circular shapes, etc, in a character. Hence it is required to acquire the image with good resolution and with proper brightness and contrast adjustments so that the OCR can intelligently convert it into a binary image and make the character shape within the text image suitable for its recognition. Even though color scanners and tablets enable data acquisition with high resolution, there is always a trade off between the acquired image quality and complexity of the algorithms, which limits the recognition rate. Storage requirements: The raw data storage requirements are widely different depending on the acquisition device and the resolution setups, e.g., the data requirements for an average cursively written word are 230 bytes in the on-line case (sampling at 100 samples/sec), and 80 Kbytes in the off-line case (sampling at 300 dots per inch (dpi)). A 27

4 typical 8.5 X 11 inch page scanned at a resolution of 300 dpi with 256 gray level resolution needs 8.4 megabytes of space. File format: The standard image file formats like BMP, GIF, PNG, JPG, TIFF, etc, are used to store the image. Choosing a right file format to save the images is of vital importance. Each image format is suited to a specific type of image and matching the image to correct format results in a small file size, quality and fast-loading graphic. The user can view the image on computer screen using the standards like VGA (640x480pixels with 16 colors or gray shades) and SVGA (800x600 or 1024x768 pixels with 256 colors or gray shades). 2.2 Preprocessing Most of the recognition methods use feature extraction to assign an image to a prototype class. The recognition accuracy strongly depends on selected features and the feature extraction in tern depends on how well the text is represented in the image with respect to the background. This text can be a page, word or a character. In online cases, the text is usually a character or a word. But in offline cases, the text is usually a page of a document. As the acquired images contain various types of noises due to different reasons as stated in section and 1.4.2, they need to be preprocessed to get rid of all unwanted information with only the text prominent. Hence preprocessing stage is concerned with processing of the acquired image to make it suitable for feature extraction so that only the shape of the character or word contributes to the feature. The intention is to make the strokes forming the character of uniform thickness, with no unintended breaks, etc, and ensure no other marks are present in the image. This is generally a hard problem; for example, distinguishing a. which is part of the character, from that caused by noise. Hence preprocessing is in itself a research area and many researchers concentrate on this domain of character recognition. The preprocessing techniques broadly fall under the domain of image processing. The main objectives of the preprocessing are noise reduction, normalization of the data and compression of information. Some of the techniques used to achieve these objectives are discussed below briefly, explaining the purpose and the commonly used methods. 28

5 2.2.1 Noise reduction The text image distortions like disconnected edges, bumps and gaps in edges, filled loops, local intensity variations, non uniform thickness of the edges, etc, need to be eliminated. There are many techniques proposed in the literature to reduce the effect of these noises. The standard techniques are available in image processing tool boxes. Depending on the requirements, some researchers have developed their own specialized noise reduction algorithms Filtering Filters are used to remove or to diminish the effect of noise, to enhance the edges for better discrimination with the background, etc. The basic idea is to convolve the pre-defined mask with the image. The mask size decides the neighborhood distance to be considered in the computation. For example, a 3x3 mask will consider all the neighborhood pixels at a distance of one. The immediate 8 neighbors of the center pixel of the image under modification will be used in the decision. The convolution process computes a new value for the center pixel as a function of the grey values of its m x m neighborhood pixels x(i-k, j-l) and the mask weights W kl given by (2.1). y ( i, j) w x( i k, j l) (2.1) k l kl The mask size has an effect on the noise. As the mask becomes bigger, more neighborhood pixels participate in the decision and may introduce some unexpected distortions. Various spatial and frequency domain filters can be designed for smoothing, sharpening, contrast adjustment, salt and pepper noise elimination, thresholding, etc. Some of these are explained below. The effect of these filters is shown in figure 2.2. (a) Original image (b)smoothing (c) sharpening (d) contrast adjustment (e) salt and pepper noise (f) salt and pepper noise removed (g) Gaussian Noise (h) Gaussian low pass filter (i) Gaussian smoothing (j) contrast adjustment Figure 2.2 Effect of filters 29

6 Smoothing: The rugged edges due to Gaussian noise, writer, paper and pen imperfections are smoothened using these filters. The different smoothing filters are Averaging filter, Wiener filter, Low pass filters like Gaussian filter, Butterworth filter, etc. As it cuts on high frequency components, if not used carefully, it will blur the image. Sharpening: To enhance the sharpness of the edges, the filters namely, high boost filter, Gaussian filter, high pass filters, etc, are used. This filter is to be used carefully to avoid the noise becoming dominant. Removal of salt and pepper noise: It is also called as speckle noise. The white (salt) and the black (pepper) spots which are distinct in the image constitute this noise. Order-statistic filter - median filter is used to remove this noise. The size of the mask should be chosen carefully. If the number of noisy pixel is more than half of the total pixels used in the median computation, the noise cannot be eliminated. Hence depending on the density of the noise, the mask size is chosen. But, if the density of the edge is less, then this filter may result in broken edges. Contrast adjustment: The low contrast character images are enhanced by making dark portions darker and bright portions brighter. The position and the slope of the inputoutput gray level mapping curve should be chosen carefully around which the stretch in both dark and bright portions happens. This increases the valley between the dark and bright regions and hence simplifies thresholding. But, there is a possibility of background becoming foreground and vice versa. Thresholding: In order to increase processing speed, it is often desirable to represent grey scale or color images as binary images by picking some threshold value such that everything above that value is set to 1 and everything below is set to 0. According to a survey [162], the algorithms have evolved from global thresholding to local adaptive thresholding to allow for variations in the image background. Today they range from relatively simple algorithms to some that are rather complex. Global threholding picks one threshold value for the entire document image, often based on an estimation of the background level from the intensity histogram of the image. Adaptive (local) threholding is a method used for images in which different regions of the image may require different threshold values. 30

7 In [164] comparison of many thresholding techniques is given, using evaluation criteria of word recognition power to check the accuracies of a character recognition system. On those tested, Niblack s locally adaptive method produces the best result. In this method, the threshold for each pixel as determined by examining the average of the pixels m(x,y) and the standard deviation (x,y) in the neighborhood. In [27], a comparative performance evaluation of thresholding algorithms applied to OCR is done. They used Hausdorff, Jaccard and Yule distance measures to measure the similarities between the thresholded and original images. When all the measuring parameters of each method are summed up, Lloyd, Otsu, Local Average Thresholding based on Otsu, Local Contrast technique (LCT) and Nonlinear Dynamic Method (NDA) scored highest. Otsu s thresholding is used by many researchers [18][16][84]. Otsu binarization is a global thresholding method with threshold fixed based on minimization of the weighted sum of within-class variances of the foreground and background pixels using gray level histograms [163]. Thinned Edge extraction: The thresholded image contour thickness varies from image to image and also it may vary within the image due to many factors like pressure exerted while writing, pen tip width, ink flow, etc. As the contour thickness influences the computations of feature extraction, character shape can be conveyed by the single pixel width trace of the contour. To extract a single pixel width contour, techniques like Laplacian of Gaussian (LOG), canny edge detector, etc, are used. These methods give the responses at the transitions from white to black and from black to white. Hence there will be double edge responses from these operators as shown in figure 2.3. These methods preserve the edge thickness information. Figure 2.3 Edge extraction: Effect of LOG and canny edge detection techniques 31

8 Morphological operations Morphology is mathematically a set theory that is used for the analysis and processing of geometrical shapes in an image. These operations can be done for both binary and gray images. As the character images are binary images, we consider here only binary morphological operations. These operations convolve a structuring element over an input image, creating an output image of the same size. The structuring element is used to construct a morphological operation that is sensitive to specific shapes in the input image. The structuring element is itself a binary image with origin indicating the position of the pixel and size indicating the neighborhood shape to be considered for modification. Some examples of structuring elements are in figure 2.4. The center of the structuring element is marked with O which indicates the pixel position to be modified based on the filter effect. There may be more than one structuring element and every pixel output under the origin of the structuring element may be decided immediately or may be marked first and decided later based on further neighborhood observations x2 neighborhood 3x3 neighborhood Figure 2.4 Structuring elements with O as origin Two basic morphological operations given in equation (2.2) are called dilation and erosion. Dilation Erosion D( I, A) : E( I, A) : I A { x ( Aˆ ) I } I A { x ( Aˆ ) I} x x (2.2) Where I and A are the set of pixels in the input image and the structuring element respectively. Here Â x means the reflection of A about its origin followed by a shift by x positions. One can choose the structuring element with these effects and hence can avoid performing them explicitly. The results depend on the size, shape and the origin chosen 32

9 for the structuring element. In Dilation, when any pixel in the structuring element matches a black pixel (in our case) in the input image, the output image pixel under the origin is set to black. This tends to close the holes in the image by expanding the black regions. It also makes objects larger by changing every background pixel that is touching the object as object pixels. In erosion, when every pixel in the structuring element matches black pixels in the input image, the output pixel image under the origin is set to black. This tends to make the object smaller by changing the object pixels that are touching the background pixels to background. Various morphological operations can be defined using these basic operations and new structuring elements can be designed to smooth the contour, join the broken edges, prune the wild points and edges, open the joints, find the skeleton of the character, thin the character, extract the boundaries, clean the noise, etc. Some of these are discussed below. Opening: Opening smoothes the contours, breaks down narrow bridges and eliminates thin protrusions. Thus opening isolates objects which may be just touching one another. Opening of an image I is done using two basic operations: erosion (E) followed by dilation (D) using the same structuring element A as shown in equation (2.3). This technique is useful to eliminate the small islands and thin filaments of object pixels. OPEN (I,A) = D( E(I,A), A) (2.3) Closing: Closing fuses narrow breaks and eliminates small holes. Thus closing fills the narrow gaps and joins the contours. Closing of an image I is done using two basic operations: dilation (D) followed by erosion (E) using the same structuring element A as shown in equation (2.4). This technique is useful to eliminate the small islands and thin filaments of background pixels. CLOSE (I,A) = E( D(I,A), A) (2.4) Boundary extraction: It is used to extract the boundary of the binary image. Hence it is also called as morphological gradient operator. Object boundary is extracted by first eroding the image I by the structuring element A and then the set difference between I and its erosion is taken as given by equation (2.5). The structuring element A size decides the boundary thickness. For example, 3x3 size gives a one pixel width boundary. 33

10 BOUNDARY (I,A) = I (I A) (2.5) Thinning / Skeletonization: In OCR problems, most of the information can be extracted from the shape of the strokes. As the stroke thickness is influenced by a number of factors like, pen tip, ink flow, paper quality, pressure exerted, etc, the stroke shape extraction becomes difficult. Therefore, there is a need for removal of redundant pixels and a skeleton of a given character or stroke is in most cases sufficient for recognition. Thinning and skeletonization methods generate a minimally connected line that is equidistant from the boundaries. The skeletonization is a reconstructable thinning algorithm that preserves all details of internal structure of the objects within the resulting skeleton. Hence from the skeleton of an object, original object can be reconstructed. It is similar to thinning as thinned image is a skeleton of the object but with some loss of information and hence cannot be used for object reconstruction. The behavior of these operations depends on structuring element as it determines the situations for an object pixel to be set to background. A group of structuring elements is used to produce the skeleton and the process is repeated until no change in the output image is observed. This operation is shown by equation (2.6). (I KA) = ( (I A) A) A ) A (2.6) Here (I KA) indicates successive erosions of I for K times and every time I is eroded based on two level decision using a group of structuring elements (usually 8 depending on 8 direction processing) A ={A 1,A 2, A 8 }. There are two basic approaches for thinning namely, pixel-wise thinning and non pixelwise thinning. In pixel-wise thinning, the image is locally and iteratively processed until one pixel wide skeleton is obtained. These methods include erosion and iterative contour pealing. They are very sensitive to noise and may deform the shape of the character. The non-pixel-wise methods use some global information about the character during the thinning. Clustering based thinning methods, use the cluster centers as the skeleton. In [28], the skeleton growing algorithm uses gray level image for skeletonization. It controls the development of the skeleton using iterative skeletonization and deletion of boundary pixels, which is nested within the iterative binarization of the gray level image. The results were compared with other 3 algorithms that 34

11 worked on binary image. The effects of 3 other algorithms are given by images 1, 2 and 3 as shown in figure 2.5. They failed to separate lines that are touching or very close to each other. The skeleton growing algorithm could handle these problems as shown in 4 th image in figure 2.5. Figure 2.5 Effect of different thinning methods In [13], pre-thinning logic is first applied to reduce the effect of binarization noise. Then the image is thinned and a post-thinning logic is applied to remove the thinning distortions like hairs, splitting of fork points, etc. The post-thinning is also called as pruning. Pruning: The extra tail pixels generated after thinning or skeletonization methods are removed using pruning using equation (2.7). I pruning = I thinning A i (2.7) That is, perform convolution with the structuring elements in A until no change. One of the structuring elements in A is as shown in figure 2.6. The 1 in the center is object pixel, 0 is background pixel and X can be 0 or 1. By using the 45 0 rotations of this structuring element, other structuring elements can be obtained. The pixel removal using Pruning also cleans the image. 0 X X Figure 2.6 3x3 pruning structuring element 35

12 Cleaning: This is used to remove spurious points left over in the image after some operations like thinning. Cleaning is done by using erosion with the structuring element (similar to pruning with X value set to 0) whose center matches to the noise and neighborhood matches to the object and vice versa Compression The word compression refers to the representation of the character shape in the image with minimal information. Thresholding, thin edge extraction and morphological thinning (skeletonization) algorithms are used for compression. The minimal representation is an essential component in template matching techniques, where large numbers of images need to be preserved for comparison. The feature extraction becomes fast due to logical computations with only two values 0 and 1 and also due to minimal data involved in feature computation Normalization Normalization methods aim to remove commonly observed variations caused during writing which otherwise may influence the feature extraction process. The document, the individual text lines, individual words and even characters within a word may be skewed. Some strokes of the character shape may be out of proportion in comparison to others. It is required to make the shape of the character in a text to be as close as possible to the natural standard shape of the character so that their effect on features can be minimized. This is achieved by 3 basic types of normalizations namely skew normalization, slant normalization and size normalization. Also there is a need for contrast normalization of the text page image as text page background variations may influence the normalization process. The normalization methods are briefly described below Skew normalization There are two types of skew. The global skew is associated with the complete page misalignment with respect to horizontal direction caused while scanning, writing, etc. The local skew is associated with the individual character misalignment in a given word or with respect to base line of a text as shown in figure

Every text line has a base line and usually all the base lines should be parallel. If they are not parallel then, every base line needs to be normalized for orientation.

13 Every text line has a base line and usually all the base lines should be parallel. If they are not parallel then, every base line needs to be normalized for orientation. There are some characters that can be distinguished relative to the position with respect to the base line (eg. 9 and g in handwritten form). Hence identifying the baseline is important for recognizing a character. A wide variety of skew detection algorithms have been proposed in the literature. The commonly used methods are Hough transform, analysis of projection profiles, run length analysis, etc. Figure 2.7 Skew and slope for normalization In [121], for Bangle script skew detection method is proposed. The script has shirorekha (head-lines) and the approach is based on detection of these shirorekha of words. Individual words in a text line are detected by the method of connected component labeling. As the upper envelopes of selected components contain shirorekha information, they are found by column-wise scanning from the top of the component. Portions of the upper envelope satisfying the properties of digital straight line are detected. They are then clustered into groups belonging to single text lines. Estimates from these individual clusters give the skew angle of each text (base) line. The proposed multi-skew detection technique has accuracy about 98.3% Slant normalization It is also called as tilt correction. Some people write straight and some slant. The angle between longest stroke in a word and the vertical direction is referred to as slant. By slant normalization, all characters are normalized to a standard form with no slant. [19] uses alif, a very commonly used Arabic character which is almost vertical in case of no tilt, for tilt identification. The method scans for the occurrence of this letter and estimates the tilt based upon the letter s slope. 37

14 In [122], a method to correct the slant of Arabic text is proposed. To calculate the average slant angle of a word, all contours in a word are traversed to compute the slant angle for each near-vertical line. For a contour pixel on row n, the absolute differences between the x-coordinates of adjacent points on rows n-2, n-1, n, n+1 and n+2 on the same contour are added. This sum gives the slant-ness of the contour. If the sum is below a threshold, they are assumed to be part of near-vertical line. The weighted average of the individual slants angles of all contours in a word is used to compute the global average slant angle of the word for individual word slant correction Size normalization It is used to adjust the size of the character in the image to a certain standard. The original sizes of the handwritten characters vary to a great extent and may influence the feature extraction. There are some characters that can be distinguished only based on the aspect ratio of the character shape (eg. O and 0 ). Hair line strokes and small openings of the characters are much less likely to be detected in text set in a small font size 6 or 8 point (1 point = 1/72 inch) than in normal 10 to12 point font sizes. Such images have to be scaled by size normalization before further processing. In most cases, the size of the character is normalized by retaining the shape and the aspect ratio of the character. Linear Interpolation/decimation scaling methods are very commonly used [56]. There may be some strokes with dis-proportionate lengths which may influence the aspect ratio. There are other normalization techniques to handle these kinds of problems, but, they are computationally expensive. In [21], a fuzzy normalization method is applied to Chinese handwritten characters to deal with the irregular stroke lengths by compressing the redundant portions and preserving the important features. The recognition results improved from 80% using normal scaling to 85% using fuzzy normalization for special sample images with irregular stroke lengths. In [22], resampling of the handwritten digit gray image is based on multi-rate filter theory that is implemented by a cascade of interpolation and decimation filters namely, Hamming and Gaussian window functions. The recognition accuracy of Multi-rate method (96.8%) is compared with ratio based normalization (96.7%) and simple scaling (96.5%). An improvement of 0.3% is observed over normal scaling method. 38

In [56], to make the input images of same dimension, cubic spline interpolation in polynomial form and in poly-phase network form and linear decimation by block averaging are used.

15 In [56], to make the input images of same dimension, cubic spline interpolation in polynomial form and in poly-phase network form and linear decimation by block averaging are used. Interpolation in polynomial form show better recognition rates as compared to poly-phase network form, while the latter provides a low computational complexity solution for real time applications In [123], a Normalization-Cooperated Gradient Feature (NCGF) extraction method is proposed to alleviate the distortion of the original stroke direction due to pre size normalization. Five 1-D coordinate normalization methods are used: Linear Normalization (LN), Non Linear Normalization (NLN), Moment Normalization (MN), Bi-Moment Normalization (BMN) and Modified Centroid-Boundary Alignment (MCBA). They are extended to Pseudo 2 Dimensional (P2D) versions called Line Density Projection Interpolation (LDPI), P2DMN, P2DBMN and P2DCBA. The experimental results show that NCGF with P2D methods has less error rate than the NCCF and Normalization Based Gradient Function (NBGF) methods. The effect of each of these methods is shown in figure 2.8. Figure 2.8 Effect of different normalization methods on Chinese characters In [124], a fuzzy normalization method is proposed to normalize the irregular stroke lengths of Chinese characters. The recognition results of Fuzzy and scale normalized images are compared. When invariant features like number of strokes and ring data are tested with maximum distance clustering method on special samples, the fuzzy normalized image recognition rate is 85% as compared to 80% for the scaling normalization Contrast normalization It deals with the correction of non uniform contrast and brightness of the background surrounding the image. The background elimination of a page of text and making it uniformly 39

16 white improves the contrast of the text image. To normalize the local contrast in a document, researchers have proposed different methods, such as In [20], least mean square (error) estimation method is proposed that also uses generalized fuzzy operator to enhance the object of interest. All these techniques are applied in many areas of image processing. While enhancing the image in a specific way as mentioned above, most of the techniques may introduce some unexpected distortions and therefore should be applied with care. The outcome of preprocessing stage should be a clean normalized image with maximal shape information, maximal compression and minimal noise. In general, given the compromise between introducing further distortion and correcting existing problems, forming an effective proper preprocessing pipeline is a challenging open problem, often very dependent on the nature of images and types of problems usually observed and their severity for the intended task Preprocessing of a page of text image The acquired image under document image processing is usually a page of text. To extract the text, one needs to do line, word and character extraction [16][17], skew detection and correction at page level [121], etc. The background may be noisy due to old paper, rough surface, thin paper, colored paper, paper folds, back page ink visibility, etc. It may also be embedded with patterns or pictures. Hence background elimination or contrast normalization is to be done to get a uniform intensity background. Smoothening filter and median filter are used to reduce the background noise by some researchers [25] and some have analyzed the intensity variations in background and foreground to generate a uniform intensity for the background [23][24]. The detection of interfering marks such as blots, underscores, and creases is complex as they are randomly present in the image and may even affect the text quality at those positions. Another equally time consuming task is the localized skew correction that is necessary for handset pages and the baseline curl for pages copied from bound volumes. We restrict our attention to single character (including Kagunita) image at a time and hence do not discuss these issues further in this thesis. 40

17 2.2.4 Preprocessing of character images The characters thus extracted from a page of text need to be further processed. Compared to printed characters, the handwritten characters need extensive preprocessing due to writing variations and noise due to pen, ink, paper quality and writing environment. The methods discussed here deal with handwritten character preprocessing and these methods can be applied for printed characters depending on the kind of problems (eg, distortion, uneven print, etc.) observed. As we are dealing with character recognition, a survey of character preprocessing is of utmost importance and we investigated the preprocessing done on the character images by the researchers. The broad categories of preprocessing operations and their role have been discussed earlier in sections and The character image extracted from a page of text needs to be normalized to make the character shape close to a standard shape bounded within a fixed size image with proportionate stroke lengths [20][21][22]. Some of the normalization operations are deslanting (removing slant) [122], size normalization (to compensate for scaling) [56][123] and stroke length normalization (to compensate for velocity, dis-proportionate stroke lengths) [124] etc. While writing, due to rapid or erratic pen movements, some hook kind of pen movement may happen at the end or beginning of a stroke. Dehooking [141] helps to remove such hooks. Blocking algorithms are used to perform bounding box extraction to detect exactly where the textual information exists within the scanned image and to produce boundary touching character images by removing the surrounding background pixels [15]. Sharpening Filters help to sharpen the character shape in the image. In case of blur, faint or fine edges with intensities close to background, sharpening filter helps to preserve the edges. Smoothing helps to remove jitter noise from the image and helps in removing the directional changes between the adjacent pixels [15]. Gray-tone scanning helps for adaptive binarization or for gray-scale feature extraction. Adaptive local binarization helps cope with uneven contrast, but fine or faint connecting strokes can be more easily detected by complete gray scale processing. Binarization converts a gray image into a black and white image. This process may cause fragmentation problem and can be solved by filling in the broken joints. Thinning reduces the patterns to single pixel width thin line representation of the image called skeleton. Any unwanted edges after skeletonization needs to be pruned and spurious dots need to be cleaned [15]. This also reduces the amount of data. The shape analysis for feature extraction can be made more easily with thin line patterns. 41

18 The above preprocessing methods are chosen by the researchers depending on the quality of the input image and the common problems observed in their usage scenario. A general pipeline flexible enough to handle any deformation observed in the input image is still a research area. We discuss our approach and studies in chapter Segmentation Segmentation, in general, deals with separating out parts from a larger entity. As mentioned earlier, an image for character recognition may be a page which can be segmented into different lines, which in turn can be segmented to words, then to individual characters and then to strokes (in some cases). Though we deal with character level images and hence segmentation is not a major concern in our work we briefly discuss the approaches to various types of segmentation, for completeness, in this section. A survey on segmentation is given in [57][160]. In offline OCR, segmentation deals with extraction of Text lines from a page image Words from text line image Character images from a word image Individual strokes from a character image (part of feature extraction) Character and individual stroke segmentation falls under internal segmentation whereas the paragraphs, lines and words segmentation falls under external segmentation. The segmentation is done based on whether HCR uses words or characters for recognition. With holistic approaches, attempt is made to segment a word to recognize the attributes present within words rather than partitioning a word and attempting to recognize each part of it [11][12]. On the other hand, with the analytical approach, the characters are individually separated from the rest of the word. The strokes can be extracted from the character image or the whole character image is used for further processing. The analytical approach is better suited for on-line applications [7][44][30] For off-line applications both the analytical and holistic approaches can be used. External segmentation: Because of inter-line distance variability and base line skew variability, line segmentation in unconstrained handwritten document is very difficult. This 42

19 becomes still complicated when overlapping occurs between two consecutive text lines. The techniques followed by some researchers are as follows. In [16], Run Length Smearing Algorithm (RLSA) and morphological operations are used to segment individual text lines from unconstrained handwritten document image. RLSA links together neighboring black/white areas that are within a predefined distance to get a word as a component. On these components, morphological erosion is done to extract foreground information and background information for line segmentation. [17] uses dual method based on interdependency between text-line and inter line gap using histogram peaks and inter peak valleys for line identification. The intra-line curve cuts through the character strokes of a text line as many times as possible as long as these lines are straight. The imaginary inter-line curve that separates two text lines above and below is also generated similarly with many conditions like Inter-line curve should not cross intra-line curve and vice versa. Both curves grow in parallel guiding one another and after a few iterations, semi optimal piecewise linear curves for both text and inter-line gaps is obtained for line identification. Internal segmentation: The complexity of segmentation of characters within the word increases as we move from isolated discrete character words to cursive handwriting, more complex mixed handwriting and overlapped writing. This is still a complex unsolved problem. The natural skewness in the handwritten words poses some challenges for automatic character segmentation. The handwritten words contain some consistent skewness and also inconsistent skewness. As majority of HCR systems depend upon upright images, the skewed images severely degrade the performance of such systems. In [14], to individually separate a character from a word, the analytic segmentation technique is used. Firstly, a simple heuristic approach is used to identify valid segmentation points between the characters. This usually looks for minimas or arcs between characters common in handwritten cursive script. In many cases these arcs are the ideal segmentation points. Holes are found in some characters that are totally or partially closed (eg, a, u, o, etc.). Some times the segmentation points may segment a holed character in half. To avoid such cuts, after deciding the segmentation point, the hole-seeking algorithm checks whether it had not segmented a character in half by checking for holes and closeness of two segmentation points with respect to average 43

20 character width. This process of finding correct segmentation points is automated using feed-forward NN with back-propagation by training it with the manually segmented handwritten words. While testing the heuristically segmented word image is inputted to the NN to obtain the correctly identified segmentation points. Identified points are retained and the remaining points are removed. In [70] Segmentation of handwritten English numerals and alphabets is done by moving a marble on either side of touching characters for the selection of cut point. By making it move downwards, or diagonally downwards, or to the right or to the left based on the current position and its surroundings and the cut is made at point where the marble falls. Segmentation is also used for multi script recognition, the script identification (segmentation) can be done based on holistic approach and further text recognition of a particular script is done with analytic approach [81], thus exploiting both the approaches. In [34], online signature verification is done by extracting global features using holistic approach from the signature and local features using analytic approach. Segmentation till words can be part of preprocessing but the segmentation of characters and the segmentation of strokes from the characters are often a part of feature extraction. In such cases, the feature extraction process is called segmentation based feature extraction [7]. 2.4 Feature Extraction Character recognition algorithms do not usually work on images; they use features identified from the image as the input dimension for recognition. Identifying the right set of features and effectively and efficiently extracting them from the image are often the key elements in character recognition problem. In this section, we briefly look at existing literature in this regard. Our approach is detailed in chapter 7. A good feature set should capture appropriate characteristics of a class that help distinguish it from other classes, while remaining invariant to characteristic differences within the class. Based on segmentation models, the HCR can be segmentation-based (analytic approach) or segmentation-free (holistic approach) [7][11]. The isolated character applications use analytic approach. Where as for cursive word recognition, some have used analytic approach wherein the characters in the word are first segmented and then used for 44

21 feature extraction [40]. But segmentation is a difficult task and so some researchers tried on word recognition based on holistic approach with the dictionary support. A survey of feature extraction is presented in [158]. The chief design function is to select the best set of features, which maximizes the recognition rate with the least amount of elements. This problem can be formulated as a dynamic programming problem for selecting the k-best features out of N features, with respect to a cost function such as Fishers Discriminant ratio. Selection of features using a methodology as mentioned here requires expensive computational power and most of the time yields a sub optimal solution. Therefore, the feature selection is mostly done by heuristic or intuition for a specific type of application, usually guided by empirical experiments, an analysis of the shapes of various characters in the target set, etc Holistic approach Human readers easily resolve the confusion in recognition of similar shapes, because they don t consider each letter or numeral in isolation. They also adapt instantly to each typeface and even to a mixture of typefaces. But the automation of the same is a very difficult task. Holistic approaches mimic the way humans perceive the text. Holistic strategy employs top-down approaches for recognizing the full word, eliminating the segmentation problem. The price for this computational saving is to constrain the problem of OCR to limited vocabulary. This scheme can tolerate dramatic amounts of deformation within words, as often seen in cursive script. However, it is greatly dependent upon its prescribed lexicon as they are the nodes by which the objects of recognition are compared. Most of the conventional handwritten word recognition methods found in the literature are lexicon-driven methods in which one is given a handwritten word image with a list of possible target words the lexicon. The recognition of word image is basically a matching process. Each algorithm gives a way of matching the word image with the given text words in the lexicon. The best match gives the recognition result [11][12]. Due to the complexity introduced by the whole cursive / mixed handwriting word (compared to the complexity of a single character or stroke), the recognition accuracy is decreased. 45

22 2.4.2 Analytic approach Most of the HCR systems use analytic approach and recognize individual characters primarily by their shape [6]. Shape is a property of a class of characters and also of a particular method of observation or measurement. Shape cannot depend on size, color or location. In most of the languages the measurements are related to the features like length, width, curvature, orientation, relative position of strokes, etc. The measurements should be invariant to translation and scale. Hence the challenge is to find descriptions from the character image that are invariant to transformations leaving figures unaltered in unimportant ways, yet sensitive to transformations that change figures in important ways. Every character image should have some unique feature to uniquely identify itself. But all characters that cannot be distinguished by the given method of measurement are said to have the same shape resulting in ambiguity. Hence there is a need for different kinds of shape measurements for feature extraction. The analytic strategies employ bottom up approaches starting from stroke or character level and going towards producing a meaningful text. Segmentation is required which, not only adds extra complexity to the problem, but also introduces segmentation error to the system. However, with the cooperation of segmentation stage, the problem is reduced to the recognition of simple isolated characters or strokes, which can be handled for unlimited vocabulary with high recognition rates [29]. The comparison of both strategies is listed in table 2.1. Table 2.1 Strategies: Holistic vs. Analytic Holistic Strategy Analytic Strategy Whole word recognition Limited vocabulary No segmentation Sub-word or letter recognition Unlimited vocabulary Requires explicit or implicit segmentation Based on the method of data acquisition and the kind of information available for feature extraction, we have two categories of features off-line features and on-line features. 46

23 Offline features To recognize the hand written character image, different primitive features are extracted from the preprocessed character image during segmentation or after segmentation [29]. The features can be extracted from the whole image or from a specific part of it by dividing the image into several overlapping or non-overlapping zones (windows or cells). Hundreds of features are mentioned in the literature; the major ones can be broadly categorized as follows: Statistical features: These features are derived from the statistical distribution of foreground points in the image. They provide high speed and low complexity and take care of style variations to some extent. The statistical features can be extracted from the whole image or from zones of the image. The features like density of the points, number of strokes, count of each direction strokes, area, perimeter, number of crossings (number of times line segments are traversed by vectors in specified directions), curve distance from the boundary, compactness, count of start points and end points, pen ups, number of sub patterns, etc. can be extracted from the whole image and also from each of the zone. Structural features: The two common classes of structural features is straight line and curved line. These features describe the structure of the character shape like shape is straight, curved, circular, etc. Usually, a character shape has many structural features and hence they need to be segmented. For a segment identified as a straight line, the orientation or inclination angle of the segment, distinguish it into vertical line, horizontal line, positive slant, negative slant, etc. Categorizing curved lines is a more complex task. These features have high tolerances to distortions and style variations. They also tolerate a certain degree of translation and rotation. The sequence of structural features forming a character shape may be considered as features. Geometrical features: Since a character shape may have many structural features, simply identifying a segment as line or curve or as a vertical line may not describe the shape fully. We use geometrical features to describe them further. Hence Geometrical features are also referred as structural features in the literature. The length of the line/curve, the angle of orientation, etc, are some of the geometric features. These features may represent global (whole image under observation) and zonal (part of image under observation- local) properties of characters. 47

24 Topological (Positional) features: These features determine the relative position of the geometrical feature. The line/curve position within a zone (whole image may be considered as a single zone), position of start point and end point of the character, For example, if an image is divided into non overlapped 3x3 zones, then, a curve may be in the left region of (1,1) zone. As the topological features further characterize geometrical features and geometrical features distinguish structural features, we refer these three categories of features as structural features in this report unless they need to be distinguished. Global Transformation and Series Expansion Features: The transform domain representation of the image generally highlights the information (signals) that cannot be visualized in a spatial domain image and can be used for generating features. The signal representation of the image generally provides additional opportunities as it can be transformed into other domain (eg, time and frequency). These signals can be represented as a linear combination of a series of simpler well-defined functions. The coefficients of the linear combination provide a compact encoding known as series expansion. Some common transform and series expansion ways of feature extraction are as follows. Fourier Transform: It represents the spatial domain image as a summation of sinusoids of varying frequencies and amplitudes. The general procedure is to choose magnitude spectrum of the image as features. One of the most attractive properties of the Fourier Transform is the ability to recognize the position-shifted characters, when it observes the magnitude spectrum and ignores the phase. The draw back is that the time (local) information is lost. Gabor Transform: The Gabor transform is a variation of the windowed Fourier Transform. In this case, the window used is defined by a Gaussian function. This transformation maps a signal into a two dimensional function of time and frequency. The draw back is that the window size remains fixed for all frequencies. By varying the width and the angle of orientation of the Gaussian function, Gabor wavelets can be generated. Width variation helps to extract the edges of varying thicknesses and orientation angle helps to extract the edges in a particular direction from the image. Wavelet transform: Wavelet transformation is about Multi-Resolution analysis. It provides more flexibility than Gabor Transform where one can vary the window size 48

EE795: Computer Vision and Intelligent Systems

EE795: Computer Vision and Intelligent Systems Spring 2012 TTh 17:30-18:45 WRI C225 Lecture 04 130131 http://www.ee.unlv.edu/~b1morris/ecg795/ 2 Outline Review Histogram Equalization Image Filtering Linear