Chapter 2. Literature Survey and Objectives. 2.1 Literature Survey

Size: px

Start display at page:

Download "Chapter 2. Literature Survey and Objectives. 2.1 Literature Survey"

Kevin Goodwin
5 years ago
Views:

1 Chapter 2 Literature Survey and Objectives 2.1 Literature Survey In India, there are 18 official (Indian constitution accepted) languages. Two or more of these languages may be written in one script. Twelve different scripts are used for writing these languages. Many of the Indian documents are supposed to be written in three languages namely, English, Hindi and the state official language as per the three language formula. For example, a money order form in the Tamil Naidu state is written in English, Hindi and Tamil, because Tamil is the state official language of Tamil Naidu. The need to have some form of automated or semi automated OCR has been recognized for decades. As segmentation is the crucial part of the OCR, therefore more stress should be given to this phase. Today, there are numerous algorithms that perform this task, each with its own strengths and weaknesses. In this survey, a number of papers are reviewed and presented which are related to the present work. Dunn and Wang [1992] surveyed the techniques for segmenting images of handwritten text into individual characters. The topic is broken into two categories, one is segmentation and other is segmentation recognition techniques. First one discussed in the paper, is straight segmentation which is the technique of forming rules to identify members of a character set without identifying their specific classification. It is useful for printed character set but a bit less effective for cursive text. It greatly reduces the complexity of search for a word hypothesis since the character boundaries are pre determined. Several approaches to segmentation recognition are discussed in the paper. Each is analyzed for its relevance to printed, cursive, on line and off line input data. 27

2 Segmentation recognition strategies are more expensive due to the increased complexity of search for finding optimum word hypotheses. However, the inherent ambiguity of cursive text requires this type of segmentation. Fujisawa et. al. [1992] presented a pattern oriented segmentation method for optical character recognition that leads to document structure analysis. As a case study, segmentation of handwritten numerals, which touch to each other, is taken first. Connected pattern components are extracted, and spatial interrelations between components are measured and grouped into meaningful character patterns. Stroke shapes are analyzed. On the basis of that analysis, a method is described to find the touching positions that separate about all of connected numerals correctly. Authors handled ambiguities by making multiple hypotheses and verification by recognition. An extended form of pattern oriented segmentation, tabular form recognition, is considered. Images of tabular forms are analyzed, and frames in the tabular structure are segmented. By identifying semantic relationships between label frames and data frames, information on the form can be properly recognized. Abulhaiba and Ahmed [1993] presented an automatic off line character recognition system for totally unconstrained handwritten numerals using Fuzzy logic. The system was trained and tested on the field data collected by the U.S. Postal Services Department from dead letter envelopes. It was trained on one thousand seven hundred sixty three unnormalized samples. The training process produced a feasible set of one hundred five Fuzzy Constrained Character Graph Models (FCCGMs). FCCGMs tolerate large variability in size, shape and writing style. Characters were recognized by applying a set of rules to match a character tree representation to a FCCGM. A character tree is obtained by first converting the character skeleton into an approximate polygon and then 28

3 transforming the polygon into a tree structure suitable for recognition purposes. The system was tested on (not including the training set) one thousand eight hundred and twelve unnormalized samples and it proved to be powerful in recognition rate and tolerance to multi writer, multi pen, multi textured paper, and multi color ink. Akindele and Belaid [1993] described a page segmentation method that allows one to cut a document page image into polygonal blocks as well as into classical rectangular blocks. The inter column and inter paragraph gaps are extracted as horizontal and vertical lines. This builds an intersection table from the lines. The points of intersection between these lines are treated as vertices of polygonal blocks. With the aid of the four connected chain codes and the derived intersection table, simple isothetic polygonal blocks are constructed from these points of intersection. The method is robust enough to be applied to obtain polygonal blocks of any shape and any number of sides. Pavlidis [1993] stated that research in optical character recognition (OCR) has focused on the shape analysis of binarized images, by assuming that there would be good quality document and isolated characters. Such assumptions are challenged by the conditions met in practice. Binarization is difficult for low contrast documents because characters often touch each other, not only on the sides but also between lines, etc. Author has discussed current efforts to deal with OCR as a signal processing problem where the causes of noise and distortions as well the idealized images (definitions of typefaces) are modeled and subjected to a quantitative analysis. The key idea of the analysis is that while printed text images may be binary in an ideal state, the images seen by the sensors are gray scale because of convolution distortion and other causes. Finally, it is stated that binarization should be carried out at the same time as feature extraction. Liang et. al. [1994] proposed a new discrimination function for segmenting touching 29

4 characters. This function is based on both pixel projection and profile projection. A dynamic recursive segmentation algorithm is developed for effectively segmenting touching characters. Contextual information and a spelling checker are used to correct errors caused by incorrect recognition and segmentation. As per the paper, the proposed algorithm achieved good recognition accuracy. Seni and Cohen [1994] described techniques to separate a line of unconstrained (written in a natural manner) handwritten text into words. When the writing style is unconstrained, recognition of individual components may be unreliable so these components must be grouped together into word hypotheses, before recognition algorithms, which may require dictionaries, can be used. The proposed system uses original algorithms to determine distances between components in a text line and to detect punctuation. The algorithms are tested on number of handwritten text lines extracted from postal address blocks. A detailed performance analysis of the complete system and its components is presented in the paper. Avi-Itzhak et. al. [1995] stated that optical character recognition (OCR) refers to a process by which printed documents are transformed into ASCII files for the purpose of compact storage, editing, fast retrieval, and other file manipulations through the use of a computer. The recognition stage of an OCR process is made difficult by added noise, image distortion, and the various character typefaces, sizes, and fonts that a document may have. In the proposed study, a neural network approach is introduced to perform high accuracy recognition on multi size and multi font characters. A novel centroid dithering training process with a low noise sensitivity normalization procedure is used to achieve high accuracy results. The study is divided in two parts. The first part focuses on single size and single font characters, and a two layered neural network is trained to recognize 30

5 the full set of 94 ASCII character images in 12 pt Courier font. The second part trades accuracy for additional font and size capability, and a larger two layered neural network is trained to recognize the full set of 94 ASCII character images for all font sizes from 8 to 32 and for 12 commonly used fonts. The performance of these two networks is evaluated based on a database of more than one million character images from the testing data set. Congedo et. al. [1995] presented a procedure for the segmentation of handwritten numeric strings. The proposed procedure first uses hypothesis then verification strategy. In the paper, multiple segmentation algorithms, which were based on contiguous row partition, work sequentially on the binary image until an acceptable segmentation is obtained. To achieve this purpose a new set of algorithms simulating a "drop falling" process is introduced. Drop fall algorithms attempt to build a segmentation path by mimicking an object falling or rolling in between the two characters which make up a connected component. There are four primary types of drop fall algorithms which differ on the direction and the starting point of the drop fall. These are top left (or left descending), top right (or right descending), bottom left (or left ascending), and bottom right (or right ascending). The experimental tests demonstrate the effectiveness of the new algorithms in obtaining high confidence segmentation hypotheses. Lu [1995] provided the insight of character segmentation. Though the information in this paper is related with machine printed character but it gives a basis to understand segmentation. According to the paper the segmentation can be divided in three parts. First part is the Classical Approach in which segmentations are identified based on character like properties. This process of cutting up the image into meaningful components is called dissection. The second part is Recognition Based Segmentation, in which the system searches the image for components that match classes in alphabet. Holistic Methods is the 31

6 third one, in which the system seeks to recognize words as a whole, thus avoiding the need to segment into characters. Casey and Lecolinet [1996] aimed at providing an appreciation for the range of character segmentation techniques that have been developed. The segmentation is listed under four headings. Classical approach consists of methods that partition the input image into sub images, which are then classified. The second class of methods segments the image either explicitly by classification of pre specified windows, or implicitly by the classification of subsets of spatial features collected from the image as a whole. The third proposed strategy is the hybrid of first two, employing dissection together with recombination rules but using classification to select from the range of admissible segmentation possibilities offered by these sub images. Finally, holistic approach, which avoids segmentation by recognizing entire character strings as units. Lee [1996] proposed a new scheme for off line recognition of totally unconstrained handwritten numerals using a simple multilayer cluster neural network trained with the back propagation algorithm. This method highlighted that the use of genetic algorithms avoids the problem of finding local minima in training the multilayer cluster neural network with gradient descent technique. Hence, the recognition rates are improved. In the proposed scheme, Kirsch masks are adopted for extracting feature vectors and a three layer cluster neural network with five independent sub networks to be developed for classifying similar numerals efficiently. In order to verify the performance of the proposed multilayer cluster neural network, it was experimented with handwritten numeral database and correct recognition rates were obtained. Lu and Shridhar [1996] presented an overview on the most important techniques used in segmenting characters from handwritten words. It is well recognized that it is difficult 32

7 to segment individual characters from handwritten words without the support from recognition and context analysis. One common characteristic of all the existing handwritten word recognition algorithms is that the character segmentation process is closely coupled with the recognition process. This review consists of three major portions, hand printed word segmentation, handwritten numeral segmentation and cursive word segmentation. Every algorithm discussed in the paper is accompanied with a flow chart to give a clear grasp of the algorithm. One section summarizes the terms and measurements commonly used in handwritten character segmentation. Messelodi and Modena [1996] presented an algorithm for text segmentation and recognition mainly suited for complex problems where many merged characters are present. The basic idea is to define a distance, between lines of text and strings, which helps to postpone the final decision about text segmentation and character classification until the contextual analysis is performed. The distance takes into account both the hypotheses about segmentation generated by a text segmentation module and the hypotheses about character classification produced by a probabilistic classifier. The algorithm has been tested by reading text on books' covers. The experimental results highlight the quality of the solution proposed. Trier et. al. [1996] presented an overview of feature extraction methods for offline recognition of segmented (isolated) characters. Selection of a feature extraction method is probably the single most important factor in achieving high recognition performance in character recognition systems. The feature extraction methods which are discussed in the paper, are categorized with reference to invariance properties, reconstructability, and expected distortions and variability of the characters. Paper also suggested the problem of choosing the appropriate feature extraction method for a given application. Different 33

8 feature extraction methods are designed for different representation of the characters. Yu and Jain [1996] proposed a robust and fast skew detection algorithm based on hierarchical Hough transformation. It is capable of detecting the skew angle for various document images, including technical articles, postal labels, handwritten text, forms, drawings and bar codes. The algorithm is robust even when black margins introduced by photocopying are present in the image and when the document is scanned at a low resolution of 50 dpi. The algorithm has two steps. In the first step, the centroids of connected components are quickly extracted using a graph data structure. Then, in second step, a hierarchical Hough transform (at two different angular resolutions) is applied to the selected centroids. The skew angle corresponds to the location of the highest peak in the Hough space. The performance of the algorithm is shown on a number of document images collected from various application domains. The algorithm is not very sensitive to algorithmic parameters. Chaudhuri and Pal [1997 a] proposed an OCR system that can read two Indian language scripts which are Bangla and Devnagari (Hindi). These two are the most popular ones in Indian subcontinent. These scripts, having the same origin in ancient Brahmi script, have many features in common and hence a single system can be modeled to recognize them. The proposed model did document digitization, skew detection, text line segmentation and zone separation, word and character segmentation, character grouping into basic, modifier and compound character category. These are done for both scripts by the same set of algorithms. The feature sets classification tree as well as knowledge base (required for error correction such as lexicon) differ for Bangla and Devnagari. The system shows a good performance for single font scripts printed on clear document. Chaudhuri and Pal [1997 b] considered scanned documents in Devnagari and Bangla 34

9 for skew angle detection of scanned documents. Most characters in these scripts have horizontal lines at the top, called head lines. The character head lines mostly join one another in a word and the word appears as a single component. In the proposed method the components are labeled. The upper envelope of a component is found by column wise scanning from an imaginary line above the component. Portions of upper envelope satisfying the properties of digital straight line are detected. They are clustered as belonging to single text lines. Estimates from individual clusters are combined to get the skew angle. An advantage of the method is that character segmentation and zone detection can be readily done from headline information, which is useful in Optical Character Recognition approaches for these scripts. Chung and Yoon [1997] presented a performance comparison of several feature selection methods based on neural network node pruning. It is assumed that features are extracted and presented as the inputs of a three layered perceptron classifier. After the assumption, authors had applied five feature selection methods before/during/after neural network training in order to prune only input nodes of the neural network. Four of them are node pruning methods such as node saliency method, node sensitivity method, and two interactive pruning methods using different contribution measures. The last one is a statistical method based on principle component analysis (PCA). The first two of them prune input nodes during training whereas the last three do before/after network training. For gradient and upper down, left right hole concavity features, the proposed scheme was performed on several experiments of handwritten English alphabet and digit recognition with/without pruning using the five feature selection algorithms, respectively. The experimental results show that node saliency method outperforms the others. Peake and Tan [1997] presented a detailed review of current script and language 35

10 identification techniques. The proposed method is based on texture analysis for script identification which does not require character segmentation whereas the existing schemes rely on either connected component analysis or character segmentation. A uniform text block on which texture examination can be performed is produced from a document image by simple processing. Multiple channel (Gabor) filters and grey level co-occurrence matrices are used in independent experiments in order to extract texture features. Classification of test documents is made on the basis of the features of training documents using the K NN classifier. The method shows strength with respect to noise, the presence of foreign characters or numerals, and can be applied to very small amounts of text too. Alpaydin [1998] suggested that learners based on different paradigms can be combined for improved accuracy. Each learning method assumes a certain model that comes with a set of assumptions which may lead to error if the assumptions do not hold. Learning is an ill posed problem and with finite data, each algorithm converges to a different solution and fails under different circumstances. Authors stated that classifiers based on these paradigms did generalize differently, failed on different patterns and to a certain extent complement each other and thus they looked for ways to combine them for higher accuracy. One way to get complementary classifiers is by using different input representations. The methods, which are investigated, are voting, mixture of experts, stacking and cascading. The proposed method is experimented on real world applications like optical handwritten digit recognition, and pen based handwritten digit recognition and it is claimed in the paper that proposed method gave satisfactory results. Chaudhuri and Pal [1998] presented a complete Optical Character Recognition (OCR) system for printed Bangla, the fourth most popular script in the world, in this paper. This is the first OCR system among all script forms used in the Indian sub continent. The 36

11 captured image is subjected to skew correction, text graphics separation, line segmentation, zone detection, word and character segmentation using some conventional and some newly developed techniques. From zonal information and shape characteristics, the basic, modified and compound characters are separated for the convenience of classification. The basic and modified characters which are about seventy five in number and which occupy about ninety six percent of the text corpus, are recognized by a structural feature based tree classifier. The compound characters are recognized by a tree classifier followed by template matching approach. The feature detection is simple and robust where preprocessing like thinning and pruning are avoided. Madhvanath and Govindaraju [1998] proposed a methodology of coarse holistic features and heuristic prediction of ideal features from ASCII to address certain issues. One of the issues included is perceptual holistic feature. This is visually obvious feature of the word shape that has been cited in reading studies as being utilized in fluent reading. While these features have been used for word recognition when the lexicon of possible words is small and static, their application to the general problem of omni scriptor handwritten word detection is limited by their variability at the word level and the paucity of samples for word level training. The real world examples of handwritten words are instances of the ideal paradigm of the word class distorted by the scriptor, stylus, medium and intervening electronic imaging processes. This provides a basis for the proposed methodology. The proposed scheme has applications in verification and lexicon reduction for handwritten word recognition. Reddy and Nagabhushan [1998] described a three dimensional (3-D) neural network recognition system for conflict resolution in recognition of unconstrained handwritten numerals. This neural network classifier is a combination of modified self organizing map 37

12 (MSOM) and learning vector quantization (LVQ). The 3-D neural network recognition system has many layers of such neural network classifiers and the number of layers forms the third dimension. The proposed scheme is experimented by employing SOM, MSOM, SOM and LVQ, and MSOM and LVQ networks. These experiments on a database of unconstrained handwritten samples show that the combination of MSOM and LVQ performs better than other networks in terms of classification, recognition and training time. The 3-D neural network eliminates the substitution error. Tang et. al. [1998] presented an offline recognition system based on multifeature and multilevel classification for handwritten Chinese characters. Ten classes of multifeatures, such as peripheral shape features, stroke density features, and stroke direction features, are used in the proposed system. The multilevel classification scheme consists of a group classifier and a five-level character classifier, where two new technologies which are overlap clustering and Gaussian distribution selector, are developed. Experiments have been conducted to recognize number of daily used Chinese characters. The recognition rate is about high as claimed in the research paper. Jung et. al. [1999] proposed a segmentation method for a machine printed character string with arbitrary length. It exploits recognition based segmentation, combined with heuristic and holistic methods. The merged part of touching characters generates different shape of patterns from the primitive character patterns. However, far left side and far right side patterns in the touching characters are not affected by the touching. The algorithm firstly constructs a line adjacency graph (LAG) from a word image. Blobs are found as connected components of the LAG and small dot noises are removed. Secondly, as a word in English can be divided into three typographical zones such as the ascender, the x height and the descender, the location of the connected components among those zones are also 38

13 examined. Thirdly, the right profile of the touching character is compared with that of the sample characters in the prototype and then the touching characters are segmented with the width of one of the candidates in the prototype. Finally, upward, downward and left profiles of the segmented pattern are compared with those of the candidate respectively. Third and final steps are continued until confirmed by successful matchings of the resulting character patterns. It has been tested with touching characters in Times and in Helvetica fonts that are proportional pitch fonts and found that the proposed method is promising. Lee and Kim [1999] proposed an integrated segmentation and recognition method using cascade neural network. The proposed method as discussed in the paper, a new type of cascade neural network is developed to train the spatial dependences in connected handwritten numerals. This cascade neural network is originally extended from the multilayer feed forward neural network. This extension improves the discrimination and generalization power. The performance of the proposed method is verified by performing it on recognition experiments. As is clear from the experimental results, the proposed method has higher discrimination and generalization power than the previous integrated segmentation and recognition (ISR) methods. The network size of the method proposed in the paper is smaller than that of previous integrated segmentation and recognition methods. Lehal and Singh [1999] described a feature extraction and hybrid classification scheme for machine recognition of Gurmukhi characters, using binary decision tree and nearest neighbor. Classification process is completed in three stages, where in the first stage, the characters are grouped into sets depending on their zonal positions. In the second stage, the characters in middle zone set are further distributed into smaller sub sets by a binary 39

14 decision tree using a set of robust and font independent features. In the third stage, the nearest neighbor classifier is used using the special features distinguishing the characters. The significant point of this scheme is that a character image is tested against only certain subsets of classes at each stage, which enhances the computational efficiency. Oh et. al. [1999] proposed a new approach to combine multiple features in handwriting recognition based on two ideas: feature selection based combination and class dependent features. A non parametric method is used for feature evaluation. The first part of this paper is devoted to the evaluation of features in terms of their class separation and recognition capabilities. In the second part, multiple feature vectors are combined to produce a new feature vector. Based on this fact that a feature has different discriminating powers for different classes, a new scheme of selecting and combining class dependent features is proposed. In this scheme, a class is considered to have its own optimal feature vector for discriminating itself from the other classes. Using architecture of modular neural networks as the classifier, a series of experiments were conducted on unconstrained handwritten numerals. The results indicate that the selected features are effective in separating pattern classes and the new feature vector, derived from a combination of two types of features further improves the recognition rate Arica and Yarman [2000] introduced a set of one dimensional features to represent two dimensional shape information for HMM (Hidden Markov Model) based handwritten optical character recognition problem. The proposed feature set embeds two dimensional information into an observation sequence of one dimensional string, selected from a code book. It provides a consistent normalization among distinct classes of shapes, which is very convenient for HMM based shape recognition schemes. The normalization parameters, which maximize the recognition rate, are dynamically estimated in the training 40

15 stage of HMM. The proposed character recognition system is tested on handwritten data of the NIST database and a local database. The experimental results indicate very high recognition rates. Chen and Wang [2000] proposed a new approach of segmenting single or multiple touching handwritten numeral strings (two digits). Most of the available algorithms, used for the segmentation of connected digits, mainly focus on the analysis of foreground pixels. Some of them concentrated on the analysis of background pixels only and others are depending upon the concept based on a recognizer. But in this paper, the combination of background and foreground analysis is used to segment single or multiple touching handwritten numeral strings. Thinning of both foreground and background regions are first processed on the image of connected numeral strings and the feature points on foreground and background skeletons are extracted. Several possible segmentation paths are then constructed while doing these, useless stroke is removed. Finally, the parameters of geometric properties of each possible segmentation paths are determined and these parameters are analyzed by the mixture Gaussian probability function to decide the best segmentation path otherwise these are rejected. Experimental results show that the proposed algorithm can get a good accuracy rate. Kim et. al. [2000 a] presented a methodology which combine HMM (hidden Markov model) and MLP (multilayer perceptron) for cursive word recognition. An explicit segmentation based on HMM is designed. This scheme is combined with implicit segmentation based MLP using weighting coefficients. The main reason behind the proposed methodology is that more distinct classifiers can complement each other in a better way. A new probability measure for the hybrid classifier as well as conventional combining schemes is also introduced. Results mentioned in the paper showed good 41

16 segmentation. Kim et. al. [2000 b] described a scheme for recognizing unconstrained handwritten numeral strings by a composite segmentation method. Two concepts, one is recognition free and other is recognition based segmentation, are combined. A digit group detector has been designed to separate touching digits from isolated digits by the recognition free segmentation method. Subsequently touching digits are segmented by prioritizing segmentation points. These points are accomplished by analyzing the ligature and touching types. Four special kinds of candidate segmentation points and six touching types are defined to obtain more stable segmentation points. As per the claim made in paper, the proposed algorithm achieved good success rate. Lehal and Singh [2000] presented a system for recognition of machine printed Gurmukhi script. Character recognition in Gurmukhi script faces major problems mainly related to the unique characteristics of the script like connectivity of characters on the headline, a larger number of similar characters and two or more characters in a word having intersecting minimum bounding rectangles. A set of very simple and easy to compute features is used and a hybrid classification scheme consisting of binary decision tree and nearest neighbors is employed. Nicchiotti and Scagliola [2000] proposed a simple procedure for the over segmentation of cursive word, which is based on the analysis of the handwritten profiles and on the extraction of white holes. Straight segmentation tries to decompose the image in a set of sub images, each one corresponding to a character. In segmentation recognition strategies the image is subdivided in a set of sub images (strokes) whose combinations are used to generate character candidates. The number of sub images is greater than the number of characters and the process is referred to also as over segmentation. Recognition is then 42

17 used to select the correct character hypothesis from character candidates. It follows the policy of using simple rules on complex data and sophisticated rules on simpler data. Experimental results show robustness and performances comparable with the best ones presented in the literature. Plamondon and Srihari [2000] described that handwriting has continued to persist as a means of communication and recording information in day to day life even with the introduction of new technologies. This has significance in human transactions, machine recognition of handwriting has practical significance, as in reading handwritten notes, in postal addresses on envelopes, in amounts in bank cheques, in handwritten fields in forms, etc. This overview describes the nature of handwritten language and how it is transduced into electronic data. It also gave the insight of the concepts behind written language recognition algorithms. Both the online case (which pertains to the availability of trajectory data during writing) and the off line case (which pertains to scanned images) are considered. Algorithms for preprocessing, character and word recognition, and performance with practical systems are indicated. Other fields of application, like signature verification, writer authentification, and handwriting learning tools are also considered in the paper. Alimoglu and Alpaydin [2001] investigated techniques to combine multiple representations of a handwritten digit to increase classification accuracy without significantly increasing system complexity or recognition time. In pen based recognition, the input is the dynamic movement of the pen tip over the pressure sensitive tablet. There is also the image formed as a result of this movement. On a real world database of handwritten digits containing more than eleven thousand handwritten digits, authors noticed that the two multi-layer perceptron (MLP) based classifiers using these 43

18 representations make errors on different patterns implying that a suitable combination of the two would lead to higher accuracy. Therefore, they implemented and compared voting, mixture of experts, stacking and cascading. Combining the two MLP classifiers, higher accuracy is achieved because the two classifiers/representations fail on different patterns. So it is advocated, especially, multistage cascading scheme where the second costlier image based classifier is employed only in a small percentage of cases. Arica and Yarman [2001] served as an update for the readers working in the character recognition area. First, an overview of the character recognition systems and their evolution over time is presented. Then, the available classification recognition (CR) techniques with their superiorities and weaknesses are reviewed. Finally, the current status of CR is discussed and directions for future research are suggested. Special attention is given to the offline handwriting recognition, since this area requires more research to reach the ultimate goal of machine simulation of human reading. Madhvanath and Govindaraju [2001] surveyed to take a fresh look at the potential role of the Holistic paradigm in handwritten word recognition. According to Holistic paradigm in handwritten word recognition, a word is treated as a single, indivisible entity and attempts to recognize words from their overall shape, as opposed to their character contents. In this survey, an overview of studies of reading process is presented which provide evidence for the existence of a parallel holistic reading process in both developing and skilled readers. The handwriting recognition approaches are characterized as forming a continuous spectrum based on the visual complexity of the unit of recognition employed. An attempt is made to interpret well known paradigms of word recognition in this framework. An overview of features, methodologies, representations, and matching techniques employed by holistic approaches is presented, in the paper. 44

19 Srihari et. al. [2001] undertook a study to objectively validate the hypothesis that handwriting is individualistic. Handwriting samples of one thousand five hundred individuals, representative of the US population with respect to gender age, ethnic groups, etc., were obtained. Analyzing differences in handwriting was done by using computer algorithms for extracting features from scanned images of handwriting. Attributes characteristic of the handwriting were obtained. The attributes chosen were line separation, slant, character shapes, etc. These attributes, which are a subset of attributes used by expert document examiners, were used to quantitatively establish individuality by using machine learning approaches. Using global attributes of handwriting and very few characters in the writing, the ability to determine the writer with a high degree of confidence was established. The work is a step towards providing scientific support for admitting handwriting evidence in court. The mathematical approach and the resulting software also have the promise of aiding the expert document examiner. Acharyya and Kundu [2002] presented an efficient and computationally fast method for segmenting text and graphics part of document images based on textural cues. It is assumed that the graphics part have different textural properties than the nongraphics (text) part. The segmentation method uses the notion of multiscale wavelet analysis and statistical pattern recognition. Authors have used M band wavelets which decompose an image into M M bandpass channels. Various combinations of these channels represent the image at different scales and orientations in the frequency plane. The objective is to transform the edges between textures into detectable discontinuities and create the feature maps which give a measure of the local energy around each pixel at different scales. From these feature maps, a scale space signature is derived, by which the vector of features at different scales is taken at each single pixel in an image. It is claimed in the paper that segmentation is achieved by simple analysis of the scale space signature with traditional 45

20 k mean clustering. Any prior information regarding the font size, scanning resolution, type of layout, etc. of the document in the proposed segmentation scheme is not assumed. Arica and Yarman [2002] proposed a new analytic scheme, which uses a sequence of image segmentation and recognition algorithms, for the off line cursive handwriting recognition problem. First, some global parameters, such as slant angle, baselines, stroke width and height, are estimated. Second, a segmentation method finds character segmentation paths by combining gray scale and binary information. Third, a hidden Markov model (HMM) is employed for shape recognition to label and rank the character candidates. For this purpose, a string of codes is extracted from each segment to represent the character candidates. The estimation of feature space parameters is embedded in the HMM training stage together with the estimation of the HMM model parameters. Finally, information from a lexicon and from the HMM ranks is combined in a graph optimization problem for word level recognition. This method corrects most of the errors produced by the segmentation and HMM ranking stages by maximizing an information measure in an efficient graph search algorithm. The experiments indicate higher recognition rates compared to the available methods reported in the literature. Ashwin and Sastry [2002] described an OCR system for printed text documents in Kannada, which is a South Indian language. Scanned image of a page written in Kannada is given as an input to the system and the output, as a machine editable file, is achieved. This output file is compatible with most typesetting software. The proposed system extracts words from the document image. The segmented words are differentiated into sub character level pieces. The structure of the script is used in the proposed scheme for segmentation. A novel set of features for the recognition problem, which are computationally simple to extract, is proposed. The final recognition is achieved by 46

21 employing a number of two class classifiers which is based on the Support Vector Machine (SVM) method. The recognition is independent of the font and size of the printed text Garain and Chaudhuri [2002] described that one of the important reasons for poor recognition rate in optical character recognition (OCR) system is the error in character segmentation. Existence of touching characters in the scanned documents is a major problem to design an effective character segmentation procedure. In this paper, a new technique, based on fuzzy multifactorial analysis, is presented for identification and segmentation of touching characters. A predictive algorithm is developed for effectively selecting possible cut columns for segmenting the touching characters. The proposed method has been applied to printed documents in Devnagari and Bangla as authors felt that these two scripts are the most popular scripts of the Indian sub continent. The results obtained from a test set of considerable size show that a reasonable improvement in recognition rate can be achieved with a modest increase in computations. Kapoor et. al. [2002] proposed an accurate and exhaustive approach to detect the skew angle of the images of words/ characters of cursive Devanagari script. This approach was applied to 235 writing samples and a total collection of around 6000 samples. It is efficient in terms of time and is a simpler process as compared to the existing ones. The method is an extension to the work carried out by Pal and Chaudhuri. Heuristic approach has been applied to detect the skew angle. The inherent dominating features of the structure of the Devanagari script have been used to accurately calculate the skew of the Devanagari word. Pal et. al. [2002] dealt with a new scheme for automatic segmentation of unconstrained handwritten connected numerals. This approach is mainly based on water reservoir. A reservoir is a metaphor to illustrate where the region numerals touch. Reservoir is obtained 47

22 by considering accumulation of water poured from the top or from the bottom of the numerals. At first, considering reservoir location and size, touching positions are decided. Next, analyzing the reservoir boundary, touching position and topological features of the touching pattern, the best cutting point is determined. Finally, combined with morphological structural features the cutting path for segmentation is generated. Pal and Datta [2003] proposed a robust scheme to segment unconstrained handwritten Bangla texts into lines, words and characters. For line segmentation, at first, the text is divided into vertical stripes. Stripe width of the document is computed by statistical analysis of the text height in the document. The horizontal histogram of these stripes and the relationship of the minimal values of the histograms are used to segment text lines. Based on the vertical projection profile, lines are segmented into words. For segmentation of characters, water reservoir principle is used. At first, isolated and touching characters in a word are identified. Next touching characters of the word are segmented based on the reservoir base area points and structural feature of the component. Devessar et. al. [2003] suggested a new approach to segment machine printed Gurmukhi text. To resolve the issues of touching characters, a two pass mechanism is used. In pass one, the segmentation point is approximated, while in pass two the cutting point is optimized. This approach has been very successful in segmenting a pair as well as triplets of touching characters. This approach can easily be extended to the other Indian languages scripts such as Devnagri and Bangla, which have horizontal lines at the top called headlines. Pal and Sarkar [2003] worked on Optical Character Recognition system for printed Urdu. Here, the document image is captured using a flatbed scanner and passed through skew correction, line segmentation and character segmentation modules. These modules 48

23 are developed by combining conventional and newly proposed techniques. Next, individual characters are recognized using a combination of topological, contour and water reservoir concept based features. The feature detection methods are simple and robust. This approach achieves a good character level accuracy on average. Pal et. al. [2003 a] dealt with a new technique for automatic segmentation of unconstrained handwritten connected numerals. To take care of variability involved in the writing style of different individuals a robust scheme is presented in the paper. The scheme is mainly based on features obtained from a concept based on water reservoir. A reservoir is a metaphor to illustrate the region where numerals touch. Reservoir is obtained by considering accumulation of water poured from the top or from the bottom of the numerals. At first, considering reservoir location and size, touching position (top, middle or bottom) is decided. Next, analyzing the reservoir boundary, touching position and topological features of the touching pattern, the best cutting point is determined. Finally, combined with morphological structural features the cutting path for segmentation is generated. Pal et. al. [2003 b] stated that a document page may contain two or more different scripts. For Optical Character Recognition (OCR) of such a document page, it is necessary to separate different scripts before feeding them to their individual OCR system. In this paper an automatic scheme is presented to identify text lines of different Indian scripts from a document. For the separation task, at first the scripts are grouped into a few classes according to script characteristics. In the next step, feature based on water reservoir principle, contour tracing, profile etc. are employed to identify them without using any expensive OCR like algorithms. Zhang et. al. [2003] tried their hands in the analysis of handwritten characters 49

24 (allographs) and found that it plays an important role in forensic document examination. However, so far there is lack of comprehensive and quantitative study on individuality of handwritten characters. Based on a large number of handwritten characters extracted from handwriting samples of one thousand individuals in US, the individuality of handwritten characters has been quantitatively measured through identification and verification models. This study shows that in general, alphabetic characters bear more individuality than numerals and use of a certain number of characters will significantly outperform the global features of handwriting samples in handwriting identification and verification. Moreover, the quantitative measurement of discriminative powers of characters offers a general guidance for selecting most informative characters in examining forensic documents. Grau et. al. [2004] presented a new image segmentation system. This system is based on the calculation of a tree representation of the original image in which image regions are assigned to tree nodes, followed by a correspondence process with a model tree, which embeds a prior knowledge about the images. An algorithm is proposed in the paper, which performs the minimization of an error function that quantifies the difference between the input image tree and the model tree. Another algorithm is also proposed for automatically calculating the model tree from a set of manually segmented images. Results on synthetic and MR brain images are presented in the paper. Pal and Roy [2004] stated that there are printed artistic documents where text lines of a single page may not be parallel to each other. These text lines may have different orientations or the text lines may be curved shapes. For the optical character recognition (OCR) of these documents, such lines are needed to extract properly. A novel scheme, mainly based on the concept of water reservoir analogy, is proposed to extract individual 50

25 text lines from printed Indian documents containing multioriented and/or curved text lines. In the proposed scheme, initially connected components are labeled and identified either as isolated or touching. Next, each touching component is classified to either straight type (S-type) or curve type (C-type), depending on the reservoir base area and envelope points of the component. Based on the type (S-type or C-type) of a component, two candidate points are computed from each touching component. Finally, candidate regions (neighborhoods of the candidate points) of the candidate points of each component are detected. After analyzing these candidate regions, components are grouped to get individual text lines. Tripathy and Pal [2004] proposed a scheme based on the water reservoir concept for the segmentation of unconstrained Oriya handwritten text into individual characters. At first, the text image is segmented into lines, and then lines are segmented into individual words, and words are segmented into individual characters. For line segmentation, the document is divided into vertical stripes. Analyzing the heights of the water reservoir obtained from different components of the document, the width of a stripe is calculated. Stripe wise horizontal histograms are then computed and the relationship of the peak valley points of the histograms is used for line segment. Based on vertical projection profile and structural features of Oriya characters, text lines are segmented into words. For character segmentation, first the isolated and connected characters in a word are detected. Using structural, topological and water reservoir concept based features; touching characters of the word are then segmented. Zheng et. al. [2004] addressed the problem of the identification of text in noisy document images. In the paper, the stress is focused on segmenting and identifying between handwriting and machine printed text because handwriting in a document often 51

26 indicates corrections, additions, or other supplemental information that should be treated differently from the main content and moreover the segmentation and recognition techniques requested for machine printed and handwritten text are significantly different. The proposed scheme treats noise as a separate class and models noise based on the selected features. Trained Fisher classifiers are used to identify machine printed text and handwriting from noise. The context is further exploited to refine the classification. A Markov Random Field based approach is used to model the geometrical structure of the printed text, handwriting, and noise to rectify misclassifications. As is clear from the result in the paper, the scheme can significantly improve page segmentation in noisy document collections. Jindal et. al. [2005] identified different kinds of degradation available in Gurmukhi script. After identifying the different kinds of degradation, that is, touching characters, broken characters, heavy printed characters, faxed documents and typewritten documents and problems associated with each kind of degradation have been discussed and some possible solutions have also been discussed. Pal and Tripathy [2005] proposed a scheme towards the recognition of Indian stylistic documents. Here, using feature based on the water reservoir concept, the characters are segmented from the stylistic documents without any skew correction. Next, individual characters are recognized. For recognition, contour distances of the outer contour points of the characters are calculated from the centroid. These contour distances are then arranged in a particular order to get size and rotation invariant feature. Finally, computing statistical feature on these arranged contour distances, the input character is recognized. Jindal et. al. [2006] stated that multiple horizontally overlapping lines are normally found in printed newspapers of almost every language due to high compression methods 52

27 used for printing of the newspapers. For any optical character recognition (OCR) system, presence of horizontally overlapping lines decreases the recognition accuracy drastically. In this paper, authors have proposed a solution for segmenting horizontally overlapping lines. Whole document has been divided into strips and proposed algorithm has been applied for segmenting horizontally overlapping lines and associating small strips to their respective lines. The results reveal that the algorithm is almost ninety percent perfect when applied to the Gurmukhi script. Li et. al. [2006] dealt with curvilinear text line detection and segmentation in handwritten documents. Given no prior knowledge of script, authors modeled text line detection as an image segmentation problem by enhancing text line structure using a Gaussian window, and adopting the level set method to evolve text line boundaries. Experiments show that the proposed method achieves high accuracy for detecting text lines in both handwritten and machine printed documents with many scripts. Jindal et. al. [2007] stated that horizontally overlapping lines are normally found in printed newspapers of any Indian script. Along with these overlapping lines few other broken components of a line (stripe) having text less than a complete line are also found in text. The horizontally overlapping lines and other stripes make it very difficult to estimate the boundary of a line leading to incorrect line segmentation. Incorrect line segmentation decreases the recognition accuracy. In this paper, the authors have proposed a solution for segmenting horizontally overlapping lines and solved the problem of other stripes in eight most widely used printed Indian scripts. Whole document has been divided into stripes and proposed algorithm has been applied for segmenting horizontally overlapping lines and associating small stripes to their respective lines. Sulem et. al. [2007] made a survey regarding the line segmentation and described that 53

28 there is a huge amount of historical documents in libraries and in various National Archives that have not been converted electronically. Although automatic reading of complete pages remains, in most cases, a long term objective, tasks such as word spotting, text/image alignment, authentication and extraction of specific fields are in use today. For all these tasks, a major step is to segment document into text lines. Because of the low quality and the complexity of these documents (background noise, artifacts due to aging, interfering lines), automatic text line segmentation remains an open research field. Authors presented a survey of existing methods, developed during the last decade and dedicated to documents of historical interest. Jindal et. al. [2008] stated that the performance of an OCR system depends upon printing quality of the input document. There are number of designed OCRs which correctly identify fine printed documents in Indian and other scripts. But, little reported work has been found which deals with the recognition of the degraded documents. Therefore, if any standard OCR is tested on degraded documents, then the performance of that system, which is working well for fine printed documents, decreases. Feature extraction is an important task for designing an OCR for recognizing degraded documents. In this paper, authors have discussed efficient structural features selected for recognizing degraded printed Gurmukhi script characters. Li et. al. [2008] proposed a novel approach based on density estimation and a state of the art image segmentation technique, which is called as the level set method. From an input document image, probability map is estimated, where each element represents the probability that the underlying pixel belongs to a text line. The level set method is then exploited to determine the boundary of neighboring text lines by evolving an initial estimate. The proposed algorithm in the paper does not use any script specific knowledge. 54

29 Extensive quantitative experiments on freestyle handwritten documents with diverse scripts, such as Arabic, Chinese, Korean, and Hindi, demonstrate that the algorithm consistently performs well. Palacios and Gupta [2008] described the problem related with processing of cheques. As nowadays, bank cheques are preprinted with the account number and the cheque number in special ink and format in many countries. These two numeric fields can be easily read and processed using automated techniques. However, the amount filled on a filled cheque are usually read by human eyes, and involves significant time and cost. The system described in this paper uses the scanned image of a bank cheque to 'read' the cheque. It includes three main modules. If these modules are implemented then that allow for fully automated bank cheque processing. These three modules are the detection of strings within the image, the segmentation and recognition of string in a feedback loop, and the post processing issues that help to ensure higher accuracy of recognition. The major benefit of the integrated system is the ability to address the complex problem of reading handwritten bank cheque by implementing efficient algorithms for each processing step. As per the paper, all modules have been implemented and subsequently tested for reading the value of the cheque using different image databases. Due to the particular requirements of this application, the system can be tuned to yield low levels of incorrect readings. This leads to higher levels of rejection than the levels encountered in other handwritten recognition applications. A 'rejected' cheque can be read subsequently by human eyes or other more advanced automated approaches. However, a cheque 'read' incorrectly is more difficult to deal with, in terms of costs and time involved to rectify the mistake. As such, the proposed architecture can be geared towards producing the most suitable balance between inaccurate readings and rejection level, in accordance with user preferences. The experimental results presented in the paper do not focus on the best 55

30 possible results for a particular database of cheque. But, they show the benefits attained independently by each of the modules proposed. Bukhari et. al. [2009] stated that handwritten document images contain text lines with multi orientations, touching and overlapping characters within consecutive text lines, and small inter line spacing making text line segmentation a difficult task. In the paper, authors modeled text line extraction as a general image segmentation task. The central line of parts of text lines using ridges over the smoothened image is computed. Then the state of the art active contours (snakes) over ridges are adapted, which results in text line segmentation. Chaudhuri and Bera [2009] dealt with text line identification of handwritten Indian scripts. Some of the Indian Scripts discussed in the paper are Bangla, as well as English, Hindi, Gurmukhi and Malayalam, etc. A new dual method based upon interdependency between text line and inter line gap is proposed in the paper. The curves are drawn by the proposed scheme simultaneously through the text and inter line gap points found from strip wise histogram peaks and inter peak valleys. The curves start from left and move right while one type of points guides the curve of other type so that the curves do not intersect. Then these curves are allowed to iteratively evolve so that the text line curves cross more character strokes while inter line curves cross less character strokes and yet keep the curves as straight as possible. After several iterations, the curves stabilize and define the final text lines and inter line gaps. The approach works well on text of different scripts with various geometric layouts, including poetry. Philip and Samuel [2009] described an Optical Character Recognition (OCR) System for printed text documents in Malayalam which is one of the South Indian languages. As this is a known fact that Indian scripts are rich in patterns but these combinations of such patterns makes the problem even more complex. But in the paper, these complex patterns 56

31 are exploited to get the solution. The proposed system extracted the scanned document image into text lines, words and further characters and sub characters. The proposed segmentation algorithm is influenced by the structure of the script. A novel set of features, computationally simple to extract are proposed. The proposed approaches are based on the distinctive structural features of machine printed text lines written in these scripts. A lateral cross sectional analysis is performed along each row of the normalized binary image matrix resulting in distinct features. The final recognition is done through classifiers based on the Support Vector Machine (SVM) method. The proposed algorithms have been tested on a variety of printed Malayalam characters and give good result. Yin and Liu [2009] suggested a novel text line segmentation algorithm based on minimal spanning tree (MST) clustering with distance metric learning. Given a distance metric, the connected components (CCs) of document image are grouped into a tree structure, from which text lines are extracted by dynamically cutting the edges using a new hyper volume reduction criterion and a straightness measure. By learning the distance metric in supervised learning on a dataset of pairs of CCs, the proposed algorithm is made robust to handle various documents with multi skewed and curved text lines. The results presented in the paper suggest that the proposed method worked very well. Das et. al. [2010] addressed the segmentation of overlapped text lines and characters in Telgu text. In fact, Segmentation is an important task of any OCR system. The accuracy of OCR system mainly depends on the segmentation algorithm being used. Segmentation of Telugu text is difficult when compared with Latin based languages because of its structural complexity and increased character set. It contains vowels, consonants and compound characters. Some of the characters may overlap together. The profile based methods can only segment non overlapping lines and characters. The proposed algorithm 57

32 is based on projection profiles, connected components and spatial vertical relationships. To segment the image into lines and characters, in this method, first the connected components are extracted from the document image and labeled. For each connected component the top, bottom, left, right positions are identified. Then, nearest neighborhood method to cluster the connected components is also used. Good character segmentation accuracy can be achieved with overlapping lines and characters as per the result shown in the paper. Kumar and Sengar [2010] described the line, word, character and top character segmentation for printed Hindi text in Devanagari script and Gurmukhi script. The global horizontal projection method computes sum of all black pixels on every row and constructs corresponding histogram. Based on the peak/valley points of the histogram, individual lines and words are separated. Nallapareddy et. al. [2010] proposed a robust method for segmentation of individual text lines based on the modified histogram obtained from run length based smearing. A complete line and word segmentation system for some popular Indian printed languages is presented in the paper. Both foreground and background information is used here for accurate line segmentation. There may be some touching or overlapping characters between two consecutive text lines and most of the line segmentation errors are generated due to touching and overlapping character occurrences. Sometimes, interline space and noises make line segmentation a difficult task. The proposed method can take care of this situation accurately. Word segmentation from individual lines is also discussed here. The results of the proposed method on documents of Bangla, Devnagari, Kannada, Telugu scripts as well as some multi script documents are shown in the paper. Aradhya and Naveena [2011] proposed a novel method for text line segmentation of 58

33 unconstrained handwritten Kannada script. The proposed method consists of two phases. In the first phase, mathematical morphology technique is used to bridge the gap between character components. In the second phase, component extension technique is used for text line extract. Mahender and Kale [2011] stated that writing has been the most natural method of collecting, storing and transmitting information through the centuries, now serves not only for the communication among humans, but also for the communication of humans and machines. The free style handwriting recognition is difficult not only because of the great amount of variations involved in the shape of characters, but also because of the overlapping and the interconnection of the neighboring characters. Authors have presented a structured based feature extraction and rule based recognition scheme for handwritten Marathi word. Pradeep et. al. [2011] gave an off line handwritten alphabetical character recognition system using multilayer feed forward neural network. A new method, called, diagonal based feature extraction is introduced for extracting the features of the handwritten alphabets. The proposed recognition system performs quite well yielding higher levels of recognition accuracy compared to the systems employing the conventional horizontal and vertical methods of feature extraction. This system can be suitable for converting handwritten documents into structural text form and recognizing handwritten names. Borrowing from past literature, it can be summarized that though there is rapidly growing body of literature on how to segment scanned documents of International scripts as well as Indian languages but relatively few studies are there that have examined how to effectively segment a document written in Gurmukhi script. The studies which are available for Gurmukhi script that mostly deals with machine printed texts. 59

34 2.2 Need of The Study Text can be categorized in order of increasing difficulty when there are well separated and unbroken characters in proportional spacing, in which characters occupy different amounts of horizontal space, depending on their shapes. Lu [1995] quoted that like when the characters are broken, that is, single characters have more than one component. When characters are touching characters that is more than one character in a single connected component, or similarly when there are broken and touching characters both. In most OCR systems, character recognition performs on individual characters. Pre processing stage yields a clean document in the sense that sufficient amount of shape information, high compression and low noise on normalized image is obtained. According to Pal et. al. [2003 b], in India, there are 18 official (Indian constitution accepted) languages. Two or more of these languages may be written in one script. Twelve different scripts are used for writing these languages. Under the three language formula, many of the Indian documents are written in three languages namely, English, Hindi and the state official language. For example, a money order form in the Punjab state may be written in English, Hindi and Gurmukhi, because Gurmukhi (Punjabi) is the state official language of Punjab. Here are some properties common in Indian Language scripts Properties of Indian Language Scripts Assamese, Bangla, English, Gujarati, Hindi, Konkanai, Kannada, Kashmiri, Malayalam, Marathi, Nepali, Oriya, Panjabi, Rajasthani, Sanskrit, Tamil, Telugu and Urdu are the official languages of India. Hindi is most popular language in India followed by Bangla which is the second most popular languages in India. On global scenario, English is most popular language and whereas these two languages (Hindi and Bangla) are the 4 th and 5 th most popular language in the world. The scripts used for the Indian languages are 60

35 not all different. One script is used to write different languages. For example, Bangla script is used to write Assamese and Bangla (Bengali) languages while Devnagari script is used to write Hindi, Marathi, Rajasthani, Sanskrit and Nepali language. Constitution wise, there are twelve different scripts which are used to write these 18 languages. Pal et. al. [2003 b] stated that these scripts are named as Urdu, Tamil, Telugu, Gurmukhi (Panjabi), Devnagari, Bangla, English, Gujarati, Kannada, Kashmiri, Malayalam, and Oriya. Examples of different script lines are shown in figure 2.1. Figure 2.1: Different Indian script lines (from top to bottom: Devnagari, Bangla, Gurumukhi, Malayalam, Kannada, English, Tamil, Telugu, Urdu, Kashmiri, Gujrathi, Oriya) In most of Indian scripts, alphabet system exists having basic characters, which are actually vowel and consonant characters. Apart from these basic characters, there are compound characters formed by combining two or more basic characters. The shape of a compound character is usually more complex than the constituent basic characters. In some scripts (like Gurmukhi, Devnagri or Bangla etc), many characters of the 61

36 alphabet system have a horizontal line at the upper part. In Devnagari it is called sirorekha while in Bangla, this line is called matra. However, in the present study, it is referred as head line. When two or more characters are put side by side to form a word in the language, the head line portions of these characters touch one another and generate a long head line, which is used as a feature for script identification. In most Indian languages, a text line may be partitioned into three zones: higher zone, heart zone and lower zone. Different zoning is shown in figure 2.2. Figure 2.2: Different zones of English, Devnagari and Gurmukhi text line The higher zone denotes the portion above the head line. The portion below the head line is known as heart zone. This zone covers the portion of basic as well as compound characters. The lower zone is the portion below base line. Those texts where script lines do 62

37 not contain head line, the mean line separates higher zone and heart zone. The base line separates heart zone and lower zone. Pal et. al. [2003 b] opined that mean line can be defined as an imaginary line, where most of the uppermost (lowermost) points of characters of a text line lie. The uppermost and lowermost boundary lines of a text line are named as upper line and lower line Features of Indian Languages and Scripts The feature means something which is present in a symbol or character of any script, for example a feature can be a side bar, or loop and so on. A character may have one or combinations of certain features in it or not. Kumar et. al. [2003] is of the opinion that there are certain features present or common in Indian scripts; some of these are as given in the following section Common Alphabet: The set of alphabets of Indian languages have been derived from the Sanskrit alphabet. Usually, there is a common set of alphabets containing 33 consonants and 15 vowels. In addition to this, there are three to four consonants and two to three vowels which are used in specific languages or in the classical forms of others. This difference is not very significant in practice. The basic letters of the alphabet are formed by individual consonants and vowels. The only exception is the Tamil language which uses twelve fewer consonants. However, the structure is not too different in Tamil too, as this change can be modeled as dropping some of the consonants from the master list Akshara or Akhar: Akshra is notion used for a basic unit, called character, of Indian languages, with reference to Gurmukhi this is also known as Akhar. It forms the fundamental linguistic unit, like a character in English. An akhar can be made up of 0, 1, 2, or 3 consonants and a vowel. The combination of one or more akhars makes a Word. As the languages are completely phonetic, therefore each akhar can be pronounced 63

38 independently. Samyuktaksharas are the combinations of akhars with more than one consonant. They are also called as combo characters. The last of the consonants is the main one in a samyuktakshara Diverse Graphemes: The commonality in the alphabet does not mean the graphic forms are used to express them to print in the same way. Each language uses different scripts consisting of dissimilar graphemes for printing. Thus, printed matter of one language written in one script is unapproachable to readers of other language but written in the same script. As we know that there are twelve major scripts in India. The Devanagari script is the widest used one, being used to write Hindi, Marathi, Konkani, and Nepali. Here Nepali is the language of the neighboring nation Nepal. For the individual graphemes and their combinations, different philosophies are used for different scripts. Some have a head line while others have non touching graphemes. The grapheme of one of the consonants is usually at the heart of the printed akshara. The vowel appears as a matra or vowel modifier. These can appear to the above, below, right or left to it or in combinations. The supporting consonants of a samyuktashara also appear as modifier graphemes to the above, below, right or left of the main one. These modifiers could be truncated or scaled down forms of the basic consonant, but could also be completely different. They may touch each other or the main consonant in some cases or may be separated. These rules are not consistent even within a script and certainly not across scripts Formless Font Design: With the wide use of information technology over the last few decades, different fonts have been designed for each Indian script. The fonts are built from glyphs and follow the graphical structure of each script, which is different for 64

39 different languages. It is not possible to use a consistent set of rules for this step for all scripts. No conventions have been followed Gurmukhi Script Lehal and Singh [2002] concluded that the word Gurmukhi is derived from the combination of two words Guru and Mukh. Gurmukhi means to record the sayings from the mukh (or mouth or lips) of the Gurus, i.e. from the Guru s mukh. The credit to originate this script goes to Guru Angad Dev Ji. He not only rearranged but also modified and shaped certain letters into a script. New shape and order was given to the alphabets and made it precise and accurate. Those letters were retained which depicted sounds of the then spoken language. There was some rearrangement of the letters also such as s and h were shifted to the first line and a was given the first place in the new alphabet. It is believed that Gurmukhi belongs to Brahmi family. Aryans developed an Aryan script which is known as Brahmi. This script was adapted by Aryans as per their local needs. Between 8th and 6th B.C., this Brahmi script was introduced. Gurmukhi script is primarily used for the Punjabi language, which is the world s 14 th most widely spoken language. Gurmukhi script is a logical composition of its constituent symbols in two dimensions. It is an alphabetic script. Lehal and Singh [1999] explained that Gurmukhi script alphabet consists of 41 consonants, 12 vowels and 3 half characters which lie at the feet of consonants. These vowels and consonants are shown in figure 2.3 and 2.4 respectively. Besides the consonants and the vowels, other constituent symbols in Gurmukhi are a set of vowels modifiers called matra placed to the left, right, above or at the bottom of a character or conjunct, pure consonants forms corresponding to some consonant (also called half letters) which when combined with other consonants yield conjuncts. 65

40 Figure 2.3: Vowels and Vowel diacritics (Laga Matra) Figure 2.4: Consonants (Vianjans) of Gurmukhi Script 66

The writing style is from left to right and the concept of upper/lower case (as in English) is absent. Most of the characters have a horizontal line at the upper part.

41 The writing style is from left to right and the concept of upper/lower case (as in English) is absent. Most of the characters have a horizontal line at the upper part. Mostly this line, called headline connects the character of words. Lehal and Singh [2000] suggested that a word in Gurmukhi script can also be partitioned into two horizontal zones. The upper zone denotes the region above the headline. The area below the headlines, the major part of the character, is located in center zone or heart zone. These zones are shown in figure 2.5. a) Upper zone from line number 1 to 2 b) Heart zone from line number3 to 4 c) Lower zone from line number 4 to 5 Figure 2.5: Three zones in Gurmukhi script Gurmukhi script has the following characteristics: Gurmukhi script alphabet consists of 41 consonants, 12 vowels and 3 half characters which lie at the feet of consonants. The characters of words are connected mostly by a horizontal line called head line. All Gurmukhi letters have uniform height. All letters in Gurmukhi can be written between two parallel horizontal lines, a is the only exception. The top curve of which extends beyond the upper line. From left to right, letters have almost uniform length, only A (aira) and g (ghaggha) may be slightly longer than the rest. 67

42 The form of letters is not effected when a vowel symbol or diacritic is attached to it, the only exception being a to which an additional curve is added which represents two syllables. A word in Gurmukhi script can be partitioned into three horizontal zones. The upper zone denotes the region above the head line, where vowels reside, while the middle zone or heart zone represents the area below the head line where the consonants and some sub parts of vowels are present. The lower zone represents the area below middle zone where some of vowels and certain half characters lie in the foot of consonants. The half characters in the lower zone frequently touch the above lying consonants in the above zone. There are many multi component characters in Gurmukhi script. A multi component character is a character that can be decomposed into isolated parts. The bounding boxes of 2 or more characters in a word may intersect or overlap vertically. Lehal and Singh [2002] asserted that the Gurmukhi script is a two dimensional composition of consonants, vowels and half characters which require segmentation in a vertical as well in horizontal direction. Thus the segmentation of Gurmukhi text calls for a two dimensional analysis instead of commonly used one dimensional analysis as for Roman script. Literature survey reveals that due to the following reasons, unique segmentation method is required for the handwritten Gurumukhi script. The letters in cursive writing are often connected. The individual letters in a cursive word are often written so as to be unidentifiable as isolated characters. The variance in writing style. 68

If the handwritten line is slanting then it is difficult to segment. Writing quality of handwritten document is not uniform throughout the document.

43 If the handwritten line is slanting then it is difficult to segment. Writing quality of handwritten document is not uniform throughout the document. Font size can not be guessed, which is very important for character segmentation. Some of the handwritten letters like m (in English) can also be interpreted as a pair nn, as shown in figure 2.6 (a). Similarly in Gurmukhi, the character g can be segmented as rw, as shown in figure 2.6 (b). Figure 2.6 (a): Incorrect Segmentation of a character in English Figure 2.6 (b): Incorrect Segmentation of a character in Gurmukhi 69

A System for Joining and Recognition of Broken Bangla Numerals for Indian Postal Automation

A System for Joining and Recognition of Broken Bangla Numerals for Indian Postal Automation K. Roy, U. Pal and B. B. Chaudhuri CVPR Unit; Indian Statistical Institute, Kolkata-108; India umapada@isical.ac.in