Amharic Character Recognition using a fast signature based algorithm

Amharic Character Recognition using a fast signature based algorithm Dr JOHN COWELL Dept. of Computer Science, De Montfort University, The Gateway, Leicester, LE1 9BH, England. jcowell@dmu.ac.uk Dr FIAZ HUSSAIN Dept. of Computing & Information Systems University of Luton, Park Square, Luton, LU1 3JU, England. fiaz.hussain@luton.ac.uk

Abstract The Amharic language is the principal language of over 20 million people mainly in Ethiopia. An extensive literature survey reveals no journal or conference papers on Amharic character recognition. The Amharic script has 33 basic characters each with seven s giving 231 distinct characters, not including numbers and punctuation symbols. The characters are cursive but not connected and unlike other cursive scripts do not use dots. This paper describes the Amharic script and discusses the difficulties of applying conventional structural and syntactic recognition processes. Two statistical algorithms for identifying Amharic characters are described. In both, the characters are normalised for both size and orientation. The first compares the character against a series of templates. The second derives a characteristic signature from the character and compares this against a set of signature templates. The signatures used are fifty times smaller than the original character and the recognition process is corresponding faster but with some loss of accuracy. The statistical techniques described have been fully implemented and the resulting performance outlined. Keywords: optical character recognition, OCR, confusion matrix, Amharic character recognition, structural recognition, syntactic recognition, character signature 1. Introduction Optical character recognition systems for Latin characters have been available for over a decade and perform well on clear typed text. There are still developments in these commercial applications concerned with coping with the widest variety of fonts, and with character recognition in less constrained environments, such as the identification of vehicle licence plates for road pricing schemes. There has been considerable recent research in the development of Arabic OCR systems for off and on-line systems. Off-line systems are where only the final printed characters are available and on-line systems where the characters are written on a graphics tablet and information is therefore available on the speed and direction of movement of the pen. Comprehensive surveys covering off-line techniques for Arabic script recognition are given by Mori et al [1], Tappert et al [2] and Amin [3]. A more up-to date review is provided by Plamondon and Srihari [4]. El-Wakil [5] reviews on-line recognition techniques where the characters are written on a tablet which records the speed and direction of the pen, while both areas are reviewed by Al-Badr et al. [12]. There are roughly two non-latin dozen scripts in wide usage and research has also been directed at other non-latin scripts such as Japanese, Chinese, Hindu, Tibetan. A notable exception to the research effort is the Amharic character set. This is the principle script of over 20 million people mainly in Ethiopia. An extensive review of the literature reveals no journal or conference papers which discuss the problem of recognition of the Amharic script apart from some MSc theses from Addis Ababa University [7-9]. In printed Amharic material very few fonts are available for two reasons. Firstly there is little commercial incentive to develop and distribute fonts in such a relatively small market. Secondly and more importantly there is no standardised mapping between Latin keyboards and the keystrokes required to generate the Amharic characters. This means that a typist trained to use one font and set of mappings cannot type in another font which uses different mappings without retraining. For these reasons multi-font support is not important in an Amharic OCR system. This paper begins with a description of the Amharic character set and then discusses possible approaches to the development of an OCR system. Finally, two statistical approaches are described. Both of these systems have been implemented. 2. The Amharic Character Set The Amharic script has 33 basic characters. There are six s derived from the basic forms. The first five s represent a combination of a consonant and vowel. The sixth may represent either a consonant alone or a consonant followed by a vowel [10,11]. Therefore, there are 231 (7 33 = 231) core characters in Amharic writing system. Besides these, there are over forty others which contain a special feature usually representing labialization. The list of these Amharic characters is shown in Table 1. Table 1. The Amharic character set 1 st 2 nd 3 rd 4 th 5 th 6 th 7 th H hä hù hu hi ha ÿ he H h ç ho L lä lù lu lþ li la l le L l lö lo ¼ hä ¼ù hu ¼þ hi ˆ ha ¼ he Þ h ho M mä Ñ mu mi ma» me M m ä mo sä ù su œþ si œ sa œ se o s ƒ so R rä ru ri ra Ê re R r é ro S sä sù su sþ si ú sa s se S s î so ¹ šä ¹ù šu ¹þ ši š ša ¹ še > š ë šo Q qä qü qu qe qi Ý qa q½ qe Q q ö qo B bä bù bu bþ bi Æ ba b be B b ï bo T tä tü tu te ti ta t½ te T t è to C čä cü ču ce či Ò ča c½ če C č Ó čo ^ hä ^ù hu ^þ hi ` ha ^ he ~ h ho N nä nù nu nþ ni Â na n ne N n ñ no ß ňä ßù ňu ßþ ňa ¾ ňa ß ňe Ÿ ň ňo x ä xù u xþ i a x e X å o W wä ý wu êe wi ê wa ê½ we W w ã wo ; ä ;ù u þ i a e : â o K kä kù ku kþ ki µ ka k ke K k ko hä ù hu þ hi á ha he < h ó ho Z zä zù zu zþ zi ² za z ze Z z Ø zo žä Ü žu E ži Ï ža ½ že i ž Î žo

Y yä yu yi Ã ya ü ye Y y yo G gä gù gu gþ gi U ga g ge G g go D dä Ç du Äþ di Ä da Á de D d ì do J jä ju þ ji ja Ë je J j í jo «ţä «ù ţu «þ ţi È ţa «ţe _ ţ õ ţo = ćä Œ ću À ći Å ća će u ć ô ćo şä ù şu þ şi Ú şa şe A ş Û şo { şä {ù şu Éþ şi É şa É şe I ş ò şo ρä ù ρu þ ρi Ô ρa ρe e ρ Õ ρo F fä û fu ð fi Í fa Ø fe F f æ fo P pä pü pu pe pi pa p½ pe P p ± po A notable differences with many other non-latin scripts such as Arabic are that Amharic characters do not use dots and the characters although cursive are not connected. 3. Structural Approaches to OCR A popular approach to character recognition is to employ a structural and syntactic approach where the character is broken into primitives and the spatial relationships between these components is expressed using operators to create sentences in a pattern grammar. One of the best descriptions of structural and syntactic approach to pattern recognition remains the work of the late K.S. Fu [12]. This approach often requires the characters to be thinned in to extract information on stroke intersections [13-16]. Figure 1 shows six stages of thinning an Amharic character, removing a layer of edge pixels at each iteration. Note the development of superfluous tails produced at the stroke ends and the lack of relationship between the original and thinned form. These problems and some solutions are discussed in detail by the authors [22]. Figure 1 Growth of superfluous tails when thinning. Structural and syntactic recognition systems have a number of shortcomings that are resultant of the required thinning process. These include unwanted tails in the thinned and sensitivity to minor variations to the original. For these reasons, a statistical rather than structural approaches were used. In addition to the template and signature based methods described in this paper a neural network approach is an obvious choice as a statistical recogniser and this is forming the next phase of the research. 4. The Recognition Process Prior to submission to the recognition system the input characters must be normalised for size and orientation, both of which are critical in this type of statistical recognition. 4.1. Normalising for Size To normalise for size the character is converted to a 100 100 representation. The distance between the two most distant pixels in both the x and y directions is altered so that they are both 100 pixels long. This process is discussed in more detail in earlier published work by the authors [17]. Figure 2 shows the effect of normalising two of the 321 characters of the Amharic character sets for both size and orientation. Before After Before After Figure 2 Typical Amharic script templates. 4.2 Normalising for Orientation Since the approach used is intended to be general purpose and could be used for applications where the orientation of the characters is not known, the original character is mapped onto a new axis. This is achieved by creating a list of edge pixels and calculating the longest chord that can be drawn between any pair of pixels forming the character outline. An edge pixel, here, is defined as one that is black but has one or more adjacent white pixels, including diagonally adjacent white pixels (that is 8- connectedness rather than 4-connectedness). The line defined by these two points is used to represent the new vertical axis of the normalised character. The horizontal axis of the new co-ordinate frame is at right angles to this axis and the point of intersection (0,0) of both axes is the lowest edge point. This defines fully the new co-ordinate frame. To normalise the bitmapped character for orientation, it is rotated about the intersection of the axes so it can be mapped onto the new co-ordinate frame. This is achieved by multiplying every pixel of the character by the direction cosines of the new co-ordinate system. Figure 3 shows a typical character before and after normalisation for orientation. new vertical axis new horizontal axis Figure 3 Normalising for orientation. 4.3 Recognising Characters In the recognition phase two alternative set of templates were used and their performance compared 1. The first compared each character against each of the template characters. The degree of closeness is given as a percentage. The highest percentage value returned by a comparison is deemed to indicate (that is, to recognise) the input figure.

2. The second technique used is a signature which can be quickly derived from the normalised character. The signature for each character is produced through a process of iteration. We loop to count the number of black pixels in each of the 100 rows and then the number of pixels in each of the 100 columns. This is compared against the corresponding count of black pixels in a set of templates. If a statistical template is used, the value of a pixel is based on the intensity of that pixel rather than simply being 0 (white) or 1 (black). The main disadvantage of the first technique is that for a 100 100 character, 10,000 pixels have to be compared for each template. In the signature based system only 200 pixels have to be compared, despite the time taken to derive the signature there is an improvement in speed of about 10,000/200 = 50 times, but this is achieved at the loss of some accuracy. This variation can be expressed using a Confusion Matrix as discussed in section 5.0. An important benefit shared by both of these recognition processes is that we do not simply get a recognised output, but also a good reflection of what level of confusion is embedded in the process. This way, we gain knowledge of likely candidates for misinterpretation and can take steps to minimise their effect. 4.4 Extracting the Characters When a page is scanned it is usually done so as 256 grey-scale image. The recognition process is greatly simplified by means of converting to a set of black characters on a while background by the application of a global threshold. All of the pixels with intensity less than the threshold are converted to black, the other are converted to white. It is often sufficient to use the same threshold, however it is straightforward to consider the distribution of intensities on the input image. There will typically be two large peaks corresponding to the background and characters. Choosing a value midway between these two extremes will provide a satisfactory threshold. Amharic characters are not connected which simplifies the process of extracting individual characters. The image is scanned horizontally. When the first 'black' pixel is encountered this is converted to 'grey', that is, some value which is not black or white. Using conventional region growing techniques, all black pixels which touch this grey pixel are converted to grey, until no more pixels can be changed. The grey character can be presented to the character recognition part of the system. Before the next character can be identified the grey pixels are converted to white which erases the character. The process then begins again until no more characters are found. 4.5 Identifying Individual Characters Figure 4 shows the interactive interface of the recognition system prototype which identifies individual characters. The Language menu option is used to select a small textual configuration file which identifies the number and names of the characters in the character set to be recognised and the location of the templates to be used. No changes are required to the system to recognise a new character set. All that is required is the configuration file and a set of representative characters of that character set which can be used to produce the templates. Since the matching process is the same for both the full template and the signature recognition process, the process which is to be used is simply identified in the configuration file. At this stage, we simply input the name of the character or number that requires to be recognised, this corresponds to a file name. The system responds by locating the bitmap for the input, which is shown in the first output window in Figure 4. The recognition process then follows the mentioned phases. A practical problem was found with Amharic text since many Amharic characters are visually so similar and many have pronunciations which are identical to non-native users. To reduce the mistakes and to ensure that there was no confusion about what character was being referred to, we adopted a simple system of naming the character. The first two letters referred to the language, Am, in this case (the authors also use other non Latin character sets, so this part is essential). The next one or two characters in the sequence refer to the number of the basic character and is a value between 1 and 33 for Amharic. An underscore provides a break before a character between 1 and 7 which identifies the. For example ¼ is referred to as Am_1_3. The sample characters used in the experimental phase are further identified by another underscore and a number from 1 to the number of sample characters tested. Figure 4 shows the identification of the fourth sample of the Amharic character we identify as am1_3, that is third of the first character in Amharic character set. Figure 4 The recognition software in action. For each pixel in the character and the corresponding pixel in a template, the difference in intensity is found. This is summed for the whole image to yield a closeness of fit. The smaller the sum of the differences, the closer the match between a character and a template. The template, which gives the closest match, identifies the character. 4.6 The Signature Comparison System The signature for each character is produced through iteration. We loop to count the number of black pixels in each of the 100 rows and then the number of pixels in each of the 100 columns. This is compared against the

corresponding count of black pixels in a set of templates[18,19]. For each row, the modulus of the difference in the number of pixels is calculated and the resultant values added. The process is repeated for columns and the two difference (one for row and the other for column) values are added. A complete match would yield a sum of zero, while the other extreme would yield a value of 20,000 when a 100% exclusive-or of input character with a template occurs. This outcome can be more readily appreciated by converting the result to a value between 0 and 100 through dividing the resulting difference value by 200. The recognition process is approximately 40 times faster than comparing every pixel. 5. Experimental Results The recognition algorithm described not only identifies the template character which most closely matches the input character but also other template characters which are similar. Character Am1_1 100 Am1_2 64 100 Am1_3 56 48 100 Am1_4 60 61 51 100 Am1_5 51 54 67 47 100 Am1_6 94 63 54 63 51 100 Am1_7 56 72 60 59 60 55 100 Am2_1 60 61 52 65 48 59 56 100 Am2_2 64 62 57 61 54 62 60 58 100 Am2_3 47 65 55 70 53 48 63 62 54 100 Am2_4 65 66 50 77 44 65 58 75 65 66 100 Am2_5 60 67 46 70 50 59 57 70 59 73 76 100 Am2_6 59 59 52 63 48 59 55 91 56 59 70 69 100 Am2_7 69 57 58 66 52 67 58 61 75 48 66 55 60 100 Char 1_1 1_2 1_3 1_4 1_5 1_6 1_7 2_1 2_2 2_3 2_4 2_5 2_6 2_7 Figure 5 - The Confusion Matrix for Amharic script using template comparison If every template character is compared against every other template character, a closeness of fit between every pair of characters can be produced and presented as a triangular matrix which shows how closely pairs of characters resemble each other. This is known as the Confusion Matrix [20,21]. Since the entire matrix has 231 columns and rows only a portion of the Confusion Matrix is shown in Figure 5 to illustrate the scenario. The distribution of results is shown in figure 6. The vertical axis is percentage of character pairs having a particular confusion rating. The horizontal axis shows the confusion rating between 0 and 100. 6 5 4 3 2 1 0 0 20 40 60 80 100% Figure 6 - Distribution when using template comparison. The highest rating of 97 is given by 4 pairs and 37 pairs give a rating of 90 or more. Experimental work shows that character pairs with a rating of 90 or over on clear type are readily confused as the quality of the input character diminishes. The situation with the signature templates is event worse. The number of pairs with a rating of 90 or over is 377, Indeed two pairs give a rating of 99 and 23 pairs a rating of 98. Even on very clear input images characters with a rating of 99 are very likely to be confused. The distribution is shown in figure 7. Experimental work shows that if the quality of the images falls slightly many errors occur. For Amharic characters, the signature templates produce an unacceptable error rate which is not compensated for by greatly increased recognition speed compared to the template comparison approach. These results indicate that the identification of Amharic script is far more demanding than the recognition of Latin script or other cursive scripts such as Arabic text because of the greater number of characters and the greater similarity between pairs of characters. In Arabic script, similar characters can usually be distinguished by an analysis of the number and position of dots, however this is not the case in Amharic which does not use dots, and characters are distinguished by the number and position of small attached embellishments. Experimental work shows that on very clear printed type characters with a confusion rating of 97 can be distinguished every time, however as the quality of the input character diminishes The confusion factor increases. 8 7 6 5 4 3 2 1 0 0 20 40 60 80 100% Figure 7 Distribution when using signature comparison

6. Conclusions This paper describes a fast recognition system based on creating image signatures which can be used for any character set. The system normalises characters for size and orientation. Two template comparison techniques are presented, one compares the every pixel of the input character to a set of templates, the other uses a set of signatures. The template comparison system achieves nearly perfect recognition rates for very clear text, but the quality of the image is even more important and as it deteriorates, the recognition rate falls significantly. The system has been demonstrated using the Amharic character set but could read any character set with a small amount of work to create the signatures for idealised characters. The system not only identifies a character but also gives a measure of how close other characters are to one recognised. The Confusion Matrix gives the degree of similarity between characters. The use of the Confusion Matrix gives an indication of how likely a character is to be confused with other characters and highlights possible problem areas. Results to date are encouraging and work has already begun to assess the performance of the recognition system using real, everyday, data. Bibliography [1] S. Mori, C.Y. Suen and K. Yamamoto. Historical review of OCR research and development. Proceedings IEEE 80, 1029-1058 (1992). [2] C.C. Tappert, C.Y. Suen, and T. Wakahara, On-line handwriting recognition - a survey., Proceedings 9th ICPR International Conference on Pattern Recognition ICPR9, Rome, Italy (1988), IEEE, New York, N.Y., USA, 1988, 1123-1132. [3] Amin A, Off-line Arabic character recognition - the state of the art [review], Pattern Recognition, vol. 31, no. 5, 517-530, (1998). [4] R. Plamondon and S.N. Srihari, On-line and off-line Handwriting Recognition: A Comprehensive Survey, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 22, no. 1, 63-84. (2000). [5] El-Wakil M.S. and Shoukry A.A., On-line recognition of handwritten isolated Arabic characters, Pattern Recognition, vol. 22, no. 2, 97-106, (1989). [6] Al-Badr B. and Mahmoud S.A., Survey and bibliography of Arabic optical text recognition, Signal Processing, vol. 41, no. 1, 49-77. (1995). [7] Ermias Abebe (1998). Recognition of Formatted Amharic text using optical character recognition; (Masters thesis) School of Information studies for Africa, Addis Ababa University, Addis Ababa. [8] Worku Alemu (1997). The Application of OCR Techniques to the Amharic Script; (Masters thesis) School of Information studies for Africa, Addis Ababa University, Addis Ababa. [9] Yaregal Assabie Lake (2001). Optical character recognition of Amharic text: an integrated approach; (Masters thesis) School of Information studies for Africa, Addis Ababa University, Addis Ababa. [10] Bender, M. et al. (1976). Language in Ethiopia. London: Oxford University Press. [11] Ullendorff, E. (1973). The Ethiopians: An Introduction to the Country and People. 3rd ed., London: Oxford University Press. [12] Fu K. S. Syntactic models in pattern recognition and applications. Pattern recognition in practice. ed. Gelsema E.S. 1980. [13] Bazzi I, Schwatz R and Makhoul J., An Omnifont Open-Vocabulary OCR System for English and Arabic. IEEE Transactions on pattern Analysis and Machine Intelligence. vol. 21, no 6, 495-504, (1999). [14] Romeo-Pakker K., Ameur A., Olivier C., and Lecourtier Y., Structural analysis of Arabic handwriting: segmentation and recognition, Machine Vision and Applications, vol. 8, no 4, (1995). [15] Bushofa and Spann M., Segmentation and recognition of Arabic characters by structural classification, Image and Vision Computing, vol. 15, 167-179, (1998). [16] Cowell J., Syntactic pattern recognizer for vehicle identification numbers, Image and Vision Computing, vol. 13, no. 1, 13-19 (1995). [17] Hussain, F., and Cowell, J., Character recognition of Arabic and Latin Script, Proceedings IV2000 conference, London 2000. [18] Hussain, F., and Cowell, J., A fast signature based algorithm for recognition of isolated Arabic characters, IASTED conference on Visualisation, Imaging and Image Processing, VIIP September 2002, Malaga. [19] Kinser Jason. Image signatures: Ontology and classification. CGIM2001 Computer Graphics and Imaging conference. IASTED, Hawaii USA CGIM2001. [20] Cowell, J., and Hussain, F., The Confusion Matrix identifying Conflicts in Arabic and Latin Character Recognition, Proceedings CGIM2000, Las Vegas November 2000. [21] Cowell, J., and Hussain, F., Resolving Conflicts in Arabic and Latin Character Recognition, EG2001 UCL London. [22] Cowell, J., and Hussain, F., Extracting Features from Arabic Characters, Proc. CGIM2001, Hawaii 2001. Acknowledgements The authors wish to express their thanks to Yaregal Assabie Lake (Computer Science Department, Addis Ababa University) for his valuable assistance for providing the understanding for the Amharic character set.