Linear Discriminant Analysis in Ottoman Alphabet Character Recognition

Linear Discriminant Analysis in Ottoman Alphabet Character Recognition ZEYNEB KURT, H. IREM TURKMEN, M. ELIF KARSLIGIL Department of Computer Engineering, Yildiz Technical University, 34349 Besiktas / Istanbul TURKEY zeyneb@ce.yildiz.edu.tr, irem@ce.yildiz.edu.tr, elif@ce.yildiz.edu.tr Abstract: - This paper proposes a novel Linear Discriminant Analysis (LDA) based Ottoman Character Recognition system. Linear Discriminant Analysis reduces dimensionality of the data while retaining as much as possible of the variation present in the original dataset. In the proposed system, the training set consisted of 33 classes for each character of Ottoman language alphabet. First the training set images were normalized to reduce the variations in illumination and size. Then characteristic features were extracted by LDA. To apply LDA, the number of samples in train set must be larger than the features of each sample. To achieve this, Principal Component Analysis(PCA) were applied as an intermediate step. The described processes were also applied to the unknown test images. K-nearest neighborhood approach was used for classification. Key-Words: - Ottoman Character Recognition, PCA, LDA 1 Introduction Within the last years, rapidly growing interest in document archiving and retrieval systems increased attention to character recognition systems. Numerous approaches have been proposed for optical or handwritten character recognition mostly in Latin-based alphabets. On the contrary, some alphabets as Arabic, Chinese, and Japanese have picture like characters which require more complex recognition algorithms to identify. This paper presents an Ottoman alphabet recognition system. Ottoman characters are the characters of an alphabet that was formed by adding some new let-

ters from Farce and Turkish alphabet to the Arabian alphabet. Besides, Ottoman alphabet has been improved with its original features and has become an art due to its various scriptural formats. There are few implementations of automated Ottoman alphabet recognition. Atici and Yarman-Vural, applied chain coding to characters to propose a Hidden Markov Model based Ottoman alphabet recognition system [1]. Ozturk, A.Gunes proposed a method for recognition of isolated Ottoman characters by neural networks [2]. We propose a new global feature based approach for Ottoman alphabet recognition by using Linear Discriminant Analysis (LDA). LDA projects the data onto a lower dimensional space while preserving as much of the class discriminatory information as possible. Fig. 1. Block diagram of proposed system Ottoman script is fundamentally cursive. Since this paper focuses on the character recognition, the segmentation of the connected characters has been not studied. In Figure 1, the block diagram of the proposed system is illustrated. The outline of the paper is as follows: Section 2 presents preprocessing steps of the system. Section 3 describes extraction

of the characteristics features using LDA. In Section 4, decision steps of the system are described. Finally in section 5, experimental results of our implementation are given. Moreover, the performance of the proposed system is discussed. 2 The Preprocessing Ottoman characters have many common points with each other in their main shapes, whereas they differ in detail. In the proposed system, each Ottoman character is considered as a class. Our main goal is to discuss the influence of the components that points out the within class similarities and between class differences on success rate. Since our purpose is to measure the performance on recognition of isolated characters, the classification success of separate characters is scrutinized. The objective of LDA is to find out the most efficient combinations to split up multiple classes and to maximize between class differences. Thus, it is widely used in several applications such as face recognition, speech recognition, image summarization and data classification. [5] As a precondition of LDA implementation, the size of each sample must be less than the number of samples in the training set. Since the training set is inconvenient with that precondition, dimension reduction should be applied. For this purpose PCA is applied to training set as an intermediate step. The main purpose of preprocessing is to reduce the variation in size and illumination. The preprocessing steps are applied both to the training and testing images. In this study, 256 gray level images with a light colored background were used. By the contrast enhancement, not only the identification of characters was improved and the characters were slimed down, but also the potential noises in the images were eliminated. Moreover, the images were aligned and resized. Figure 2, shows the preprocessing steps applied on images. For each image, the inside of the frame which was drawn by using the tangents to the maximum and minimum points of characters on x and y coordinates, is taken into consideration. After the alignment process, images may be in several sizes. In this study, each character was normalized to 32x32 pixels.

Fig. 2. Examples of the Preprocessing Steps 3 The Feature Extraction Each Ottoman character has some features that are similar to others and different from others. In this study LDA was used to extract the characteristic features of Ottoman characters. To implement the LDA, the number of samples in train set must be larger than the features of each example. To achieve this, first we applied PCA.. The selection of eigenvectors with the highest eigenvalues reduces the dimensionality. Then we applied LDA to this reduced feature set. 3.1 Principle Components Analysis PCA reduces dimensionality of data while expressing most important characteristic features. PCA finds basis vectors for a subspace, which: maximizes the variance retained in the projected data or (equivalently) gives uncorrelated projected distributions or (equivalently) minimizes the least square reconstruction error[3]. Each of the p number of images whose size is n*m is assumed to be [n*m]*1 sized column vectors that correspond to the original data.[4] F = X * X T (1)

The covariance matrix (F) is defined by the Equation-1 where X is the mean matrix subtracted vector whose size is [n*m]*p, where there are p image in the training set. Because of the multidimensionality, (F) is calculated using (2), instead of (1). F = X T *X (2) The eigenvalues and associated eigenvectors are obtained as in (3) where F is the covariance matrix, I is the set of eigenvalues and V is the associated eigenvectors. F * V =I * V (3) We do not have to use the whole acquired eigenvectors. Therefore the eigenvectors are sorted by descending order and only selected number of highest eigenvalues are used. PCA was applied to both training data and test data. The acquired projections in the eigenspace were given to LDA as input. 3.2 Linear Discriminant Analysis LDA aims to determine the linear combinations of the features which can separate the objects and events into more than one class. These combinations, which were obtained after applying LDA, can be used as a linear classifier or it can be used to decrease the dimension for the classification step. LDA maximizes between-class scatter while minimizing the withinclass scatter. [6] The projections in the eigenspace, which were obtained by PCA and belonged to the alphabet s each character, were vectorized. The mean vector (M) of the alphabet s character set was calculated. The mean vectors of each character class (Mi) were calculated and for each class the mean vector of the class was subtracted from each character in a class. Let m be the number of the samples in each class, n be the number of classes and A be the mean-centered eigenspace projections. The scatter matrices for each class were acquired by: S 1 = A 11 * A 11 T + + A 1m * A 1m T S n = A n1 * A n1 T + + A nm * A nm T (4).. (5)

The within class scatter matrix S w was obtained by (6) where M i is the mean vector of i th class and M is the mean vector of whole characters in the train set. The between class scatter matrix S B was built by: S w = S 1 + S 2 + + S n (6) S B = 2 * M (M i M) (7) The eigenvectors (V) and eigenvalues (I) were calculated as: S B * V = I * S W * V (8) The eigenvectors were sorted by their associated eigenvalues in descending order and the first n-1 eigenvectors were kept. Fisher space projections of each class in the train set were obtained by using these eigenvectors. 4 Classification Each test image was first projected into eigenspace and then into Fisher space. The k-nearest Neighborhood was used for classification of unknown test images. The system performance was evaluated by applying 10-fold cross validations [7]. 5 Experimental Results We conducted experiments using a database of 10 sample images of 33 Ottoman alphabet characters. The samples, gathered from electronic resources were both handwritten and in press printed format. In k-nearest neighborhood, k was selected as 3 by 10-fold cross validation. Figure 3(a) illustrates true classification of three test examples.

Fig. 3 (a). Three samples for the test data which were recognized correctly. (b). The train set samples of these characters respectively. Figure 4 illustrates false classification of three test examples. As it can clearly be seen, the test examples are very similar to the characters of the incorrect classes. Fig. 4 (a). Three samples for the test data which were recognized incorrectly. (b). Train set samples for the decision classes of the system for incorrectly recognized letters As given in Table 1 success ratio of our Ottoman Character Recognition system is 88%. Table 1. The success ratio of the processed system True Classification 290 False Classification 40 Total Number of Characters 330 Recognition Ratio 88%

6 Conclusion This study presents a Linear Discriminant Analysis based automatic Ottoman Alphabet Character Recognition System. This approach retains class separability while reducing the dimensionality. As the performance of our proposed system is very promising for the recognition of individual characters, in the future work we will study on a system for the recognition of cursive characters in scripts. This will yield to processing and recognition of a large number of historical documents and we will be able to archive these documents. References 1. Atici AA and Yarman-Vural FT (1997) A Heuristic Algorithm for Optical Character Recognition of Arabic Script. In: Signal Processing, Vol. 62, No. 1, pp. 87-99. 2. Ozturk A, Gunes S and Ozbay Y. Multifont Ottoman Character Recognition. In : Proceedings of the 7th IEEE Int. Conf. on Electronics Circuits and Systems (ICECS), December 17-20, 2000, pp. 945-949, Jounieh, Lebenon. 3. Turk M and Pentland A. Eigenfaces for recognition. In: J. Cog. Neurosci., 3(1), 1991. 4. Turk M and Pentland A (1991) Face Recognition Using Eigenfaces. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Maui, Hawaii, USA, pp. 586-591. 5. Belhumeur PN, Hespanha JP, Kriegman DJ (1996) Eigenfaces vs. Fisherfaces: Recognition using Class Specific Linear Projection. In: Proc. of the 4th European Conference on Computer Vision, ECCV'96, Cambridge, UK, pp. 45-58. 6. Etemad K and Chellappa R (1997) Discriminant Analysis for Recognition of Human Face Images. In: Journal of Optical Society of America A, pp. 1724-1733. 7. Kung SY, Mak MW and Lin SH (2004) Biometric Authentication: A Machine Learning Approach, Prentice Hall.