Alternatives for Page Skew Compensation in Writer Identification

Alternatives for Page Skew Compensation in Writer Identification Jin Chen and Daniel Lopresti Department of Computer Science & Engineering Lehigh University Bethlehem, PA 18015, USA Email: {jic207, lopresti}@cse.lehigh.edu Abstract Traditionally, page images undergo pre-processing before the later stages of document analysis are applied. One common pre-processing step is to calculate and correct for the presence of simple page skew through a compensating rotation. Such operations modify the original input image, however, and in doing so may discard or obscure useful information. In this paper, we examine the impact of page deskewing on the task of writer identification for complicated handwritten documents. As an alternative to rotating the page image, we demonstrate a method that compensates for page skew during feature extraction. Experimental evaluation involving 61 Arabic writers and 610 page images show that handling page skew during feature extraction can benefit writer ID with a significant 1.4% gain in accuracy. In addition, we also obtain a 4.7% gain after improving an existing contour-based feature extraction method. I. INTRODUCTION Traditional techniques for document image analysis (DIA) follow a paradigm of pre-processing data, extracting features from it, and training a classifier to decode testing data. Noise and artifacts are normalized or removed during pre-processing so that the feature extraction and the classification modules can work on improved data. However, such pre-processing usually modifies the original image and may discard these image modifications which can be exploited for later stages of document analysis. For example, slant correction, stroke-width normalization, and vertical scaling are commonly used for offline handwriting recognition, but Schlapbach and Bunke found that slant correction, and its combination with vertical scaling or width normalization, can be harmful to writer ID [1]. Writer identification, which is the problem of assigning a sample of unknown handwriting to one of a list of known writers, has been studied for decades. Over the years, a number of classifiers have been proven useful in identifying writers, including Neural Networks [2], [3], K-nearestneighbors (KNNs) [4], [5], [6], [7], [8], Hidden Markov Models (HMMs) [9], [10], Gaussian Mixture Models (GMMs) [11], [12], Support Vector Machines (SVMs) [13], and weighted Euclidean Distance classifiers (WED) [4]. Features have been based on connected-component contours [7], grapheme codebooks [7], [14], Gabor filtering [4], chain-code encoding [15], morphological operations [3], and many others [5], [13], [6]. Given carefully prepared datasets, researchers are able to achieve a reasonably high accuracy with the help of discriminative classifiers and feature sets [16]. Recently, however, people have become interested in dealing with challenging data, which may contain various types of noise and artifacts. One type of artifacts in handwritten documents is page skew, which is introduced during document scanning. In practice, pre-printed ruling lines, which are designed to help people write neatly, may help decide the page skew, as shown in Figure 1. On the other hand, rulings can interfere with efforts to segment handwritten strokes [17], [18], [19], [20], Thus, to examine the impact of page skew on writer ID, we have to take into account pre-printed rulings. In general, there are two ways to handle pre-printed rulings: remove rulings during pre-processing, or detect and compensate them during later processing stages. Following the pre-processing paradigm, Arvind, et al. detect the ruling lines within segmented handwritten blocks by computing the horizontal projection profiles and design several rules to remove them [19]. Cao, et al. design a set of heuristics to recover handwritten broken strokes after removing rulings [20]. Other machine-learning base methods include [17], [18]. None of the these approaches seems ideal. First, training based approaches have to face the difficulty of creating ground-truth, which is either tedious to label at the pixel level, or uses synthetic datasets which is less convincing. Second, removalbased approaches use simple shape analysis and tend to create false-alarm strokes and/or to miss broken ruling segments. Nevertheless, ruling line removal has been widely used in applications such as check processing and form processing. In our previous work [21], we avoid the problem of recovering broken strokes after removing rulings, but try to handle the impact of rulings during feature extraction. This paradigm differs from the pre-processing one in that we do not modify any part of the image, but detect different image components and deal with them during later processing, e.g., feature extraction. It turned out this paradigm enables us to make use of the fact that people may treat them differently and thus can be exploited for writer ID. In this paper, we examine the impact of page skew during pre-processing and try to instead compensate for the page skew during feature extraction. First, we detect the pre-printed rulings based on a model-based method [22]. Next, we overcome the effects of ruling lines during the extraction of contour-hinge based features [7]. Then, we propose our ways of handling page skew during feature extraction rather than rotating the image during pre-processing. Finally we discuss several issues when implementing the feature extraction module and examine their impact on writer ID performance.

Y 10 1 11 0 12 23 13 22 X (12, 0) contour direction ruling contour (0, 12) Fig. 2: Quantization of the angular plane in feature extraction. in detail in Section IV. In our experiments, all the features are computed on a text line basis. Fig. 1: An Arabic document with negative page-wise skew. II. A. Page Skew Detection WRITER ID SYSTEM Our model-based ruling line detection algorithm exploits characteristics of pre-printed rulings such as consistent spacing β 1 and approximately the same length L, skew angle β 2, and thickness H [22]. It models these rulings as a problem of multi-line linear regression. One advantage is that it guarantees a globally optimal solution under the Least Squares Error (LSE). The result of the algorithm is a set of parameters of the ruling model which can be used to render them again. In our experiments, these pre-printed rulings are represented as lists of pixel sequence, and the skew angle β 2 is defined to be the skew of the document image. B. Feature Extraction Contour-hinge features are one bivariate probabilistic distribution function (PDF) that captures both the orientation and and the curvature of contours [7]. After extracting contours from connected-component analysis, we examine each two adjacent segments (each 10-pixel long) along the contours, and compute their angles against the horizontal axis. Quantizing the angle plane ([0, 2π)) into a 24-bin histogram, we vote in these bins when traversing all contours. As shown in Figure 6, this quantization strategy separates positive and negative skew angles so that it may cause jumps between bins in the PDF matrix when dealing with ruling contours. We shall discuss this We observe that the testing datasets in Bulacu and Schomaker s work [7] do not contain pre-printed rulings. Ruling lines complicate feature extraction because handwriting tends to overlap them. In this work, we follow the same strategy as in our previous work to deal with ruling lines, as shown in Figure 3. First, rulings are extracted by the previous detection algorithm and when traversing the contours, we measure whether the current pivot pixel lies on any ruling line. If so, we simply skip the computation of the pivot s angular indices in the PDF matrix. We show the effect of ruling compensation in Figure 3, where blue means the valid contour pixels that contribute to the PDF matrix and red means the ruling contour pixels. Also in their paper [23], the authors only use half of the matrix as a feature vector (φ2 >= φ1), considering the other half redundant, which results in 300-D feature vectors (n(2n + 1) = 300, n = 12). We shall examine other options of generating feature vectors in Section IV which may provide more discriminating power. In general, there are two different ways to handle page skew during feature extraction. First, when traversing contours, we explicitly subtract the page skew from the hinge segment angles. Second, we transform the coordinates of extracted contour pixels against the skew direction before the traverse operation. We will investigate both methods in the experimental evaluation. C. Writer Identification Support Vector Machines (SVMs) are often used for writer ID. SVMs construct a hyperplane with maximum margin in higher dimensional vector space, where a non-linearly separable classification problem in the original vector space may become linearly separable after projecting these feature vectors into higher dimensional space by different mapping functions. The mapping functions are called kernels in the literature. We

1.2 Accumulated Feature Vector Distances 1.15 1.1 1.05 1 0.95 Distance 0.9 0.85 0.8 0.75 0.7 Fig. 3: Dealing with rulings during feature extraction. In the lower half, blues pixels are valid contour pixels that contribute to the PDF matrix, while red pixels are ruling contours that are not counted in the PDF matrix. 0.65 0.6 0.55 Deskew Page vs. Subtract Skew Deskew Page vs. Transform Contour Subtract Skew vs. Transform Contour 0.5-1 -0.9-0.8-0.7-0.6-0.5-0.4-0.3-0.2-0.1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Page Skew (degrees) Fig. 5: Accumulated feature vector distances between the three methods in comparison. Deskew Page Subtract Skew Transform Contour Rotate Page ( 1, 2) Detect Page Skew 2 Extract Contour ( 1 2, 2 2) Transform Contour ( 1, 2) Pre-processing Feature Extraction Fig. 4: A workflow diagram showing processing modules in different feature extraction methods in evaluation. (, ) means the actual angles used to index in the PDF matrix. use the Radial Basis Function (RBF) kernel because it offers better discriminability than the linear kernel, while using fewer parameters than the polynomial kernel. In our experiments, we employ the libsvm tool [24] for writer ID, where we set the cost c = 10000 and we normalize feature vectors into the unit hyper-cube. III. EXPERIMENTAL SETUP Our Arabic dataset was provided by the Linguistic Data Consortium (LDC) [25]. We randomly selected a subset that has 61 writers in total, each of whom contributed 10 handwritten pages. Each page was scanned at 600 DPI with a bitonal setting. A typical size for a page image is 5104w 6600h. We randomly divided the dataset into five folds, each having two pages from each writer. Each page was annotated with polygon bounding boxes for handwritten text lines. Using 5-fold cross-validation, each text line was tested once. In total, we used 4,893 text lines for experimental evaluation. First, we examined different ways of handling page skew: Deskew Page: Subtract Skew: this served as the baseline system which rotates the image against to page skew direction, as in traditional pre-processing. when traversing contours, subtracted the page skew from the angles of hinge segments, and then computed the indices in the PDF matrix. Transform Contour:compensated the page skew by transforming the coordinates of extracted contours during feature extraction. As shown in Figure 4, all three methods involved page skew detection, contour extraction and traversal. The baseline rotated the image during pre-processing while the other proposed methods did not. We can think of these three methods of handling page skew differ in the order they handle it. Deskew Page compensates page skew during pre-processing, by rotating the bitmap directly. Transform Contour first extracts the contours and then rotates them before computing the contour hinge angles. Subtract Skew pretends there is no page skew until indexing in the PDF matrix. In theory, these methods should generate the same feature vector. Due to the discrete 2-D digital grid, however, they may generate significant different feature vectors, as in Figure 5. We generate this figure by extracting features on a standard eclipse shape under different page skew in [ 1.0, 1.0]. After extracting feature vectors from the three methods, we compute the accumulated distances between pairs of methods and then plot their distributions. As we can see, the distances are quite significant and thus they may result in different writer ID performance.

TABLE I: Writer ID performance on different methods. Deskew Subtract Skew Transform Contour Fold 0 75.83% 77.76% 75.73% Fold 1 74.35% 71.58% 78.48% Fold 2 80.61% 79.45% 80.19% Fold 3 78.31% 77.29% 81.81% Fold 4 81.59% 82.48% 81.40% Average 78.14% 77.71% 79.52% TABLE II: Options in implementing feature extraction. Accumulated Distance 0.26 0.24 0.22 0.2 0.18 0.16 Accumulated Transpose Distance in PDF Matrix Eclipse Real HW Image Deskew Rotate Contour Half PDF 73.47% 74.89% Full PDF 78.14% 79.52% Adjust Quantization 81.42% 82.00% Second, we examined benefits of using the full PDF matrix as feature vectors. Half PDF: served as the baseline method which used half of the PDF matrix as feature vectors (n(2n + 1) = 300-D, n = 12), as in [7]. Full PDF: used the full matrix as feature vectors, so the feature vectors are (2n) 2 = 576-D, n = 12. Finally, we discuss the effect of rulings on the quantization strategy, and also their combining impact on writer ID. In the experimental evaluation, all the experiments used the same SVM configuration and the full PDF matrix for features except for the one that addresses this issue explicitly. A. Evaluation IV. EXPERIMENTAL RESULTS First of all, we show performance of our proposed systems that compensated page skew during feature extraction rather than rotating images during pre-processing. The experimental results are summarized in Table I. The baseline seemed to outperform the subtract-skew based system, but the statistical significance test showed that this performance difference (0.43%) is not significant. In other words, these two systems performed similarly. If we choose to rotate the contours during feature extraction, we obtained a performance gain (1.4%) with statistical significance (at a confidence level of 99%). This result validated our hypothesis that it is possible to avoid the damage caused by rotating bitmaps during traditional preprocessing but to exploit it during feature extraction. In the following discussions, we only used the transform-contour based system for performance comparison. Next, we show why we chose to use the full PDF matrix as feature vector for writer ID. In Bulacu and Schomaker s work [7], they used half of the matrix as a feature vector, considering the other half contains only redundant information. This idea assumes the contours are symmetric with respect to the horizontal axis, so the PDF matrix is symmetric. We investigated this by rotating a standard eclipses by different angles 0.14-1 -0.9-0.8-0.7-0.6-0.5-0.4-0.3-0.2-0.1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Page Skew (degrees) Fig. 6: Transposed PDF matrix distance of different objects. The eclipse is rotated at different angles in [ 1.0, 1.0 ] and the other curve is summarized with the evaluation dataset. for feature extraction. For each skew angle in [ 1.0, 1.0], we computed the accumulated transpose matrix distance D in the PDF matrix M: D = 2n 2n i=1 j=i+1 M[i][j] M[j][i] (1) where n = 12. Then, we computed the average distance for each bin in the skew range. Likewise, we also computed this metric using all the text line images in our evaluation dataset. The difference of the two is shown in Figure 6. Although the PDF matrix of the text lines seemed symmetric, the distance between their transposed elements is significantly larger than that from a real symmetric object. Hence, we considered this difference might be useful information to exploit. The results in Table II validated our hypothesis, showing that the subtle differences in the transposed entries (shown in Figure 6) played an important role in identifying writers. All methods obtained a large performance gain over the baseline which used only half of the PDF matrix. Finally, we discuss a subtle but practical issue when dealing with page skew using angle subtraction during feature extraction. One strategy of quantizing the angular plane is as shown in Figure 2, which separates positive and negative angles, and this works fine when no rulings are present. It will, however, cause problems when rulings are prevalent as in our evaluation dataset. Without losing generality, suppose the contours on ruling lines are traversed in the counter clockwise direction. Since the page skew is usually small in our dataset (±1 ), locally the contour hinge segments are always horizontal, thus the matrix indices are (φ1, φ2) = (0, 12) and (φ1, φ2) = (12, 0) (Figure 2). For positive skew, subtracting it bumps the indices to into (23, 11) and (11, 23), respectively. For negative skew, however, subtracting it from hinge segment angles will not change the indices. This is undesirable because now the PDF matrices vary significantly just because of the direction of page skew. The same case when the contours are

traversed in the other direction. There are two ways to solve this issue. In addition to detect rulings as we did in our experiments, we also tried to adjust the quantization strategy so that the 0th bin covers both positive and negative angles. This was done by rotating the x-/y- axis counter clockwise wise by 15 /2 = 7.5. After this adjustment, we conducted the experiment with the baseline and also the transform-contour based system, and found that we obtained significant performance gains, as shown in Table II. Again, transform-contour based system outperformed the deskewing based system with statistical significance. B. Statistical Significance Test In the experimental evaluation, all the performance loss or gains were validated using the McNemar test [26]: Z 2 = ( n 01 n 10 1) 2 n 10 + n 01. (2) where we first divided misclassified samples into two groups, and then stated the hypothesis test (Denote F 0, F 1 as the performance for the baseline system and the proposed system, respectively): n 01 : number of samples misclassified by the proposed system, but not by the baseline system. n 10 : number of samples misclassified by the baseline system, but not by the proposed systems. Null Hypothesis H 0 : F 0 = F 1. Alternative Hypothesis H 1 : F 0 < F 1. The test statistic Z 2 approximately follows the χ 2 distribution with 1 degree of freedom. Looking this up in the χ 2 table, we concluded that the performance gains we obtained and reported here are statistically significant at a confidence level of 99%. V. CONCLUSION Traditional pre-processing techniques usually modify the original image before later stages of document analysis are applied. For example, images are modified by rotating bitmaps during deskewing. In this paper, we investigated the impact of image rotation and proposed methods to compensate page skew while retaining all the information in the original image. Experimental results involving 61 writers with 610 Arabic handwritten documents showed that our methods performed better than the deskewing-based method. In addition, we also examined the complexity of feature extraction when dealing with pre-printed rulings and showed how to adopt the quantization strategy as well as the benefits of exploiting the full PDF matrix for writer ID. For future work, we plan to examine the fundamental reasons why deskewing tends to modify the contour characteristics of handwritten text lines. ACKNOWLEDGEMENT The authors acknowledge insightful discussions with George Nagy on the idea of not altering input images through pre-processing. This work is supported by a DARPA IPTO grant administered by Raytheon BBN Technologies. REFERENCES [1] A. Schlapbach and H. Bunke, Writer identification using an HMMbased handwriting recognition system: to normalize the input or not? in Proc. of the 12th international Graphonomics Society, 2005, pp. 138 142. [2] R. Sabourin and J. Drouhard, Off-line signature verification using directional PDF and Neural Networks, in Proc. the International Conference on Pattern Recognition, Vancouver, BC, Canada, 1992, pp. 321 325. [3] E. Zois and V. Anastassopoulos, Morphological waveform coding for writer identification, Pattern Recognition, vol. 33, pp. 385 398, 2000. [4] H. Said, T. Tan, and K. Baker, Personal identification based on handwriting, Pattern Recognition, vol. 33, pp. 149 160, 2000. [5] C. Hertel and H. Bunke, A set of novel features for writer identification, J. Kittler and M. Nixon, Eds. Springer, 1998. [6] B. Li, Z. Sun, and T. Tan, Hierarchical shape primitive features for online text-independent writer identification, in Proc. 10th International Conference on Document Analysis and Recognition, Barcelona, Spain, August 2009, pp. 986 990. [7] M. Bulacu and L. Schomaker, Text-independent writer identification and verification using textural and allographic features, IEEE Transaction on Pattern Analysis and Machine Intelligence, vol. 29, pp. 701 717, 2007. [8] S. Fiel and R. Sablatnig, Writer retrieval and writer identification using local features, in Proceedings of the 10th International Workshop on Document Analysis Systems, 2012, pp. 145 149. [9] A. Schlapbach and H. Bunke, A writer identification and verification system using HMM based recognizers, Pattern Analysis and Application, vol. 10, pp. 33 43, 2007. [10] Y. Yamazaki, T. Nagao, and N. Komatsu, Text-indicated writer verification using Hidden Markov Models, in Proc. International Conference on Document Analysis and Recognition, 2003, pp. 329 332. [11] A. Schlapbach and H. Bunke, Off-line writer identification using Gaussian Mixture Models, in Proc. of the 18th International Conference on Pattern Recognition, 2006, pp. 992 995. [12], Off-line writer identification and verification using Gaussian Mixture Models, Studies in Computational Intelligence, vol. 90, pp. 409 428, 2008. [13] E. Justino, F. Bortolozzi, and R. Sabourin, A comparison of SVM and HMM classifiers in the off-line signature verification, Pattern Recognition Letters, vol. 26, pp. 1377 1385, 2004. [14] A. Bensefia, T. Paquet, and L. Heutte, A writer identification and verification system, Pattern Recognition Letters, vol. 26, pp. 2080 2092, 2005. [15] I. Siddiqi and N. Vincent, A set of chain code based features for writer recognition, in Proc. the 10th international Conference on Document Analysis and Recognition, 2009, pp. 981 985. [16] G. Louloudis, N. Stamatopoulos, and B. Gatos, ICDAR 2011 writer identification contest, in Proceedings of the 19th International Conference on Document Analysis and Recognition, 2011, pp. 1475 1479. [17] H. Cao and V. Govindaraju, Handwritten carbon form preprocessing based on markov random field, in Proc. of IEEE Conference on Computer Vision and Pattern Recognition, 2007. [18] W. Abd-Almageed, J. Kumar, and D. Doermann, Page rule-line removal using linear subspaces in monochromatic handwritten Arabic documents, in Proc. of the 12th International Conference on Document Analysis and Recognition, 2009, pp. 768 772. [19] K. Arvind, J. Kumar, and A. Ramakrishnan, Line removal and restoration of handwritten strokes, in Proc. of the 7th international Conference on Computational Intelligence and Multimedia Application, 2007, pp. 208 214. [20] H. Cao, R. Prasad, and P. Natarajan, A stroke regeneration method for cleaning rule-lines in handwritten document images, in Proc. of the MOCR workshop at the 10th international Conference on Document Analysis and Recognition, 2009. [21] J. Chen and D. Lopresti, Exploiting ruling line artifacts in writer identification, in Proceedings of the 2012 21st International Conference on Pattern Recognition, September 2012, pp. 3737 3740. [22], A model-based ruling line detection algorithm for noisy handwritten documents, in Proceedings of the 11th International Conference on Document Analysis and Recognition, September 2011, pp. 404 408. [23] B. Gatos, D. Danatsas, I. Pratikakis, and S. Perantonis, Automatic table detection in document images, in Proceedings of the Third International Conference on Advances in Pattern Recognition, 2005, pp. 609 618. [24] C.-C. Chang and C.-J. Lin, LIBSVM: a library for support vector machines, 2001, software available at http://www.csie.ntu.edu.tw/ cjlin/libsvm. [25] S. Strassel, Linguistic resources for Arabic handwriting recognition, in Proceedings of the Second International Conference on Arabic Language Resources and Tools, Cairo Egypt, 2009. [26] T. Dietterich, Approximate statistical tests for comparing supervised classification learning algorithms, Neural Computation, vol. 10, pp. 1895 1923, 1998.