The Impact of Ruling Lines on Writer Identification

Size: px

Start display at page:

Download "The Impact of Ruling Lines on Writer Identification"

Emory Chambers
5 years ago
Views:

1 The Impact of Ruling Lines on Writer Identification Jin Chen Lehigh University Bethlehem, PA 18015, USA Daniel Lopresti Lehigh University Bethlehem, PA 18015, USA Ergina Kavallieratou University of Aegean Samos, Greece Abstract Paper often includes pre-printed ruling lines to help people write more neatly. This particular example of realworld noise can have a serious impact on applications such as handwriting recognition and writer identification, however. In this work, we investigate the effects of ruling lines on writer ID. We study a method for detecting and removing ruling lines and test its utility for Arabic writer identification through a series of experiments. Our preliminary results show that under realistic assumptions where ruling lines are expected to have different properties across the collection, e.g., thickness, spacing, etc., removing them significantly improves identification performance. We conclude with a discussion of work-in-progress to examine followup questions raised by our initial investigations. Keywords-Writer Identification; Ruling-line Artifacts; I. INTRODUCTION Writer identification is the task where, given a query and asetofknownwriters,thesystemattemptstooutputthe identity of the handwriting. In general, the output is a list of potential authors with associated confidence scores in a descending order [1], [2]. Sometimes, a rejection option is available as well [2]. If any text content is used for identity establishment, this task of identification is usually called text-dependent, otherwiseitis text-independent. Since the survey by Plamondon and Lorette that summarized the state of art in 1989 [3], there have been significant improvements in the field. Researchers have investigated the problem in English [1], Chinese [4], Arabic [5], and many other languages [6], [7]. In terms of recognition techniques, K- nearest Neighbors [8], Neural Networks [9], Hidden Markov Models [2], and Support Vector Machines [10] have been found to be useful in discriminating writers. Unlike the case in handwriting recognition, writer identification strives to preserve as much as possible the characteristics of inter-writer variation. Schlapbach and Bunke investigated three normalization steps in off-line English writer identification [11]: character width normalization, vertical scaling, and slant correction. From experimentation, the authors observed that slant correction is likely to hamper performance. Although extensive research has been done in writer identification, previous work has assumed that the input is provided on clean, unlined paper. This may be a reasonable simplifying assumption to start with, but it is not likely to hold in practice. Provided as helpful guides on writing pads and invoice forms, ruling lines often overlap with the handwriting of interest. It is, therefore, understood that ruling lines must be removed before attempting handwriting recognition, and past work has attempted to address this need in that field. However, the impact of ruling lines on writer identification has not yet received similar attention. In this paper, we make the first steps in studying the impact of ruling-lines on the task of writer identification. Using a database of Arabic offline handwritten documents collected by the Linguistic Data Consortium (LDC) [12], we first split it into groups that contain ruling-line-only, rulingline-free, and mixed handwritten text lines. Then we run SVM classifiers on different settings of training and testing datasets. Our experimental results shows that under realistic assumptions where ruling lines are expected to have different properties across the collection, e.g., thickness, spacing, etc., removing them significantly improves identification performance. On the other hand, in an artificial setting where ruling-lines are always present and have the same properties across all of the samples, removing them hampers identification performance, as might be expected. The remainder of the paper is organized as follows. We first discuss related work on ruling line removal for handwriting recognition in Section II. In Section III we explain the method we have chosen to use in our studies. Next, we describe our experimental setup in Section IV. We present some preliminary experimental results in Section V, and conclude with a discussion of ongoing work in Section VI. II. RELATED WORK There has been much work in the field of forms processing and handwriting recognition that deals with ruling lines. For example, in Cao and Govindaraju s work [13], the patchbased MRF shape modeling was trained on pre-defined shape patterns to recover the deformations of image shapes. The authors reported significant improvements over traditional approaches using a database of handwritten carbon forms. However, the method is computationally expensive to scale up. Abd-Almageed, et al., introducedarulinglineremoval algorithm based on modeling in linear subspaces [14]. In

2 attractive properties, such as the ability to handle light and largely incomplete lines. However, the line-by-line approach is a more general one in that it can operate on various of handwritten documents: not only writing pads, but forms, tables, etc. In our experiments, we adopt the line-by-line approach in which we first decompose the page into preidentified text lines. It should be clear, however, how the method generalizes to full-page images. Figure 1 shows an example of overlapping ruling lines and handwriting from our test collection. Figure 1: Arabic handwriting on a page with ruling lines. the training phase, they used ruling-line-only pages, while in the testing stage, they first projected feature vectors into the subspaces and then computed the reconstruction error. They implemented a synthetic evaluation scheme and obtained approximately 88% for both recall and precision. Arvind, et al., proposedarule-basedmethodthatfirstdetected the ruling lines within segmented handwritten blocks by computing the horizontal projection profile [15]. Based on minimizing the profile entropy, the authors computed the skew angle and detected the positions of ruling lines by investigating the peak position in the horizontal projection file. They performed run-length analysis to determine which pixels belonged to the ruing lines. After removing the lines, they designed an algorithm to correct broken strokes. They observed an accuracy of 86.33%. As an improvement, Cao, et al., modifiedarvind salgorithm to recover deformations of the shape of the handwriting [16]. They also introduced a simple technique to detect false connected sections after removing ruling lines. They achieved a word error rate (WER) reduction of 15.5% using 57 pages. In addition, their algorithms achieved a mismatch ratio of 1.37% on synthesized clean handwriting and rulingline-only images. Although an ultimate goal of our work is to develop ruling line removal algorithms appropriate for writer identification applications, this is not the subject of our current paper. Thus, we adapt the algorithms from [15] and [16] with a few necessary changes. In the following section, we briefly describe the particular approach we have implemented. III. RULING LINE REMOVAL In our context, the ruling lines on a page can be represented by a model-based approach which attempts to capture all of the lines with a compact set of parameters, or a line-by-line approach which considers each ruling line in isolation. Model-based ruling line removal has several A. Ruling Line Detection After applying a generic median filter to remove scanning noise, we detect the positions of ruling lines and the skew simultaneously. The idea is from Arvind, et al. s work [15], where the underlying assumption is that ruling lines usually dominate the horizontal projection profiles (HPPs). We first estimate a global skew interval of handwritten lines (±12 ), then we rotate the line images by each skew angle (1 at a time) and compute the HPPs. Within each profile, the entropy is computed by the following equation: E(i) = i HPP(i) log(hpp(i)) (1) where row index i ranges from 0 to the height of the rotated image and HPP(i) means the pixel count in each row. Now the problem of finding ruling lines becomes an entropy minimization problem. The result of this step is illustrated in Figure 2b. Next, we estimate the line thickness by investigating the histogram of vertical run-lengths [17]. We select the peak value as the estimation. Later, we analyze all horizontal runlengths around the position and run a least square line fitting to acquire an optimal ruling line (one pixel wide). Finally, all horizontal run-lengths around this central ruling line are considered to be the line. Figure 2c shows the detected ruling line in a sample. B. Stroke Deformation Recovery Once the ruling line pixels have been identified, we remove them by assigning white pixels accordingly. Then the next problem is to recover broken handwritten strokes. Following the strategy from [16], there are three sub-steps: broken stroke reconnection, thinned stroke recovery, and Ushape pattern detection and stroke regeneration. We briefly describe each of these steps for completeness. Here the term sections mean the stroke segments that are caused by removing ruling lines. Broken strokes are recognized by computing the distances between sections above the ruling line and those below them. If the distance of two sections is within a pre-defined threshold and their lengths are comparable to the stroke thickness, we consider them broken stroke segments and reconnect them by drawing a trapezoid. Otherwise, we draw

3 (a) Original line image. (b) The horizontal projection after median filtering and skew correction. (c) Detected ruling line. In the zoom-in segment, red pixels mark the central position of the ruling line and blue ones are run-lengths associate with the ruling lines. (d) Finally, the result of ruling line removal. Figure 2: Illustration of ruling line detection, removal, and deformation recovery. aparallelogramconnectingtheshortersectiontothenearest end of the longer one. Thinned strokes are caused by cases where horizontal strokes have partially overlapped with a ruling line. One solution is to examine each section around the line to determine whether their vertical run-lengths are significantly shorter than the estimated handwriting thickness. If so, we draw extra ink pixels column by column in the direction of the ruling line. U-shape recovery is more difficult. The idea is to examine two sections that are on the same side of the ruling line and also close to each other. If two sections form a U-shape pattern and the imaginary bottom line of the Ushape is around the position of a line, we consider them caused by removal of the ruling line and thus in need of recovery. The particular stroke recovery method is slightly different from either [15] or [16]: we draw a straight line at the middle part of two sections and partial ellipses at the ends to make the artificial strokes more natural. As an example, the result of stroke deformation recovery is shown in Figure 2d. IV. EXPERIMENTAL SETUP Turning to our experiment, we first introduce the database we are working on in Section IV-A, then we discuss our usage of contour-hinge features and the SVM classifier in Section IV-B. A. Data Preparation The Arabic database we are working on is from the DARPA MADCAT project as provided by the Linguistic Data Consortium (LDC). In the current release, there are 7,447 Arabic handwritten document files scanned at 600 dpi and then binarized. The 70 writers are native Arabic speakers. We first partition the database based on the presence of ruling lines on a given page. Next, we extract line images according to the ground-truth file associated with each document. To ensure sufficient data for training and testing, we filter out 10 writers who have a very small number of handwritten pages that contain ruling lines. Next, we cluster each writer s text line images by their document page IDs. Because of the uneven distribution of numbers of handwritten documents among the remaining 60 writers, we would only utilize a small portion of our data if we decide to select document pages first and then divide these pages into lines. Instead, we divide document pages into text lines and then select from each writer 100 text lines which include ruling lines and 500 text lines which do not. During the text line selection, we assure that there is at most one single page that straddle both training and testing datasets. In the end, we combine 40 text lines per writer from the ruling-line-only dataset and 40 text lines per writer from the ruling-line-free dataset to

4 Table I: Datasets used in our experiments. Sample Size (text lines) Dataset Training Testing Total Ruling-line-only 2, ,600 Ruling-line-free 20,700 6,900 27,600 Mixed 3,600 1,200 4,800 be the mixed dataset for experimentation. All datasets have the same 60 writers and there are no overlapping samples between them. To avoid biased sampling, we split each writer s handwritten lines into four disjoint subsets to conduct four-fold crossvalidation. Thus for each fold, the data has been equally divided into four subsets, and each subset in terms serves as atestingsetandtheremainingthreeasatrainingset.the results are then computed by the average performance of all folds. In this way, we ensure that each sample is trained and tested for exactly once. As a control experiment, we generate alineimagedatasetthatisonlypre-processedbyageneric median filter to exclude the scanning noise. A breakdown of all three datasets is shown in Table I. B. Feature Extraction and Classification There has been extensive work in the literature studying feature extraction for writer identification [18], [1]. Here we implement one particular set of features from Bulacu and Schomaker s work [1]. In this set of so-called contourhinge features, for each two adjacent segments (5-pixel long) along the contours, their angles against the horizontal axis are computed and serve as two random variables. By quantizing the angle plane ([0, 2π)) into24bins,we accumulate the count in each bin as we traverse all contours. As those authors did in their work, we only consider cases where the second angle is no smaller than the first. In the end, we normalize the distribution table to compute the joint probability distribution function as a feature vector (300- dimensional). For classification, we use Support Vector Machines for writer identification. A SVM constructs a hyperplane with maximum margin in higher dimensional vector space, where anon-linearlyseparableclassificationproblemintheoriginal vector space may become linearly separable after projecting these feature vectors into higher dimensional space by different mapping functions. The mapping functions are called kernels in the literature. The choice of kernels is critical for determining how to perform the projection into higher dimensional spaces. Commonly used kernels are the linear, polynomial, radial basis functions (RBF), Gaussian Radial basis, etc. K(x, y) =(x y) (2) K(x, y) =(x y +1) d (3) K(x, y) =exp( γ x y 2 ),γ >0 (4) x y 2 K(x, y) =exp( 2σ 2 ) (5) Note that the generic form of SVM is only applicable to 2-class classification. The common way of using it for multiclass classification is to run k(k 1)/2 2-class classifiers where k is the number of classes, and then vote for a multiclass decision using the outputs of all the 2-class classifiers. In our experiments, we employ the libsvm tool [19]. We use the RBF kernel because it offers better discriminability than the linear kernel, while using less parameters than the polynomial kernel. From our experimental results, we found that setting the cost c =10000performed best. To facilitate SVM training and testing, we normalize feature vectors into the unit hyper-cube. In addition, there is also a probability output option available for us to compute the Top-N list. V. EXPERIMENTAL RESULTS Since we want to investigate the impact of ruling lines on writer identification, we treat the feature extraction and classification steps as black boxes. We run the classifier based on different conditions: ruling-line-only, ruling-linefree, and mixed datasets. The results are summarized in a 3 3 matrix as shown in Table II. In this table, bold figures means significant improvements when ruling lines are removed. For convenience in the following discussion, we use the notation E(train/test) to represent an experiment that trains on some dataset and tests on another (or the same) dataset. For example, E(RLO/RLF ) means the experiment that trains on the ruling-line-only dataset and tests on the ruling-line-free dataset. It is clear that removing ruling lines in E(RLO/RLF ), E(RLF/RLO), and E(RLO/M) gives us significant improvements over the control groups. Plotting the latter two in Figure 3, we find that the performance increases quickly within the Top-10 choices. In addition, both experiments have an identification rate greater than 90% for the Top-10 choices. The fact that E(RLF/RLO) outperforms E(RLO/RLF ) might be due to the fact that samples in ruling-linefree dataset exceeds significantly those in ruling-line-only dataset. It is generally accepted that a more extensively trained classifier will outperform one with less training. However, more experiments are needed to validate this hypothesis in this case. There is no surprise that removing ruling lines in the mixed dataset does not make a difference in the E(M/RLO). This is because ruling lines in the control group and cleaned line images in our experiment are both modeled in the training set. However, it is interesting to observe that this experiment gives the best performance

5 Table II: Writer identification accuracy under different training and testing conditions. The first figure in each cell represents the control group (no attempt to remove ruling lines) and the second figure represents the experimental group (ruling lines, if any, removed). Testing Dataset Training Dataset Ruling-line-only (RLO) Ruling-line-free (RLF) Mixed (M) Ruling-line-only (RLO) (62.5%, 58.0%) (31.6%, 50.2%) (65.7%, 74.3%) Ruling-line-free (RLF) (49.0%, 54.9%) (74.7%, 74.7%) (62.7%, 65.4%) Mixed (M) (87.5%, 86.3%) (62.2%, 64.6%) (62.0%, 61.0%) 1 Performance Gains using Different Datasets will have the same properties. Identification Rate Exp(RLO/RLF) Exp(RLO/RLF Control) Exp(RLO/M) Exp(RLO/M Control) Top-N Choice Figure 3: Performance gains for ruling line removal. Dashed curves are performance of the control groups. across the table. Moreover, the classification performance in E(RLF/RLF ) is not surprising since trying to remove ruling lines in a dataset that is free of ruling lines does not affect any data. Therefore the writer identification rates for E(RLF/RLF ) and the control group are identical. As previously stated, we employed four-fold cross validation, and the figures shown in the table are the mean values of all of the folds. Since in data preparation we ensure that each fold is randomly generated, the classification results across folds are quite close to one another. As a quantitative measure, the standard deviations for E(RLO/RLO) are only 0.01 (control group) and 0.01 (experiment group), respectively. It is also interesting to see that there is a slight performance loss in E(RLO/RLO) where ruling line images always appear in both training and testing. As we suggested earlier, removing ruling lines will improve performance when lines are expected to have different properties, e.g., they are not always present, have different thicknesses, spacings, etc. Thus in an artificial experiment such as this, aclassifiermighteffectivelytreattherulinglinesasanother feature to be used in identifying the writer in question. Of course, this is unlikely to be useful in practice since it is unrealistic to assume two pages drawn from different sources VI. CONCLUSIONS AND FUTURE WORK In this paper, we have investigated the impact of ruling lines on writer identification, demonstrating our work using acollectionofarabichandwrittendocuments.thealgorithm we used detects ruling lines by computing a horizontal projection profiles while correcting the skew simultaneously. We then applied a series of post-processing steps that try to correct for deformations caused by the line removal. By testing on ruling-line-only, ruling-line-free, and mixed datasets, we found that removing ruling lines is useful for improving writer identification performance. To date, we have investigated one algorithm for ruling line detection and removal. It would be interesting to know how other ruling line removal algorithms perform in the context of writer identification. As ongoing work, we are examining amodel-basedlineremovalmethod[20]thatoperatesatthe page level. This approach takes advantage of the properties of pre-printed ruling lines and does not require extensive training. It would also be interesting to investigate the impact of ruling lines with different properties (layouts, thicknesses, spacings, etc.) on writer identification. We are currently collecting various blank writing pads with pre-printed ruling lines and using them to synthesize page images. When this ground-truth is ready, we will be able to generate a large collection of data for training and testing purposes. This will allow us to compute better statistics regarding the performance of ruling line removal algorithms. VII. ACKNOWLEDGMENT This work is supported by a DARPA IPTO grant administered by Raytheon BBN Technologies. REFERENCES [1] M. Bulacu and L. Schomaker, Text-independent writer identification and verification using textural and allographic features, IEEE Transaction on Pattern Analysis and Machine Intelligence, vol.29,pp ,2007. [2] A. Schlapbach and H. Bunke, A writer identification and verification system using hmm based recognizers, Pattern Analysis and Application, vol.10,pp.33 43,2007.

6 [3] R. Plamondon and G. Lorette, Automatic signature verification and writer identification the state of the art, Pattern Recognition, vol.22,pp ,1989. [4] X. Li and X. Ding, Writer identification of chinese handwriting using grid microstructure feature, in ICB, 2009, pp [5] A. Ayman and Z. R. Abu, Arabic writer identification based on hybrid spectral-statistical measures, Journal of Experimental and Theoretical Artificial Intelligence, vol.19,no.4, pp , [6] B. Helli and M. Moghadam, Persian writer identification using extended Gabor filter. Heidelberg: Springer, [7] U. Garain and T. Paquet, Off-line multi-script writer identification using ar coefficients, in Proc. of the 10th international Conference on Document Analysis and Recognition,2009,pp [17] G. Kim and V. Govindaraju, A lexicon driven approach to handwritten word recognition for real-time applications, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.19,no.4,pp ,1997. [18] S. Srihari, S. Cha, H. Arora, and S. Lee, Individuality of handwriting, Journal of Forensic Science, vol. 47, pp. 1 17, [19] C.-C. Chang and C.-J. Lin, in LIBSVM: a library for support vector machines, 2001, software available at cjlin/libsvm. [20] D. Lopresti and E. Kavallieratou, Ruling line removal in handwritten page images, in Proc. of the 20th International Conference on Pattern Recognition, 2010,acceptedforpublication. [8] B. Li, Z. Sun, and T. Tan, Hierarchical shape primitive features for online text-independent writer identification, in Proc. 10th International Conference on Document Analysis and Recognition, Barcelona,Spain,August2009,pp [9] R. Sabourin and J. Drouhard, Off-line signature verification using directional pdf and neural networks, in Proc. the International Conference on Pattern Recognition, Vancouver, BC, Canada, 1992, pp [10] E. Justino, F. Bortolozzi, and R. Sabourin, A comparison of svm and hmm classifiers in the off-line signature verification, Pattern Recognition Letters,vol.26,pp ,2004. [11] A. Schlapbach and H. Bunke, Writer identification using an hmm-based handwriting recognition system: to normalize the input or not? in Proc. of the 12th international Graphonomics Society, 2005,pp [12] The linguistic data consortium, [13] H. Cao and V. Govindaraju, Handwritten carbon form preprocessing based on markov random field, in Proc. of IEEE Conference on Computer Vision and Pattern Recognition, [14] W. Abd-Almageed, J. Kumar, and D. Doermann, Page ruleline removal using linear subspaces in monochromatic handwritten arabic documents, in Proc. of the 12th International Conference on Document Analysis and Recognition,2009,pp [15] K. Arvind, J. Juman, and A. Ramakrishnan, Line removal and restoration of handwritten strokes, in Proc. of the 7th international Conference on Computational Intelligence and Multimedia Application, 2007,pp [16] H. Cao, R. Prasad, and P. Natarajan, A stroke regeneration method for cleaning rule-lines in handwritten document images, in Proc. of the MOCR workshop at the 10th international Conference on Document Analysis and Recognition, 2007.

Alternatives for Page Skew Compensation in Writer Identification

Alternatives for Page Skew Compensation in Writer Identification Jin Chen and Daniel Lopresti Department of Computer Science & Engineering Lehigh University Bethlehem, PA 18015, USA Email: {jic207, lopresti}@cse.lehigh.edu