Detecting Dense Foreground Stripes in Arabic Handwriting for Accurate Baseline Positioning

Size: px

Start display at page:

Download "Detecting Dense Foreground Stripes in Arabic Handwriting for Accurate Baseline Positioning"

Wesley Edwards
5 years ago
Views:

Detecting Dense Foreground Stripes in Arabic Handwriting for Accurate Baseline Positioning Felix Stahlberg Qatar Computing Research Institute, HBKU Doha, Qatar Email: fstahlberg@qf.org.

1 Detecting Dense Foreground Stripes in Arabic Handwriting for Accurate Baseline Positioning Felix Stahlberg Qatar Computing Research Institute, HBKU Doha, Qatar Stephan Vogel Qatar Computing Research Institute, HBKU Doha, Qatar Abstract Since Arabic script has a strong baseline, many state-of-the-art recognition systems for handwritten Arabic make use of baseline-dependent features. For printed Arabic, the baseline can be detected reliably by finding the maximum in the horizontal projection profile or the Hough transformed image. However, the performance of these methods drops significantly on handwritten Arabic. In this work, we present a novel approach to baseline detection in handwritten Arabic which is based on the detection of stripes in the image with dense foreground. Such a stripe usually corresponds to the area between lower and upper baseline. Our method outperforms a previous method by 22.4% relative for the task of finding acceptable baselines in Tunisian town names in the IFN/ENIT database. I. INTRODUCTION Robust recognition of handwritten Arabic usually relies on sophisticated normalization procedures in the preprocessing step. Common preprocessing operations include line thinning, skew correction, and size normalization. For Arabic, many recognition systems also require the extraction of the baselines. Arabic script features two baselines: The lower and the upper baseline. Figure 1 gives an example. The presence or absence of ascenders (above upper baseline) or descenders (below lower baseline) are useful features for recognition systems. Furthermore, concavity and foreground pixel distribution features turned out to be useful when calculated separately for the regions below and above the lower baseline [1]. Therefore, the accuracy of baseline estimation has direct impact on the final performance of many state-of-the-art recognition systems. In this paper, we automatically extract the lower baseline in handwritten Tunisian town names in the IFN/ENIT database [2]. Our method assumes that the zone between both baselines (core zone) usually includes a large fraction of all foreground pixels i.e. is a dense foreground region in the image. It is justified by the fact that ascenders and descenders usually contribute relatively little to all foreground pixels compared to the core zone. This assumption may hold even for scripts other than Arabic which have two baselines (e.g. Latin). Fig. 1. Upper and lower baselines in Arabic script. II. RELATED WORK Since the position of the lower baseline gives valuable hints to the recognition system, much research effort has been made in the past to automatically find the (lower) baseline in an image. Recently, neural nets have been used successfully for baseline estimation in Latin script [3], [4]. The baseline in horizontal printed Arabic text can be estimated reliably using the horizontal projection method. If the orientation of the printed text line is unknown, the maximum in the Hough space [5] is often assumed to correspond to the baseline. Both methods exploit the fact that most Arabic letters contain many pixels along the baseline [1]. However, for short and/or handwritten Arabic words, these approaches are no longer valid. Authors in [6] report that for more than a fifth of the words in the IFN/ENIT database the maximum in the Hough space does not match the baseline position. More accurate baselines can be found based on the approximation of the script skeleton with piecewise linear curves [7], [8]. Other more recent approaches include [9], [10], [11], [12], [13], [14]. A comparative overview of different methods for Arabic can be found in [15]. Unfortunately, the evaluation of the presented methods is often limited to example images but does not include standardized test sets or error measures. Thus, comparing them is difficult. Authors in [6] define the Baseline Error as the difference between detected baseline and ground truth in pixel, normalized by the image width. According the judgement of Arabic native speakers, they classify a baseline error greater than 7 pixels as insufficient. They also present their skeleton-based baseline estimation method that finds insufficient baselines in only 12.5% of the words in the IFN/ENIT database [2]. In this work, we suggest a novel method for baseline estimation that can reduce the fraction of insufficient baselines to 9.7% on the same dataset (22.4% relative improvement). III. BASELINE DETECTION BASED ON DENSE FOREGROUND STRIPES Horizontal projection and Hough space maximization assume that the straight line in the image which accumulates the most foreground pixels corresponds to the lower baseline. In contrast, we relax this assumption and search for stripes in the image that contain a certain fraction of all foreground pixels. We thereby assume that the area between lower and upper baseline (core zone) should contain at least a certain amount of foreground pixels (ɛ). We additionally require that the maximum of the projection profile of the stripe in the

stripe direction is in the lower half of the stripe. From all stripes that satisfy both constraints we select one according some optimization criterion (for example the narrowest stripe).

The highlighted area marks a stripe which includes 20% of the foreground pixels (ɛ = 20%). The bottom border of the stripe fits the lower baseline in Fig. 1 very accurately.

Note that the upper border of the stripe does not match the upper baseline, i.e. ɛ = 20% underestimates the number of pixels in the core zone.

Formal Algorithm Description Since diacritics and dots are irrelevant for baseline estimation we remove them in the image by detecting small connected components.

2 stripe direction is in the lower half of the stripe. From all stripes that satisfy both constraints we select one according some optimization criterion (for example the narrowest stripe). We detect the lower baseline at the bottom border of the stripe. Fig. 2 illustrates the basic idea of our method. The highlighted area marks a stripe which includes 20% of the foreground pixels (ɛ = 20%). The bottom border of the stripe fits the lower baseline in Fig. 1 very accurately. In contrast, the maximum in the Hough space (shown as dashed line in Fig. 2) does not correspond to the correct baseline position. Note that the upper border of the stripe does not match the upper baseline, i.e. ɛ = 20% underestimates the number of pixels in the core zone. However, underestimation is not critical since we are only interested in the lower baseline which is placed correctly. A. Formal Algorithm Description Since diacritics and dots are irrelevant for baseline estimation we remove them in the image by detecting small connected components. Let P {0, 1} h w be the matrix of pixel values in the h w image after removing small connected components. We set P y,x = 0 for all background pixels and P y,x = 1 for all foreground pixel. (0, 0) corresponds to the top left image corner. We formalize a stripe as triple (y 1, y 2, m) where y 1 is the intersection of the upper stripe border and the y-axis, y 2 is the intersection of the lower stripe border and the y-axis (y 1 y 2 ), and m is the slope of the stripe. The function count( ) is the number of foreground pixels in the stripe: count : (y 1, y 2, m) w y 2+ m x x=1 y=y 1+ m x { Py,x if y [1, h] 0 otherwise The projection project( ) in m direction to the y-axis can be formulated as follows: (1) project : (y, m) count(y, y, m) (2) Let S be the set of all stripes which satisfy the requirements postulated above: First, the stripe needs to contain at least ɛ R of all foreground pixels: (y 1, y 2, m) S : ɛ count(y 1, y 2, m) w h x=1 y=1 P y,x Second, the maximum of the projection within the stripe must be in the lower half of the stripe: (3) (y 1, y 2, m) S : y 1 + y 2 2 arg max(project(y, m)) (4) y [y 1,y 2] The lower baseline bl = (y 2, m ) is detected at the lower border of the best stripe (y 1, y 2, m ) S. We explore two different optimization criteria: height minimizes the height of the stripe and thereby maximizes its foreground pixel density. height : S arg min y 2 y 1 (5) (y 1,y 2,m ) S min maximizes the minimum in the projection. The motivation behind the min criterion is that all lines within the core zone in slope direction m usually accumulate many foreground pixels. A small value of project(y, m ) within the stripe indicates misplaced stripe boundaries. B. Search min : S arg max (y 1,y 2,m ) S Algorithm 1 search(ɛ, ρ) min project(y, y [y 1,y 2 ] m ) (6) 1: S 2: n ɛ w h x=1 y=1 Py,x {Required number of pixel} 3: for i 0 to ρ do 4: Θ = i ρ {Stripe angle in [ 45, +45 ]} 5: m tan Θ 6: π (project(1, m),..., project(h, m)) 7: for y 1 1 to h do 8: if π y1 > 0 then 9: acc π y1 10: y 2 y 1 11: while acc < n or y 1+y : y 2 y > arg max y [y1,y 2 ] πy do 13: acc acc + π y2 14: end while 15: S S {(y 1, y 2, m)} 16: end if 17: end for 18: end for 19: return criterion(s) {Either height or min.} Our algorithm for finding the best stripe in an image is listed in Alg. 1. The search is challenging since S is an infinite uncountable set: Decreasing y 1 or increasing y 2 of a feasible stripe again results in a feasible stripe, and the slope m R is real-valued. We address the latter challenge by equidistant quantization of the slope angle Θ in the interval (a) Hough transform. (b) Axis aligned projection. Fig. 2. Baseline detection based on dense foreground stripes (ɛ = 0.2, min criterion). Fig. 3. Line representation in our algorithm compared to the polar coordinate representation used by the Hough transformation.

However, the Hough transformation represents lines with the help of polar coordinates (r, Θ) (Fig. 3(a)). In contrast, π r represents the line in Θ-direction that intersects the y-axis at y = r (Fig.

3 [ 45, +45 ]: First, we choose a resolution ρ. The outer loop in Alg. 1 (lines 3-18) tries out ρ different values for Θ. Then, the projection of the image in Θ direction is calculated and stored in the h-dimensional vector π. The vector π is similar to a single column in a Hough transformed image. However, the Hough transformation represents lines with the help of polar coordinates (r, Θ) (Fig. 3(a)). In contrast, π r represents the line in Θ-direction that intersects the y-axis at y = r (Fig. 3(b)). The advantage of the different representation is that we avoid quantization errors for r since only integer values need to be considered. The disadvantage is that we are not able to represent all possible lines (e.g. lines parallel to the y-axis cannot be represented). However, this is not an issue since the baseline cannot be vertical. Then, for each slope m = tan Θ and each possible stripe start y 1, we find the narrowest feasible stripe and add it to S (lines in Alg. 1). This gives us the best stripe according both criteria (height and min) for the given y 1 and m. In line 19, we return the best stripe in S according either to the height or min criterion. The set S has at most h ρ elements and thus can be traversed efficiently. IV. A. Baseline Error Measure EXPERIMENTS Authors in [6] suggest the Baseline Error measure to assess baseline detection methods. It is defined as the area between estimated and ground truth baseline divided by the width of the image w (Fig. 4). Based on human judgement, they classify a baseline error up to 5 pixels as excellent, between 5 and 7 pixels as acceptable, and higher than 7 pixels as insufficient. Our goal is to reduce the number of insufficient baselines. B. The IFN/ENIT Database The IFN/ENIT database [2] is a widely used corpus for Arabic handwriting recognition research.it consists of 26,459 images of 937 handwritten Tunisian town names (sets a to d). Many groups working on baseline estimation for handwritten Arabic use this corpus [15] because the database provides manually verified baseline information. However, baseline estimation on the IFN/ENIT database is challenging [6]: The writing styles of different writers differ largely. Moreover, the words are sometimes very short or the baseline is discontinuous or curved. We address the latter challenges in Sec. IV-E. We show in the next sections that our approach is able to cope with different writing styles and short words. C. Impact of the Parameters ɛ and ρ The central parameter of our method is the ɛ parameter which defines the required fraction of foreground pixels covered by the stripe. Fig. 5 shows the fraction of insufficient Insufficient Baselines Fig % 40 % 35 % 30 % 25 % 20 % 15 % width criterion min criterion 10 % 10 % 20 % 30 % 40 % 50 % 60 % ε Baseline error over ɛ for both optimization criteria width and min. baselines depending on the ɛ parameter for both the height and min criterion. The minima of both curves are not at the same position (ɛ = 25% for min and ɛ = 35% for width): There is no single value for ɛ which optimizes both criteria. The best result is achieved with the width criterion and ɛ = 35% producing no more than 14.5% insufficient baselines. Insufficient Baselines Fig % 80 % 60 % 40 % 20 % 0 % Insufficient baselines Runtime ρ 4 h 3 h 2 h 1 h 0 h Impact of the resolution for the slope angle (ρ parameter). The parameter ρ controls the resolution of the slope angle Θ and has major impact on the runtime. Fig. 6 plots the fraction of insufficient baselines and the runtime over ρ using ɛ = 35% and the height criterion. Our implementation 1 is written in Java and based on OpenCV [16]. Our testing platform was a Kubuntu with Linux kernel on a 8-core Intel R Core TM i7-3635qm processor at 2.40 GHz, HDD and sufficient RAM. The runtime measurements include loading all 26,459 images from the HDD, binarization, diacritic removal and baseline estimation. The runtime of the baseline estimation is linear in ρ. The time complexity of the other steps is low and independent of ρ (see green curve at ρ = 0). The error drops quickly with increasing ρ and stays at the same level for ρ 40. In latter experiments we choose ρ = 130 to eliminate accuracy degradations due to ρ quantization errors. D. Combination of Multiple Baselines We showed in the previous section that the exact value for the resolution ρ is not decisive for the baseline quality as long as it is not too small. However, how can we know the best value for ɛ in advance? Our solution is to combine multiple baselines resulting from different values for ɛ. This approach has two advantages: First, we cover a whole range for ɛ and do not have to commit to a single value. Second, the combination works even better than the optimal individual ɛ value. Fig. 7 demonstrates the improvements through combination. First, we decide a range for the ɛ parameter (visualized in Fig. 7 by the two horizontal axes). Second, we estimate a baseline Runtime Fig. 4. Baseline error measure. 1 Download from

(a) Baseline estimation using the height criterion. (b) Baseline estimation using the min criterion. Fig. 7. Fraction of insufficient baselines when multiple baselines are combined.

The combination is done by selecting the baseline with the most horizontal slope (i.e. with the smallest absolute slope value m ).

$7 the fraction of insufficient baselines can be reduced to 10.6% with the height criterion (ɛ-range from 20% to 50%) and to 10.4% with the min criterion (ɛ-range from 15% to 45%).$ Changing the ɛ-range has only minor impact: For example, all ranges with a lower bound between 15% and 30% and an upper bound between 40% and 60% result in no more than 12% insufficient baselines

Changing the ɛ-range has only minor impact: For example, all ranges with a lower bound between 15% and 30% and an upper bound between 40% and 60% result in no more than 12% insufficient baselines

IV-E): the exact boundary values are not critical for the accuracy.

4 (a) Baseline estimation using the height criterion. (b) Baseline estimation using the min criterion. Fig. 7. Fraction of insufficient baselines when multiple baselines are combined. The horizontal axes define the boundaries of the ɛ range. for each multiple of 5% in that range i.e. we sample ɛ with 5% resolution. The combination is done by selecting the baseline with the most horizontal slope (i.e. with the smallest absolute slope value m ). This simple criterion is justified by the fact that the slope angle of most baselines in the IFN/ENIT database is close to 0. Note that the time complexity is not significantly higher than with only a single ɛ value since all baselines can be estimated in a single pass with only minor changes to Alg. 1. As shown in Fig. 7 the fraction of insufficient baselines can be reduced to 10.6% with the height criterion (ɛ-range from 20% to 50%) and to 10.4% with the min criterion (ɛ-range from 15% to 45%). Changing the ɛ-range has only minor impact: For example, all ranges with a lower bound between 15% and 30% and an upper bound between 40% and 60% result in no more than 12% insufficient baselines using the height criterion (Fig. 7(a)). This enables us to use the same ɛ-ranges for a variety of resolutions, writing styles and pen sizes (Sec. IV-E): the exact boundary values are not critical for the accuracy. The error surface for the min criterion is steeper than for the height criterion indicating that min is slightly less tolerant against changes of the ɛ parameter. The best result is achieved when we combine baselines from both optimization criteria. In Fig. 8 the ɛ-range for the height criterion is fixed to [25%, 45%]. Like in Fig. 7(b) the horizontal axes define the upper and lower bound for the ɛ parameter for min. Setting the ɛ-range for min to [20%, 40%] reduces the fraction of insufficient baselines to 9.7%. Tab. I compares our work with other methods in the literature. To the best of our knowledge, the skeleton-based approach by Pechwitz et. al. [6] with 12.5% insufficient baselines is the best in the literature evaluating on the IFN/ENIT database with the baseline error evaluation scheme presented in Sec. IV-A. We are able to improve their results by 22.4% relative to our final result of 9.7% (relative improvement of TABLE I. Method COMPARISON WITH RELATED WORK Related work Insufficient baseline rate Hough space maximization [5], [6] 21.9% Skeleton-based method by Pechwitz et. al. [6] 12.5% This work Without combination (ɛ = 35%, height criterion) 14.5% With combination (ɛ [25%, 45%] for height, 9.7% ɛ [20%, 40%] for min) 55.7% compared to traditional Hough space maximization). E. Curved or Discontinuous Baselines If the baseline is straight, we can apply our approach to a whole line instead of words without modification. If the baseline is curved or discontinuous, our method fails since it tries to fit a straight dense stripe to the image. Authors in [18] suggest a collection of skeleton-based geometric and topological features for curved baseline extraction. In this work, we vertically split the image into smaller segments and apply our baseline estimation method on each segment separately to deal with curved baselines. Our splitting procedure ensures that the width of each segment is greater or equal twice the height and that we do not cut any foreground strokes. Fig. 9 gives an example. The first splitting position s 1 is at the left most vertical line with distance to the left image corner greater than 2 h that does not cross any foreground strokes. Starting from s h we look for the next vertical line that does not cut foreground and assign the next splitting position s 2 to it. We proceed until we reach the right corner of the image. The algorithm description for curved baseline extraction follows: 1) Estimate the baseline from the whole image and rotate it such that the estimated baseline is horizontal. This is a first rough normalization step. 2) Split the image into smaller segments. 3) Estimate the baseline for each of the segments separately and rotate/translate them such that the baseline is horizontal at the center of the image. 4) Concatenate the segments to a single image. Fig. 8. Combining both optimization criteria. The ɛ-range for height is fixed to [25%, 45%]. The ɛ-range for min is defined by the horizontal axes. Fig. 9. Image of curved text lines into smaller segments.

This corpus consists of line images from handwritten forms filled out by a variety of different writers with different writing styles.

5 Fig. 10. Baseline normalization in images with curved baselines and various writing styles from the KHATT database [17]. The described procedure results in a normalized image with a rectified straight and horizontal baseline at the center. Fig. 10 shows some examples taken from the KHATT database [17]. This corpus consists of line images from handwritten forms filled out by a variety of different writers with different writing styles. The slope of the text often changes multiple times within a single text line. However, our approach is able to normalize the text reliably using the same ɛ-ranges which performed best on the IFN/ENIT database (ɛ [25%, 45%] for height, ɛ [20%, 40%] for min). Since KHATT does not provide ground truth baseline information, we have to restrict the evaluation to these visual inspections and are not able to report Baseline Errors as for the IFN/ENIT database. V. CONCLUSION In this work we studied baseline estimation in handwritten Arabic words. Our method finds stripes in the image that cover areas of dense foreground. Realizing that a large amount of foreground pixels in Arabic script are between the upper and the lower baseline, we place the lower baseline at the bottom border of the best stripe. We studied two optimization criteria: height minimizes the height of the stripe. min maximizes the minimum in the projection on the y-axis in the stripe direction. Our method requires two parameters. The ρ parameter controls the trade-off between accuracy and runtime. The ɛ parameter defines the required fraction of foreground pixels in the stripe. Using the height criterion with ɛ = 35% results in no more than 14.5% insufficient baselines on the IFN/ENIT [2] database consisting of handwritten Tunisian town names. The accuracy can be improved even more by combining a set of baselines resulting from different values for ɛ. Since multiple baselines can be estimated in a single pass, the time complexity is not significantly higher as with only a single value for ɛ. The best result is achieved when combining baselines estimated using ɛ [25%, 45%] for height and ɛ [20%, 40%] for min resulting in 9.7% insufficient baselines i.e. more than 9 of 10 estimated baselines have acceptable quality. In case of curved or discontinuous baselines, we segment the image into smaller parts and apply our baseline estimation method to each segment separately. By rotating and translating each segment according to its estimated baseline, we can rectify the complete baseline to a straight horizontal line. In the future, we plan to use our new baseline estimation algorithm for feature extraction for a recognition system to improve the recognition accuracy. REFERENCES [1] L. Likforman-Sulem, R. A. H. Mohammad, C. Mokbel, F. Menasri, A. Bianne-Bernard, and C. Kermorvant, Features for HMM-based Arabic handwritten word recognition systems, in Guide to OCR for Arabic Scripts, [2] M. Pechwitz, S. S. Maddouri, V. Märgner, N. Ellouze, H. Amiri et al., IFN/ENIT-database of handwritten Arabic words, in CIFED, [3] S. Espana-Boquera, M. J. Castro-Bleda, J. Gorbe-Moya, and F. Zamora- Martinez, Improving offline handwritten text recognition with hybrid HMM/ANN models, Pattern Analysis and Machine Intelligence, IEEE Transactions on, [4] O. Morillot, L. Likforman-Sulem, and E. Grosicki, New baseline correction algorithm for text-line recognition with bidirectional recurrent neural networks, Journal of Electronic Imaging, [5] R. O. Duda and P. E. Hart, Use of the Hough transformation to detect lines and curves in pictures, Communications of the ACM, vol. 15, no. 1, [6] M. Pechwitz, H. El Abed, and V. Märgner, Handwritten Arabic word recognition using the IFN/ENIT-database, in Guide to OCR for Arabic Scripts, [7] M. Pechwitz and V. Märgner, Baseline estimation for Arabic handwritten words, in ICFHR, [8] F. Farooq, V. Govindaraju, and M. Perrone, Pre-processing methods for handwritten Arabic documents, in ICDAR, [9] M. Ziaratban and K. Faez, A novel two-stage algorithm for baseline estimation and correction in Farsi and Arabic handwritten text line, in ICPR, [10] H. Boubaker, M. Kherallah, and A. M. Alimi, New algorithm of straight or curved baseline detection for short Arabic handwritten writing, in ICDAR, [11] H. Boukerma and N. Farah, A novel Arabic baseline estimation algorithm based on sub-words treatment, in ICFHR, [12] M. Blumenstein, C. K. Cheng, and X. Y. Liu, New preprocessing techniques for handwritten word recognition, in VIIP, [13] P. Nagabhushan and A. Alaei, Tracing and straightening the baseline in handwritten Persian/Arabic text-line: A new approach based on painting-technique, International Journal on Computer Science and Engineering, vol. 2, no. 4, [14] T. Abu-Ain, S. N. H. S. Abdullah, B. Bataineh, K. Omar, and A. Abu- Ein, A novel baseline detection method of handwritten Arabic-script documents based on sub-words, in Soft Computing Applications and Intelligent Systems, [15] A. Al-Shatnawi and K. Omar, A comparative study between methods of Arabic baseline detection, in ICEEI, [16] G. Bradski and A. Kaehler, Learning OpenCV: Computer vision with the OpenCV library. O Reilly Media, [17] S. A. Mahmoud, I. Ahmad, M. Alshayeb, W. G. Al-Khatib, M. T. Parvez, G. A. Fink, V. Märgner, and H. E. Abed, KHATT: Arabic offline handwritten text database, in ICFHR, [18] H. Boubaker, M. Kherallah, and A. M. Alimi, New algorithm of straight or curved baseline detection for short Arabic handwritten writing, in ICDAR, 2009.

An Arabic Baseline Estimation Method Based on Feature Points Extraction

, July 5-7, 2017, London, U.K. An Arabic Baseline Estimation Method Based on Feature Points Extraction Arwa AL-Khatatneh, Sakinah Ali Pitchay and Musab Al-qudah Abstract Baseline estimation is an important