Applying Preattentive Visual Guidance in Document Image Analysis

Size: px

Start display at page:

Download "Applying Preattentive Visual Guidance in Document Image Analysis"

Geraldine Norman
5 years ago
Views:

1 Applying Preattentive Visual Guidance in Document Image Analysis Di Wen 1,2, Xiaoqing Ding 1,2 1 Department of Electronic Engineering, Tsinghua University 2 State Key Laboratory of Intelligent Technology and Systems Beijing , P.R.China {wendi, dxq}@ocrserv.ee.tsinghua.edu.cn Abstract. In this paper, we present a novel methodology on document image analysis (DIA) which harnesses the mechanism of preattentive visual guidance in human vision. Summarizing the psychophysical research on preattentive vision, we suggest using two types of computations to simulate this biological process: the visual similarity clustering and visual saliency detection. Based on the computational implementation of these two processes, we develop a biological plausible way to guide the interpretation of document images, which is distinctively different from previous DIA methods. Experimental results prove the effectiveness of these two processes, whose outputs can be further utilized by any task-oriented DIA application. Keywords: Document image analysis, preattentive visual guidance, texture synthesis, dynamic clustering, visual saliency. 1 Introduction Detecting and segmenting semantic contents from document images is a challenging task. In recent years, there have been proposed dozens of matured algorithms in the document image analysis (DIA) domain, oriented at different application scenarios. Some of them are quite successful in automatically converting specific classes of paper-based documents in batches into their electronic counterparts [1, 2, 3, 4]. However, little attention has been paid to the adaptability of DIA methods while they are encountered with constantly switching environments, such as from simple layouts to complex layouts, from upright to geometric distorted images, from clean background to clutter background etc. As a result, current DIA systems are quite specialized for specific class of samples and quite demanding for image quality, which greatly reduce the usability of OCR techniques. In this paper, we propose a new attempt towards a generic DIA approach adapting to various cases. Our original idea comes from the mechanism of preattentive visual guidance in human vision. In the literature, most of the DIA methods process binary images by investigating simple geometric features. For example, the famous RLSA method [1] discriminates text and non-text by the distance between foreground pixels and extracted text contents after a run-length smearing preprocessing. But unfortunately, RLSA method

2 is sensitive to noise, page orientation and font size, making it clearly unsuitable for construct adaptive systems. Another bunch of methods, i.e., the connected component aggregation methods [5, 6] utilize alignment and spacing consistency of text components, performing a bottom-up hierarchical reconstruction of document layouts. These bottom-up methods can be well suited for segmenting arbitrary layouts only if the text line aggregation results are reliable. But this is hard to achieve by using geometric features only. The biggest problem of the bottom-up methods is their local merging scheme without any global directive. Furthermore, analysis on binary connected components is also unreliable in noisy, skewed and irregular documents. Limited by the simple geometric-based features, the bottom-up methods are inadequate to discriminate text and non-text contents robustly. In order to make full use of image appearance in document analysis, some researchers proposed the texture-based methods [7, 8, 9, 10], which model each homogeneous region in document images as a visual texture pattern. By extracting texture statistics as features, dynamic or static classifier techniques are applied to classify the image pixels as text or non-text. Frequently used texture features include Gabor responses [7], morphological masked pixel values [8, 10], wavelet coefficients [9] etc. Although the texture-based methods aim at modeling the visual appearance of different contents in document images, the texture features they used are always derived from mathematical perspectives and hence are not biological plausible. Therefore, it is hard to tell to what extent these texture features are capable to characterize unknown samples. As we know, model generality is very important to implement adaptability. Based on the above observation, we consider a new adaptive DIA solution from the biological plausible perspective. Summarizing the latest progress in psychophysical research on preattentive vision, we propose a novel computational model in this paper to simulate the preattentive visual guidance mechanism in human vision and apply it to document image analysis. The model contains two kinds of computation: visual similarity clustering (VSC) and visual saliency detection (VSD). The former simulates the categorical characteristics of preattentive vision to summarize homogeneous regions. And the latter simulates the visual center-surround characteristics to spot salient regions. These two parts of information can be combined to guide visual search of specific document contents in the following stage. Initial experimental results show that computation results of VSC and VSD are very adaptive and highly complementary, which is helpful for robust and efficient interpretation of various document images. The rest of the paper will be arranged as follows. In section 2, as a background of this paper, we introduce our computational model for preattentive visual guidance. Then the implementation of VSC and VSD computation is illustrated in details in section 3 and section 4 respectively. Experimental results are demonstrated in section 5, followed by conclusion and discussion in section 6. 2 Computational Model for Preattentive Visual Guidance Human vision is highly robust in capturing useful objects in a field of background distracter elements. Such adaptive power is attributed to the visual guidance

3 mechanism in the primary visual cortex. That is, the high-level cognitive functionality is not to be executed at every position in the scene. On the contrary, the visual guidance mechanism serves as a system bottle-neck to direct the interpretation of the whole scene [11]. Until now, most vision researchers agree that much work in visual guidance is contributed by the low-level neurons, which is based on simple visual properties and completed in very short instant [12]. Psychophysical experiments have shown that some visual searching task can be finished accurately within 200~250 milliseconds, which is too short for human eyes to move and pay serial attention. The visual properties affirmed in such experiments are called preattentive [12]. Pre-attentive visual feature Visual Similarity Clustering (VSC) Linear Filtering H 1 H 2 H M Homogeneous regions Selective Attention Attentive Cognition Input image Visual Saliency Detecting (VSD) F 1 F 2 F P Focus of attention S 1 S 2 S N Salient regions Attentive processing Preattentive processing Fig. 1. Our computational model for preattentive visual guidance in DIA In this paper, we notice two types of hints that can be discerned in the preattentive stage of document image analysis: visual similarity and visual saliency. The former was conjectured by Julesz and further proven by psychophysical and computational experiments [13, 14, 15]. And also, theories and initial computational models for the latter has also been proposed in recent years, contributed by Treisman, Ullman, Koch and Itti etc. [11, 16, 17, 18]. Motivated by these two streams of research efforts, we propose in this paper a computational model characterizing the preattentive visual guidance mechanism in document image analysis. As shown in figure 1, the input image is first decomposed by series of preattentive feature channels. Then these separated feature maps will go through two independent processes: visual similarity clustering (VSC) and visual saliency detection (VSD). The VSC process categorizes homogeneous regions by measuring visual similarity. On the other hand, the VSD process points out salient regions different from their neighbors. Both of these two processes compute only preattentive visual properties. In a document image, the VSC results are useful in aggregating homogeneous text contents; while the VSD results provides us with hints to find out conspicuous titles, separating lines, edges and

4 graphics etc. In the following stage named selective attention, oriented to specific object extraction task, a series of focuses of attention (FOA) will be determined by consulting the VSC and VSD results. The final attentive cognition of various document contents is executed in these selective FOAs serially. 3 Visual Similarity Clustering To discuss the subjective visual similarity perception, we must first find out the numerical way to measure it objectively. Before the studies on human vision system, this problem had no generic solution. The early characteristic features for texture were merely proposed by mathematical convenience or task dependant heuristics [19]. Later, physiological research on visual cortex revealed the spatial/frequency representation of image in human vision system. Such discovery inspired researches to use spatial/frequency localized filters as generic texture feature extractors. In 1995, Heeger and Bergen accomplished the first texture synthesis experiment by matching the histograms of image pyramids between the synthesized and target images [14]. Their experiments reveal that image pairs with matching histograms also share similar visual appearance. Later, Zhu etc. offered a stricter mathematical framework to ensure the convergence of histogram matching and further argued that marginal histogram statistics pooled from Gabor filters can serve as generic features to characterize various homogeneous textures [15, 20]. Based on Zhu s work, Liu developed the quantitative measurement of visual similarity between two image patterns, based on the χ 2 -distance of Gabor histogram features [19]. His work was applied in texture classification. In our work, we are interested to use visual similarity measurement in document image segmentation, which is a variant case of texture classification. So the first problem is how to characterize visual patterns in the document images using representative texture features. And second, the similarity-based segmentation should be self-adaptive to various document contents. Since the conceptual texture patterns can not be statically defined among different document images, the best way to accomplish adaptability is through dynamic clustering. Therefore, our major implementation problem in VSC is: first, to select a series of representative filters and histogram bins to extract the Gabor histogram features from document images; and second, to derive a mathematical plausible way to cluster the features so as to obtain adaptive segmentation results. To solve the first problem, we must investigate in what scale that the document contents will illustrate homogeneous visual textures. Generally speaking, font sizes of the perceptible text contents in document images vary from 2 pts to 32 pts (we regard that characters with sizes bigger than 32 pts can not be treated as homogeneous texture). In the very low resolution, text contents demonstrate line pattern and character blob pattern. Along with the increase of resolution, the character stroke pattern will gradually emerge. These three patterns can be well captured by Gabor and Laplacian of Gaussian filters. Therefore, we first prepare a bank of filters containing Gabor filters (annotated as G) and Laplacian of Gaussian filters (annotated as LoG)

5 covering consecutive scales and orientations in frequency domain. The mathematical expressions of these two kinds of filters are as follows: 1 Gabor x y T x y x y 2T 2π exp{ j ( xcosθ + ysin θ)} T 2 2 (,, θ) = exp{ [4( cosθ + sin θ) + ( sinθ + cos θ ) ]} 2 (1) x + y LoG( x, y T ) = C( x + y T )exp( ) (2) 2 T To capture the three typical texture patterns of text contents from 2 pts to 32 pts, we choose parameters T = 2,2,4,6,8,10 and θ =0,45,90,135, which results in a filter bank consisting 24 G filters (only use cosine components) and 6 LoG filters. To further select representative filters from these 30 filters, a visual similarity testing experiment is performed. That is, we choose a sufficient set from, with whose histograms we can fully characterize the visual appearance of the referenced image I obs, which is picked up as typical visual patterns in document images. The filter selection process is presented in [21], in which we followed the Minimax entropy principle and used Markov Chain Monte Carlo sampling [15] to match the histograms between the synthesized and original images. By matching the histograms from more and more filter channels, the synthesized image become more and more similar to the referenced image. The experiment finally selects the following 13 filters to form the representative filter set : 1. G(T, θ), T = 2, 4, and θ= 0 º, 45 º, 90 º, 135 º ; 2. G(T, θ), T = 6, and θ= 0 º, 90 º ; 3. LoG(T), T = 2, 4, Pattern1 Pattern2 Pattern Pattern1 Pattern2 Pattern Pattern1 Pattern2 Pattern3 0 0 Fig. 2. GHF of different texture patterns, with the concatenated histograms of all 13 channels and the magnified histograms in two particular channels (G(4,90º) and LoG(4)).

6 To make the histograms extracted from different images comparable, we normalize the responses of each filtered image to the fixed range [-1, 1] before pooling histograms. With 11 bins of histograms extracted for each filter channel, we construct the 143-dim Gabor histogram feature (GHF) vector for each input image. Figure 3 compares the GHF vectors of different visual patterns found in document images. In our image segmentation experiment, the GHF vector H υ in the site υ is calculated within a window surroundingυ. The window size we choose is 32. For the second problem, we define the distance metric between two GHF vectors as their Euclidian distance, just for the convenience to perform K-means clustering. () i () i 2 υ1 υ = 2 υ 1 υ2 i DH (, H ) ( ( h h ) ) Another clustering parameter, that is, the initial class number K is set to 4 in our experiment. This is empirically determined by observing that there are usually 4 types of contents in a document image: text, photograph, line drawing and white space. And also experiments have shown that setting K>4 will cause the homogeneous class corresponding to text contents to split into smaller classes. Therefore by setting K=4 initially we can conserve the homogeneity of text regions. As for the simple plain documents which probably contain less than 4 distinctive classes of visual patterns, we allow the clustering procedure to drop the empty classes. After setting the distance measurement and initial state, the K-means clustering is ready to run. The clustered results will be dynamic, depending on the specific contents of different document samples. Therefore, we also need to identify which class in the clustered results is correspondent to the text contents. To solve the uncertainty caused by unsupervised clustering, it is necessary to introduce supervised constraints. To our observation in the experiments, texture features in the main body text regions usually demonstrate higher energy in the Gabor filter responses. Therefore, we calculate the following texture energy for each GHF vector as follows: E υ i= 1 () i υ 1 2 (3) = E (4) () i Here, E υ refers to the variance of filter responses in the ith channel, which can be computed through the histogram. Then the texture energy E k for the whole class of GHF vectors can be further estimated by counting the most frequent texture energy. By this means we can sort the K classes in descending order according to their texture energies (i.e., class 1 has the maximal texture energy). We select class 1 and class 2 as the candidate classes for text contents. In our experiments, for most tested document images, text contents occupy one of these two clustered classes. 4 Visual Saliency Detection As compared with the top-down categorical VSC process, the VSD process undertakes a bottom-up investigation on how different a site is from its neighbors.

7 Here we use the computational architecture proposed by Itti [18] to obtain a salient map of the input image. In [18], Itti computed saliency in three independent feature channels: the color channel, the intensity channel and the orientation channel. In our work, since the similarity of Gabor orientation features has already been investigated in the VSC process, we carry out visual saliency detection in only two feature channels: the color and the intensity channels (for gray scale image, only intensity channel is calculated). The input image is first decomposed into one intensity and four color channels. Then multi-scale representation for each channel is constructed, using a Gaussian pyramid. The center-surround difference is calculated between different coarse scales (surrounded values) and fine scales (centered values), resulting in 12 feature maps in the color channel and 6 feature maps in the intensity channel. Finally, the color salient map, the intensity salient map and the overall salient map are calculated respectively by normalizing and combining these feature maps. Computational details can be found in [18]. 5 Experimental results We perform several groups of experiments to test the adaptability of VSC and VSD in providing homogeneous regions and salient regions among various document images. These two kinds of information can be further utilized by any task-oriented module to detect specific contents from document images. For example, one who is interested in extracting text lines can access the homogeneous regions in the VSC results. Another one looking for titles, separating lines, edges and graphics etc. can access the salient regions in the VSD results. In the VSC process, it is obvious that with more semantic homogeneous contents categorized into the same class, the more layout segmentation can benefit from it. On the other hand, the more the salient regions are separated from the homogenous regions, the easier it is to find out the salient objects. Therefore, we pay attention to two criteria in evaluating the performance of our visual preattentive guidance computation: the region homogeneity in VSC and the complementary extent between the VSC and VSD results. The first experiment is for complex newspaper images. Figure 4 shows the clustered results by VSC for 3 newspaper images scanned in 150 dpi. 4-class segmentation results are obtained for each sample, from which we can see that: the main body text contents in each sample occupy a major visual class stably. In the English sample, they belong to class 2 (dark gray); while in the Chinese samples, they belong to class 1 (black). It can be easily explained that when there is strong periodic texture pattern in the image (e.g., halftone or background texture), text contents will not occupy the first class. Otherwise, they will appear in the first class. In figure 5, the pixels clustered as text contents are picked up separately to see the homogeneity of the clustered results particularly. As we see, the majority of text contents are successfully segmented. It should be mentioned that we have not added any spatial continuity constraints in the clustering; while the homogeneity is still satisfying, which indicate that our GHF features can really reflect visual similarity. It is interesting to see that the halftone patterns in the photographs also make them stand out as a unique visual class.

Chinese newspaper, less complex, with the

8 (a) (b) (c) Fig. 3. Segmentation results in the first experiment, using K=4. The 4 pixel values: black, dark gray, light gray and white represent 4 clustered classes, sorted by their texture energy from high to low. (a) Segmentation result of a complex Chinese newspaper, with the body text occupying class 1. (b) Segmentation result of an English newspaper, with the body text occupying class 2. (c) Segmentation result of another Chinese newspaper, less complex, with the body text occupying class 1. The second experiment is for simple plain document images. The experiment is repeated in both up-right and skewed samples to test the adaptability of VSC to

The algorithm automatically reduced class number K to 2 in order to

The third experiment compares the VSC and VSD results in the same

Figure 7a shows the salient map calculated for a newspaper The gray

pixels. Figure 7b shows a thresholded version of 7a.

9 (a) (b) (c) Fig. 4. Text contents extracted from the segmentation results in figure 4. (a) (b) (c) (d) Fig. 5. Segmentation results for simple plain documents. The algorithm automatically reduced class number K to 2 in order to adapt the simple contents. skewness. Figure 6 shows the clustered results. Notice that they have been reduced to include only 2 classes. The third experiment compares the VSC and VSD results in the same image. Figure 7a shows the salient map calculated for a newspaper image. The gray scale values in it indicate the salient values detected in these pixels. Figure 7b shows a thresholded version of 7a. We can see that the main titles, separating lines, edges and boundaries pop out in the results. As compared with the homogenous regions shown in figure 7c (i.e., the same results in figure 5c), the VSD results are highly complementary to the VSC results and they reflect the discontinuous changes and highlights in the image.

(a) (b) (c) Fig. 6. Comparison of the VSD and VSC results. (a) the salient map calculated by VSD; (b) the thresholded version of (a); (c) the homogeneous text contents segmented from the VSC results.

Our ultimate goal is to achieve real adaptability for target segmentation in any type of document samples. The VSC computation is thus proposed to categorize similar contents in the image.

10 (a) (b) (c) Fig. 6. Comparison of the VSD and VSC results. (a) the salient map calculated by VSD; (b) the thresholded version of (a); (c) the homogeneous text contents segmented from the VSC results. 6 Conclusion and Future Work We have demonstrated a computational method to implement preattentive visual guidance in document image analysis. Our ultimate goal is to achieve real adaptability for target segmentation in any type of document samples. The VSC computation is thus proposed to categorize similar contents in the image. And the VSD computation is introduced to simulate the detection of salient regions. Both of these two processes are proposed based on the current discovery from psychophysics experiments. Initial experiments show that the VSC process is able to cluster the image contents into visual homogenous regions, especially for the main body text contents. The clustered results are quite stable in distinctively different document samples. And the VSD results reveal the salient regions in document images, corresponding to major titles, separating lines and edges etc. Both results can be further utilized in a specific DIA task to extract and interpret different types of semantic contents. Being undertaken in a data-driven manner, these two processes both have the inherent potential to implement adaptability and the experimental results also support this fact. Our future work will focus on two problems. The first is to develop more efficient visual similarity clustering algorithm. Since the current normalization method in VSC tends to diminish the deference between histograms, better normalization method and distance metric are needed to improve the numerical characterization of visual similarity. The second is to add more user-driven heuristic after the preattentive stage to extract task-oriented contents from the VSC and VSD results, with which the preattentive visual guidance can really benefit the adaptability of document image analysis. This work is supported by the National Natural Science Foundation of China (project ).

11 References [1] Wong, K.Y., R.G. Casey, and F.M. Wahl: Document Analysis System. IBM Journal Res. Develop(1982). 26(6): [2] Ittner, D.J. and H.S. Baird. Language-free layout analysis. in Proceedings of 2nd International Conference on Document Analysis and Recognition (ICDAR '93), Oct. 1993, Tsukuba Science City, Japan, 1993//, 1993, pp [3] Tang, Y.Y., S.-W. Lee, and C.Y. Suen: Automatic document processing: a survey. Pattern Recognition(1996). 29(12): [4] Nagy, G., S. Seth, and M. Viswanathan: A prototype document image analysis system for technical journals. IEEE Computer(1992). 25(7): [5] Drivas, D. and A.Amin. Page Segmentation and Classification Utilizing Bottom-Up Approach. in Proceedings of the third International Conference on Document Analysis and Recognition, Aug , 1995, pp [6] Liang, J., I.T. Phillips, and R.M. Haralick: An optimization methodology for document structure extraction on Latin character documents. IEEE Trans on Pattern Analysis and Machine Intelligence(2001). 23(7): [7] Jain, A.K. and S. Bhattacharjee: Text Segment Using Gabor Filters for Automatic Document Processing. Machine Vision and Applications(1992). 5(3): [8] Jain, A.K. and Y. Zhong: Page Segmentation Using Texture Analysis. Pattern Recognition(1996). 29(5): [9] Li, J. and R.M. Gray: Context-Based Multiscale Classification of Document Images Using Wavelet Coefficient Distributions. IEEE Trans on Image Processing(2000). 9(9): [10] Chen, J.-L.: A simplified approach to the HMM based texture analysis and its application to document segmentation. Pattern Recognition Letters(1997). 18(10): [11] Itti, L. and K. C.: Computional Modeling of Visual Attention. Nature Reviews Neuroscience(2001). 2(3): p [12] Healey, C.G., K.S. Booth, and J.T. Enns: High-speed visual estimation using preattentive processing. ACM Transactions on Computer-Human Interaction (TOCHI)(1996). 3(2): [13] Julesz, B.: Visual pattern discrimination. IRE Transaction of Information Theory(1962)(IT-8): [14] Heeger, D.J. and J.R. Bergen. Pyramid-based texture analysis/synthesis. in Computer Graphics Proceedings. SIGGRAPH 95, Los Angeles, CA, USA, 6-11 Aug. 1995, 1995, pp. p [15] Zhu, S.C., Y.N. Wu, and D. Mumford: Minimax Entropy Principle and Its Application to Texture Modeling. Neural Computation(1997). 9(8): [16] Treisman, A. and G. Gelade: A feature integration theory of attention. Cognitive Psychology(1980). 12(2): [17] Koch C. and U. S.: Shifts in selective visual attention: towards the underlying neural circuitry. Human Neurobiology(1985). 4(4): [18] Itti, L., C. Koch, and E. Niebur: A model of saliency-based visual attention for rapid scene analysis. IEEE Trans on Pattern Analysis and Machine Intelligence(1998). 20(11): [19] Liu, X. and D. Wang: Texture Classification Using Spectral Histogram. IEEE Trans on Image Processing(2003). 12(6): [20] Zhu, S.C., Y.N. Wu, and D.B.Mumford: FRAME : Filters, Random fields And Maximum Entropy -- towards a unified theory for texture modeling. International Journal of Computer Vision(1998). 27(3): [21] Wen, D. and X. Ding: Visual similarity based document layout analysis. To be appeared in Journal of Computer Science and Technology(2006). 21(3).

Layout Segmentation of Scanned Newspaper Documents

, pp-05-10 Layout Segmentation of Scanned Newspaper Documents A.Bandyopadhyay, A. Ganguly and U.Pal CVPR Unit, Indian Statistical Institute 203 B T Road, Kolkata, India. Abstract: Layout segmentation algorithms