Learning to Segment Document Images

Similar documents
On Segmentation of Documents in Complex Scripts

Optical Character Recognition (OCR) for Printed Devnagari Script Using Artificial Neural Network

A Statistical approach to line segmentation in handwritten documents

Character Segmentation for Telugu Image Document using Multiple Histogram Projections

Spatial Topology of Equitemporal Points on Signatures for Retrieval

Localization, Extraction and Recognition of Text in Telugu Document Images

A New Algorithm for Detecting Text Line in Handwritten Documents

Accurate Image Registration from Local Phase Information

An Efficient Character Segmentation Based on VNP Algorithm

CS 223B Computer Vision Problem Set 3

An ICA based Approach for Complex Color Scene Text Binarization

CS 231A Computer Vision (Fall 2012) Problem Set 3

N.Priya. Keywords Compass mask, Threshold, Morphological Operators, Statistical Measures, Text extraction

Cellular Learning Automata-Based Color Image Segmentation using Adaptive Chains

Image Segmentation for Image Object Extraction

OCR For Handwritten Marathi Script

A Graph Theoretic Approach to Image Database Retrieval

Comparative Study of ROI Extraction of Palmprint

Handwritten Devanagari Character Recognition Model Using Neural Network

Pattern Recognition. Kjell Elenius. Speech, Music and Hearing KTH. March 29, 2007 Speech recognition

Part I. Hierarchical clustering. Hierarchical Clustering. Hierarchical clustering. Produces a set of nested clusters organized as a

UW Document Image Databases. Document Analysis Module. Ground-Truthed Information DAFS. Generated Information DAFS. Performance Evaluation

Cluster Analysis. Jia Li Department of Statistics Penn State University. Summer School in Statistics for Astronomers IV June 9-14, 2008

Collaborative Rough Clustering

CS 490: Computer Vision Image Segmentation: Thresholding. Fall 2015 Dr. Michael J. Reale

Uncertainties: Representation and Propagation & Line Extraction from Range data

Adaptive Building of Decision Trees by Reinforcement Learning

Chapter 3 Image Registration. Chapter 3 Image Registration

One-Shot Learning with a Hierarchical Nonparametric Bayesian Model

RESTORATION OF DEGRADED DOCUMENTS USING IMAGE BINARIZATION TECHNIQUE

Unsupervised Learning

Segmentation and Tracking of Partial Planar Templates

Robust line segmentation for handwritten documents

SYDE Winter 2011 Introduction to Pattern Recognition. Clustering

Cursive Handwriting Recognition System Using Feature Extraction and Artificial Neural Network

CRFs for Image Classification

Evaluation of Moving Object Tracking Techniques for Video Surveillance Applications

Segmentation and labeling of documents using Conditional Random Fields

A Graphics Image Processing System

Object-Based Classification & ecognition. Zutao Ouyang 11/17/2015

MultiDimensional Signal Processing Master Degree in Ingegneria delle Telecomunicazioni A.A

Nearest Clustering Algorithm for Satellite Image Classification in Remote Sensing Applications

Segmentation of Handwritten Textlines in Presence of Touching Components

Object Recognition Using Pictorial Structures. Daniel Huttenlocher Computer Science Department. In This Talk. Object recognition in computer vision

Recognition of Gurmukhi Text from Sign Board Images Captured from Mobile Camera

[Programming Assignment] (1)

Clustering. Mihaela van der Schaar. January 27, Department of Engineering Science University of Oxford

Clustering Algorithm (DBSCAN) VISHAL BHARTI Computer Science Dept. GC, CUNY

Image Analysis - Lecture 5

Convexization in Markov Chain Monte Carlo

Document Image Restoration Using Binary Morphological Filters. Jisheng Liang, Robert M. Haralick. Seattle, Washington Ihsin T.

Fabric Defect Detection Based on Computer Vision

A Hierarchical Pre-processing Model for Offline Handwritten Document Images

MRF Based LSB Steganalysis: A New Measure of Steganography Capacity

Image Segmentation. Srikumar Ramalingam School of Computing University of Utah. Slides borrowed from Ross Whitaker

Character Recognition

Multi-scale Techniques for Document Page Segmentation

HOUGH TRANSFORM CS 6350 C V

Performance Comparison of Six Algorithms for Page Segmentation

Layout Segmentation of Scanned Newspaper Documents

HMM-Based Handwritten Amharic Word Recognition with Feature Concatenation

Automated Segmentation Using a Fast Implementation of the Chan-Vese Models

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University

DATA MINING LECTURE 7. Hierarchical Clustering, DBSCAN The EM Algorithm

10-701/15-781, Fall 2006, Final

Performance Characterization in Computer Vision

Textural Features for Image Database Retrieval

Robust Phase-Based Features Extracted From Image By A Binarization Technique

Norbert Schuff VA Medical Center and UCSF

Available online at ScienceDirect. Procedia Computer Science 45 (2015 )

The K-modes and Laplacian K-modes algorithms for clustering

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS

Unsupervised Learning

Stochastic Language Models for Style-Directed Layout Analysis of Document Images

EUSIPCO

Prototype Selection for Handwritten Connected Digits Classification

Graph Matching Iris Image Blocks with Local Binary Pattern

A Two-stage Scheme for Dynamic Hand Gesture Recognition

Supervised Learning for Image Segmentation

Clustering CS 550: Machine Learning

Optimization Model of K-Means Clustering Using Artificial Neural Networks to Handle Class Imbalance Problem

Unsupervised Learning: Clustering

Digital Image Processing Laboratory: MAP Image Restoration

Clustering. SC4/SM4 Data Mining and Machine Learning, Hilary Term 2017 Dino Sejdinovic

Pixels. Orientation π. θ π/2 φ. x (i) A (i, j) height. (x, y) y(j)

A Model-based Line Detection Algorithm in Documents

A New Online Clustering Approach for Data in Arbitrary Shaped Clusters

Fine Classification of Unconstrained Handwritten Persian/Arabic Numerals by Removing Confusion amongst Similar Classes

Projected Texture for Hand Geometry based Authentication

OTCYMIST: Otsu-Canny Minimal Spanning Tree for Born-Digital Images

CS4733 Class Notes, Computer Vision

Robust Ring Detection In Phase Correlation Surfaces

Application of Characteristic Function Method in Target Detection

A Point Matching Algorithm for Automatic Generation of Groundtruth for Document Images

Hand Written Character Recognition using VNP based Segmentation and Artificial Neural Network

Deep Generative Models Variational Autoencoders

Image Analysis Image Segmentation (Basic Methods)

A Novel q-parameter Automation in Tsallis Entropy for Image Segmentation

Human Motion Detection and Tracking for Video Surveillance

Image Processing. BITS Pilani. Dr Jagadish Nayak. Dubai Campus

Transcription:

Learning to Segment Document Images K.S. Sesh Kumar, Anoop Namboodiri, and C.V. Jawahar Centre for Visual Information Technology, International Institute of Information Technology, Hyderabad, India Abstract. A hierarchical framework for document segmentation is proposed as an optimization problem. The model incorporates the dependencies between various levels of the hierarchy unlike traditional document segmentation algorithms. This framework is applied to learn the parameters of the document segmentation algorithm using optimization methods like gradient descent and Q-learning. The novelty of our approach lies in learning the segmentation parameters in the absence of groundtruth. 1 Introduction Document image layout has a hierarchical (tree like) representation, with each level encapsulating some unique information that is not present in other levels. The representation contains the complete page in the root node, and the text blocks, images and background form the next layer of the hierarchy. The text blocks can have a further detailed representations like text lines, words and components, which form the remaining layers of the representation hierarchy. A number of segmentation algorithms [1] have been proposed to fill the hierarchy in a top-down or bottom-up fashion. However, traditional approaches assume independence between different levels of the hierarchy, which is incorrect in many cases. The problem of document segmentation is that of dividing a document image (I) into a hierarchy of meaningful regions like paragraphs, text lines, words, image regions, etc. These regions are associated with a homogeneity property φ( ) and the segmentation algorithms are parameterized by θ. The nature of the parameters θ depend on the algorithm that is used to segment the image. Most of the current page segmentation algorithms decide the parameters θ,apriori, and use it irrespective of the page characteristics as shown in [2]. There are also algorithms that estimate the document properties (like average lengths of connected components) and assign values to θ using a predefined method [3]. Though these methods are reasonably good for homogeneous document image collections, they tend to fail in the case of wide variations in the image and its layout characteristics. The problem becomes even more challenging in the case of Indian language documents due to the presence of large variations in fonts, styles, etc. This warrants a method that can learn from examples and solve the document image segmentation for a wide variety of layouts. Such algorithms can be very effective for large and diverse document image collections such as digital libraries. In this paper, we propose a page segmentation algorithm based on a learning framework, where the parameter vector θ is learnt from a set of example documents, using homogeneity properties of regions, φ( ). The feedback from different stages of the segmentation process is used to learn the parameters that maximize φ( ). S.K. Pal et al. (Eds.): PReMI 2005, LNCS 3776, pp. 471 476, 2005. c Springer-Verlag Berlin Heidelberg 2005

472 K.S. Sesh Kumar, A. Namboodiri, and C.V. Jawahar Hierarchical framework: The hierarchical segmentation framework is characterized at each level i, byaparametersetθ i, 1 i n. The input to the segmentation algorithm is the document image, (I), which is generated by a random process, parameterized by θ. Given an input document I, the goal is to find the set values for parameters θ =(θ 1,θ 2,...,θ n ), that maximizes the joint probability of the parameters for a given (I), using (1). ˆθ = arg max P (θ 1,θ 2,...,θ n I) (1) θ 1,θ 2,...,θ n = arg max P (θ 1 I).P (θ 2 I)...P(θ n I) (2) θ 1,θ 2,...,θ n = arg max P (θ 1 I).P (θ 2 I,θ 1 )...P(θ n I,θ 1,...,θ n 1 ) (3) θ 1,θ 2,...,θ n On assuming independence between different levels of the hierarchy, the parameters are conditionally independent of each other as they characterize the segmentation process at each level. Hence (1) can be rewritten as (2). However, this is not a valid assumption as errors at a level of the hierarchy is propagated downwards deteriorating the overall performance of the segmentation system. Hence the segmentation algorithm and its parameters at the lower levels depend on the parameters of the upper levels and the input image I. To incorporate this dependency into the formulation, we need to rewrite (1) as (3). To achieve optimal segmentation the joint probability of the parameters P (θ 1, θ n ) needs to be modeled. However, the distribution is very complex in case of page segmentation algorithms. In addition a large set of algorithms that can be used at different levels of hierarchy. Hence a fitness measure of segmentation is defined that approximates the posterior probability at each stage. Our goal is to compute the parameter θ i that maximizes the posterior P (θ i /I,θ 1,θ 2,...,θ i 1 ) or the fitness function at level i. Conventionally, segmentation has been viewed as a deterministic partitioning scheme characterized by the parameter θ. The challenge has been in finding an optimal set of values for θ, to segment the input Image I into appropriate regions. In our formulation, the parameter vector θ is learned based on the feedback calculated in the form a homogeneity measure φ( ) (a model based metric to represent the distance from ideal segmentation). The feedback is propagated upwards in the hierarchy to improve the performance of each level above, estimating the new values of the parameters to improve the overall performance of the system. Hence the values of θ are learned based on the feedback from present and lower levels of the system. In order to improve the performance of an algorithm over time using multiple examples or by processing an image multiple times, the algorithm is given appropriate feedback in the form of homogeneity measure of the segmented regions. Feedback mechanisms for learning the parameters could be employed at various levels. 2 Hypothesis Space for Page Segmentation A document image layout is conventionally understood as a spatial distribution of text and graphics components. However, for a collection of document images, layout is characterized by a probability distribution of many independent random variables. For example, consider a conference proceedings formatted according to a specific layout style

Learning to Segment Document Images 473 sheet. The set of values for word-spaces, line spaces etc. will follow a specific statistical distribution. For a new test page, segmentation implies the maximization of the likelihood of the observations (word-spaces etc.) by assuming the distribution. In our case, the objective is to maximize the likelihood of the set of observations by learning the parameters of the distribution, indirectly in θ-space. We assume the existence and possibly the nature of a distribution of the region properties, but not the knowledge of the ground truth. For the problem of page segmentation, assumption of an available distribution (ground-truth) may not be very appropriate. Hence the problem is formulated as learning of the parameter vector θ, which minimizes an objective function J(I,φ,θ). The function J( ) could be thought of as an objective quality measure of the segmentation result. The parameter θ includes all the parameters in the hierarchy of document segmentation, θ 1,,θ n. In this case, the objective function, J( ), can be expanded as a linear combination of n individual functions. Objective functions can be defined for different segmentation algorithms based on the homogeneous properties of the corresponding regions it segments. Segmentation Quality Metric: A generic objective function should be able to encapsulate all the properties associated with the homogeneity function φ. In the experiments, a function is considered that includes additional properties of the layout, such as density of text pixels in the segmented foreground and background regions and a measure that accounts for partial projections in multi-column documents. In this work, the inter line variance (σ 1 ), the variance of the line height (σ 2 ), the variance of the distance between words (σ 3 ), the density of foreground pixels within a line (ILD) and between two lines (BLD), the density of foreground pixels within a word(iwd) and between words (BWD) are considered. We use the following objective functions J(I,φ l,θ l ) and J(I,φ w,θ w ), that needs to be maximized for best line and word segmentation respectively: J(I,φ l,θ l )= 1 + 1 BLD + ILD (4) 1+σ 1 1+σ 2 J(I,φ w,θ w )= 1 BWD + IWD (5) 1+σ 3 The value of each of the factors in the combination in the above equations falls in the range [0, 1], and hence J(I,φ l,θ l ) [ 1, 3] and J(I,φ w,θ w ) [ 1, 1]. 3 Learning Segmentation Parameters The process of learning tries to find the set of parameters, θ, that maximizes the objective function J(I,φ,θ): argmin J(I,φ,θ). Let θ i be the current estimate of the parameter θ vector for a given document. We compute a revised estimate, θ i+1, using an update function, D θ (.), which uses the quality metric J( ) in the neighborhood of θ i to compute δθ i. θ i+1 = θ i + δθ i ; δθ i = D θ (J(I,φ, θ i )) (6) In order to define the parameter vector θ, we need to look at the details of the segmentation algorithm that is employed. The algorithm used for segmentation operates in

474 K.S. Sesh Kumar, A. Namboodiri, and C.V. Jawahar two stages. In the first stage, image regions are identified based on the size of connected components (size θ c ) and are removed. The remaining components are labeled as text. The second stage identifies the lines of text in the document based on recursive XY cuts as suggested in [1]. The projection profiles in the horizontal and vertical directions of each text block are considered alternately. A text block is divided at points in the projection, where the number of foreground pixels fall below a particular threshold, θ n. In order to avoid noise pixels forming a text line, text line with a height less than a threshold, θ l is removed. The lines segmented are further segmented into words using a parameter θ w. We also restrict the number of projections to a maximum of 3, which is sufficient for most real-life documents. One could employ a variety of techniques to learn the optimal set of values for these parameters (θ c,θ n,θ l and θ w ). The parameter space corresponding to the above θ is large and a simple gradientbased search would often get stuck at local maxima, due to the complexity of the function being optimized. To overcome these difficulties, a reinforcement learning based approach called Q-learning is used. Peng et al. [4] suggested a reinforcement learning based parameter estimation algorithm for document image segmentation. However, our formulation is fundamentally different from that described in [4] in two ways: 1) It assumes no knowledge about the ground truth of segmentation or recognition, and 2) The state space is modeled to incorporate the knowledge of the results of the intermediate stages of processing unlike in [4] that uses a single state after each stage of processing. 3.1 Feedback Based Parameter Learning The process of segmentation of a page, is carried out in multiple stages such as separation of text and image regions, and the segmentation of text into columns, blocks text lines and words. The sequential nature of the process lends well to learning in the Q-learning framework. In Q-learning, we consider the process that is to be learned as a sequence of actions that takes the system from a starting state (input image I) to a goal state (segmented image) [5]. An action a t from a state s t takes us to a new state s t+1 and could result in a reward, r t. The final state is always associated with a reward depending on the overall performance of the system. The problem is to find a sequence of actions that Q*(.) 3 2.5 2 1.5 1 0.5 Evaluation of Segmentation 1.4 1.2 1 0.8 0.6 0.4 0.2 GaussianNoise-Fixed GaussianNoise-Learn SaltAndPepper-Fixed SaltAndPepper-Learn 0 0 10 20 30 40 50 60 70 80 90 100 Number of Iterations (a) 0 0 2 4 6 8 10 Noise Levels (b) Fig. 1. (a): Q( ) over iterations, (b): segmentation accuracy of our approach and that of using fixed parameters in the presence of noise

Learning to Segment Document Images 475 maximizes the overall reward obtained in going from the start state to the goal state. The optimization is carried out with the help of the table (Q(s t,a t )) that maintains an estimate of the expected reward for the action a t from the state, s t.iftherearen steps in the process, the Q table is updated using the following equation: Q(s t,a t )=r t + γr t+1 +...+ γ n 1 r t+n 1 + γ n max (Q(s t+n,a)) (7) a In the example of using XY cuts, the process contains two stages, and the above equation would reduce to Q(s t,a t )=r t + γ max a Q(s t+1,a), wherer t is the immediate reward of the first step. Figure 1(a) shows the improvement of Q (s t,a t )= max a Q(s t,a t ) value over iterations for a particular document image. We note that the Q function converges in less than 100 iterations in our experiments. 4 Experimental Results The segmentation algorithm was tested on a variety of documents, including new and old books in English and Indian languages, handwritten document images, as well as scans of historic palm leaf manuscripts collected by the Digital Library of India (DLI) project. Figure 2 shows examples of segmentation on a palm leaf, handwritten document, an indian language document with skew and a document with complex layout. Note that the algorithm is able to segment the document with multiple columns and images even when there is considerable variation between the font sizes, inter-line spacing among the different text blocks and unaligned text lines. Performance in presence of Noise: Varying amounts of Gaussian noise and Salt-and- Pepper noise were added to the image and segmentation was carried out using the parameters learned from the resulting image. Figure 1(b) shows the segmentation performance of the learning-based approach as well as that using a fixed set of parameters, for varying amounts of noise in the data. We notice that the accuracy of the learning (a) (b) (c) Fig. 2. Results of segmentation on (a) palm leaf, (b) handwritten document, (c) line and word segmentation of Indian language document with skew and (d) a multicolumn document (d)

476 K.S. Sesh Kumar, A. Namboodiri, and C.V. Jawahar approach is consistently higher than that using a fixed set of parameters. At high levels of noise, the accuracy of the algorithm is still above 70%, while that using a fixed set of parameters falls to 20% or lower. The accuracy of segmentation, ρ, is computed as ρ =(n(l) n(c L S L M L F L ))/n(l), where n(.) gives the cardinality of the set, L is the set of ground truth text lines in the document, C L is the set of ground truth text lines that are missing, S L is the set of ground truth text lines whose bounding boxes are split, M L is the set of ground truth text lines that are horizontally merged, and F L is the set of noise zones that are falsely detected [2]. We notice that the algorithm performs close to optimum even in presence of significant amounts of noise. Meanwhile, the performance of the same algorithm, when using a fixed set of parameters, degrades considerably in presence of noise.our algorithm gave a performance evaluation of 97.80% on CEDAR dataset with five layouts and five documents of each layout and 91.20% on the Indian language DLI pages. Integrating Skew Correction: Skew correction is integrated into the learning framework by introducing a skew correction stage to the sequence of processes in segmentation, after the removal of image regions. An additional parameter, θ s, would be required to denote the skew angle of the document, which is to be rectified. The results of segmentation with skew can be viewed in 2(c). Defining an immediate reward for the skew correction stage, computed from the autocorrelation function of the projection histogram helps in speeding up the learning process. The proposed algorithm is able to achieve robust segmentation results in the presence of various noises and a wide variety of layouts. The parameters are automatically adapted to segment the individual documents/text blocks, along with correction of document skew. The algorithm can easily be extended to include additional processing stages, such as thresholding, word and character segmentation, etc. 5 Conclusion We have proposed a learning-based page segmentation algorithm using an evaluation metric which is used as feedback for segmentation. The proposed metric evaluates the quality of segmentation without the need of ground truth at various levels of segmentation. Experiments show that the algorithm can effectively adapt to a variety of document styles, and be robust in presence of noise and skew. References 1. G.Nagy, S.Seth, M.Vishwanathan: A Prototype Document Image Analysis System for Technical Journals. Computer 25 (1992) 10 12 2. Mao, S., Kanungo, T.: Emperical performance evaluation methodology and its application to page segmentation algorithms. IEEE Transactions on PAMI 23 (2001) 242 256 3. Sylwester, D., Seth, S.: Adaptive segmentation of document images. In: Proceedings of the Sixth International Conference on Document Analysis and Recognition, Seattle, WA (2001) 827 831 4. J.Peng, B.Bhanu: Delayed reinforcement learning for adaptive image segmentation and feature extraction. IEEE Transactions on Systems, Man and Cybernetics 28 (1998) 482 488 5. Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction. The MIT Press (1998)