Automatic Article Extraction in Old Newspapers Digitized Collections

Size: px

Start display at page:

Download "Automatic Article Extraction in Old Newspapers Digitized Collections"

Vernon Sherman
6 years ago
Views:

1 Automatic Article Extraction in Old Newspapers Digitized Collections David Hebert Pierrick Tranouez Thomas Palfray Thierry Paquet Stephane Nicolas ABSTRACT We present a complete method for article segmentation in old newspapers, which deals with complex layouts analysis of degraded documents. The designed workflow can process large amounts of documents and generates digital objects in METS/ALTO format in order to facilitate the indexing and the browsing of information in digital libraries. The analysis of the document image is performed by a two stages scheme. Pixels are labelled in a first stage with a Conditional Random Field model in order to intent to label the areas of interest with a low logical level. Then this first logical representation of the document content is analysed in a second stage to get a higher logical representation including article segmentation and reading order. This top-level structural analysis relies on the generation of an article separation grid applied recursively on the document image, allowing analysing any type of Manhattan page layout, even for complex structures with multiple columns and overlapping entities. This method which benefits from both a local analysis using a probabilistic model trained using machine learning procedures, and a more global structural analysis using recursive rules, is evaluated on a dataset of daily local press document images covering several time periods and different page layouts, to prove its effectiveness. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Copyright 20XX ACM X-XXXXX-XX-X/XX/XX...$ Keywords page layout analysis, information extraction from document images, logical structure, articles extraction in newspapers, document image labelling, conditional random field, structural analysis 1. INTRODUCTION During the last twenty years, the archives and national libraries worldwide have launched digitization programs of their historical collections in order to preserve them, and together providing remote access for a wider range of potential users thanks to the internet. Old newspapers archives are emblematic of this trend. However, digitization programs of such collections require indexing facilities of the document images obtained after scanning. Indeed, considering the large amount of documents these collections contain, and the large amount of articles each document can contain, smart indexing and retrieval is required. This is only conceivable through textual requires of users. This is why textual transcriptions of the digitized collections are necessary as well as deep understanding of the document structure so as to extract each article of the collection. Standard OCR technologies cannot fully automate the whole process of automatic transcription generation and page layout understanding of old newspaper, and a specific digitization process dedicated to complex page layout analysis, character recognition (OCR), logical structure detection and reading order detection. Physical structure extraction is the process that extract the physical structure Of the document such as columns and lines of text, prior to the OCR process. Logical structure analysis is the process that gives access to the information units of the document and its organization. It gives access to descriptors (logical tags also known as meta data) such as Title, Sub-Title, Chapter, Article, Paragraph, Captions, etc...

2 Generally, these two processes of document structure extraction (known as physical and logical layout analysis) operate separately and sequentially, one after the other. This is justified by the fact that most of the time physical segmentation of document images can be performed without the need for any additional knowledge, whereas logical layout extraction is generally performed thanks to the use of a document model (e.g. a style sheet) that express the relations between physical and logical entities of the two representations of the document. However, it is now well known that difficult segmentation tasks must incorporate a recognition stage so as to improve their performance. Therefore, we have developed a new methodology dedicated to logical labelling of old newspapers images. This method is intended to extract metadata in the images of the digitized, thanks to the joint use of a method of classification of sequence of pixels based on Conditional Random Field modelling, associated with a set of rules defining the concept of article within a newspaper. Based on physical descriptors, the pixel labelling perform directly a logical analysis that give us a first low logical level of segmentation. Then, based on the detected logical entities, the set of rules is able to bring a higher logical level to detect articles in a newspaper. In the first part of this paper we overview the related works in the literature. Then, the second part describes our method. The third part is dedicated to the evaluation of the approach which was tested on old newspaper issues from the Journal de Rouen, a regional French newspaper. Finally we conclude by a discussion about the potential of the method and future work. 2. RELATED WORK Since 2001 the ICDAR Conference is organizing a document page segmentation competition [2] in which some of the proposed algorithms may have goals similar to the system we propose in this paper. Nevertheless, the document dataset used for this competition contains only modern documents, therefore the proposed methods may be inefficient for old newspapers. Among the the contribution we can mention the work described in [1] which is based on pixel labelling stage which may be adaptable to old documents. In [3] an approach based on the detection of the maximal empty rectangles to delimit columns and text blocks is described. This method is integrated in the OCROPUS opensource OCR. Although interesting, this method does not cope with the difficulties inherent to old documents (skewing, deformations,...). A more interesting method taking into account these difficulties is described in [11]. The authors propose to use a multiscale approach to extract text blocks in old newspapers. This efficient method is limited to detect text blocks. No logical structure, neither reading order, is provided by this method. The approach we propose in this paper tries to bring a solution to this problem. 3. PROPOSED APPROACH The method we present in this paper has been implemented as a complete system dedicated to process a large amount of old newspaper document images. It automatically analyzes the logical organization of pages so as to extract articles. This extraction process has to make both physical and logical layout analysis. As opposed to the traditional approaches which separate physical and logical analysis, we have designed a methodology that performs logical analysis both at pixel level and at text blocs level. First, we proceed to a logical labelling of the image at pixel level which gives a low level representation of the logical structure to be extracted. This image processing stage embeds some knowledge about the physical organisation of the logical information to be extracted. It is based on a machine learning approach (Conditional Random Fields). The labelled image obtained is further analyzed during a second stage that recursively builds the logical entities of the document by referring to a generic layout model. This reconstruction stage can be viewed as a specific bidimensional parser that provides the parsed tree of the document image. Finally, the system provides XML METS/ALTO output files. These files contain both the logical structure describing the reading order of the articles, their physical layout composed of the detected text lines and their associated character recognized by an OCR. The following two paragraphs provide an in depth description of the two main steps of the methodology. 3.1 Logical labelling at pixel level The proposed method for article extraction in newspaper document images relies on a first segmentation stage using Conditional Random Fields model (CRF) with multiscale quantization feature functions. This approach has been presented in details in [8] and we recall its most important steps only. Conditional Random Fields (CRF) introduced in 2001 by Lafferty et al. [9] have opened a new way for sequence and image analysis. In its original formulation, a CRF is a stochastic model of a process that account for the dependencies between a sequence of discrete observations (originally a word sequence) and a sequence of labels that can be associated to these observations (originally Part Of Speech tags). The application of CRF to image labelling requires some adaptation so as to deal with numerical values instead of discrete observations. This adaptation can be viewed as a preprocessing step dedicated to providing the CRF discrete observations extracted from raw numerical pixel values of the image. In the field of computer vision [7, 10] use the outputs of a neural network or a SVM to feed a CRF. CRF have also been applied to document structure extraction. [12] uses a 2D-CRF based approach on top of a first neural Network classification stage. Another example can be found in [4] were the CRF model is introduced at a second stage after a pixel classification stage. Training these systems is particularly difficult be cause they require training two sub systems : a local classifier, and then a CRF that is fed with the output of the first classifier. In the proposed system, we use a conditional random field with multi-scale quantization feature functions [8]. Such an approach requires training one CRF only. Physical descriptors of the image made of Run Length features are quantized using several quantization functions and fed to the CRF. Let us define a linear quantization function with quantifier q that quantizes the continuous observation o as follows: Q(o, q) : 0 X o x = round( o q ) Assuming o is ranging within the interval [o min, o max] then the quantization function can take only Nd discrete values, with Nd = (o max o min)/q. To avoid a choice of the q value, we use multiple quantization functions. Let q 1, q 2,, q N be a set of quantizers, each

(a) Figure 1: The CRF logical labelling at pixel level for the image (a) defining a quantization function Q i(o) = Q(o, q i), then by choosing a dyadic law of quantifiers as follows q i = 2 q i 1 = q

these features. With these multiscale quantizations, the CRF model can be written has in equation (1). p(y X) = 1 t=t Z(X) t=1 exp ( k=k k=1 λ k f k (y t 1, y t, Q 1(X),.

3 (a) Figure 1: The CRF logical labelling at pixel level for the image (a) defining a quantization function Q i(o) = Q(o, q i), then by choosing a dyadic law of quantifiers as follows q i = 2 q i 1 = q 1 2 i 1, we build a multi-scale quantization scheme with the ability to keep most of the original information contained in the continuous features without any assumption about the distribution of these features. With these multiscale quantizations, the CRF model can be written has in equation (1). p(y X) = 1 t=t Z(X) t=1 exp ( k=k k=1 λ k f k (y t 1, y t, Q 1(X),..., Q N (X), t) In the old newspaper article extraction workflow, we use a CRF with multiscale quantized feature functions on a set of physical descriptors made of vertical and horizontal run length, as described in [8]. We choose a first quantization step q1 = 2 with the dyadic law q i = 2 q i 1, giving on our images 10 quantization scales. One strength of a CRF model is its ability to deal with contextual information in the observation space. Concretely, a decision at the position t is not only made by taking only the observation at this position into account but also neighboring ones in a defined vicinity. The logical labelling use a sequential CRF along the horizontal direction with a context of two previous and two next observations around the current pixel to label. This system provides a fine segmentation of the image at pixel level, where each pixel is associated to a logical label specifying the logical function of the entity this pixel belongs to. The logical functions are the following ones : Vertical separator Horizontal separator Titles (composed of title characters and title interwords ) Text lines (composed of characters, inter-characters and inter-words ) Noise Background An example of the labelling result is given by the figure 1. This labelling stage provides a precise description of the ) (1) logical structure of the document at pixel level, but it does not provide the logical structure of the document. This is the goal of the second labelling stage that we present in the following paragraph. 3.2 Logical structure extraction Logical structure extraction from document images aims at producing the parsed tree of the document, where each node of the tree account for a particular entity of its logical organization e.g. Title, sub-title, paragraph, figure, caption, etcâă e In addition the order in which these entities are organized, namely the logical order, is associated. Logical structure extraction is the process that takes as input a raw level representation of the document at pixel level and provides a high level representation. Therefore, logical structure extraction can be viewed as a particular bidimensional parser. As already mentioned, this stage generally takes place after a first physical segmentation stage of the document. In the method that we propose here, we exploit both the physical properties of the image entities as well as their labelling that is produced by the CRF labelling stage. The labels are Title, Text, H separator and V separator. The approach is composed of two main steps : 1- the detection of labelled atomic entities in the image 2- the reconstruction process of higher level entities thanks to a generic document model. We now detail these two steps Labelled atomic entities detection The atomic entities on which the reconstruction process takes place are text lines, titles, horizontal and vertical separators. These entities are detected in the image thanks to the CRF labelling stage. First titles and text lines are detected by applying the following label merging rules : text line = characters + inter-characters + inter-words title = title characters + title inter-words Despite the very good performance of the CRF labelling stage, there are some pixel labelling inconsistencies that require to be detected and corrected before the reconstruction process can take place. Most of the inconsistencies occur when some entities with different labels are connected together (for example text entities and title entities being connected together). Such erroneous cases are corrected by labelling the entities involved with the most occurring label (background pixels are not considered). The text lines are obtained by extracting the connected components, which are labelled text in the resulting image. Despite of the robustness of the extraction process, some text lines may be connected because of some important deformation in the image due to some degradations or digitization artefacts. Possible connected test lines are detected by computing the average surface of the text lines in the whole document image. Then, text entities with a surface much higher than the mean surface, are considered erroneous. These situations are then corrected by a specific algorithm, which allows separating them. The results provided by this detection and labelling process are shown on figure 2. Structural entities such as text lines, titles, vertical and horizontal separators can then be extracted on the image Article extraction using a layout model The atomic entities of interest detected in the image match one of the following label: title, text, horizontal separator,

Figure 2: Labelled entities detection (a) Figure 3: (a): A section is the set of blocks of the same color. : The unambiguous reading order inside sections and vertical separator.

A page layout model is made of a precise description of the allowed spatial physical organization of articles as well as a description of the organization of an article.

A section can be composed of many sections that are organized sequentially, one below the other, or hierarchically one inside the other.

4 Figure 2: Labelled entities detection (a) Figure 3: (a): A section is the set of blocks of the same color. : The unambiguous reading order inside sections and vertical separator. Page layout of a newspaper follows some precise layout rules, namely a layout model. A page layout model is made of a precise description of the allowed spatial physical organization of articles as well as a description of the organization of an article. The model that was implemented in this study copes with complex multi-sections and multi-columns page layouts. A page is a section. A section can be composed of many sections that are organized sequentially, one below the other, or hierarchically one inside the other. They are separated with a large horizontal separator that span over the whole width of the section. Each section can contain multiple columns, with a variable number of columns between sections. Columns are separated with vertical separators that span over the whole height of the section. A section contains a sequence of articles (figure 3(a)). An article begins with a title entity followed by at least one text entity, and ends with an horizontal separator entity or another title entity. The detection of the spatial organization of pages in sections appears the key issue that further enables the detection of articles. Indeed, articles are organized sequentially within each section and are separated by titles and / or horizontal separators. Ar- Figure 4: text line The grid of cells containing at least on ticles extraction is therefore implemented following the two main steps. First, the physical grid constituted by vertical and horizontal separators is detected and text blocs are assigned to their surrounding cell. Section delimiter are also detected at the end of this process. Second, articles are detected easily as they are made of the text blocs delimited between successive titles and horizontal separators and following the reading order of the section. Detection of the separator grid and text blocks. The first step of our text blocks detection consists in extending all the entities that are logical separators. These are entities labelled Âń separator Âż and those that are labelled Âń title Âż. Extension of a horizontal (resp. vertical) separator consists in extending its width (resp. height) until touching the vertical (resp. horizontal) left and right (resp. top and bottom) separators. we apply the following steps sequentially: Create the vertical and horizontal separator mask Connect the neighbouring vertical separators Extend vertical separators as long as they do not cross a horizontal separator or a title Connect the neighbouring horizontal separators Extend horizontal separators and the titles as long as they do not cross a vertical separator At the end of this process, the grid covering the entire image is obtained and it will serve to extract the articles. For that purpose each grid cell is associated with the text lines it surrounds. The title entities are also associated to their surrounding grid cell. Cells containing no text line are removed. Finally we obtain a list of text blocks made of the remaining grid cells (figure 4). Reading order detection. Section detection is based on the detection of the horizontal section delimiters. These are the horizontal delimiters that span over multiple columns. Text blocs are grouped within each section they belong to, and they are organized sequentially following a top-down, left right reading order. By definition, a section follows this unambiguous reading order (see figure 3). Within each section, following the

(a) (c) (d) Figure 5: The final article segmentation Table 1: Results of the whole process of logical segmentation into articles #articles #detected #correct %correct %over-seg 226 245 194 85.84 8.

4.1 RESULTS Quantitative evaluation This method was tested on a dataset containing 42 document images issued from a French regional newspaper called Journal de Rouen.

The analysis of the errors produced shows that a great amount of them are due to labelling errors produced by the CRF segmentation stage.

Theses errors produce over-segmentation cases, most of the time. 4.

21550 documents made of 4 pages in average have been precessed by the workflow described in this paper. Theses documents are newspapers published between 1767 and 1843.

5 (a) (c) (d) Figure 5: The final article segmentation Table 1: Results of the whole process of logical segmentation into articles #articles #detected #correct %correct %over-seg reading order, articles are made of the successive text blocs delimited between title entities and horizontal separators (figure 5) RESULTS Quantitative evaluation This method was tested on a dataset containing 42 document images issued from a French regional newspaper called Journal de Rouen. The results have been checked manually by visual inspection. We determined the article detection rate and the article over-segmentation rate. These results are given in table 1 below. The analysis of the errors produced shows that a great amount of them are due to labelling errors produced by the CRF segmentation stage. For example, 14 text lines have been labelled erroneously as title. This leads to the detection of 14 extra articles when applying the editorial,rules, which lead to detect 28 articles instead 14. Theses errors produce over-segmentation cases, most of the time. 4.2 Mass evaluation of the method The proposed method has been used extensively during the digitization process of the old newspaper Journal de Rouen documents made of 4 pages in average have been precessed by the workflow described in this paper. Theses documents are newspapers published between 1767 and The layout is not constant and evolves during this large period of time. The simplest one is given by the running example of this paper (figure 1(a)). Some examples of more complex layouts are shown on figures 6 and 7. Results can not be quantified but seem to follow the 85% found for the quantitative evaluation. Some regions of documents are more difficult to analyse such as documents with tables or document without any vertical separator between columns. These errors induced by missing separators can be avoided by evolving the separator definition. At this time, a separator is a physical entity but considering large white spaces as Figure 6: Some examples of layouts available in the collection and their associated article segmentation separators can solve these issues. 5. CONCLUSION AND FUTURE WORK We presented in this paper a logical segmentation method based on the analysis of low-level labelling results produced by a CRF model, using a set of rules defined by a generic layout model. The proposed method is able to segment the textual content of old newspapers with complex Manhattan structure (multi columns), using a little set of simple rules. We obtain with this method an article segmentation rate of 85.84% on a test dataset containing 42 images of Journal de Rouen, one of the oldest French regional newspapers. These first results are promising, and allow us to identify two main improvement issues. The first one consists in improving our CRF model because we noted that the majority of the errors come from this stage. In a second step we will improve further both the CRF model and the layout rules in order to be able to take into account some important other entities of the document structure, such as figures, pictures, captions and tables. 6. REFERENCES [1] C. An, D. Yin, and H. S. Baird. Document segmentation using pixel-accurate ground truth. In ICPR, pages IEEE, [2] A. Antonacopoulos, S. Pletschacher, D. Bridson, and C. Papadopoulos. Icdar 2009 page segmentation competition. In ICDAR, pages IEEE Computer Society, 2009.

(a) (c) (d) Figure 7: Some other examples of layouts available in the collection and their associated article segmentation [3] T. M. Breuel. Two geometric algorithms for layout analysis.

Model-guided segmentation and layout labelling of document images using a hierarchical conditional random field.

Artie res. Conditional Random Fields for Online Handwriting Recognition. In G. Lorette, editor, Tenth International Workshop on Frontiers in Handwriting Recognition, La Baule (France), Oct. 2006.

In the Proceedings of the 2nd IEEE International Conference on Document Image Analysis for Libraries (DIAL, pages 30 37, Washington, DC, USA, 2006. [7] X. He, R. S. Zemel, and M. A. Carreira-perpinan.

6 (a) (c) (d) Figure 7: Some other examples of layouts available in the collection and their associated article segmentation [3] T. M. Breuel. Two geometric algorithms for layout analysis. In Proceedings of the 5th International Workshop on Document Analysis Systems V, DAS 02, pages , London, UK, UK, Springer-Verlag. [4] S. Chaudhury, M. Jindal, and S. Dutta Roy. Model-guided segmentation and layout labelling of document images using a hierarchical conditional random field. In Proceedings of the 3rd International Conference on Pattern Recognition and Machine Intelligence, PReMI 09, pages , Berlin, Heidelberg, Springer-Verlag. [5] T.-M.-T. Do and T. Artie res. Conditional Random Fields for Online Handwriting Recognition. In G. Lorette, editor, Tenth International Workshop on Frontiers in Handwriting Recognition, La Baule (France), Oct Universite de Rennes 1, Suvisoft. [6] S. Feng, R. Manmatha, and A. Mccallum. Exploring the use of conditional random field models and hmms for historical handwritten document recognition. In the Proceedings of the 2nd IEEE International Conference on Document Image Analysis for Libraries (DIAL, pages 30 37, Washington, DC, USA, [7] X. He, R. S. Zemel, and M. A. Carreira-perpinan. Multiscale conditional random fields for image labeling. In In CVPR, pages , [8] D. Hebert, T. Paquet, and S. Nicolas. Continuous crf with multi-scale quantization feature functions application to structure extraction in old newspaper. In International Conference on Document Analysis and Recognition (ICDAR), pages IEEE, March [9] J. D. Lafferty, A. McCallum, and F. C. N. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the Eighteenth International Conference on Machine Learning, ICML 01, pages , San Francisco, CA, USA, Morgan Kaufmann Publishers Inc. [10] C.-H. Lee, S. Wang, A. Murtha, M. R. G. Brown, and R. Greiner. Segmenting brain tumors using pseudo-conditional random fields. In D. N. Metaxas, L. Axel, G. Fichtinger, and G. SzA l kely, editors, MICCAI (1), volume 5241 of Lecture Notes in Computer Science, pages Springer, [11] A. Lemaitre, J. Camillerapp, and B. Couasnon. Approche perceptive pour la reconnaissance de filets bruites - application a la structuration de pages de journaux. In Actes du Xeme Colloque International Francophone sur l Ecrit et le Document, CIFED 08, pages 61 66, France, A. T. et Thierry Paquet (ed.). [12] S. Nicolas, J. Dardenne, T. Paquet, and L. Heutte. Document image segmentation using a 2d conditional random field model. In ICDAR, pages IEEE Computer Society, 2007.

The PAGE (Page Analysis and Ground-truth Elements) Format Framework

2010,IEEE. Reprinted, with permission, frompletschacher, S and Antonacopoulos, A, The PAGE (Page Analysis and Ground-truth Elements) Format Framework, Proceedings of the 20th International Conference on