The Book Structure Extraction Competition with the Resurgence Software at Caen University

Size: px

Start display at page:

Download "The Book Structure Extraction Competition with the Resurgence Software at Caen University"

Miranda Shaw
6 years ago
Views:

1 The Book Structure Extraction Competition with the Resurgence Software at Caen University Emmanuel Giguet and Nadine Lucas GREYC Cnrs, Caen Basse Normandie University BP 5186 F CAEN Cedex France Abstract. The GREYC Island team participated in the Structure Extraction Competition part of the INEX Book track for the first time, with the Resurgence software. We used a minimal strategy primarily based on top-down document representation. The main idea is to use a model describing relationships for elements in the document structure. Chapters are represented and implemented by frontiers between chapters. Page is also used. The periphery center relationship is calculated on the entire document and reflected on each page. The strong points of the approach are that it deals with the entire document; it handles books without ToCs, and titles that are not represented in the ToC (e. g. preface); it is not dependent on lexicon, hence tolerant to OCR errors and language independent; it is simple and fast. 1 Introduction The GREYC Island team participated for the first time in the Book Structure Extraction Competition held at ICDAR in 2009 and part of the INEX evaluations [1]. The Resurgence software was modified to this end. This software was designed to handle various document formats, in order to process academic articles (mainly in pdf format) and news articles (mainly in HTML format) in various text parsing tasks [2]. We decided to join the INEX competition because the team was also interested in handling voluminous documents, such as textbooks. The experiment was conducted over 1 month. It was run from pdf documents to ensure the control of the entire process. The document content is extracted using the pdf2xml software [3]. The evaluation rules were not thoroughly studied, for we simply wished to check if we were able to handle large corpora of voluminous documents. The huge memory needed to handle books was indeed a serious obstacle, as compared with the ease in handling academic articles. The OCR texts were also difficult to cope with. Therefore, Resurgence was modified in order to handle the corpus. We could not propagate our principles on all the levels of the book hierarchy at a time. We consequently focused on chapter detection. In the following, we explain the main difficulties, our strategy and the results on the INEX book corpus. We provide corrected results after a few modifications were made. In the last section, we discuss the advantages of our method and make proposals for future competitions. S. Geva, J. Kamps, and A. Trotman (Eds.): INEX 2009, LNCS 6203, pp , Springer-Verlag Berlin Heidelberg 2010

2 The Book Structure Extraction Competition with the Resurgence Software Our Book Structure Extraction Method 2.1 Challenges In the first stage of the experiment, the huge memory needed to handle books was found to be indeed a serious hindrance: pdf2xml required up to 8 Gb of memory and Resurgence required up to 2 Gb to parse the content of large books (> 150 Mb). This was due to the fact that the whole content of the book was stored in memory. The underlying algorithms did not actually require the availability of the whole content at a time. They were so designed since they were meant to process short documents. Therefore, Resurgence was modified in order to load the necessary pages only. The objective was to allow processing on usual laptop computers. The fact that the corpus was OCR documents also challenged our previous program that detected the structure of electronic academic articles. A new branch in Resurgence had to be written in order to be tolerant to OCR documents. We had no time to propagate our document parsing principles on all the levels of the book hierarchy at a time. We consequently focused on chapter detection. 2.2 Strategy Very few principles were tested in this experiment. The strategy in Resurgence is based on document positional representation, and does not rely on the table of contents (ToC). This means that the whole document is considered first. Then document constituents are considered top-down (by successive subdivision), with focus on the middle part (main body). The document is thus the unit that can be broken down ultimately to pages. The main idea is to use a model describing relationships for elements in the document structure. The model is a periphery-center dichotomy. The periphery center relationship is calculated on the entire document and reflected on each page. The algorithm aims at retrieving the book main content bounded by annex material like preface and postface with different layout. It ultimately retrieves the page body in a page, surrounded by margins [2]. However, for this first experiment, we focused on chapter title detection so that the program detects only one level, i. e. chapter titles. Chapter title detection throughout the document was conducted using a sliding window. It is used to detect chapter transitions. The window captures a four-page context with a look-ahead of one page and look-behind of two pages. The underlying idea is that the chapter begins after a blank, or at least is found in a relatively empty zone at the top of page. The half page fill rate is the simple cue used to decide on chapter transition. The beginning of a chapter is detected by one of the two patterns below, where i is the page where a chapter starts. Figure 1 and 2 illustrate the two patterns. Pattern 1: - top and bottom of page i-2 equally filled - bottom of page i-1 less filled than top of page i-1 - top of page i less filled than bottom of page i - top and bottom of page i+1 equally filled

Pattern 2: - any content for page i-2 - empty page i-1 - top of page i less filled than bottom of page i - top and bottom of page i+1 equally filled Chapter title extraction is made from the first

3 172 E. Giguet and N. Lucas Fig. 1. View of the four-page sliding window to detect chapter beginning. Pattern 1 matches. Excerpt from 2009 book id = 00AF1EE1CC79B277. Fig. 2. View of the four-page sliding window to detect chapter announced by a blank page. Pattern 2 matches. Excerpt from 2009 book id= 00AF1EE1CC79B277. Pattern 2: - any content for page i-2 - empty page i-1 - top of page i less filled than bottom of page i - top and bottom of page i+1 equally filled Chapter title extraction is made from the first third of the beginning page. The model assumes that the title begins at the top of the page. The end of the title was not carefully looked for. The title is grossly delineated by a constraint rule allowing a number of lines containing at most 40 words. 2.3 Experiment The program detected only chapter titles. No effort was exerted to find the sub-titles. The three runs were not very different since runs 2 and 3 amount to post-processing of the ToC generated by run 1.

4 The Book Structure Extraction Competition with the Resurgence Software 173 Run 1 was based on minimal rules as stated above. Run 2 was the same + removing white spaces at the beginning and end of the title (trim) Run 3 was the same + trim + pruning lower-case lines following a would-be title in higher-case. 2.4 Commented Results The entire corpus was handled. The results were equally very bad for the three runs. This was due to a page numbering bug where p = p-1. The intriguing value above zero 0,08% came from rare cases where the page contained two chapters (two poems). Table 1. Book Structure Extraction official evaluation 1 RunID Participant F-measure (complete entries) MDCS 41,51% Microsoft Development Center Serbia XRCE-run2 28,47% Xerox Research Centre Europe XRCE-run1 27,72% Xerox Research Centre Europe XRCE-run3 Xerox Research Centre Europe 27,33% Noopsis 8,32% Noopsis inc. GREYC-run1 GREYC - University of Caen, France 0,08% GREYC-run2 0,08% GREYC - University of Caen, France GREYC-run3 0,08% GREYC - University of Caen, France Table 2. Detailed results for GREYC Precision Recall F-Measure Titles 19,83% 13,60% 13,63% Levels 16,48% 12,08% 11,85% Links 1,04% 0,14% 0,23% Complete entries 0,40% 0,05% 0,08% Entries disregarding depth 1,04% 0,14% 0,23% The results were recomputed with correction on the unfortunate page number shift in the INEX grid (Table 3). The alternative evaluation grid suggested by [4, 5], was applied. In table 4, for the GREYC result, the corrected run p= p-1 is computed under the name "GREYC-1" doucet/structureextraction2009/ 2 AlternativeResults.html

5 174 E. Giguet and N. Lucas Table 3. GREYC results with page numbering correction precision recall F-measure run-1 10, 41 7,41 7,66 run-2 10,56 7,61 7,85 run-3 11,22 7,61 8,02 Table 4. Alternative evaluation XRCE Link-based Measure Links Accuracy (for valid links) Precision Recall F1 Title Level MDCS XRCE-run XRCE-run XRCE-run Noopsis GREYC GREYC The results still suffered from insufficient provision made for the evaluation rules. Notably, the title hierarchy is not represented, which impairs recall. Titles were grossly segmented on the right side, which impairs precision. Title accuracy is also very low for the same reason. However, level accuracy balances the bad results reflected in the F1 measure. The idea behind level accuracy is that good results at a given level are more satisfying than errors scattered everywhere. The accuracy for chapter level, which was the only level we tempted, was 73,2%, second high. It means that few chapter beginnings were missed by Resurgence. Errors reflect both non responses and wrong responses. Our system returned 80 non responses for chapters, out of 527 in the sample, and very few wrong responses. Chapter titles starting on the second half of the page have been missed, as well as some chapters where the title was not very clearly contrasted against the background. 2.5 Corrections after Official Competition A simple corrective strategy was applied in order to better compare methods. First the bug on page number was corrected. A new feature boosted precision. In a supplementary run (run 4) both page number shift and chapter title detection were amended. The title right end is detected, by calculating the line height disruption: a contrast between the would-be title line height and the rest of the page line height. These corrections result in a better precision as shown in Table 5 (line GREYC-2) with the XRCE link-based measure. The recall rate is not improved because the subtitles are still not looked for.

6 The Book Structure Extraction Competition with the Resurgence Software 175 Table 5. Corrected run (GREYC-2) with better title extraction compared with previous results XRCE Link-based Measure Links Accuracy (for valid links) Precision Recall F1 Title Level GREYC GREYC GREYC Table 6 reorders the final results of the Resurgence program (GREYC-2) against other participants known performance. The measure is the XRCE Link-based measure. Table 6. Best alternative evaluation XRCE Link-based Measure Links Accuracy (for valid links) Precision Recall F1 Title Level MDCS XRCE-run GREYC Noopsis Discussion The experiment was preliminary. We were pleased to be able to handle the entire corpus and go through evaluation, since only four competitors out of eleven finished [1]. The results were very bad even if small corrections significantly improved them. In addition to the unfortunate page shift, the low recall is due to the fact that the hierarchy of titles was not addressed as mentioned earlier. This will be addressed in the future. 3.1 Reflections on the Experiment On the scientific side, the importance of quick and light means to handle the corpus was salient. The pdf2xml program for instance returned too much information for our needs and was expensive to run. We wish to contribute to the improvement of the software on these aspects. Although the results are bad, they showed some strong points of the Resurgence program, based on relative position and differential principles. We intend to further explore this way. The advantages are the following: The program deals with the entire document, not only the table of contents; It handles books without ToCs, and titles that are not represented in the ToC (e. g. preface); It is dependent on typographical position, which is very stable in the corpus;

7 176 E. Giguet and N. Lucas It is not dependent on lexicon, hence tolerant to OCR errors and language independent. Last, it is simple and fast. Some examples below underline our point. They illustrate problems that are met in classical literal approaches but are avoided by the positional solution. Example 1. Varying forms for the string chapter due to OCR errors handled by Resurgence CHAPTEK CHAPTEE CH^PTEE CHAP TEE CHA 1 TKR C II APT Kit (MI A I TKIl C II A P T E II C H A P TEH C H A P T E R C II A P T E U Oil A PTKR Since no expectations bear on language-dependent forms, chapter titles can be extracted from any language. A reader can detect a posteriori that this is being written in French (first series) or in English (second series). TABLE DES MATIÈRES DEUXIEME PARTIE CHAPITRE TROISIÈME PARTIE QUATRIÈME PARTIE PREFACE CHAPTER TABLE OF CONTENTS INTRODUCTION APPENDIX Since no list of expected and memorized forms is used, but position instead, fairly common strings are extracted, such as CHAPTER or SECTION, but also uncommon ones, such as PSALM or SONNET. When chapters have no numbering and no prefix such as chapter, they are found as well, for instance a plain title Christmas Day. Resurgence did not rely on numbering of chapters: this is an important source of OCR errors, like in the following series. Hence they were retrieved as they were by our robust extractor.

8 The Book Structure Extraction Competition with the Resurgence Software 177 II HI IV V VI SECTION VI SECTION VII SECTION YTIT SECTION IX SKOTIOX XMI SECTION XV SECTION XVI THE FIRST SERMON THE SECOND SERMON THE THIRD SERMON THE FOURTH SERMON CHAPTEE TWELFTH CHAPTER THIRTEENTH CHAPTER FOURTEENTH The approach we used was minimal, but reflects an original breakthrough to improve robustness without sacrificing quality. 3.2 Proposals Concerning evaluation rules, generally speaking, it is unclear whether the ground truth depends on the book or on the ToC. If the ToC is the reference, it is an error to extract prefaces, for instance. The participants using the whole text as main reference would be penalized if they extract the whole hierarchy of titles as it appears in the book, when the ToC represents only higher levels, as is often the case. Concerning details, it should be clear whether or when the prefix indicating the book hierarchy level (Chapter, Section, and so on) and the numbering should be part of the extracted result. As it was mentioned earlier and as it can be seen in Figure 1, the chapter title is not necessarily preceded by such mentions, but in other cases there is no specific chapter title and only a number. The ground truth is not clear either on the extracted title case: sometimes the case differs in the ToC and in the actual title in the book. It would be very useful to provide results by title depth (level) as suggested by [4], because it seems that providing complete results for one or more level(s) would be more satisfying than missing some items at all levels. It is important to get coherent and comparable text spans for many tasks, such as indexing, helping navigation or text mining.

9 178 E. Giguet and N. Lucas The reason why the beginning and end of the titles are overrepresented in the evaluation scores is not clear and a more straightforward edit distance for extracted titles should be provided. There is also a bias introduced by a semi-automatically constructed ground truth. Manual annotation is still to be conducted to improve the ground truth quality, but it is time-consuming. We had technical difficulties to meet that requirement in summer It might be a better idea to open annotation to a larger audience and for a longer period of time. It might be a good idea to give the bounding box containing the title as a reference for the ground truth. This solution would solve conflicts between manual annotation and automatic annotation, leaving man or machine to read and interpret the content of the bounding box. It would also alleviate conflicts between ToC-based or text-based approaches. The corpus provided for the INEX Book track is very valuable, it is the only available corpus offering full books. Although it comprises old printed books only, it is interesting for it provides various examples of layout. References 1. Doucet, A., Kazai, G.: ICDAR 2009 Book Structure Extraction Competition. In: 10th International Conference on Document Analysis and Recognition ICDAR 2009, Barcelona, Spain, pp IEEE, Los Alamitos (2009) 2. Giguet, E., Lucas, N., Chircu, C.: Le projet Resurgence: Recouvrement de la structure logique des documents électroniques. In: JEP-TALN-RECITAL 08 Avignon (2008) 3. Déjean, H.: pdf2xml open source software (2010), (last visited March 2010) 4. Déjean, H., Meunier, J.-L.: XRCE Participation to the Book Structure Task. In: Geva, S., Kamps, J., Trotman, A. (eds.) INEX LNCS, vol. 5631, pp Springer, Heidelberg (2009) doi: / Déjean, H., Meunier, J.-L.: XRCE Participation to the Book Structure Task. In: Geva, S., Kamps, J., Trotman, A. (eds.) INEX LNCS, vol. 5631, pp Springer, Heidelberg (2009)

Setting up a Competition Framework for the Evaluation of Structure Extraction from OCR-ed Books

Setting up a Competition Framework for the Evaluation of Structure Extraction from OCR-ed Books Antoine Doucet, Gabriella Kazai, Bodin Dresevic, Aleksandar Uzelac, Bogdan Radakovic, Nikola Todic To cite