Classification and Information Extraction for Complex and Nested Tabular Structures in Images

Size: px

Start display at page:

Download "Classification and Information Extraction for Complex and Nested Tabular Structures in Images"

Shawn Fletcher
5 years ago
Views:

1 Classification and Information Extraction for Complex and Nested Tabular Structures in Images Abstract Understanding of technical documents, like manuals, is one of the most important steps in automatic reporting and/or troubleshooting of defects. The majority of the relevant information exists in tabular structure. There are some solutions for extracting tabular structures from text. However, it is still a big issue to extract tabular information from images and, on top of that, from complex and nested tables. This paper aims to propose classification and information extraction methods for complex tabular structures in document images. These are hybrid approaches using both image layout and OCRed text. The proposed methods outperform on a real-world technical documents dataset from a German railway company (Deutsche Bahn AG) as compared to other state-of-the-art approaches. As a result, the proposed approaches won the competition held by Deutsche Bahn AG in 2016 against other participating research groups and companies. Index Terms Classification, Information Extraction, Table Information Extraction, Tabular Structure I. INTRODUCTION Most technical documents, that are used for reporting and/or troubleshooting of problems, exist as documents with tabular layout. For an automation of this process, document understanding is required. Already existing solutions can deal with extracting tables, more specifically tabular structures. Still, performing this process on images that contain complex, nested tables is a problem as most state-of-the-art solutions only work for simple text-based scenarios. In 2016, a German railway company, i.e. Deutsche Bahn AG (DB), held a competition [1] about classifying and extracting information of document images with complex, nested tabular structures. These give rise to many challenges in general, and specifically with respect to both classification and information extraction (IE). The common challenges include: i) Varying image resolutions: From low to high, as can be seen in figures 1(a) and 1(b). ii) Bad optical character recognition (OCR) results even for high resolution images as can be seen in figure 1(c) (which is the OCR result of figure 1(b) using Tesseract [2]). The specific challenges with respect to classification are: i) A high intra-class complexity: Documents that seem to be similar from a point of view in structure and layout, but are belonging to different classes, as can be seen in figure 2(a). ii) Labeled and non-labeled class introductions: As illustrated by figure 2(b), sometimes a class name is introduced by the keyword Benennung, and sometimes not; even within the same class. The specific challenges with respect to IE are: i) A complex and nested layout consisting of multiple tables, like in figures 2(a) to 3. ii) Non-unique labeling: The Contributed equally. same label is used multiple times in one image, as visualized by figure 3. iii) Different reading orders within the same document: As described by figure 3, sometimes a value is above, below, behind, or in front of its corresponding label; tables have to be read bottom-up. iv) Different field sets for the same class, due to intra-class complexity. State-of-the-art approaches for document classification, to be precise general document classification and table type classification, are: i) Image-based approaches [3], [4]. But those will suffer from the high intra-class complexity. ii) Keywords-based approaches [4], which will have problems with labeled and non-labeled class introductions. iii) Text-based approaches [3], [4]. But they will have problems with the different resolutions, and the bad OCR results. iv) Deep learning methods, like a combination of region-based and holistic analysis via a convolutional neural network (CNN) [5]. However, any neural network approach will not work here, due to a too small set of documents and the lack of a proper ground truth (GT). (More details on the latter will be provided later in this paper.) Based on the common and the classification-specific challenges of the above mentioned DB technical document scenario, none of these individual approaches could work well for document classification. For the information extraction, table information extraction approaches as well as general IE methods are suited. Especially approaches for semi-structured documents: Because most fields are introduced by labels, the given documents are not considered unstructured. However, field positions may vary from document to document even within the same class therefore the documents also cannot be seen fully structured. State-of-the-art methods here are: i) regular expression (RegEx)-based approaches [6], [7]: Since many of the needed information is introduced or followed by a label, RegEx-based matching seems promising. Still, the non-unique labeling, the different reading orders, and the different fields sets will cause trouble for any RegEx method. ii) Because the documents consist of tabular structures, region of interest (ROI)-based approaches [8], [9] could help guiding the IE process to the right locations. The different label sets and the non-unique naming will, on the other hand, cause problems for ROI-based approaches. iii) A hybrid ontology-based IE has recently been proposed, as it is showing promising results [10]. Even though it worked for a small amount of sentences, i.e. mostly four sentences, it is not applicable to the here given scenario. The problem is, that in this scenario there are if any only labels; no complete sentences. Same like before, none of these individual state-of-the-art IE approaches are suitable for the

2 (a) Low resolution image (b) High resolution image (c) OCR result of figure 1(b) using Tesseract Fig. 1. Common challenges for document images with complex, nested tabular layout. (a) Similar layout and structure, but different classes. (b) Labeled and non-labeled class introduction Fig. 2. Challenges for the document type classification. Fig. 3. Challenges for the information extraction: Non-unique labeling and different reading orders

3 DB scenario, because of the mentioned challenges. In this paper, hybrid approaches, combining image layout and OCRed text, for both classification and information extraction of document images with complex and nested tabular structures are proposed. These techniques solve the limitations of existing state-of-the-art approaches and also outperform competing approaches in the DB competition from The rest of the paper is organized as follows: Section II is about explaining the here proposed methods in detail. In section III the approaches of other research groups and companies, that also participated in the competition, are presented. Section IV contains a performance evaluation, where all here named methods are evaluated; including the competitor s, the here proposed, and state-of-the-art approaches. This section additionally contains a discussion. The last section, section V, is about summarizing the results and drawing a conclusion. II. THE PROPOSED CLASSIFICATION AND INFORMATION EXTRACTION METHODS FOR COMPLEX TABULAR STRUCTURES IN IMAGES In this section, novel document classification and information extraction techniques, using both document layout and OCRed text in a hybrid manner, are described in detail. A. The Proposed Hybrid Classification Approach The proposed document classification method uses document layout and OCRed text information in an interleaved fashion. The procedure is described here as follows: 1) Increase image contrast to enhance the OCRed text. 2) Extract the biggest rectangles (max. five) using connected components (CC) analysis from the image. The rationale behind this operation is that the class of the document is most probably located within one of these rectangles. If this is not the case, further processing is done as discussed in the next steps. 3) As first trial, get the OCRed text of the rectangles detected using Tesseract. Then find a good match between any string from the generated OCR and any class name given the list of all class names. The matching is done using a threshold, which was experimentally set to 0.6. If more than one good match is found, the one with the highest matching ratio is selected. The process of finding a good match works as follows: First, search for the keyword Benennung, since this is, for some images, followed by the class name. But only looking for this keyword won t be effective enough. If it is found, the string following it is extracted and compared with the list of classes provided. Keeping in mind the given threshold while matching. Otherwise, go through the OCRed text line by line: Whenever a good match is found, store its matching ratio, such that the match with the highest ratio can be chosen at the end. It is important to mention that some class names are prefix of others. Additionally, in some cases the class name is split over two or more different lines. The presented technique takes these observations into account by continuously searching for the longest possible and available class name (a.k.a. greedy matching). 4) As second trial, in case no good match was found, apply image denoising before applying the OCR to produce a better OCRed text. Afterwards, a good match (as defined above) is found between strings from the OCRed text and the given class names; but with a threshold of ) Finally, as last trial, get the OCR of the whole image with increased contrast and also search for a good match (as defined above) with a threshold of 0.5. Otherwise, the image class is assigned to Unknown, which is actually one of the possible classes. B. The Proposed Hybrid information extraction Approach The proposed information extraction method is based on the combination of relative ROIs and RegEx. The steps are described as follows: 1) Detect all non-empty rectangles, using a CC-based approach, in the image with increased contrast and extract them in reading order. The coordinates of the top left corner of each rectangle are saved, in order to get its surrounding rectangles as well as to compare its location with other rectangles. 2) Get the OCRed text for each rectangle using Tesseract. 3) An extraction method is described for each of the desired information fields using a set of RegEx and four functions which can get all rectangles located to the left, right, top, and bottom, with respect to a specific rectangle. 4) Finally, all the fields are extracted using the defined extraction methods over the detected rectangles and their OCRed text. III. COMPETING APPROACHES The DB competition for technical documents classification and information extraction was an open challenge. Including us, ten different companies and research groups from all around Germany participated in the competition. We are thankful to the following two participants, one from the commercial site (Team Hybrid ) and another one from the research site (Team Awesome ), who provided us a brief description of their solutions. In section IV, we are comparing our solution (the proposed methods by this paper) with these participants. However, as a quick summary, our solution got the 1 st position and the teams Hybrid and Awesome got the 3 rd and 5 th positions, respectively. A. Kofax-based Solution The DB Systel company [11] participated in the DB competition under the team name Hybrid. Their system is based on the commercial Kofax [12] Transformation Products. The usual task of these products is to digitize any formal documents like bills, forms, and delivery notes using predefined functionalities including several features like keyword databases, RegEx, classification, and even trainable classifiers. The DB Systel s solution for the DB competition is described here as follows: Kofax image classifier is used to create a

4 layout-based classification of the documents during a preprocessing step. Based on this information, specific classifiers are trained for the ten most represented classes. There is mostly one classifier, which takes the layout relations of the information fields into account. Several scripts are also implemented for extending the most general functionalities of Kofax Transformation with more task-specific ones, such as OCR substitution. Additionally, all documents that couldn t be classified into the ten most represented classes are combined into one abstract class called no match. Then, in order to process the documents that belong to no match class, trainable classifiers are used because of the layout differences in these documents. B. OCR- and Image Processing-based Solution The Computing Services (ZEDAT) [13], a research group of Freie Universität Berlin, also participated in the competition with the team name Awesome. For document classification, a Tesseract-based OCR is applied on the input document images, after adding all the given class names into the dictionary of Tesseract. Then, each text line is compared with the list of given class names using Hamming Distance. The class name corresponding to the best matching string is recorded as the class name for the input document. For the IE, at first the document is split into individual cells using Canny Edges and Hough Lines as separators. Afterwards the Tesseract OCR engine is applied on each cell independently. Then IE is performed for each field in multiple steps. At first, each cell is checked on its own and is searched for a specific keyword of each field and tried to find the corresponding field value. Since this attempt worked for only 10% of the fields, in a second step, fields without any specific keywords are found. Therefore, the structure of the unmatched fields is analyzed and several simple RegEx are defined to find the field values. As a third step, the surrounding cells are checked for supporting keywords. Additionally, the different positions of the field value needed to be taken into account for each field. Combining these three steps, different fields and their values are detected. IV. PERFORMANCE EVALUATION & DISCUSSION This section contains a performance evaluation and its interpretation, for both classification and information extraction. Contained are a comparison in the area of the DB competition, i.e. between the proposed approaches and the competitor s approaches, and a comparison with state-of-the-art approaches. A. The DB Datasets The organizer of the competition, i.e. Deutsche Bahn AG, provided 909 sample documents, a list of 250 different classes, and a set of 55 fields, where each document belongs to a single class and contains a subset of the provided fields. But they did not provide any classification GT for those samples; however, they provided the field values for a subset of 46 samples as a GT for the IE. The data set consisting of 909 samples will from now own be referred to as DB-909-dataset. It is used by the organizers for comparing the results of all participants. Additionally, for comparing the results of the proposed methods with the state-of-the-art approaches (which were not included in the competition), the GT for document classification is manually created for the DB-909-dataset. For the IE the provided 46 samples with their field values are used. This subset of samples will from now own be referred to as DB-46-dataset. B. The DB Competition The DB asked to upload the solution files, i.e. the files containing the classification assignment and the IE results. To avoid problem-agnostic solutions, they computed the accuracies by themselves. The organizer s computed results for classification, information extraction, and a combination of both as a final score, are shown in table I. The here proposed methods scored a 2 nd place in the classification (with a little margin to the 1 st place holder), the 1 st place for the IE, and the 1 st place in the final rank. This shows how promising the proposed solutions are for this area. TABLE I PERFORMANCE COMPARISON (CLASSIFICATION, INFORMATION EXTRACTION, AND FINAL RANK) BETWEEN COMPETITION PARTICIPANTS, USING THE DB-909-dataset, COMPUTED BY THE COMPETITION ORGANIZERS (I.E. THE DB). Solution classification information extraction Accuracy Rank Accuracy Rank final rank proposed % 2 nd * % 1 st ** 1 st DB Systel % 4 th % 4 th 3 rd ZEDAT % 7 th % 3 rd 5 th * 1 st of classification: % ** 2 nd of IE: % C. State-of-the-art comparison In case of the classification task, the proposed method got compared to an image-based classifier using scale-invariant feature transform (SIFT), a keywords-based approach using a bag-of-words, a text-based approach using Logistic Regression and binary feature weights, and a voting-based approach. The latter was using the results of the three previously mentioned methods and combined them through a majority vote. In case of a conflict, the output of the text-based method was used. This decision is based on the fact, that the text-based classifier achieved the best accuracy among those approaches. The comparison was run on the DB-909-dataset in all five cases. As can be seen in table II, the proposed classification approach achieves an accuracy of %. Hence it outperforms its classical competitors by more than doubling their performance. In case of the IE, the proposed approach got compared to a RegEx- and an ROI-based method. All the methods were executed on the DB-46-dataset. The previous run of the classification in order to being able to run the IE using

5 TABLE II CLASSIFICATION PERFORMANCE COMPARISON OF THE PRESENTED METHOD WITH RESPECT TO STATE-OF-THE-ART CLASSIFICATION APPROACHES OVER THE DB-909-dataset, WHICH IS COMPUTED BY THE AUTHORS OF THIS PAPER. Method image- keywords- textvoting proposed based based based approach Accuracy % % % % % the text-based classifier, achieved an accuracy of %. Due to that fairly low accuracy, the IE was executed twice for each the RegEx- and ROI-based approaches: First it was run directly after the classification finished, without adapting the results manually. This is referred to as Auto. in table III. Second, the classification results were manually corrected and then, the IE was run. This is referred to as Manual in the table. As the proposed approach is performing the IE without considering the sample s class, this distinction was not done for this method. The comparison between GT and the respective approach was done using the Levenshtein distance of the overall result set. As table III illustrates, the here proposed method is, with an accuracy of %, the most accurate approach. TABLE III INFORMATION EXTRACTION PERFORMANCE COMPARISON OF THE PRESENTED METHOD WITH RESPECT TO STATE-OF-THE-ART IE APPROACHES OVER THE DB-46-dataset, WHICH IS COMPUTED BY THE AUTHORS OF THIS PAPER. Method RegEx ROI proposed Auto. Manual Auto. Manual approach Accuracy % % % % % V. CONCLUSION This paper presented classification and information extraction (IE) methods for understanding technical document images containing complex and nested tabular structures. Both of the proposed techniques are developed using a hybrid approach: the document classification using both image layout and OCRed text, and the information extraction from nested and complex tabular structure using regions of interest (ROIs) and regular expressions (RegEx). These proposed methods participated in an open challenge organized by Deutsche Bahn AG (DB) for classification and IE from their technical documents dataset of around 900 samples. These samples are from 250 different document classes where each sample contains a subset of 55 different fields. Altogether ten different commercial companies and research groups participated in this competition. According to the results, which were computed by the organizers, our proposed approaches won the competition and outperformed all other nine participating methods. Additionally, after the competition, we also compared our classification and IE approaches with widely used state-ofthe-art classification and IE techniques, respectively, that were not included in the competition. In this case too, our presented methods outperformed all the state-of-the-art approaches. As one can notice in the performance evaluation results of the IE, even though our IE method got the best results as compared to other participants and state-of-the-art approaches, the results are still not good enough. Extracting information from nested and complex tabular structure, like the ones we are handling in this paper, is a very challenging task which cannot be efficiently solved in a fully automated manner. Therefore in near future, our goal is to introduce a semi-automatic IE technique on the basis of our proposed solution which can interactively involve a user in the loop for correction and then learning the behavior of correction for successively automating the correction. To be added after review. ACKNOWLEDGMENT REFERENCES [1] Deutsche Bahn AG. (2016) DB Digital Header:challenge. [Online]. Available: digital-headerchallenge.html [2] tesseract ocr. (2017) Tesseract Open Source OCR Engine. GitHub. [Online]. Available: [3] S. Bukhari and A. Dengel, Visual appearance based document classication methods: Performance evaluation and benchmarking, in th International Conference on Document Analysis and Recognition (ICDAR), Aug 2015, pp [4] N. Chen and D. Blostein, A survey of document image classification: problem statement, classifier architecture and performance evaluation, vol. 10, pp. 1 16, [5] A. W. Harley, A. Ufkes, and K. G. Derpanis, Evaluation of deep convolutional nets for document image classification and retrieval, Feb [6] A. Bartoli, A. D. Lorenzo, E. Medvet, and F. Tarlao, Inference of regular expressions for text extraction from examples, IEEE Transactions on Knowledge and Data Engineering, vol. 28, no. 5, pp , May [7] R. Fagin, B. Kimelfeld, F. Reiss, and S. Vansummeren, A relational framework for information extraction, SIGMOD Rec., vol. 44, no. 4, pp. 5 16, May [Online]. Available: http: //doi.acm.org/ / [8] R. Girshick, Fast r-cnn, in International Conference on Computer Vision (ICCV), [9] T. Gogar, O. Hubacek, and J. Sedivy, Deep neural networks for web page information extraction, in 12th IFIP WG 12.5 International Conference and Workshops, AIAI 2016, Thessaloniki, Greece, September 16-18, 2016, Proceedings. Springer, Cham, 2016, pp [10] F. Gutierrez, D. Dou, S. Fickas, D. Wimalasuriya, and H. Zong, A hybrid ontology-based information extraction system, [11] DB Systel GmbH. [12] Kofax Inc. [13] Die Zentraleinrichtung für Datenverarbeitung (ZEDAT), Freie Universität Berlin.

Deep Tracking: Biologically Inspired Tracking with Deep Convolutional Networks

Deep Tracking: Biologically Inspired Tracking with Deep Convolutional Networks Si Chen The George Washington University sichen@gwmail.gwu.edu Meera Hahn Emory University mhahn7@emory.edu Mentor: Afshin