Identifying Table Boundaries in Digital Documents via Sparse Line Detection

Size: px

Start display at page:

Download "Identifying Table Boundaries in Digital Documents via Sparse Line Detection"

Lee Walsh
5 years ago
Views:

1 Identfyng Table Boundares n Dgtal Documents va Sparse Lne Detecton Yng Lu, Prasenjt Mtra, C. Lee Gles College of Informaton Scences and Technology The Pennsylvana State Unversty Unversty Park, PA, USA, ylu@st.psu.edu, pmtra@st.psu.edu, gles@st.psu.edu ABSTRACT Most pror work on nformaton extracton has focused on extractng nformaton from text n dgtal documents. However, often, the most mportant nformaton beng reported n an artcle s presented n tabular form n a dgtal document. If the data reported n tables can be extracted and stored n a database, the data can be quered and joned wth other data usng database management systems. In order to prepare the data source for table search, accurately detectng the table boundary plays a crucal role for the later table structure decomposton. Table boundary detecton and content extracton s a challengng problem because tabular formats are not standardzed across all documents. In ths paper, we propose a smple but effectve preprocessng method to mprove the table boundary detecton performance by consderng the sparse-lne property of table rows. Our method easly smplfes the table boundary detecton problem nto the sparse lne analyss problem wth much less nose. We desgn eght lne label types and apply two machne learnng technques, Condtonal Random Feld (CRF) and Support Vector Machnes (SVM), on the table boundary detecton feld. The expermental results not only compare the performances between the machne learnng methods and the heurstcal-based method, but also demonstrate the effectveness of the sparse lne analyss n the table boundary detecton. Categores and Subject Descrptors H.3 [INFORMATION STORAGE AND RETRIEVAL]; H.3.7 [Dgtal Lbrares] General Terms Algorthms, Performance Keywords Table Boundary Detecton, Sparse Lne Property, Table Labelng, Condtonal Random Feld, Support Vector Machne, Table Data Collecton Permsson to make dgtal or hard copes of all or part of ths work for personal or classroom use s granted wthout fee provded that copes are not made or dstrbuted for proft or commercal advantage and that copes bear ths notce and the full ctaton on the frst page. To copy otherwse, to republsh, to post on servers or to redstrbute to lsts, requres pror specfc permsson and/or a fee. CIKM 08, October 26 30, 2008, Napa Valley, Calforna, USA. Copyrght 2008 ACM /08/10...$ INTRODUCTION Table, as a specfc document component, s wdely used n web pages, scentfc documents, fnancal reports, etc. Researchers always use tables to concsely dsplay the latest expermental results or statstcal fnancal data n a condensed fashon. Other researchers, for example, who are conductng the emprcal studes n the same topc, can quckly obtan valuable nsghts va examnng these tables. Along wth the rapd expanson of the Internet, tables become a valuable nformaton source n the nformaton retreval feld. Based on the ncreasng demands to unlock the nformaton n tables, more applcatons appear n the table-related felds, e.g., the table search [12]. Although approaches on table analyss are dverse, they share two same analyzng steps: table boundary detecton and table structure decomposton. For the further table content storage and sharng (e.g., the table data extracton and the table search), locatng the table boundary n a document s the frst and crucal step. Dfferent from most table detecton works, whch are the pre-defned layout based and the rule-based methods, we apply machne learnng technques on the table boundary detecton feld n ths paper. Pre-defned layout based algorthms usually work well for one doman, but are dffcult to extend. For the rule-based methods, the performance s always heavly affected by the qualty of the rules. When the testng data set s large enough, t s dffcult to determne the good values for thresholds. Machne learnng methods aregoodchocestodealwthsuchproblems. Wangetal. [23] appled the decson tree and Support Vector Machne (SVM) technques to classfy the web tables nto genune tables and non-genune tables. However, ther work starts wth the dentfed tables wthout any detal about how to detect the table boundary n a document page. Davd et al. [19] compared the expermental results of extractng the tables from plan-text government statstcal reports usng Condtonal Random Felds (CRF) and Hdden Markov Models (HMM) respectvely. Unfortunately, many adopted lne labels (e.g., separator) n [19] are too specfc to be applcable n other document medum (e.g., HTML, PDF). In general, table boundary detecton problem can be transformed nto the problem of dentfyng the table lnes, whch consttute the table boundares. By the observaton of tables wth dverse layouts from dfferent documents, we dentfy that all the table lnes share an mportant property: majorty lnes belongng to the table areas are sparse n terms of the text densty. Exstng flter-out based table-lne dscoverng methods dentfy the table lnes from the entre set of lnes of a document accordng to certan rules, whch n 1311

2 turn results n low recall. However, for applcatons such as table search [12], recall s more mportant than precson because once the false negatve table lnes are removed, t s dffcult to retreve them back. However, the false postve rate can be easly lowered n later table structure decomposton step. In ths paper, we propose a novel but effectve method to quckly locate the boundary of a table by takng advantage of the aforementoned property. We also propose an exclusve based method for dentfyng table lnes, whch generates hgh recall and saves substantal effort to analyze the nosy lnes. To our best knowledge, there are no proposed works that compare the machne learnng based methods wth the rule-based methods on the table boundary detecton. In ths paper, we apply two machne learnng methods (CRF and SVM) on the table boundary detecton Furthermore, we elaborate the feature selecton, analyze the factor effects of dfferent features, and compare the performance of CRF/SVM approaches wth our proposed rulebased method. Instead of extractng tables from the HTML documents [23] or the plan-text documents [19], we focus on Portable Document Format (PDF). PDF s a wdely used document format n dgtal lbrares because t can preserve the appearance of the orgnal document. Although a good number of researches have been done to dscover the document layout by convertng the PDFs to other types of fles (e.g., mage, html, text) n the past two decades, automatcally dentfyng the document logcal structures nformaton (e.g., words, text lnes, paragraphs, etc) and extractng the document components (e.g., fgures, tables, mathematcal formulas, etc) as well as the content [2] are stll a challengng problem. The major reasons are as follows: 1) the structural nformaton s not explctly marked up because of the un-tagged nature of PDF format; 2) the text sequences are often messly generated by the exstng PDF-to-text tools; 3) new noses can be generated by some necessary tools (e.g., OCR), f convertng the PDFs nto other meda (e.g., mage). The rest of the paper s organzed as follows. Secton 2 revews several relevant studes n table boundary detecton area and the appled machne learnng methods n ths feld. Secton 3 ntroduces the sparse-lne property of the table lnes. Secton 4 descrbes n detal the sparse lne detecton and the nose lne removng usng the condtonal random feld and support vector machne (SVM) technques. We elaborate the label types and the feature sets. Secton 5 explans the lne constructon before the lne labelng. Secton 6 explans how to locate the table boundary based on the labeled lnes as well as the table keywords. The detaled expermental results are dsplayed n Secton 7. We conclude our paper wth plans for future work n Secton RELATED WORKS 2.1 Related works on Table Detecton Researchers n the automatc table extracton feld largely focus on analyzng the table structure n a specfc document meda. Chen et al. [3] used heurstc rules and cell smlartes to dentfy tables. They tested ther table detecton algorthm on 918 tables from arlne nformaton web pages and acheved an F-measure of 86.50%. Penn et al. [18] proposed a set of rules for dentfyng genunely tabular nformaton and news lnks n HTML documents. They tested ther algorthm on 75 web ste front-pages and acheved an F-measure of 88.05%. Yoshda et al. proposed a method to ntegrate WWW tables accordng to the category of objects presented n each table [27]. Ther data set contans 35,232 table tags gathered from the web. They estmated ther algorthm parameters usng all table data and then evaluated algorthm accuracy on 175 of the tables. The average F-measure reported n ther paper s 82.65%. Zanbb [28] provdes a survey wth detaled descrpton of each method. All the methods can be dvded nto three categores: pre-defned layout based [22], heurstcs based [17, 6, 9, 8], and statstcal based. Pre-defned layout based algorthms usually work well for one doman, but s dffcult to extend. Heurstcs based methods need a complex postprocessng and the performance reles largely on the choce of features and the qualty of tranng data. Most approaches descrbed so far utlze purely geometrc features (e.g. pxel dstrbuton, lne-art, whte streams) to determne the logcal structure of the table, and dfferent document medums requre dfferent process methodologes: OCR [19], X-Y cut [4], tag classfcaton and keyword searchng [10][3][24] etc. In the past two decades, a good number of researches have been done to dscover the document layout by convertng the PDFs to mage fles. However, the mage analyss step can ntroduce nose (e.g., some text may not be recognzed or some mages may not be correctly recognzed). In addton,because of the lmted nformaton n the btmap mages, most of them only work on some specfc document types wth mnmal object overlap: e.g., busness letters, techncal journals, and newspapers. Some researchers combne the tradtonal layout analyss on mages wth low-level content extracted from the PDF fle. Even f the verson 6 of PDF allows a user to create a fle contanng structure nformaton, most of them do not contan such nformaton. Chao et al. [2] reported ther work on extract the layout and content from PDF documents. Hadjar et al. have developed a tool for extractng the structures from PDF documents. They beleve that, to dscover the logcal components of a document, all/most of the page objects need to be analyzed such as text objects, mage objects, path objects, etc, whch are lsted by PDF document content stream. However, the object overlappng problem happens frequently. If all the objects are analyzed, more effort needs to be spent to frstly segment these objects from each other. In addton, even such objects/structures are dentfed, they are stll too hgh level to fulfll many specal goals, e.g., detectng the tables, fgures, mathematcal formulas, footnotes, references, etc. Instead of convertng the PDF documents nto other types of meda (e.g., mage or HTML) and then applyng the exstng technques, we process PDF documents drectly from the text level. 2.2 Related works on table analyss wth machne learnng approaches Several machne learnng approaches are appled n the table analyss feld, e.g., decson tree [20], Nave Bayes classfer [29], Support Vector Machne (SVM) [1], Condtonal random felds (CRF) [11]etc. Hurst mentoned n [5] that a Nave Bayes classfer algorthm produced adequate results but no detaled algorthm and expermental nformaton was provded. Wang et. al. tred both the decson tree classfer and SVM to classfy each gven table entty as ether genune or non-genune table based on features from layout, content type, and word group perspectves. Decson tree 1312

3 learnng s one of the most wdely used and practcal methods for nductve nference. It s a method for approxmatng dscrete-valued functons that s robust to nosy data. Comparng wth our work, they started wth the detected tables and all features are related to the table tself (e.g., the number of columns). How to detect these tables, the key problem of our paper, s mssng n ther work. Condtonal random felds (CRFs) s ntally ntroduced by Lafferty et al. [11] n 2001 as a framework for buldng probablstc models to segment and label sequence data. After the brth, CRF s appled n bo-nformatcs, computatonal lngustcs and speech recognton felds. Condtonal Random Felds (CRF) have been shown to be useful n partof-speech taggng [11], shallow parsng [21], named entty recognton for newswre data [15], as well as table detecton [19]. To the best of our knowledge, Pnto et al. [19] dd the most related work as we dd. Comparng wth our work, the dfference spans the followng areas: 1) they extract table from a more specfc document type plan-text government statstcal reports; 2) because of the specfc document nature, they adopted several specal labels and correspondng features (e.g., BLANKLINE label and SEPARATOR features), whch are not applcable for other document types; 3) ther features focus on whte space, text, and separator nstead of of the coordnate features, whch are mportant for the table structure decomposton; 4) although they clamed that ther paper concentrated on locatng the table and dentfyng the row postons and types, they dd provde the detal about the table locatng. In order to mprove the performance table data extracton, we zoom n the table boundary detecton problem and elaborate the feature selecton n our paper. Moreover, we consder the coordnate features, whch not only play a crucal role n the table boundary feld, but also are unavodable n the later cell segmentaton phase. Dfferent from most CRF applcatons, the nput data s a document lne nstead of a word. Red rectangle: Sparse lnes Outsde rectangle: Non-sparse lnes Lne label: Capton Lne label: Capton Sparse lnes wthout label: OTHERSPARSE Lne label: Headngs Lne label: Headngs Lne label: Headers/ Footers Fgure 1: The sparse lnes n a PDF page 3. THE SPARSE-LINE PROPERTY OF TABLES Tables present structural data and relatonal nformaton n a two-dmensonal format and n a condensed fashon. Scentfc researchers always use tables to concsely dsplay ther latest expermental results or statstcal data. Other researchers can quckly obtan valuable nsghts by examnng and ctng tables. Tables have become an mportant nformaton source for nformaton retreval. The demand for locatng such nformaton (table search) s ncreasng. To successfully get the table data from a PDF document, detectng the boundary of the table s a crucal phase. Based on the observaton, we notce that dfferent lnes n the same document page have dfferent wdths, text denstes, and the szes of the nternal spaces between words. A document page contans at least one column. Many journals/conferences requre two (e.g., ACM and IEEE templates) or three even four columns. In a document, some lnes have the same length as the wdth of the document column, some are longer (e.g., cross over multple document columns) or shorter (e.g., the headng 1. INTRODUCTION n our paper) than a column. From the nternal space perspectve, the majorty of the lnes contan normal space szes between adjacent words whle some lnes have large spaces. In ths paper, we defne sparse lne as follows. Defnton 1. Sparse Lne: A document lne s a sparse lne f any of the followng condton s satsfed: 1). The mnmum space gap between a par of consecutve words wthn the lne s larger than a threshold sg. 2). The length of the lne s much shorter than a threshold ll; Snce the majorty of the lnes n a document belong to the non-sparse category, separatng the document lnes nto sparse/non-sparse categores accordng to the text nternal space/densty and then gettng rd of the non-sparse category become a frutful preprocessng step for the table boundary detecton. Such a method has two advantages: 1) the sparse lnes cover nearly the entre table content lnes; 2) Narrowng down the table boundary to the sparse lnes at the early stage can save substantal tme and effort to analyze nose lnes. There are tables whose cells can cross over multple table columns. In order to collect all such cells, method proposed n [26] sets up constrants on the number of such long cells wthn a table boundary. However, determnng a reasonable value s dffcult. For example, f the value s set up too tght, part of a table could be mssed out. If the value s loose, nose lnes wll be ncluded nto the table boundary. Unlke the approach n [26], our method treats the long cells as nonsparse lnes and remove temporally. To decde whether a sparse lne should be ncluded nto the same table boundary, we only need to check the vertcal space gaps between ths sparse lne and ts prevous neghbor sparse lne. Once we merge these two sparse lnes nto the same table boundary, the prevously temporally removed long lnes (f exsts any) between those two sparse lnes should be retreved back. Dfferent defntons of the much shorter than may generate dfferent sparse lne labelng results. We defne t as the half of the document column wdth. We show a snapshot of a PDF document page n Fgure 1 as an example. We hghlght the sparse lnes n red rectangles. Apparently, the table body content lnes are labeled as sparse lnes. Ten sparse lnes are not located wthn the table boundary: two headng lnes, one footer lne, three capton lnes, and four short lnes that are the last lne n a paragraph. We label them as sparse lnes because they satsfy the second condton. Snce such short-length lnes also happen n some table rows wth only one flled cell, we consder them as sparse lnes to avod mssng out the potental table lnes. Such nose non-table sparse lnes are very few because they usually only exst at the headngs or the last lne of a para- 1313

4 graph. In addton, the short length restrcton also reduce the frequency. We can easly get rd of them based on the coordnate nformaton later. 4. MACHINE LEARNING TECHNIQUES 4.1 Support Vector Machnes SVM [1] s a bnary classfcaton method whch fnds an optmal separatng hyperplane x : wx + b = 0 to maxmze the margn between two classes of tranng samples, whch s the dstance between the plus-plane x : wx + b =1 and the mnus-plane x : wx + b = 1. Thus, for separable noseless data, maxmzng the margn equals mnmzng the objectve functon w 2 subject to, wy(x + b) 1. In the noseless case, only the so-called support vectors, vectors closest to the optmal separatng hyperplane, are useful to determne the optmal separatng hyperplane. Unlke classfcaton methods where mnmzng loss functons on wrongly classfed samples are affected serously by mbalanced data, the decson hyperplane n SVM s not affected much. However, for nseparable nosy data, SVM mnmzes the objectve functon: w 2 + C n =1 ε subject to, wy (x + b) 1 ε,andε 0, where ε s the slack varable, whch measures the degree of msclassfcaton of a sample x. Ths nosy objectve functon has ncluded a loss functon that s affected by mbalanced data. In order to ncrease the mportance of recall n SVM, a cut-off classfcaton threshold value t<0should be selected. In methods wth outputs of class probablty [0, 1], then a threshold value t<0.5 should be chosen. As noted before, for noseless data, SVM s stable, but for nosy data, SVM s affected much by mbalanced support vectors. In our work, the latter approach s appled for SVM,.e., when t<0, recall s to be mproved but precson decreases. When t>0, a reverse change s expected. 4.2 Condtonal Random Felds Condtonal Random Felds (CRFs) are undrected statstcal graphcal models, whch are well suted to sequence analyss. The prmary advantage of CRFs over Hdden Markov Models (HMM) s ther condtonal nature, resultng n the relaxaton of the ndependence assumptons requred by HMMs n order to ensure tractable nference. Addtonally, CRFs avod the label bas problem, a weakness exhbted by Maxmum Entropy Markov Models (MEMMs) and other condtonal Markov models based on drected graphcal models. Let o =< o 1,o 2,..., o n > be an sequence of observed nput data sequence, for example n our case as a sequence of nput lnes of text n a PDF document page. Let S be a set of states n a fnte state machne, each correspondng to a label l L (e.g., sparse lne, non-sparse lne, headng lne, etc.) Let s =< s 1,s 2,..., s n > be the sequence of states n S that correspond to the labels assgned to the lnes n the nput sequence o. Lnear-chan CRFs defne the condtonal probablty of a state sequence gven an nput sequence to be: P (s o) = 1 n m exp( λ jf j(s 1,s, o,)) (1) Z o =1 j=1 where Z o s a normalzaton factor of all state sequences, the sum of the scores of all possble state sequences. f j(s 1, s, o,) s one arbtrary feature functon of m functons that descrbes a feature over ts arguments, and λ j s a learned weght for each such feature functon. Z o = n exp( λ jf j(s 1,s, o,)) (2) s S =1 j Intutvely, the learned feature weght λ j for each feature f j should be postve for features that are correlated wth the target label, negatve for features that ant-correlated wth the label, and near zero for relatvely unnformatve features. These weghts set to maxmze the condtonal log lkelhood of labeled sequences n a tranng set D = <o,l > (1),..., < o, l > (n) : LL(D) = n m λ 2 j log(p (l () o () ) (3) 2σ 2 =1 j=1 When the tranng state sequence are fully labeled and unambguous, the objectve functon s convex, thus the model s guaranteed to fnd the optmal weght settngs n terms of LL(D). Once these settngs are found, the labelng for a new, unlabeled sequence can be done usng a modfed Vterb algorthm. We use a weght parameter θ to boost features correspondng to the true class durng the testng process. Smlar to the classfcaton threshold t n SVM, θ can tune the trade-off between recall and precson, and may be able to mprove the overall performance, snce the probablty of the true class ncreases. Durng the testng process, the sequence of labels s s determned by maxmzng the probablty model P (s o) = 1 Z o exp( n m =1 j=1 λjfj(s 1,s, o,,θs)), where f j(s 1,s, o,,θ s)= =1 o θs tj(s 1,s, o,), θs s a vector wth θ s = θ when s = true, orθ s =1whens = false, and λ j s the parameters learned whle tranng. 4.3 Lne Labels Dfferent from the tradtonal table boundary detecton works, we use an exclusve method to label all the potental table lnes. Fgure 2 shows the ncluson-relaton of the lne types n a document page. The sze of each block does not represent the rato of a lne type n the page. Each lne type corresponds to a label n the machne learnng methods. Non-Sparse Lnes Captons Headngs footnotes references Headers & footers formulas... Sparse lnes True table lnes Labeled Table lnes True negatve (recall) False Postve (precson) Fgure 2: Composton of a PDF page wth lne types We desgn a set of labels by examnng a large number of lnes n scentfc PDF documents. Each lne wll be ntally labeled as ether SPARSE or NONSPARSE. A lne labeled as NONSPARSE satsfes nether of the condtons 1314

5 n Secton 3. NONSPARSE lnes usually cover the followng document components: document ttle, abstract, paragraphs, etc. SPARSE lnes cover other specfc document components entrely/partally: tables, mathematcal formulas, texts n fgures, short headngs, afflatons, document headers and footers, and references, etc. Even though sparse lnes cover almost all table lnes, a few non-table lnes mngle n. Removng these nose lnes can facltate the table boundary detecton effcently. Therefore, for the labeled SPARSE lnes, we label them as the followng sx categores: CAPTIONSPARSE, HEADINGSPARSE, FO- OTNOTESPARSE, REFERENCESPARSE, HEADERFO- OTERSPARSE, andothersparse. CAPTIONSPARSE refers to a lne that s the frst lne of a table capton or a fgure capton. HEADINGSPARSE marks short document headngs. Usually the lnes labeled wth the HEAD- INGSPARSE or the CAPTIONSPARSE, orfootnotes- PARSE only satsfy the second condton mentoned n Secton 3. To label a lne wth these labels, addtonal features should be consdered. For HEADINGSPARSE lnes, the font sze and type are the key features. For CAPTION- SPARSE lnes, we should examne whether the lne starts wth the defned keywords or not (e.g., Table or Fgure). For FOOTNOTESPARSE lnes, the specfc startng symbol s the most mportant factor. To dentfy the HEAD- ERFOOTERSPARSE lnes, checkng the Y-axs coordnate s key. Although a large part of lnes wth these labels also exst n non-sparse lne group, we can easly zoom n the table boundary nto the last category OTHERSPARSE by removng such lnes from sparse lne set wth ths method. 4.4 Feature sets Wse choce of features s always vtal to the fnal results. The feature based statstcal model CRFs reduce the problems to fndng an approprate feature set. Ths secton outlnes the man features used n these experments. Overall, our features can be classfed nto three categores: the orthographc features, the lexcal features, and the document layout features. Instead of the features about whte space and separators n [19], we emphasze the layout features Orthographc features Most related works treat the vocabulary as the smplest and most obvous feature set. Such features defne how these nput data appear (e.g., captalzaton etc), based on regular expressons as well as prefxes and suffxes. Because the lne layout s much more mportant than ther appearance for our lne labelng problem, we do not have to consder so many orthographc features as they dd. Our orthographc features nclude: IntalCaptcal, AllCaptcal, FontSze, Font- Type, BoldOrNot, HasDot, HasDgtal, AllDgtal, etc Lexcal features The lexcal features ncludes: TableKwdBegnnng, FgureKwdBegnnng, ReferenceKwdBegnnng, AbstractKwdBegnnng, SpecalCharBegnnng, DgtalBegnnng, Superscrpt- Begnnng, SubscrptBegnnng, LneItself Layout features Our crucal features come from the layout perspectve. The layout features nclude: LneNumFromDocTop, LneNum- ToDocBottom, NumOfTextPeces, LneWdth, CharacterDensty, LargestSpaceInLne, LeftX, rghtx, MddleX, DsTo- PrevLne, DsToNextLne. Table 1 lsts the detaled descrpton for every layout feature. Table 1: The man document layout features n our experment Document Layout Features Descrpton LneNumF romdoct op Index number from the page top LneNumT odocbottom Index number to the page bottom NumOfT extp eces Number of text peces LneW dth Wdth of the lne CharacterDensty LneW dth/number of acters LargestSpaceInLne Largest space wthn the lne LeftX X-axs value of leftmost lne end EndX X-axs value of rghtmost lne end MddleX X-axs value of mddle pont T hedst op revlne Vertcal gap to the prevous lne T hedst onextlne Vertcal gap to the next lne Conjuncton features So far all the prevous features are descrbed over a sngle predcate. In order to capture relatonshps that a lnear combnaton of features cannot capture, we look at the conjuncton of features. CRFs provde the functon to determne the label of a lne by takng nto account nformaton from another lne. In our work, we set the wndow sze of features as -1,0,1to conjunct the current lne wth the prevous lne and the followng fle. In order to avod the overflowed memory and exacerbated overftng generated by the all possble conjunctons and only generate those feature conjunctons wth sgnfcant mprovement functons, we turn to feature nducton as descrbed n [14]. We start wth no feature and choose new features nteractvely. In each teraton, we evaluate some sets of canddates usng the Gaussan pror, and add the best ones nto the model. 5. LINE CONSTRUCTION IN PDFS Dfferent from most CRF applcatons, the unt of our problem s a document lne, nstead of a sngle word. Before classfyng the document lnes, we have to construct the lnes frst. To construct the document lnes, we deal wth the PDF source fle acter by acter as well as the related glyph nformaton through analyzng the text operators 1. Adobe s Acrobat word-fnder provdes the coordnate of the four corners of the quad(s) of the word. The PDFlb Text Extracton Toolkt (TET) also provdes the functon to extract the text n the dfferent levels (acter, word, lne, paragraph, etc.). However, t only provdes the content nstead of other style nformaton n all the levels except the acter level. If we want to do some further work, content tself s usually not enough. We have to calculate the correspondng coordnates for the hgher levels by mergng the acters. Smlar to Xpdf lbrary, we adopt a bottomup approach to reconstruct these acters nto words then lnes wth the ad of ther poston nformaton and saves the results. To convert acters nto words then lnes, we adopt some heurstcs based on the dstance between acters/words. For each document page, we construct lnes accordng to ther nternal word relatve poston nformaton and wdth. Wthn a same word, dfferent acters have the same font propertes. However, wthn a same lne, font dversty may exst among dfferent words (e.g., the superscrpt, the subscrpt, or mathematcal symbols). The man unque place of our method s that we only analyze the 1 PDF Reference Ffth Edton, Verson

6 coordnate nformaton. Font nformaton, the frequently adopted parameter, s not used n our method as the rule to decde whether merge the next word nto the same lne or not. Therefore, the font nformaton does not affect the performance of the sparse lne detecton. Table 2: The parameter thresholds we adopted for word reconstructon P Defnton the vertcal dstance between two top Y-axs values: alpha = Y +1 Y the vertcal dstance between two bottom Y-axs values: beta = Y +1 Y the horzontal dstance between these two acters: = X +1 X δ the vertcal dstance of two acters θ the maxmal wdth of the space wth a word η the maxmum vertcal dstance between two acters n a same lne Formally, we defne a document as a set of pages D = n k=1(p k ), where n s the total page number. Each page P k, can be denoted as an aggregaton of acters C {Character}. c and c +1 are a par adjacent (no other acter exsts between them) acters. Intally, we get the coordnate of the frst acter c 0 n a document page. All the acters n C share a common set of attrbutes {([X, X ], [Y,Y ],W,H,F,T)}, where [X, Y ] s the par of coordnators of the upper-left corner of the acter whle [X,Y ] s the coordnates of the bottom-rght corner of the acter. The orgnal pont of the X-Y axes s the left-bottom corner of a document page. W/H denotes the wdth/heght of the component, F s the font sze, and T s the text. Fgure 3 shows the coordnates of an example acter. not use the font sze because n many journals and archves, the font nformaton (the font type and the font sze) s not so standard as we maged. Consderng such unrelable nformaton wll ncur more error to the fnal results. Y -1 y y` x x` Fgure 3: The coordnates of a acter n a PDF document page. Snce the acter C s the fundamental component of a document, other components can be constructed recursvely from t. For example, a document page P k can be denoted as an aggregaton of words W = {w j w j =([X wj,y wj ]), ([X w j Y w j ]),W wj,h wj,f wj,t wj }. A document word w j s equal to m =1c,wherem s the total number of acters n the word w j. Fgure 4 enumerates all the relatve postons of a par of adjacent acters c and c +1. Ther coordnates are ([X,X ], [Y,Y ]) and ([X +1,X +1], [Y +1,Y +1) respectvely. For the word reconstructon, we defne several parameters and thresholds, whch are lsted n Table 2: Fgure 4 (a) presents a common acter par n the same lne. Y +1 = Y ( = 0) and Y +1 = Y ( = 0). If s smaller than a gven threshold θ, the second acter c X = =0 (a) (b) (c) (d) (e) (f) (g) (h) Fgure 4: The coordnates of the example cases of the acter pars can merge wth c nto a same word. Otherwse, we treat c as the last acter of the current word and c +1 as the startng acter of a new word. Fgure 4 (b) (e) dsplay several examples of the samelne acter neghbors wth partal vertcal overlaps. Superscrpt s a typcal case of Fgure 4 (b) whle subscrpt s a typcal case of Fgure 4 (c). Fgure 4 (d) and (e) show the font sze changng wthn a document lne. All these acter pars are also same-lne acters. Every case has to satsfy some fxed condtons as follows: Y +1 Y Y +1 Y (Fgure 4 (b)), Y Y +1 Y Y +1 (Fgure 4 (c)), Y +1 Y Y Y +1 (Fgure 4 (d)), and Y Y +1 Y +1 Y (Fgure 4 (e)). For these cases, we decde whether c and c +1 go nto the same word or not, by comparng wth the same threshold θ; To analyze the acter pars n Fgure 4 (f) (h), we ntroduce another threshold η: the maxmum vertcal dstance between two acters n a same document lne. In 4(f), the fxed constrant s δ =(Y +1 Y ) > 0. In Fgure 4(g) and (h), the fxed constrant s δ =(Y Y +1) > 0. If δ>η, C and C +1 wll belong to dfferent lnes. Otherwse, we treat them as the acter neghbors n a same lne and decde whether they go to the a same word. Startng a new document column n a page s a typcal example wth large for case f, and startng the next lne s a typcal example wth large but mnus for case h. Usng the Table 1 n Fgure1 as the example, we show the merged words n Fgure 5. Each red rectangle refers an ndependent word. Fgure 5: The merged words n a table after the acter word phase Now a document page p k can be denoted as an aggregaton of words W. Smlar to the acters, we can also

7 treat words as rectangle objects n a document page. The coordnate nature of acters n the prevous secton s also applcable to the words. For a par of word neghbors, w and w +1, the possble relatve locatons are same of the cases lsted n Fgure 4. Usng the concept n Secton 5.1, we should treat the word here as the acter there and treat the text pece here as the word there. We beleve that n the non-sparse lnes, all the words can be merged nto one pece. The parameters and the thresholds n Table 2 can be reused wth only the value resets of and θ. Stll usng the Table 1 n Fgure1 as the example, we show the merged lnes n Fgure 6. After the combnaton, we check the number of text peces n each lne along the Y-axs sequence, f the number s larger than one, we label ths lne as the sparse lne. If the number s one but t satsfy the frst condton n Secton 3, we also treat t as a sparse lne. Stll usng the Table 1 n Fgure1 as the example, we show the merged lnes n Fgure 6. For all the eght lnes, the number of text peces are 1, 1, 1, 1, 5, 5, 4, and 1 respectvely. We treat lne 5, 6, and 7 are sparse lnes because they contan more than one text pece. We also treat the lne 3 and 4 as sparse lnes because of the small wdth. Fgure 6: The merged lnes n a table 6. DETECTING THE TABLE BOUNDARY BASED ON THE KEYWORDS After the sparse lne detecton and nosy lne removal, we can easly detect the table boundary by combnng the lnes labeled as OTHERSPASE wth the table keywords. Here we defne the man table content rows as the table boundary, whch does not have to nclude the table capton and the footnote. In order to enhance the performance of the table startng locaton detecton, we consder the keyword nformaton. We can drectly detect the table boundary by detectng the tabular structure wthn the sparse lne areas. In our method, we defne a keyword lst, whch lsts all the possble startng keywords of table captons, such as Table, TABLE, Form, FORM, etc. Most tables have one of these keywords n ther captons. If more than one tables are dsplayed together, the keyword s very useful to separate the tables from one another. Once we detect a lne (not only the sparse lne) startng wth a keyword, we treat t as a table capton canddate. Then we check other lnes that are located around the capton and merge them nto a sparse area accordng to the vertcal dstances between adjacent lnes. Such sparse-lne areas are the detected table boundary. The vertcal dstance s the key feature to flter out most remanng nose lnes. Because the texts wthn the detected table boundary wll be analyzed carefully n the later table structure decomposton phase, we treat recall more mportant than precson here. Once we locate the table boundary, we check ths area and try to retreve the long table lnes that are labeled as non-sparse lnes to mprove the recall. 7. EXPERIMENTS AND RESULTS In ths secton, we demonstrate the expermental results of evaluatng our table boundary detecton wth two machne learnng methods. Our experments can be dvded nto four parts: the performance evaluaton of dfferent methods, dfferent feature settngs, dfferent datasets, and dfferent parameter settngs. 7.1 Data Set We focus on tables n PDF scentfc documents. Although Wang [25] tred to buld a general table ground truth database, he focused on the web tables. No benchmark dataset exsts n PDF table analyss feld. In our work, we drectly analyze PDF documents nstead of convertng them to HTML or Image fle. Instead of analyzng tables from a specfc doman, we am to collect tables as much dfferent varetes as possble from dgtal lbrares. The collecton of ths paper comes from dverse journals and proceedngs n three sources: chemcal scentfc dgtal lbrares (Royal Chemstry Socety 2 ), Cteseer 3, and archeology 4 n chemstry, computer scence and archeology felds. The sze of each PDF repostory we collected exceeds 100, 000, 10, 000 and 8, 000 respectvely n terms of scentfc papers. All the documents span the years 1950 to From these documents, we randomly choose 300 pages wth and wthout tables for our experments as the tranng set. Among these pages, we refer 100 pages from the chemstry feld as the dataset H, 100 pages from the computer scence feld as the dataset S, and 100 come from the archeology feld as the dataset A. The total number of the lnes n three datasets are 10177, 13151, and 9741 respectvely. For every document lne, we manually dentfy t wth a label defned n secton 5.3. In order to get an accurate and robust evaluaton on the table boundary detecton performance, we adopt a hold-out method by randomly dvdng the dataset nto fve parts and n each round we tran four of the fve parts and tested on the remanng one part. The fnal overall performance comes from the combned fve results. In our experment, we use the java-mplemented, frst-order CRF mplementaton Mallet to tran two versonsofthecrfwthbnaryfeaturesandtheactualvalues. For SVM, we adopt SVM lght [7]. We dvde the table boundary detecton problem nto four man sub-problems as follows: 1) Construct the lnes n a document page; 2) Remove all the non-sparse lnes from the lne set; 3) Remove all the nosy sparse lnes; 4) Label table lnes by consderng the keywords. In ths secton, we check the performance of each step and analyze the mpact effect of dfferent feature set. 7.2 Text Extracton from PDFs PDF document content stream lsts each PDF document as a sequence of pages, whch n turn can be recursvely decomposed nto a seres of components, such as text, graphcs, and mages. The correspondng objects to those com

8 ponents are text objects, mage objects, path objects, etc. Most of the exstng works to dscover the logcal components of a document focus on analyzng most f not all of the page objects. For example, regroupng all the objects to form the mage s a tradtonal task for document analyss system. However, the object overlappng problem happens frequently and the researchers have to make more effort to segment objects from each other frst. Even when such objects or structures are dentfed, they are stll too hgh level to fulfll specfc goals, ncludng our table boundary detecton. For most table related applcatons (e.g., table data extracton and table search), the majorty research nterests are focused on the text (the table content), nstead of the borderlnes. We classfy the tables nto two categores accordng to the content type: the text table and the mage table. Text tables refer to those tables that all parts are composed of texts. Image tables refer to those tables that are mage themselves or contan mages n some cells. All three tables n Fgure 1 are text tables. By random examnng thousands of the PDFs n the computer scence, chemstry, bology, and archeology felds, we notce that more than 92% table contents consst of pure texts whle only few tables contan mages. Because most tables are composed of texts, the text extractng tools, whch only provde the very low level nformaton (acters, words, coordnates, etc) wthout the structure nformaton, s enough for our goal. Many PDF converters are avalable off the shelf (Xpdf, PDF2TEXT, PDFBOX, Text Extractng tool, PDF- TEXTSTREAM, etc) to extract texts from documents. The nformaton obtaned wth the help of these tools can be dvded nto two categores: the text content and the text style. The text content refers to the text strngs; The text style ncludes the correspondng text attrbutes: the font, the sze, lne spacng and color, etc; We tred all the tools and TET has the best performance on text extracton. The text streams extracted from PDF fles can correspond to varous objects: a acter, a partal word, a complete word, a lne, etc. In addton, the order of these text streams does not always correspond to the readng order. A lne reconstructon and a readng order resortng steps are necessary n order to correctly extract the text from a PDF fle. The detals of our text sequence resortng algorthm s beyond the topc of ths paper and can be found n [13]. In ths paper, after the lne reconstructon and the sparse lne detecton, t s assumed that the text sequence s correct as a matter of course. 7.3 Performance of lne constructon We evaluate the performance of our lne constructon n Secton 5 based on 300 selected PDF pages. Gven the number of total lnes T, the number of constructed lnes that do not have any error C. Usually an error lne contans at least one of the followng problems: 1) the constructed lne ncludes some texts that should not belong to t; 2) the constructed lne msses a part of the text; If the mssed texts are ncluded n the prevous/next lne, we do not count the error duplcately. Wthn the lnes n the 300 PDF pages, we accurately constructed %(32757) lnes. The man reason for the errors s the superscrpts/subscrpts n the documents wth dense layouts. 7.4 Performance of sparse lne detecton We perform a fve-user study to evaluate the qualty of the sparse lne detecton. Each user checks the detected sparse lnes n 20 randomly selected PDF document pages n each dataset. The evaluaton metrcs are precson and recall. we ts defne the recall as and the precson as ts. ts ts+fn ts+fp refers to the labeled true postve sparse lnes n a page. fn refers to the false negatves (they are sparse lnes but we mss them). fp refers to the false postves (they are not sparse lnes, but we label them as the sparse lnes). Table 4 dsplays the results based on the 300 PDF pages. Table 3: The performance evaluaton of the sparse lne detecton datasets H A S The Number of PDF pages Recall of sparse lne detecton Precson of sparse lne detecton There are two goals n our method: 1) removng nonsparse lnes as much as possble; 2) keepng true table lnes n the sparse lne set as much as possble. To acheve a more objectve evaluaton, we also check the performance of ths step from the perspectves of these two goals. Some tables have long cells and very small spaces between the adjacent table columns because of the crowd layout. In order to keep these table lnes (goal 2), we regulate thresholds by settng ll wth a tolerate value and sp wth a smaller value. The trade off s mslabelng some non-sparse lnes as sparse lnes. Because we have further steps to remove the nose from the sparse lnes, ncludng such non-sparse lnes (low precson) s not a bg problem. Wthn the datasets H, A and S, 84.63% lnes are labeled as non-sparse lnes and can be easly removed as nose. Wthn the remanng sparse lnes, whch account for 15.37% n the whole dataset, almost half (44.23%) of them are real table lnes and 95.35% table lnes are ncluded n the sparse lne set. There are two reasons for the mssed table lnes: 1) we label some table lnes as non-sparse lnes because they contan long cross-column cells wthout large space gap. Such mssed lnes can be retreved n secton 6. 2) the text mssng problem nherted from the text extracton tools. Ths defcency falls outsde the topc our paper. 7.5 Performance of Nose lne removal We use the same test data n secton 7.3 to evaluate the performance of the nose lne removal. The measurement methods are stll precson and recall. Lettl be the real table lnes that are kept n the sparse lne set, sp be the latest sze of the sparse lne set after each nose removal, and to be the real table lnes that are removed. We defne the precson as tl tl and precson as. sp tl+to In Fgure 7 (a), X-axs lsts all nose type to be removed from the sparse lne set as well as the postprocessng n Secton 6. FP refers the begnnng dataset all lnes n a page. RNS refers the non-sparse lnes. RH refers the nosy headng lnes. HF refers the nosy header and footnote lnes. CAP s the nosy capton lnes, REF s the nosy reference lnes, and PP represents the postprocessng step. Along wth the nose removng, the sze of the sparse lne dataset decreases and the precson of the table lne labelng ncrease steadly. Non-sparse lne removng and the postprocessng are two crucal steps for the table boundary detecton problem. The results on three datasets are consstent wthout any remarkable dfference. The precson value s mproved 1318

9 from 8.59% to 59.72% on average after removng all noses. In addton, the further steps are much easer because of the dramatcally reduced sparse lne set. Although the results are not good enough, the remanng false postve table lnes scatter the page and the large dstance to the table capton s an mportant feature to dentfy them. Fgure 7 (b) shows the recall curves wth the same expermental condtons. The ntal recall values are 100% because no lne s removed. Along wth each step, the recall s decreasng because few true table lnes are mslabeled and removed. Wthn three datasets, dataset S has the worst recall value because most computer scence documents do not follow the standard template strctly and some true table lnes are mslabeled. 7.6 Keyword Combnaton After the nose removal n the prevous secton, the typcal false postve table lnes are the lnes wth short length. Such lnes are usually located at the end of paragraphs, the last lne of a table capton, or a short table footnote wthout specal begnnng symbol etc. Consderng the dstance features, most of the frst type can be fltered out. For those mssed true table lnes, analyzng the locaton nformaton of adjacent sparse lne sectons together wth the table capton help us to retreve them back. Based on the method n Secton 6, the precson values s enhanced to 95.32% and the precson values s close to 98.34%. 7.7 Impact effects of feature sets In order to compare the mpact effect of dfferent feature sets, we mplement three set of experments completed n the tme allotted: one CRF model usng only the orthographc features descrbed n secton 5.4.1, the second system adds the lexcal feature set, and the thrd system uses all features. Every model s tested wth all three datasets separately. The results are lsted n Table 4. We use the results based on the rule-based method as the comparson baselne. The evaluaton metrcs are precson, recall and F -measure. Gven the number of the correctly-labeled true table lnes by each method A, the number of true postve table lnes but overlooked B, and the number of true negatve non-table lnes that s msdentfed as table lnes A A+C A+B A C, the Precson s, the Recall s, and the F- measure=(2*recall*precson)/(recall+precson). Table 4: Average accuracy of table boundary detecton wth dfferent feature sets after all the nose lne removal feature sets, and datasets Recall Precson CRF, Orthographc, H 42.18% 44.96% CRF, Orthographc, S 41.66% 45.14% CRF, Orthographc, A 40.89% 45.16% CRF, Orthographc+Lexcal, H 61.22% 61.66% CRF, Orthographc+Lexcal, S 59.30% 59.81% CRF, Orthographc+Lexcal, A 59.98% 60.58% CRF, Orthographc+Lexcal+Layout, H 98.92% 96.28% CRF, Orthographc+Lexcal+Layout, S 97.33% 96.49% CRF, Orthographc+Lexcal+Layout, A 98.76% 96.20% It s not surprsng to notce that wthn the multple feature sets, the layout features play the most mportant mpact effect on the fnal performance of the table lne detecton. For the rule-based method and CRF wth less features, the recall values are always less than that of precsons. Asthe jonng of more layout features, not only the overall table Table 5: Average accuracy of table boundary detecton wth dfferent methods after all the nose lne removal Method and datasets F-measure Rule-based method, H + S + A 91.93% CRF, H+S+A 96.36% SVM lnear, H+S+A 94.38% Max Ent n [19] 88.7% CRF Bnary n [19] 91.2% CRF Contnuous n [19] 91.8% C4.5 n [16] < 95% Bp n [16] < 91% Det n [16] < 70% boundary detecton performance s heavly ncreased, but also the recalls exceed the precsons, and satsfy the nature of the further table data search demand. Wthn the experments wth the same method and the same features, dfferent tranng datasets have smlar performances. Usually the dataset H has better performance comparng the datasets S and A because that the document layout and table structure n chemcal papers are more standard than those n other two felds. 7.8 Impact effect of parameters Fgure 7(c) shows the effect of the feature boostng parameter θ for CRF. In the prevous secton, we use the default parameter settng: θ = 1.0. In ths secton, we test dfferent values: 0.5, 1.0, 1.5, 2.0, 2.5, 3.0. Ifθ<1.0, the non-sparse lnes get more preference. In each θ value, we compare the average precson results among dfferent feature settngs based on dfferent datasets. We notce that no matter how we change the θ and datasets, more features generate better results than fewer features and the contrbuton of the layout feature s much hgher than others. As the ncreasng of θ, the trend of precson s deceasng. The more features we consder, the more robust the results along the changng of θ. 7.9 Impact effect of dfferent technques Moreover, we compare our results based on CRF and SVM methods wth several man table boundary detecton results publshed n other related works. The evaluaton method s F-measure. For our CRF and SVM methods, we adopt all the features on all three datasets. Comparng wth our prevous rule-based method, our CRF experments mprove the performance by 54.90% and our SVM experments mprove the performance by 30.36%. Wthn all the publshed works, Ng et. al, acheved the best results wth C4.5 method n [16]. Our work wth CRF method enhance the F-measure by more than 27.2%. 8. CONCLUSIONS In ths paper, we propose a novel method to detect the table boundary. Because most tables are text-based, we clam that the text object of PDF provdes enough nformaton for table detecton. Wthn the text object, we beleve that the font sze s not so relable as other work stated. Based on the sparse-lne nature of tables, we propose a fast but effectve method to detect the table boundary by only processng the sparse lnes n a document page. Processng the sparse lnes solely can also mprove the performance of the text sequence resortng problem. Combnng dfferent keywords, 1319

A Binarization Algorithm specialized on Document Images and Photos

A Binarization Algorithm specialized on Document Images and Photos A Bnarzaton Algorthm specalzed on Document mages and Photos Ergna Kavalleratou Dept. of nformaton and Communcaton Systems Engneerng Unversty of the Aegean kavalleratou@aegean.gr Abstract n ths paper, a