Bootstrapping Structured Page Segmentation

Size: px

Start display at page:

Download "Bootstrapping Structured Page Segmentation"

Zoe Cameron
5 years ago
Views:

Bootstrappng Structured Page Segmentaton Huanfeng Ma and Davd Doermann Laboratory for Language and Meda Processng Insttute for Advanced Computer Studes (UMIACS) Unversty of Maryland, College Park, MD

The dea evolves from attempts to segment dctonares that often have a consstent page structure, and s extended to the segmentaton of more general structured documents.

The system s frst traned usng a small number of samples, and a larger test set s processed based on the tranng result.

The newly created samples are used to retran the system agan to refne the learned features and resegment the test samples. Ths procedure s appled teratvely untl the learned parameters are stable.

1 Bootstrappng Structured Page Segmentaton Huanfeng Ma and Davd Doermann Laboratory for Language and Meda Processng Insttute for Advanced Computer Studes (UMIACS) Unversty of Maryland, College Park, MD {hfma, ABSTRACT In ths paper, we present an approach to the bootstrappng learnng of a page segmentaton model. The dea evolves from attempts to segment dctonares that often have a consstent page structure, and s extended to the segmentaton of more general structured documents. In cases of hghly regular structure, the layout can be learned from examples of only a few pages. The system s frst traned usng a small number of samples, and a larger test set s processed based on the tranng result. After makng correctons to a selected subset of the test set, these corrected samples are combned wth the orgnal tranng samples to generate bootstrap samples. The newly created samples are used to retran the system agan to refne the learned features and resegment the test samples. Ths procedure s appled teratvely untl the learned parameters are stable. Usng ths approach, we do not need to provde a large group of tranng set ntally, and by bootstrappng, the results can be refned step by step. We have appled ths segmentaton to many structured documents such as dctonares, phone books, spoken language transcrpts, and obtaned satsfyng segmentaton performance. Keywords: Bootstrap, Document Segmentaton, OCR. INTRODUCTION AND RELATED WORK Although we can obtan a lot of nformaton onlne, there s stll a lot of nformaton avalable only n the form of prnted document. Many of the documents have a repeated structure at the physcal and semantc levels, and the layout s often based on the functon of entres. Fgure shows typcal examples of structured documents. In a blngual dctonary, the content s usually structured nto translaton entres; In a phone book, the content s structured nto one person s name and hs/her personal nformaton. A restaurant menu may have a herarchcal structure,.e. the content s frst structured nto food type then further nto menu dshes. The analyss of the structure can help humans extract and organze nformaton. Reorganzng the consstency can also help force automated document analyss system. (c) (d) Fgure. Structured Document Examples Englsh-French Dctonary; Phone Book; (c) Advertsement; (d) Restaurant Menu The process of document layout structure analyss s often dvded nto two tasks: physcal segmentaton and logcal analyss. Physcal segmentaton usually dvdes a page nto zones wth specfc physcal characterstcs. Logcal

2 analyss labels each extracted zone wth a specfc functonal or logcal label. The structure complexty of dfferent documents makes t dffcult to desgn a generc document analyss tool that can be appled to all documents. Furthermore, snce the logcal analyss s based on the physcal segmentaton result, the performance of physcal segmentaton s crucal for understandng of document mage and domnates ts performance. In ths paper, we present a segmentaton approach that combnes the physcal and logcal segmentaton and can be used to segment structured pages by learnng the physcal and semantc features that characterze the functonalty of unque entres. A bootstrap technque s appled to the generaton of tranng data to mprove the accuracy of tranng and segmentaton. Before descrbng our page segmentaton approach n detal, however, we provde a bref survey of related work. Lang, Phllps and Haralck present a probablty-based text-lne dentfcaton and segmentaton approach. Ther approach conssts of two phases: an offlne statstcal tranng and onlne text-lne segmentaton. In the onlne text-lne segmentaton phase, an teratve, relaxaton-lke method was appled to fnd an optmal partton of source enttes by mprovng a condtonal probablty. Kopec and Chou apply a stochastc approach to buld Markov source models for text-lne segmentaton under the assumpton that a symbol template s gven and the zone (or text columns) had been extracted. Under the assumpton that the physcal layout structures of document mages can be modeled by a stochastc regular grammar, Kanugo and Mao 5 use a generatve stochastc document model to model a Chnese-Englsh dctonary page, and a weghted fnte state automaton that model the proecton profle at each level of document physcal layout tree s used to segment the dctonary page on all levels. S. Lee and D. Ryu 6 propose a parameter-free method to segment document mages wth varous font szes, text-lne spacng and document layout structures. There are also some rulebased segmentaton methods whch perform the segmentaton based on rules that are ether manually set up by user 7 or learned automatcally by tranng 9,. 2. PAGE SEGMENTATION In document analyss, pages are frst segmented nto dfferent levels of enttes based on physcal features. Wth ournal artcles, for example, the page can be represented wth a herarchcal structure of zones, text-lnes, words, and characters. The segmentaton s performed based on the physcal features such as spacng, relatve poston and textlne attrbutes. After obtanng the physcal segmentaton result, logcal analyss s usually appled to the hghest level -- zones. For ournal artcals, zones can be classfed as ttle, author, abstract, body, and references. The structured page segmentaton problem we present n ths paper s dfferent from tradtonal page segmentaton problems n that the document s assumed to have repeatng entres wth smlar structures, such as an entry n a dctonary, phone book, table of contents, or bblography. Furthermore, these logcal entres may occur n a sngle physcal zone. In some sense, ths problem s a combnaton of physcal and logcal segmentaton because () pages are frst physcally segmented nto physcal zones; (2) one physcal zone can be further functonally segmented nto multple entres based on ther functonal characterstcs; and (3) extracted entres are classfed nto dfferent logcal types. We often fnd that for these types of documents, the desgn of the document s such that the author provdes dfferent functonal propertes of the zones to allow the reader to dstngush between them. The functonal characterstcs of entry are dfferent for dfferent document types but often consstent wthn a sngle document. For entry classfcaton of dctonares, one entry may extend across columns or zones. Other entres may need to be gnored (.e. classfed as nose) because they are not of nterest for logcal labelng (for example, a page number, header or footer). In a dctonary, our functonal segmentaton essentally nserts a new level (entry) between the zone and text-lne of the typcal herarchcal representaton. Thus the segmentaton s not only based on the structural features of the page, but also on the structural features of the entry. Another novelty of our segmentaton approach s the applcaton of a bootstrap technque for the generaton of new tranng samples. Bootstrappng helps to make the segmentaton adaptve and mprove the segmentaton performance. In our segmentaton, we start wth OCR results that nclude text sze, font, face, text-lne and text-zone nformaton. The goal of the segmentaton s to segment each zone nto multple entres wth smlar features (the features wll be defned n the next subsecton) or organze multple text-lnes nto one entry. The page segmentaton s llustrated n Fgure 2 and each teraton conssts of the followng three steps: ) Feature Extracton: The segmentaton system s automatcally traned usng a small set of labeled samples and features of entres are extracted. 2) Segmentaton: Pages are segmented based on the extracted features.

3 3) Correcton and Bootstrappng: The segmented results are fed back to the user, who can make correctons to a small subset of the results wth errors. Based on the corrected segmentaton results, bootstrappng samples are generated and used to retran the system. To warrant tranng, we only concentrate on documents wth a sgnfcant volume or number of pages. Correcton and tranng requres an operator who knows the structure of the document. Feature Extracton Tranng Samples Bootstrappng Correcton & Bootstrappng User s Correcton Selected Results Pror Knowledge Feature Extracton Feature Space Segmentaton Fnal Results Test Document Physcal & Functonal Decomposton Segmentaton Fgure 2. Dagram of the page segmentaton approach 2. Feature Extracton Based on a study of dfferent types of structured documents, the followng Entry parameters has been shown to be useful and can be extracted and appled. Examples are Specal Symbol 2 2 shown n Fgure Word Style Specal symbols: Specal symbols 3 such as punctuaton, numbers, and 4 other non-alphabet symbols are often used to start a new entry, to 5 Symbol Pattern 4 end a entry, or to mark the Word Style Pattern 4 contnuaton of a text lne. 5 5 Lne Structure Word font, face and sze: Word font, face and sze (especally the features of the frst word n each entry) are Fgure 3. Feature Example often mportant entry features. In a dctonary page, for example, the frst word of each entry (typcally the headword) can be bold, all captal, a dfferent font, or larger than the rest of the entry. Word patterns: Words often form dstngushable patterns whch can be used to descrbe the entry structure consstency. Symbol patterns: Combned wth other symbols or regular characters, specal symbols can form some consstent patterns to represent the begnnng or endng of entry. Lne structures: The ndent, spacng, length, heght of text lnes n a entry all can be contaned n the lne structures to represent the features of a entry. Other features: Other features can also be used to segment the entres such as spacng between adacent entres, the poston of text, scrpt type, word spacng, character case and so on.

4 Durng the tranng (feature extracton) phase, a Bayesan framework s used to assgn and update the probabltes of extracted features. Based on estmated probabltes, each extracted feature wll be assgned a weght that can be used to compute the entry score from all extracted features. The detaled procedure s as follows: () Count the occurrence of dfferent features n tranng samples; (2) Compute feature occurrence rate as the feature probablty. Suppose there are totally N tranng entres, and there K are K extracted features, then for feature ( K), the probablty can be computed as: p =, where K s N the number of occurrence of feature. (3) Assgn feature weghts based on the computed probablty as follows: K p w =, where A = A p = and (4) Consder the extracted features as a formed feature space, each entry s proected to ths space and a votng score s computed as follows: Extracted Features & Weghts FV = K = w S where S = f the feature occurs, otherwse S = (5) Obtan the mnmum, maxmum, average votng scores of entres, these values wll be used as thresholds n the segmentaton stage. Dfferent types of entres may have dfferent feature occurrences, so ths procedure s run for each type of entres (refer to Secton 2.3 for a dscusson of entry types). In the feature extracton stage, we scale the weghts by to facltate computaton. From Fgure 4, we can see that the lne structure (negatve ndent of frst textlne) has the heavest weght, whch means t s the most mportant feature n ths document. 2.2 Segmentaton The segmentaton s an teratve procedure whch maxmzes the feature votng score of a entry n the feature space. Based on the features extracted n the feature extracton phase, a document can, n prncple, be segmented nto entres by searchng for the begnnng and endng text-lnes of a entry. Ths search operaton s a threshold-based teratve and relaxaton-lke procedure, and the threshold can be estmated from the tranng set. Consderng the fact that there are a relatvely small number of text lnes n one page, ths search can be done by brute force. The approach s teratve and tranng set can be generated by bootstrappng, so ntal segmentaton results can be refned step by step. The segmentaton procedure s descrbed as: ) Search canddates for the frst text lne n one zone by feature matchng. Ths operaton s equvalent to determnng f the frst lne n one zone s the begnnng of a new entry or a contnuaton of a entry n the prevous zone or prevous page. 2) Search for the end of a entry. Ths operaton s replaced by searchng the begnnng of next entry because the begnnng of a new entry s the endng of the prevous entry. 3) Remove the extracted entres and terate untl all new entres are dentfed. Once we obtan the ntal segmentaton results, before gong to the next step, the results are traversed and f necessary, two smple operatons (splttng, mergng) are appled. The detals can be found n prevous work 4. Weght Frst Word St yl e K.9 Wor d Styl e Pattern 5.48 Endng Symbol Symbol Pattern Wor d Symbol Pattern 6.67 L ne St r uct ur e Fgure 4. Extracted features and assgned weghts of selected entres n the document of Fgure 3

5 2.3 Correcton and Bootstrappng Because of the complexty of many structured documents, t s dffcult to determne the optmal value of some parameters. We attempt to learn as much as possble about the features of the gven tranng set. One possble way to do ths s to generate a new tranng set from the orgnal set and selected new segmentaton results. Ths technque s the so-called bootstrap technque that was frst proposed by Efron 3 n 979. The new generated tranng samples are called bootstrap samples. The bootstrap samples can be generated from the orgnal tranng samples, from the new segmentaton results, or from the combnaton of both. Consderng the stuaton that the orgnal tranng sample set s usually a small set, we always generate bootstrap samples from the combned set of orgnal tranng samples and selected segmentaton results. Before combnng the segmentaton results wth orgnal tranng samples to generate bootstrap samples, the operator makes correctons to the orgnal segmentaton results by performng one or more of the followng operatons: Splttng: splt one segmented entry nto two or more ndvdual entres; Mergng: merge two more adacent entres nto one sngle entry; Reszng: change the sze of a segmented entry Movng: change the boundng box poston of a segmented entry Removng: remove a segmented entry Relabelng: change the type label of a entry In ther paper, Hamamoto et al. 2 analyze four dfferent procedures to generate bootstrap samples. We appled the two of the four procedures n our approach, whch are only dfferent n the computaton of weghts. Frst, let X N = { x, x,..., xn } 2 be a set extracted from the set of orgnal tranng samples and new selected segmentaton results for entry type, where x N ) are the feature vectors wth each vector element the probablty of the ( N =, 2 N wth sze N from the orgnal set, so one of the followng two procedures can be appled to generate the desred bootstrap set. Procedure : B b b b specfc feature entty. We generate a bootstrap sample set X { x x,..., x } ) Select one sample entry wth feature vector xr from X N ; 2) Fnd the k closest samples x r, xr 2,..., xrk to xr n the feature space; 3) Compute a bootstrap sample b k x = = w xr, where w s a weght whch s gven by: w =, k k = c c where s chosen from a unform dstrbuton on [,] and w = ; 4) Repeat untl all N samples are selected. Procedure 2: ) Select one sample entry wth feature vector xr from X N ; k = 2) Fnd the k closed samples x r, xr 2,..., xrk to xr n the feature space;

b k 3) Compute one bootstrap sample x = = x r ; k 4) Repeat untl all N samples are selected.

The dfference s that the frst approach combnes samples based on random weghts whle the second one generates bootstrap samples by computng an unweghted mean.

For example, the lne ndent feature wll not appear n a sngle lne entry, whle the endng specal symbol may not appear n a entry that has a contnuaton part on the next page, so we generate the bootstrap

6 b k 3) Compute one bootstrap sample x = = x r ; k 4) Repeat untl all N samples are selected. In the frst step of both procedures, the sample entres are chosen such that no entry s selected more than once. Generated bootstrap samples are the lnear combnaton of the tranng samples n source. The dfference s that the frst approach combnes samples based on random weghts whle the second one generates bootstrap samples by computng an unweghted mean. We assume the document has consstent functonal structure. Some of the features occur n one entry type may never appear n another. For example, the lne ndent feature wll not appear n a sngle lne entry, whle the endng specal symbol may not appear n a entry that has a contnuaton part on the next page, so we generate the bootstrap samples for each predefned entry type. These entry types could be: Regular entry: A complete mult-lne entry that starts and ends n the same page; Contnuaton entry: A entry that s the contnuaton of a entry from the prevous page; Un-termnated entry: A entry that s not ended n one page and has a contnuaton part n the next page; Open entry: A entry that s the contnuaton part of a entry n the prevous page and does not end n the current page; Sngle-lne entry: A regular entry that contans only one sngle text lne; 2.4 Post-Processng Fgure 5. Segmentaton Errors Caused by OCR Errors. Zone output of OCR; Wrong segmentaton entry. The entry segmentaton result s heavly dependent on the zone segmentaton results. In other words, f the zone segmentaton result s ncorrect, t s mpossble to obtan correct entry segmentaton results from the zone segmentaton wthout any adustment (please refer to Fgure 5). So the task of the post-processng stage s: correct the ncorrect entry segmentaton results caused by ncorrect zone segmentaton and make the segmentaton approach more adaptve. The post-processng procedure can be brefly llustrated usng the flowchart n Fgure 6. The statstcal nformaton ncludes: average wdth of entry, average textlne heght of entry, average regular textlne wdth, word spacng wthn entry, relatve poston of nterested entry and so on. The man operaton n the post-processng stage s: browsng the words n each of entres wth sgnfcant dfference from the statstcal nformaton obtaned from tranng samples and reorganze words nto textlnes, and entres further. Fgure 7 shows the segmentaton result after post-processng, Fgure 6. Flowchart of Post-Processng Fgure 7. Corrected Result after Post-Processng

3. EXPERIMENT RESULTS AND PERFORMANCE EVALUATION We have appled ths presented approach to the

phone books; For the frst two categores, we ll provde the results and evaluaton, whle for the last

Dctonary Segmentaton Results The segmentaton approach was appled to four dfferent dctonares wth

Turksh-Englsh dctonary (99 pages). Englsh-Turksh dctonary (52 pages).

the same features. Fgure 8- show the segmentaton results of these four dctonares. (c) Fgure 8.

Turksh-Englsh Dctonary Segmentaton Result (wth many sngle-lne entres) Fgure.

7 3. EXPERIMENT RESULTS AND PERFORMANCE EVALUATION We have appled ths presented approach to the segmentaton of the three categores of structured documents: () dctonares; () voce transcrptons; and () phone books; For the frst two categores, we ll provde the results and evaluaton, whle for the last category, we only show the segmentaton results. 3. Dctonary Segmentaton Results The segmentaton approach was appled to four dfferent dctonares wth dfferent structure features: French-Englsh dctonary (63 pages), Englsh-French dctonary (657 pages), Turksh-Englsh dctonary (99 pages). Englsh-Turksh dctonary (52 pages). The French-Englsh and the Englsh-French pages are taken from the same blngual dctonary, so they have the same features. Fgure 8- show the segmentaton results of these four dctonares. (c) Fgure 8. Englsh-French Dctonary Segmentaton Results. Word; Text-lne; (c) Entry (page number s nose). Fgure 9. Turksh-Englsh Dctonary Segmentaton Result (wth many sngle-lne entres) Fgure. Englsh-Turksh Dctonary Segmentaton Result (dfferent entry features) (c) Fgure. Progressve Performance Improvement Based on Bootstrappng (4 dctonares) Accuracy Rate Improvement; Intal Features & Weghts; (c)features & Weghts after Bootstrappng;

Fgure shows the performance mprovement for the dctonary segmentaton based on bootstrappng. The results evaluaton comes from the statstcal nformaton of 5 pages of each dctonary.

8 Fgure shows the performance mprovement for the dctonary segmentaton based on bootstrappng. The results evaluaton comes from the statstcal nformaton of 5 pages of each dctonary. The ntal segmentaton was based on four tranng entres (one tranng entry for each entry type descrbe n Secton 2.3). Iteratons follow the ntal segmentaton are based on added dfferent numbers of tranng entres whch are used to generate bootstrap samples. The chart n Fgure shows that the segmentaton can be refned step by step by applyng bootstrap technque. Fgure (c) show the extracted features and assgned weghts n the ntal step and after bootstrappng. It can be seen that the weghts were changed after bootstrappng. New features may be extracted n the bootstrappng step. For example, when LRSpace (lne-entry spacng dfference) feature exsts n a document, only provdng nonadacent entres s not suffcent to extract ths feature, but by bootstrappng, new tranng entres are added, whch makes t possble to extract ths feature. The fnal performance evaluaton s shown n Table. Obtanng ground truth on such a large data set s a very tme-consumng work, so the evaluaton s only based on the avalable ground truths of these dctonares. In Table, the Correct Entres and Incorrect Entres are two complementary parts of the evaluaton results, so the summaton of the four error percentages n Incorrect Entres and the percentage n Correct Entres s. Whle False Alarm error s a value to measure the mpact of nose on the segmentaton result, and Mslabeled Entres error s a measurement to measure the labelng of correctly segmented entres. From Table, we can see the presented segmentaton algorthm works well such that the lowest percentage of correct segmentaton s hgher than 96%, and the best result can even acheve hgher than 99%. Fgure 2, 3 show the examples of all errors (except mssed entry error, whch s easy to understand) lsted n Table. Among the ncorrect entry errors, overlapped error s less serous compared wth the other three error types because ths error s usually caused by: () character or symbol dstrbuted over two textlnes; or (2) textlne spacng s too small (Fgure 2). In Fgure 3, the false alarm occurs due to the nose. Images shown n Fgure 3(c) are two parts of one sngle entry whch are dvded and put n two adacent pages. The entry type n Fgure 3 should be untermnated, whch was labeled normal n the segmentaton result. Ths type of error s usually caused by a specal symbol whch s used to termnate one entry (. n ths case). Table. Result Evaluaton of Four Dctonares Document EnglshFrench FrenchEnglsh EnglshTurksh TurkshEnglsh Page No Total Entres Correct Entres 939 (96.%) 2372 (97.9%) 349 (99.26%) 2627 (98.98%) Mssed (.5%) (.4%) (.4%) Incorrect Entres Overlapped Merged 22 (%) 528 (2.62%) 39 (.6%) 6 (.25%) 6 (.7%) 4 (.%) 7 (.26%)) Splt 53 (.26%) 5 (.2%) 6 (.45%) 9 (.72%) False Alarm 43 (.2%) 6 (.25%) 8 (.23%) 2 (.8%) Mslabeled Entres 6 (.8%) 2 (.49%) (.3%) (.38%) (c) Fgure 2. Incorrect Entry Errors: Overlapped; Merged; (c) Splt (c) Fgure 3. False Alarm and Mslabelng Errors: False Alarm; Mslabelng Error;

3.2 APOLLO 5 Voce Transcrpton Segmentaton Results For the dctonary parsng problem, the motvaton s obvous. We wsh to segment the dctonares nto entres that can be tagged and used as lexcal resources.

The audo and scanned mages of typed transcrpts are from the Lunar Module (LM), the command Module (CM) and msson control.

In order to do ths, our frst task was to segment the document mages nto spoken unts and label the tmes, sources and the spoken text regons.

9 3.2 APOLLO 5 Voce Transcrpton Segmentaton Results For the dctonary parsng problem, the motvaton s obvous. We wsh to segment the dctonares nto entres that can be tagged and used as lexcal resources. Typed transcrptons of audo content provde a related challenge. We are currently ntegratng audo, vdeo and photographs from the Apollo 5th msson nto an audo retreval nterface. The audo and scanned mages of typed transcrpts are from the Lunar Module (LM), the command Module (CM) and msson control. Our goal s to be able to synchronze references to these mages of the transcrptons to the audo as t s played. In order to do ths, our frst task was to segment the document mages nto spoken unts and label the tmes, sources and the spoken text regons. Whle ths text s not complcated, unparameterzed segmentatons wll not be as accurate as a modeled segmentaton. Fgure 4 shows the segmentaton results of the transcrpton. The transcrpton contans 5 dfferent parts (totally around 34 pages), and each ndvdual part has dfferent structure features. Table 2 shows the evaluaton results of the segmentaton based only on avalable ground truths, and the structure of table 2 s exactly the same as Table. Compared wth the four dctonares, these transcrpton documents have relatvely smpler structures and more obvous structure features, so the segmentaton results are more accurate than the dctonary segmentaton results, wth lowest percentage hgher than 98% and hghest percentage 99.87%. But due to the physcal nose and logcal nose (unnterested entry), the false alarm and mslabelng error are sgnfcantly hgher than the dctonary results, where the hghest false alarm error rate may acheve 8.37% (.25% n dctonary results), and hghest mslabelng rate may acheve 2.75% (.49% n dctonary results). () Fgure 4. Segmentaton Results of Voce Transcrpton ( and have dfferent features). Table 2. Result Evaluaton of Transcrptons Document Page No Total Entres Correct Entres Mssed Incorrect Entres Overlapped Merged Splt False Alarm Mslabeled Entres AS5_CM (99.2%) (.5%) 2 (.57%) 2 (.9%) 2 (.9%) 77 (3.65%) 5 (.23%) AS5_LM (98.7%) 35 (.78%) 3 (.5%) 73 (3.7%) (.48%) AS5_PAO 23 4 (99.2%) 6 (.53%) 3 (.27%) 94 (8.37%) 7 (.38%) AS5_PAC (99.22%) 4 (.26%) 8 (.52%) 8 (5.9%) 45 (2.75%) AS5_TEC (99.87%) 2 (.3%) 9 (.59%) Besdes the dctonares and transcrptons, we also appled ths segmentaton approach to several pages of phone book to test the robustness of ths approach, the result s shown n Fgure 5. The last textlne s gnored as nose (unnterested part). We are usng ScanSoft s Developer Kt 2 (SDK2) to obtan the zone segmentaton results. Snce our approach starts wth the OCR results, the segmentaton performance s sgnfcantly dependent on the OCR output; Once the OCR output has errors, the segmentaton result s bad even f the post-processng s Fgure 5. Segmentaton of Phone Book

10 appled to the segmentaton result. The fact that SDK2 only supports the recognton of Roman and Latn characters makes our segmentaton results for documents contanng dfferent language characters (such as Arabc, Hebrew) even worse. So part of our future work s to mprove our segmentaton approach to make t ndependent of OCR results, relax the restrcton that the current approach can only work on Roman and Latn languages. 4. SUMMARY AND FUTURE WORK In ths paper, we present an approach to page segmentaton usng a bootstrappng technque to learn a segmentaton model. The segmentaton system s frst traned usng a small set of samples. After the operator make correctons to a selected set of newly generated segmentaton results, these corrected results are combned wth the orgnal tranng set to generate a set of bootstrap samples whch are used to retran the system. Startng wth OCR results, ths approach can be appled to the segmentaton of any structured documents whose structure can be learned from tranng. We appled ths approach to many structured documents such as dctonares and voce transcrpt and obtaned satsfyng results. Experment results shows that the bootstrap technque can mprove the performance of segmentaton even wth a small set of tranng samples. Many of the structured documents contan pctures, fgures, tables, forms and some other dfferent content from the regular word content, whch makes the segmentaton more dffcult because these parts usually don t have consstent structures, although they may have consstent poston and sze. Another part of our future work s to extend our approach to document wth these elements. Currently we are only concentratng on black/whte documents, snce the stuaton that many documents are color document and dfferent colors often represent dfferent functonal entres, so, we are extendng our work to color document. In the color document analyss, color can be added to the feature space as a new feature. ACKNOWLEDGEMENTS Ths research s supported by DARPA TIDES proect under grant, the authors thank them for the support. REFERENCES. J. Lang, I.T. Phllps, R.M. Haralck, An optmzaton Methodology for Document Structure Extracton on Latn Character Documents, IEEE Tran. Pattern Analyss & Machne Intellgence, vol. 23:7, , July Y. Hamamoto, S. Uchmura, S Tomta, A Bootstrap Technque for Nearest Neghbor Classfer Desgn, IEEE Tran. Pattern Analyss & Machne Intellgence, vol 9:, 73-79, January B. Efron, Bootstrap Methods: Another Look at the Jackknfe, Annual Statstcs vol 7, -26, D. Doermann, H. Ma, B Karagol-Ayan, D. W. Oard, Translaton Lexcon Acquston from Blngual Dctonares, Proc. SPIE Conf. On Document Recognton and Retreval, 37-48, San Jose, CA, January, T. Kanungo, S. Mao, Stochastc Languate Model for Analyzng Document Physcal Layout, Proc. SPIE Conf. On Document Recognton and Retreval, San Jose, CA, January, S. Lee, D. Ryu, Parameter-Free Geometrc Document Layout Analyss, IEEE Tran. Pattern Analyss & Machne Intellgence, vol 23:, , November, S. Mao, T. Kanungo, Stochastc Language Models for Automatc Acquston of Lexcons from Prnted Blngual Dctonares. DLIA2 Advance Program, Seattle, WA, Sep R.M. Haralck. Document mage understandng: Geometrc and logcal layout. Proc. Int. Conf. on Computer Vson and Pattern Recognton, , Seattle, WA, L. Robadey, O. Htz, R. Ingold, A pattern-based method for document structure recognton, DLIA 2 Advance Program, Seattle, WA, Sep. 2.. D. Malerba, F. Esposto, Learnng Rules for Layout Analyss Correcton, DLIA 2 Advance Program, Seattle, WA, Sep. 2. G. E. Kopec, P. A. Chou, Document Image Decodng usng Markov Source Models, IEEE Tran. Pattern Analyss & Machne Intellgence, vol.6:6, 62-67, June, 994. K. Urwn. Langenschedt s Standard French Dctonary. Germany, 988.

Outline. Type of Machine Learning. Examples of Application. Unsupervised Learning

Outline. Type of Machine Learning. Examples of Application. Unsupervised Learning Outlne Artfcal Intellgence and ts applcatons Lecture 8 Unsupervsed Learnng Professor Danel Yeung danyeung@eee.org Dr. Patrck Chan patrckchan@eee.org South Chna Unversty of Technology, Chna Introducton