Going digital Challenge & solutions in a newspaper archiving project Andrey Lomov ATAPY Software Russia
Problem Description Poor recognition results caused by low image quality: noise, white holes in characters, complicated layout, etc. FineReader sometimes glues neighboring newspaper columns, which results in incorrect article assembling Misinterpretation of newspaper headers as images due to large and irregular printing Misinterpretation of some figures and photos as text
Zoning mistakes that may appear on specific layouts
Typical ABBYY FineReader zoning mistakes Not all of the picture is included into a picture block Some parts of the image are marked as text blocks Several pictures are marked as a single picture block
Target area Insufficient accuracy of image and text blocks segmentation on specific newspaper pages in FineReader applications Implement the analysis algorithm that would help Engine SDK to segment newspaper pages properly The main customer s requirement is to segment the page so it is possible to assemble blocks in articles using their interposition and order information
ATAPY Page Zoning Algorithm Solution principles Intelligent image processing Deskew Protect figures and text from image modifications Advanced garbage remover Filling holes in faint characters Step 1 Step 4 Step 5 Step 6 Preliminary image analysis Search for vertical and horizontal separators, characters and figures Optional: Build a grid from the separators in order to delimit further regions for analysis and recognize them in FineReader Engine Correction algorithm Recognize image in FineReader Engine Check and correct resulting layout Step 2 Step 3 Step 7 Step 8
Deskew Intelligent image processing algorithms Deskew in ABBYY Products: FREngine 8: up to 7 degrees FREngine 9: up to 12 degrees FREngine 10: up to 25 degrees Advanced ATAPY Deskew based on Hough Transform Algorithm: up to 45 degrees 15 ⁰
Prior to grid building Preliminary image analysis Searching areas that can be identified as separators Vertical and horizontal white gaps without overlapping characters Images with height significantly larger than width, e.g. 10:1 and width having certain minimal value Finding black elements (thin lines) that can be interpreted as separators Storing black lines in resulting layout and replacing them with white color in current layout Joining adjacent separators into one
Grid building steps Preliminary image analysis Remove black lines Build page bounding rectangle Find separators crossing the borders of the bounding rectangle Find intercrossing separators within the rectangle Find pending separators (ones that do not stop at a crossing with other separator) and drop them Make sure all crossing separators have 3 or 4 lines out of the intersection point Build a resulting grid
Remove black lines Preliminary image analysis
Build separator lines Preliminary image analysis
Filter separators and build the grid Preliminary image analysis
Detect and protect figures and text from image modifications Intelligent image processing Detect figures as huge clusters of black dots,lines, etc. Find figure's boundaries Left and right white gaps or black lines Top an bottom white gaps or black lines Create surrounding rectangle Detect potential characters as small clusters of black dots Search boundaries of each character Protect characters boundaries from image modifications
Advanced garbage remover Intelligent image processing Get garbage size from user-defined settings Clear unprotected areas only
Filling holes in faint characters Intelligent image preprocessing Get hole size from user-defined settings Process protected potential characters if it size greater than size from userdefined settings Fill holes for each area
Correction algorithm Page segmentation algorithm for specific newspapers Exclude empty areas from text blocks Distinguish text fragments, titles, footers, headers from recognized text Restore figures and titles Split and join blocks in text columns
Exclude empty areas from text blocks Correction algorithm
Blocks segmentation by type Correction algorithm Blocks segmentation by type: titles and subtitles text fragments Blocks segmentation by styles: bold italic underline Subtitle Title Subtitle Text Picture Text
Restore figures Correction algorithm Correcting initial detected picture blocks
Restore titles Correction algorithm
Split and merge blocks in text columns Incorrect placement of text blocks
Comparison with Fine Reader 8 Fine Reader 8 ATAPY segmentation
Comparison with Fine Reader 9 Fine Reader 9 ATAPY segmentation
Comparison with Fine Reader 10 Fine Reader 10 ATAPY segmentation
Summary ATAPY Software has developed a sophisticated algorithm for newspaper page zoning, which allows to improve ABBYY FineReader recognition results and expand standard SDK functionality. When digitizing newspapers and other wideformat paper media. ABBYY Europe Developers Conference, Munich 2010
Questions? Thank you for your attention! Andrey Lomov AndreyL@atapy.com ABBYY Europe Developers Conference, Munich 2010