Going digital Challenge & solutions in a newspaper archiving project. Andrey Lomov ATAPY Software Russia

Similar documents
ABBYY FineReader 14. User s Guide ABBYY Production LLC. All rights reserved.

Refinement of digitized documents through recognition of mathematical formulae

GO! with Microsoft PowerPoint 2016 Comprehensive

Page Layout Using Tables

Creating a Newsletter

Microsoft Excel Chapter 1. Creating a Worksheet and an Embedded Chart

MS Word Professional Document Alignment

Lesson 15 Working with Tables

The American University in Cairo. Academic Computing Services. Word prepared by. Soumaia Ahmed Al Ayyat

Lesson 15 Working with Tables

Creating a Basic Chart in Excel 2007

A Labeling Approach for Mixed Document Blocks. A. Bela d and O. T. Akindele. Crin-Cnrs/Inria-Lorraine, B timent LORIA, Campus Scientique, B.P.

Microsoft Excel 2000 Charts

INFS 2150 / 7150 Intro to Web Development / HTML Programming

A Survey of Problems of Overlapped Handwritten Characters in Recognition process for Gurmukhi Script

Introduction to Excel 2007

Tables Part I. Session 45: Creating Tables Session 46: Modifying the Layout of Tables Session 47: Formatting the Design of Tables. Unit.

Word Tutorial 3. Creating a Multiple- Page Report COMPREHENSIVE

Microsoft Excel Important Notice

Create a new document based on default template, other available template like: memo, fax, agenda.

Digital Media, UX-UI Design > Website Principles

Character Recognition

Morphological Image Processing

The HOME Tab: Cut Copy Vertical Alignments

Excel 2010: Getting Started with Excel

Create and edit word processing. Pages.

EECS490: Digital Image Processing. Lecture #17

Introduction. Formatting cells. Google Sheets. Formatting Cells

DOING MORE WITH WORD: MICROSOFT OFFICE 2010

ABBYY FineReader 10. Professional Edition Corporate Edition Site License Edition. Small and medium-sized businesses or individual departments

The 12 most common newsletter design mistakes

CHAPTER 4: MICROSOFT OFFICE: EXCEL 2010

The first time you open Word

Creating a Spreadsheet by Using Excel

Morphological Image Processing

DOING MORE WITH WORD: MICROSOFT OFFICE 2007

Spreadsheet Software

Creating Web Pages with SeaMonkey Composer

Microsoft Visio 2016 Foundation. Microsoft Visio 2016 Foundation Level North American Edition SAMPLE

POFT 2301 INTERMEDIATE KEYBOARDING LECTURE NOTES

Microsoft Excel 2007 Level 1

ASMP Website Design Specifications

Project 1: Creating a Web Site from Scratch. Skills and Tools: Use Expression Web tools to create a Web site

GRAFFY / HYDE - Information

Microsoft Excel 2010

DOING MORE WITH WORD: MICROSOFT OFFICE 2013

Microsoft Excel 2010 Basic

Introduction to Digital Communications

INFS 2150 Introduction to Web Development

INFS 2150 Introduction to Web Development

Using Microsoft Office 2003 Intermediate Word Handout INFORMATION TECHNOLOGY SERVICES California State University, Los Angeles Version 1.

Contents. Project One. Introduction to Microsoft Windows XP and Office Creating and Editing a Word Document. Microsoft Word 2003

CounselLink Reporting. Designer

Forms. Section 3: Deleting a Category

Design Principles. Advanced Higher Graphic Presentation. Professional Graphic Presentations by kind permission of

Excel 2010: Basics Learning Guide

Isolated Handwritten Words Segmentation Techniques in Gurmukhi Script

In this section you will learn some simple data entry, editing, formatting techniques and some simple formulae. Contents

Microsoft Word 2007 on Windows

Auto-Zoning Newspaper Articles for the Purpose of Corpus Development for Training OCR

ABBYY FineReader 14 Full Feature List

OpenForms360 Validation User Guide Notable Solutions Inc.

FRAGMENTATION OF HANDWRITTEN TOUCHING CHARACTERS IN DEVANAGARI SCRIPT

Digitizing and Editing Polygons in the STS Gypsy Moth Project. M. Dodd 2/10/04

CATEGORY SKILL SET REF. TASK ITEM. 1.1 Working with Spreadsheets Open, close a spreadsheet application. Open, close spreadsheets.

Unit D Lecture Notes Word 2003

WEEK NO. 12 MICROSOFT EXCEL 2007

PowerPoint Module 2: Modifying a Presentation

Excel 2016 Basics for Windows

EXCEL 2003 DISCLAIMER:

How to use the open-access scanners 1. Basic instructions (pg 2) 2. How to scan a document and perform OCR (pg 3 7) 3. How to scan a document and

(Refer Slide Time 00:17) Welcome to the course on Digital Image Processing. (Refer Slide Time 00:22)

Surfaces. Science B44. Lecture 11 Surfaces. Surfaces. 1. What problem do surfaces solve? 2. How are surfaces discovered

Using CSS for page layout

Microsoft Excel Level 1

Formatting documents in Microsoft Word Using a Windows Operating System

The following is the Syllabus for Module 4, Spreadsheets, which provides the basis for the practice-based test in this module.

Computer Applications Final Exam Study Guide

Summary of the Swiss Red Cross Corporate Design Manual

3. Formatting Documents

Morphological Image Processing

Time Stamp Detection and Recognition in Video Frames

This module sets out essential concepts and skills relating to demonstrating competence in using presentation software.

Desktop Publishing (Word)

Excel 2016 Basics for Mac

Creating a Website Using Weebly.com (June 26, 2017 Update)

Make a Website. A complex guide to building a website through continuing the fundamentals of HTML & CSS. Created by Michael Parekh 1

Workbooks & Worksheets. Getting Started. Formatting. Formulas & Functions

FineReader Engine Overview & New Features in V10

Microsoft Excel 2002 M O D U L E 2

A cell is highlighted when a thick black border appears around it. Use TAB to move to the next cell to the LEFT. Use SHIFT-TAB to move to the RIGHT.

RECOMMENDATION FOR THE AUTHORING OF PATENT APPLICATIONS FOR THE PURPOSE OF FACILITATING OPTICAL CHARACTER RECOGNITION (OCR)

Title bar: The top most bar in Word window that usually displays the document and software names.

A New Algorithm for Detecting Text Line in Handwritten Documents

REPORT DESIGNER GUIDE

Fall 2016 Exam Review 3 Module Test

Word Processing for a Thesis, based on UEA instructions

INFORMATION TECHNOLOGY

Quick Access Toolbar. You click on it to see these options: New, Open, Save, Save As, Print, Prepare, Send, Publish and Close.

Content provided in partnership with Que, from the book Show Me Microsoft Office Access 2003 by Steve JohnsonÃÃ

Transcription:

Going digital Challenge & solutions in a newspaper archiving project Andrey Lomov ATAPY Software Russia

Problem Description Poor recognition results caused by low image quality: noise, white holes in characters, complicated layout, etc. FineReader sometimes glues neighboring newspaper columns, which results in incorrect article assembling Misinterpretation of newspaper headers as images due to large and irregular printing Misinterpretation of some figures and photos as text

Zoning mistakes that may appear on specific layouts

Typical ABBYY FineReader zoning mistakes Not all of the picture is included into a picture block Some parts of the image are marked as text blocks Several pictures are marked as a single picture block

Target area Insufficient accuracy of image and text blocks segmentation on specific newspaper pages in FineReader applications Implement the analysis algorithm that would help Engine SDK to segment newspaper pages properly The main customer s requirement is to segment the page so it is possible to assemble blocks in articles using their interposition and order information

ATAPY Page Zoning Algorithm Solution principles Intelligent image processing Deskew Protect figures and text from image modifications Advanced garbage remover Filling holes in faint characters Step 1 Step 4 Step 5 Step 6 Preliminary image analysis Search for vertical and horizontal separators, characters and figures Optional: Build a grid from the separators in order to delimit further regions for analysis and recognize them in FineReader Engine Correction algorithm Recognize image in FineReader Engine Check and correct resulting layout Step 2 Step 3 Step 7 Step 8

Deskew Intelligent image processing algorithms Deskew in ABBYY Products: FREngine 8: up to 7 degrees FREngine 9: up to 12 degrees FREngine 10: up to 25 degrees Advanced ATAPY Deskew based on Hough Transform Algorithm: up to 45 degrees 15 ⁰

Prior to grid building Preliminary image analysis Searching areas that can be identified as separators Vertical and horizontal white gaps without overlapping characters Images with height significantly larger than width, e.g. 10:1 and width having certain minimal value Finding black elements (thin lines) that can be interpreted as separators Storing black lines in resulting layout and replacing them with white color in current layout Joining adjacent separators into one

Grid building steps Preliminary image analysis Remove black lines Build page bounding rectangle Find separators crossing the borders of the bounding rectangle Find intercrossing separators within the rectangle Find pending separators (ones that do not stop at a crossing with other separator) and drop them Make sure all crossing separators have 3 or 4 lines out of the intersection point Build a resulting grid

Remove black lines Preliminary image analysis

Build separator lines Preliminary image analysis

Filter separators and build the grid Preliminary image analysis

Detect and protect figures and text from image modifications Intelligent image processing Detect figures as huge clusters of black dots,lines, etc. Find figure's boundaries Left and right white gaps or black lines Top an bottom white gaps or black lines Create surrounding rectangle Detect potential characters as small clusters of black dots Search boundaries of each character Protect characters boundaries from image modifications

Advanced garbage remover Intelligent image processing Get garbage size from user-defined settings Clear unprotected areas only

Filling holes in faint characters Intelligent image preprocessing Get hole size from user-defined settings Process protected potential characters if it size greater than size from userdefined settings Fill holes for each area

Correction algorithm Page segmentation algorithm for specific newspapers Exclude empty areas from text blocks Distinguish text fragments, titles, footers, headers from recognized text Restore figures and titles Split and join blocks in text columns

Exclude empty areas from text blocks Correction algorithm

Blocks segmentation by type Correction algorithm Blocks segmentation by type: titles and subtitles text fragments Blocks segmentation by styles: bold italic underline Subtitle Title Subtitle Text Picture Text

Restore figures Correction algorithm Correcting initial detected picture blocks

Restore titles Correction algorithm

Split and merge blocks in text columns Incorrect placement of text blocks

Comparison with Fine Reader 8 Fine Reader 8 ATAPY segmentation

Comparison with Fine Reader 9 Fine Reader 9 ATAPY segmentation

Comparison with Fine Reader 10 Fine Reader 10 ATAPY segmentation

Summary ATAPY Software has developed a sophisticated algorithm for newspaper page zoning, which allows to improve ABBYY FineReader recognition results and expand standard SDK functionality. When digitizing newspapers and other wideformat paper media. ABBYY Europe Developers Conference, Munich 2010

Questions? Thank you for your attention! Andrey Lomov AndreyL@atapy.com ABBYY Europe Developers Conference, Munich 2010