Stelae Technologies ... Data Conversion a Walkthrough. «Extracting the Intelligence from Content»

Similar documents
Khemeia Case Study: Automation of Large Scale Legacy Data Conversion

ABBYY FineReader 14. User s Guide ABBYY Production LLC. All rights reserved.

PROCESSING AND CATALOGUING DATA AND DOCUMENTATION - QUALITATIVE

Scanshare Sales Guide V1.2

+44 (0)

PROCESSING AND CATALOGUING DATA AND DOCUMENTATION: QUALITATIVE

DOCUMENT NAVIGATOR SALES GUIDE ADD NAME. KONICA MINOLTA Document Navigator Sales Guide

The Functional Extension Parser (FEP) A Document Understanding Platform

XBRL Design and Modeling Methodology in Practice

ABBYY FineReader 10. Professional Edition Corporate Edition Site License Edition. Small and medium-sized businesses or individual departments

A Case Study Webinar: How Wiley-Blackwell Accelerated Digital Production by 75% webinar. aptaracorp.com

The Case of the 35 Gigabyte Digital Record: OCR and Digital Workflows

Mission Possible: Move to a Content Management System to Deliver Business Results from Legacy Content

Getting to JATS and BITS. Presented by Bruce D. Rosenblum CEO Inera Incorporated

ABBYY FineReader 14 Full Feature List

XML, Metadata and More!

Scan to PC Desktop Professional v9 vs. Scan to PC Desktop SE v9 + SE

How to Build a Digital Library

- What we actually mean by documents (the FRBR hierarchy) - What are the components of documents

Adobe. Using DITA XML for Instructional Documentation. Andrew Thomas 08/10/ Adobe Systems Incorporated. All Rights Reserved.

PDFelement 6 Solutions Comparison

Moving to XML: The Investment

Laserfiche Document Management at a Glance

Features & Functionalities

QUARK AUTHOR THE SMART CONTENT TOOL. INFO SHEET Quark Author

Quick Reference Guide What s New in NSi AutoStore TM 6.0

Adobe Tech Comm Survey Findings. Explore key trends shaping the Technical Communication industry

CGM v SVG. Computer Graphics Metafile v Scalable Vector Graphic. David Manock

File Format Considerations in the Preservation of e- Books

A tool for Entering Structural Metadata in Digital Libraries

ABBYY FineReader 14 YOUR DOCUMENTS IN ACTION

Overview. What is TCM? TCM Supported File Types A Day in the Life of a Document Using TCM in Munis Using TCM without Munis TCM extra Features Q&A *

GUIDELINES FOR CREATION AND PRESERVATION OF DIGITAL FILES

Features & Functionalities

XBRL: Beyond Basic XML

WEB-BASED COLLECTION MANAGEMENT FOR ARCHIVES

Learn Html Pdf Converter Software Full Version Windows 7

XML Documentation for Adobe Experience Manager

Choosing DITA and Componize

A Guide to Automation Services 8.5.1

Export out report results in multiple formats like PDF, Excel, Print, , etc.

Digitizing Historic Newspapers

Integrated S1000D & ATA ispec 2200 Publications Lifecycle Management System

ISO PDF/A -Standard Archive file format standard for long-term preservation

Improved automatic restart and failed job recovery 64-bit support for improved memory utilisation

Lingotek Client Command Line Tool

PDF/A - The Basics. From the Understanding PDF White Papers PDF Tools AG

Consider the Source Structured Authoring for XML-based Documentation

Xyleme Studio Data Sheet

Managing Information Resources

AGCO s Multi-National, Multi-language Conversion to DITA

Database of historical places, persons, and lemmas

DOWNLOAD OR READ : WORD 10 FOR MAC OS X VISUAL QUICKSTART GUIDES PDF EBOOK EPUB MOBI

Laserfiche Product Suite 2011

INDIVIDUAL bizhub ENHANCEMENT

Publishing Technology 101 A Journal Publishing Primer. Mike Hepp Director, Technology Strategy Dartmouth Journal Services

Content Enrichment. An essential strategic capability for every publisher. Enriched content. Delivered.

Making Accessible Documents. PDF: Adobe Acrobat X & XI

Scan to PC Desktop Professional v7.0 Orientation Guide

3 Publishing Technique

White Paper: ABBYY Recognition Server Web Service API Example

Question No: 2 Which part of the structured FrameMaker application controls how long SGML and FrameMaker element names can be by default?

presentation design: Kat Kamp

How to use TRANSKRIBUS a very first manual

Using PDF Files in CONTENTdm

SMEWEBSITE. How it all Works - The Dotser Process 01. Setup & Content Editing 02. The Dotser Content Management System 03

The DMS provides a web browser, a desktop client and a mobile browser as standard features.

Everyday Activity. Course Content. Objectives of Lecture 13 Search Engine

Mass Digitisation Enabling Access, Use and Reuse

Automating Publishing Workflows through Standardization. XML Publishing with SDL

AAB UNIVERSITY. Lecture 5. Use of technology in translation process. Dr.sc. Arianit Maraj

Océ PRISMA archive software. Archiving made easy. Powerful, high-volume. archiving software

Accessible and Usable PDF Documents: Techniques for Document Authors Fourth Edition

USER S GUIDE Software/Hardware Module: ADOBE ACROBAT 7

is an electronic document that is both user friendly and library friendly

Automatic Reader. Multi Lingual OCR System.

Chapter 11: Editorial Workflow

What s New in QuarkXPress 2018

Part III: Survey of Internet technologies

Chapter 9 Section 3. Digital Imaging (Scanned) And Electronic (Born-Digital) Records Process And Formats

SharePoint Archival Storage Strategies & Technologies January Porter-Roth Associates 1

Proposals for a New Workflow for Level-4 Content

Accessibility 101. Things to Consider. Text Documents & Presentations: Word, PDF, PowerPoint, Excel, and General D2L Accessibility Guidelines.

Search Engine Optimization

Contents. Page 2. delivering solutions for your environment

AIM. 10 September

Achieving Accessibility with PDF: Getting from Here to There

FineReader Engine Overview & New Features in V10

WEB-BASED COLLECTION MANAGEMENT FOR LIBRARIES

DOWNLOAD OR READ : WORD AND IMAGE IN ARTHURIAN LITERATURE PDF EBOOK EPUB MOBI

The Journey to Globalization: Building a Successful and Scalable S1000D Authoring and Data Delivery Methodology

Advanced-Forms solution overview

Structured Content and Personalization

ERPANET Seminar Fontainebleau

Automated Classification. Lars Marius Garshol Topic Maps

USER GUIDE. MADCAP FLARE 2017 r3. Import

Nuance AutoStore route destinations

Full Text Service. User Guide. Version 6.1

Paraben s Network Examiner 7.0 Release Notes

Accessible Document Practices in Adobe Acrobat

Transcription:

1

Stelae Technologies «Extracting the Intelligence from Content» Data Conversion a Walkthrough... 2

Khemeia the product Product: Khemeia - converts unstructured information into structured semantically tagged content Utilised by organizations who undertake: Data conversion and transform content Enrich content with metadata Position: Unique on the market over 70 algorithms combining multiple analysis methodologies Competition: Mainly solutions with a large manual workflow 3

Khemeia - What does it do? Technical Content types: Maintenance manuals, reference documents, catalogues Inputs: PDF, ASCII, ATF, Word Outputs: S1000D, ATA, XML, HTML, SGML, DITA Legal Content types: Judgments, legislation, regulation, contracts Inputs: PDF, OCR, HTML, RTF, Word Outputs: XML, EPUB, HTML Financial Content types: Company accounts, financial statements Inputs: PDF, OCR, Excel, Word HTML Outputs: ixbrl, XBRL, customer-specific XML Taxonomy: UK GAAP, US GAAP, Indian GAAP, Irish GAAP Publishing Content types: STM, newspapers, magazines, books Inputs: PDF, Word, InDesign, QuarkXPress Outputs: XML, EPUB, NITF, DITA, DocBook 4

Content conversion the problem Competing solutions Input: PDF Word Text. Output: XML S1000D XBRL/iXBRL Mobile devices Outsourced Semi-automated scripts Expensive Time consuming Error prone Extensive QA required 5

Khemeia the solution Customers benefits Enriched content for users Improved indexing Better search results Features Cloud based Automatic processing Rapid deployment Ultra-fast conversion times One product for multiple content types (legal, technical) Multi-language Faster speed to market XBRL filing of company accounts S1000D technical documentation for defense, aerospace 6

Khemeia - Inputs and outputs Input types include: PDF ATF (ASCII Technical Format) OCR (optical character recognition) formats Microsoft Word RTF HTML Excel CSV ASCII XML SGML InDesign, QuarkXPress Output types include: XML SGML HTML RDFa PDF JPEG XMP NITF, NewsML XBRL / ixbrl S1000D DITA EPUB, e-book reader, tablet, smartphone formats. customer-specific DTDs 7

Types of content 8

Application: Legal judgment Input document: US District Court Output: XML generated automatically per customer specification 9

Application: Technical documentation 10

Application: Reference information 11

Application: Investment bulletin 12

Application: Directory listing 13

Application: Contracts processing <?xml version="1.0" encoding="utf-8"?> <files> <agreement>channel Alliance Program Agreement </agreement> <effect>september 15st, 2000</effect> <party>masterway Telecomunicacaes Ltda</party> <address>rua do Ouvidor, 161 / 603, Rio de Janeiro, RJ, Brazil and Av. Brigadeiro Faria Lima, 1811 - Cjs. 1005/1010, Sao Paulo, SP, Brazil. </address> <region>brazil</region> <termination>two (2) years</termination> <govlaw>the State of California</govlaw> <date>september 15st</date> </files> 14

Application: Financial accounts 15

Accounts output ixbrl extract 16

Application: Invoice processing Inputs OCR PDF Analysis and Extraction Analysis and extraction of metadata: customer name, supplier names, product type, quantity, amount, VAT numbers,... Outputs Integration into ERP applications: SAP,... 17

Case studies 18

Customer example: Aerospace Legacy data Aircraft maintenance manuals ASCII text and scanned paper Khemeia processing Automated structuring into linked S1000D data modules Images linked to part numbers Benefits Speed of deployment Automation and accuracy Security considerations: NATO & MOD classified information Khemeia identified as only viable solution 19

Customer example: Legal publisher Real time information crawled from over 40+ sources - PDF, Word, HTML Khemeia automated metadata extraction, structuring according top different XML schema Output delivered to content mining and CMS solutions (e.g. Temis & Documentum) Benefits One input multiple outputs (archive, CMS, web publishing, content mining) Real time information publishing - 2 seconds/page 70% cost reduction versus other solutions 20

Customer example: Financial statements Company accounts in PDF (native digital & image), Word, Excel, HTML, InDesign Khemeia automated conversion of financial data to XML XBRL taxonomy tags automatically queued to relevant financial values Processed and validated by an operator utilizing pdf2xbrl editor and output for filing as XBRL/iXBRL Benefits Unique PDF to XBRL conversion solution 2-4 hours of processing time per account set versus 18 hours 21

Behind the scenes 22

Workflow Scan OCR Analysis Style Structure Validation QA Image creation from paper PDF, JPEG, TIFF Optical Character Recognition Content analysis Structure Metadata Font Size Bold, italics Hierarchy Tables Images Equations Semantic tagging XML DTD/XML schema Quality control Checking Correction Error Control Module for Quality Control Scan & OCR Khemeia QA 23

Khemeia - The technology is unique Utilizes software algorithms that combine multiple analysis methodologies UNIQUE ON THE MARKET Visual Analysis (font, color, size,.) Structure/Hierarchy (e.g. titles, sub-titles, paragraphs, footnotes, etc.) Geometric Positioning (pinpoints the position of content on the page) Khemeia Keyword Analysis (matches specific terms i.e. key words or phrases) Regular Expressions (elements identified by matching specific logic against content patterns) Integration of Dictionaries/Indexes (match against customer-specific taxonomies e.g. legal terms) 24

Khemeia : User Interface 25

. 26

Business problem Khemeia solves Khemeia enables: Increased productivity Improved quality and enrichment Rapid deployment times Reduces customer costs by up to 70% For organizations who undertake: Data conversion and transform content Enrich content with metadata Have mainly manual workflows Applications: Conversion of content into XML Generating metadata to define, describe and enrich the content Providing indexed and searchable content Repurposing legacy XML 27

Khemeia partial client list 28

Thank you 29