1
Stelae Technologies «Extracting the Intelligence from Content» Data Conversion a Walkthrough... 2
Khemeia the product Product: Khemeia - converts unstructured information into structured semantically tagged content Utilised by organizations who undertake: Data conversion and transform content Enrich content with metadata Position: Unique on the market over 70 algorithms combining multiple analysis methodologies Competition: Mainly solutions with a large manual workflow 3
Khemeia - What does it do? Technical Content types: Maintenance manuals, reference documents, catalogues Inputs: PDF, ASCII, ATF, Word Outputs: S1000D, ATA, XML, HTML, SGML, DITA Legal Content types: Judgments, legislation, regulation, contracts Inputs: PDF, OCR, HTML, RTF, Word Outputs: XML, EPUB, HTML Financial Content types: Company accounts, financial statements Inputs: PDF, OCR, Excel, Word HTML Outputs: ixbrl, XBRL, customer-specific XML Taxonomy: UK GAAP, US GAAP, Indian GAAP, Irish GAAP Publishing Content types: STM, newspapers, magazines, books Inputs: PDF, Word, InDesign, QuarkXPress Outputs: XML, EPUB, NITF, DITA, DocBook 4
Content conversion the problem Competing solutions Input: PDF Word Text. Output: XML S1000D XBRL/iXBRL Mobile devices Outsourced Semi-automated scripts Expensive Time consuming Error prone Extensive QA required 5
Khemeia the solution Customers benefits Enriched content for users Improved indexing Better search results Features Cloud based Automatic processing Rapid deployment Ultra-fast conversion times One product for multiple content types (legal, technical) Multi-language Faster speed to market XBRL filing of company accounts S1000D technical documentation for defense, aerospace 6
Khemeia - Inputs and outputs Input types include: PDF ATF (ASCII Technical Format) OCR (optical character recognition) formats Microsoft Word RTF HTML Excel CSV ASCII XML SGML InDesign, QuarkXPress Output types include: XML SGML HTML RDFa PDF JPEG XMP NITF, NewsML XBRL / ixbrl S1000D DITA EPUB, e-book reader, tablet, smartphone formats. customer-specific DTDs 7
Types of content 8
Application: Legal judgment Input document: US District Court Output: XML generated automatically per customer specification 9
Application: Technical documentation 10
Application: Reference information 11
Application: Investment bulletin 12
Application: Directory listing 13
Application: Contracts processing <?xml version="1.0" encoding="utf-8"?> <files> <agreement>channel Alliance Program Agreement </agreement> <effect>september 15st, 2000</effect> <party>masterway Telecomunicacaes Ltda</party> <address>rua do Ouvidor, 161 / 603, Rio de Janeiro, RJ, Brazil and Av. Brigadeiro Faria Lima, 1811 - Cjs. 1005/1010, Sao Paulo, SP, Brazil. </address> <region>brazil</region> <termination>two (2) years</termination> <govlaw>the State of California</govlaw> <date>september 15st</date> </files> 14
Application: Financial accounts 15
Accounts output ixbrl extract 16
Application: Invoice processing Inputs OCR PDF Analysis and Extraction Analysis and extraction of metadata: customer name, supplier names, product type, quantity, amount, VAT numbers,... Outputs Integration into ERP applications: SAP,... 17
Case studies 18
Customer example: Aerospace Legacy data Aircraft maintenance manuals ASCII text and scanned paper Khemeia processing Automated structuring into linked S1000D data modules Images linked to part numbers Benefits Speed of deployment Automation and accuracy Security considerations: NATO & MOD classified information Khemeia identified as only viable solution 19
Customer example: Legal publisher Real time information crawled from over 40+ sources - PDF, Word, HTML Khemeia automated metadata extraction, structuring according top different XML schema Output delivered to content mining and CMS solutions (e.g. Temis & Documentum) Benefits One input multiple outputs (archive, CMS, web publishing, content mining) Real time information publishing - 2 seconds/page 70% cost reduction versus other solutions 20
Customer example: Financial statements Company accounts in PDF (native digital & image), Word, Excel, HTML, InDesign Khemeia automated conversion of financial data to XML XBRL taxonomy tags automatically queued to relevant financial values Processed and validated by an operator utilizing pdf2xbrl editor and output for filing as XBRL/iXBRL Benefits Unique PDF to XBRL conversion solution 2-4 hours of processing time per account set versus 18 hours 21
Behind the scenes 22
Workflow Scan OCR Analysis Style Structure Validation QA Image creation from paper PDF, JPEG, TIFF Optical Character Recognition Content analysis Structure Metadata Font Size Bold, italics Hierarchy Tables Images Equations Semantic tagging XML DTD/XML schema Quality control Checking Correction Error Control Module for Quality Control Scan & OCR Khemeia QA 23
Khemeia - The technology is unique Utilizes software algorithms that combine multiple analysis methodologies UNIQUE ON THE MARKET Visual Analysis (font, color, size,.) Structure/Hierarchy (e.g. titles, sub-titles, paragraphs, footnotes, etc.) Geometric Positioning (pinpoints the position of content on the page) Khemeia Keyword Analysis (matches specific terms i.e. key words or phrases) Regular Expressions (elements identified by matching specific logic against content patterns) Integration of Dictionaries/Indexes (match against customer-specific taxonomies e.g. legal terms) 24
Khemeia : User Interface 25
. 26
Business problem Khemeia solves Khemeia enables: Increased productivity Improved quality and enrichment Rapid deployment times Reduces customer costs by up to 70% For organizations who undertake: Data conversion and transform content Enrich content with metadata Have mainly manual workflows Applications: Conversion of content into XML Generating metadata to define, describe and enrich the content Providing indexed and searchable content Repurposing legacy XML 27
Khemeia partial client list 28
Thank you 29