Smarter Document Capture This presentation will begin at 2:00 PM EDT 1 PM Central, 12 PM Mountain, 11 AM Pacific Please check that the volume on your computer is on This presentation runs through voice over IP Until then, enjoy the sounds of silence
AIIM Presents: Smarter Document Capture Peggy Winton VP, AIIM Market Access Ari Gross CEO, CVISION Technologies Inc. Ralph Gammon editor, Document Imaging Report
About AIIM AIIM is the community focused on providing education, research, and best practices to help organizations find, control, and optimize their information for maximum value. Learn more about AIIM at www.aiim.org.
About AIIM We offer year-round programming in: Market Education Peer Networking Industry Advocacy & Research Professional Development & Training
Smarter Document Capture Ari Gross CEO CVISION Technologies Inc.
Smart Captured Documents Web-optimization :: On demand access Recognition :: OCR, ICR, bar codes, form coding PDF/A :: Reproducibility, long-term archiving Compression :: Image files at electronic file sizes Metadata :: Embed field info, Database independence Color Imaging :: Improved appearance & recognition
Compression Significant progress in compression technology Scanned files can be compressed as small as the original generated files Amenable to web hosting, email & backups Print on demand Word Document 921 KB Scanned TIFF 13,124 KB Standard PDF 13,058KB Compressed PDF 870 KB
Recognition OCR, Optical Character Recognition, recognize printed text ICR, Intelligent Character Recognition, recognize handwritten text Barcode, identify and recognize barcodes Form recognition, identify form type & extract relevant database fields
Metadata Metadata insertion supports document portability, i.e., platform independence Make documents self-aware, e.g., re-attach dead documents Consistent with ARMA & NARA recommendations Useful for encoding important document information, e.g., dbase field data, retention policy Automated insertion into document management system
Recognition (OCR, ICR): Advantage Color 450 Number of words 400 350 300 250 200 150 100 50 Metrics 36 invoices Green color invoices Blue bitonal (B&W) Words Recognition Color invoices - 4390 B&W invoices 2824 55% improvement 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 Color vs. B&W Recognition Rates
Elements of Smart Captured Documents Smart captured documents increase the functionality of image files Smart documents include support for web-optimization, OCR, reproducibility, metadata, and auto-indexing PDF supports smart documents PDF/A is a restricted version of PDF (1.4), especially suited for document reproducibility & archiving Smart captured documents result in improved corporate ROI Smart captured documents are very compelling for Web-based, distributed database, and email applications
Smarter Document Capture Exploring next-generation document images Ralph Gammon editor, Document Imaging Report
Traditional Document Images Captured in centralized environments with high-speed scanners Black-and-white, TIFF, Group 4 compressed Meta data managed through document management systems Not considered a long-term archiving format
Editor of the Document Imaging Report since 1998 Who am I? Premier source of news and analysis in the document capture and imaging market Accept no advertising in print publication Paid subscription www.documentimagingreport.com Publisher RMG Enterprises
The Potential of Document Imaging Color scanners now available for the same price as black-and-white Distributed capture infrastructure in place Advanced compression methods increase usability of color images PDF/A approved as an ISO standard
PDF: A better file format? Stands for portable document format Introduced by Adobe in 1993 According to Adobe, more than 500 million free PDF readers have been downloaded In 2007, Adobe submitted the PDF specs to ISO for ratification as an international standard PDF/A (archiving) approved in 2005
PDF: A Versatile Document Format Supports both imaged and electronically-generated files Supports color and bi-tonal compression Group 4, JBIG2, JPEG, JPEG 2000 Supports image segmentation and layering Self-describing image format
PDF: A self-defining image format Provides structured container for carrying important document information Full-text OCR results for searchability Meta data such was when the document was created, who the author is, when it was scanned, what type of document it is, etc. Historically, this information has been kept in a database separate from the image If you change image management systems this information needs to be transitioned Meta data is not portable.
Capturing Meta Data Several options for capturing meta data Key entry Bar codes OCR/ICR/IDR Meta data entry increasingly automated Improvements in OCR/ICR Voting Database look-ups Better image quality Introduction of intelligent document recognition (IDR) More meta data means more options Data mining Records management Automated workflows
PDF Compression: Why Smaller is Better Cost of storage is falling, but still can be significant when talking about millions of document images When viewing on the Web, smaller files mean faster downloads and a better user experience In distributed scanning environments, smaller files are simpler to move around JBIG compression can create bi-tonal PDF files similar in size to the original electronically created files PDF color compression techniques can create file sizes up to 100 times smaller (and higher quality) than their JPEG counterparts
PDFs for Web Viewing PDF viewer is universal PDFs can be optimized for Web viewing This is helped by advanced compression that creates smaller files to download Also supports multi-page files and use of intelligent downloads
Why color images?
Why Color Document Images? Truer representation of the original Contains more information Adoption of color printing Better for Web viewing Improved recognition rates
Why not color scanning? Color file sizes can be very large Typical 300 dpi scanned color page represents 24 MB of raw data Even a JPEG compressed document images can be more than 10 times the size of a bi-tonal counterpart (400 KB for color vs. 40 KB for bi-tonal JPEG not optimized for image viewing
How advanced color compression works PDF supports MRC (Mixed Raster Content) Enables segmenting of document in layers Once segmented, those layers can be compressed separately Enables optimal compression of each layer and file sizes 10 to 100 times smaller
Lossy vs. Lossless Lossless can be a misnomer, as any color document captured in black-and-white is losing information Perceptually lossless images are those that when viewed from a certain distance appear identical to human observers. While compression formats like JBIG2 and JPEG are not technically lossless, they can also be classified as perceptually lossless Best practices call for users to adjust their advanced compression settings until they are satisfied that images for a certain type of document are perceptually lossless
PDF/A: long-term archiving format Designed so that a PDF/A file created today will be able to decoded by a PDF/A reader in perpetuity Internally contains all resources necessary to be rendered Contains provisions for meta data Approved as an ISO standard in 2005 Has yet to gain widespread adoption, but are starting to see some initiatives on the international and state gov. level Applicable across electronic files and images Applications available for testing validity of PDF/A files PDF Center for Competence dedicated to developing best practices around PDF/A adoption (www.pdfa.org)
Levels of image enhancement Basic: deskewing, despeckling, auto-crop, blank-page removal, analog color dropout More advanced: line removal, electronic color dropout, auto-rotation based on text, multi-streaming Most advanced: grayscale thresholding, color segmentation
Grayscale can be as good as color
Example of Grayscale Thresholding
Summary PDF represents a more versatile file format than TIFF or JPEG PDF represents a smarter, self-contained file format PDF/A represents an ISO certified long-term image file format Technological advances in the following areas have combined to make PDF a more attractive imaging format Compression Display Meta data capture Scanning
Questions? On the bottom left hand side of your screen, type your question in the white box and hit Submit Question Button.
Upcoming Webinars April 23 rd Implement Your ECM Roadmap in 2008 May 7 th Finding Content: The best information in the world is worthless if you can t access and use it. May 14 th Shop Smart: Critical Buying Decisions for Capture June 4 th Enterprise Report Management Can't be Overlooked June 18 th Get Rid of Your Paper! Or not. Register Today at www.aiim.org/webinars