PDF recompression using JBIG2

Similar documents
PDF Enhancements Tools for a Digital Library

}w!"#$%&'()+,-./012345<ya

Bi-Level Image Compression

Mathematical formulae recognition and logical structure analysis of mathematical papers

Digitization g of Mathematical Journals. The beginning: Current research subjects: URL:http;// Policy:

Multimedia Networking ECE 599

}w!"#$%&'()+,-./012345<ya

SOFTWARE AND MULTIMEDIA. Chapter 6 Created by S. Cox

A New Compression Method Strictly for English Textual Data

}w!"#$%&'()+,-./012345<ya

Designing a Semantic Ground Truth for Mathematical Formulas

Stonelaw High School. Computing Science. BGE - Computer Systems

CS101 Lecture 12: Image Compression. What You ll Learn Today

Data Compression. Media Signal Processing, Presentation 2. Presented By: Jahanzeb Farooq Michael Osadebey

IMAGE COMPRESSION. Image Compression. Why? Reducing transportation times Reducing file size. A two way event - compression and decompression

Adaptive String Dictionary Compression in In-Memory Column-Store Database Systems

So, what is data compression, and why do we need it?

More Bits and Bytes Huffman Coding

A COMPRESSION TECHNIQUES IN DIGITAL IMAGE PROCESSING - REVIEW

Features. Sequential encoding. Progressive encoding. Hierarchical encoding. Lossless encoding using a different strategy

MULTIMEDIA AND CODING

Chapter 1. Data Storage Pearson Addison-Wesley. All rights reserved

Wireless Communication

Compression of line-drawing images using vectorizing and feature-based filtering

Ch. 1.4 Histograms & Stem-&-Leaf Plots

Data Compression Fundamentals

Unicode. Standard Alphanumeric Formats. Unicode Version 2.1 BCD ASCII EBCDIC

JPEG: An Image Compression System

CS 335 Graphics and Multimedia. Image Compression

Document management applications Raster Image Transport and Storage Use of ISO (PDF/raster) Version 1.0

Refinement of digitized documents through recognition of mathematical formulae

Support Vector Machines for Mathematical Symbol Recognition

Smarter Document Capture

The Standardization process

Lecture 6 Review of Lossless Coding (II)

Advanced Topics in Curricular Accessibility: Strategies for Math and Science Accessibility

GCSE Computing. Revision Pack TWO. Data Representation Questions. Name: /113. Attempt One % Attempt Two % Attempt Three %

Dictionary Based Compression for Images

06/12/2017. Image compression. Image compression. Image compression. Image compression. Coding redundancy: image 1 has four gray levels

DYNAMIC HIERARCHICAL DICTIONARY DESIGN FOR MULTI-PAGE BINARY DOCUMENT IMAGE COMPRESSION**

DML-CZ: The objectives and the first steps

Compression; Error detection & correction

Study of Tools & Techniques for Accessing Mathematics by Partially Sighted / Visually Impaired persons. Akashdeep Bansal

Lossless Image Compression with Lossy Image Using Adaptive Prediction and Arithmetic Coding

Perceptual Coding. Lossless vs. lossy compression Perceptual models Selecting info to eliminate Quantization and entropy encoding

Digital Image Representation Image Compression

LORD PCAA LIONS Mat.Hr.Sec School, Reserve Line, Sivakasi

Bits and Bit Patterns

IMAGE COMPRESSION SYSTEMS A JPEG PERSPECTIVE

Digital Image Processing

Compression; Error detection & correction

Review for the Final

OCR J276 GCSE Computer Science

A Short Introduction to PDF

JPEG: An Image Compression System. Nimrod Peleg update: Nov. 2003

EPS Import Functionality for ReportLab

ISO/TR TECHNICAL REPORT. Document management Electronic imaging Guidance for the selection of document image compression methods

Engineering Mathematics II Lecture 16 Compression

JPEG IP core (KJN series)

Gzip Compression Using Altera OpenCL. Mohamed Abdelfattah (University of Toronto) Andrei Hagiescu Deshanand Singh

Experiments in Mathematical Web Animation

Multimedia Communications ECE 728 (Data Compression)

An introduction to JPEG compression using MATLAB

CS 206 Introduction to Computer Science II

CMPT 365 Multimedia Systems. Media Compression - Image

RLG Model Request for Information (RFI) for Digital Imaging Services

Jianhui Zhang, Ph.D., Associate Prof. College of Computer Science and Technology, Hangzhou Dianzi Univ.

Report on DML-CZ project

EE67I Multimedia Communication Systems Lecture 4

EFFICIENT IMAGE COMPRESSION AND DECOMPRESSION ALGORITHMS FOR OCR SYSTEMS. Boban Arizanović, Vladan Vučković

GUJARAT TECHNOLOGICAL UNIVERSITY

Analysis of Parallelization Effects on Textual Data Compression

Procedural Compression: Efficient, Low Bandwidth Remote Android Graphics

JPEG Joint Photographic Experts Group ISO/IEC JTC1/SC29/WG1 Still image compression standard Features

Xuena Bao, Dajiang Zhou, Peilin Liu, and Satoshi Goto, Fellow, IEEE

Lossy compression CSCI 470: Web Science Keith Vertanen Copyright 2013

Compression Part 2 Lossy Image Compression (JPEG) Norm Zeck

DCT Based, Lossy Still Image Compression

Encoding. A thesis submitted to the Graduate School of University of Cincinnati in

OPTIMIZING PDFS WITH ACROBAT PRO 8

Use of GeoGebra in teaching about central tendency and spread variability

Comparative Study of Dictionary based Compression Algorithms on Text Data

DIS: Design and imaging software

Embedded lossless audio coding using linear prediction and cascade coding

Compression. storage medium/ communications network. For the purpose of this lecture, we observe the following constraints:

From Electronical Questionnaires to Accessible Maths on Web

CSCD 443/533 Advanced Networks Fall 2017

End-to-End Data. Presentation Formatting. Difficulties. Outline Formatting Compression

ISO/IEC INTERNATIONAL STANDARD. Information technology JPEG 2000 image coding system: An entry level JPEG 2000 encoder

MPEG-4 Structured Audio Systems

2014 Summer School on MPEG/VCEG Video. Video Coding Concept

Elementary Computing CSC 100. M. Cheng, Computer Science

MULTIMEDIA COMMUNICATION

Data Compression Scheme of Dynamic Huffman Code for Different Languages

Data Compression. An overview of Compression. Multimedia Systems and Applications. Binary Image Compression. Binary Image Compression

What is Publisher, anyway?

A New Progressive Lossy-to-Lossless Coding Method for 2.5-D Triangle Mesh with Arbitrary Connectivity

Mathematical Document Representation and Processing

The Best-Performance Digital Video Recorder JPEG2000 DVR V.S M-PEG & MPEG4(H.264)

Integrated OCR software for mathematical documents and its output with accessibility

Transcription:

PDF recompression using JBIG2 Final identity with strapline (stacked) Radim Hatlapatka <208155@mail.muni.cz> 17th March 2010

Outline Motivation Standard JBIG2 PDF recompression Jbig2enc and leptonica Current achievements Planned next steps Conclusion

Motivation of this project decrease size of PDF documents improvement of perceptually lossless compression of encoder jbig2enc by removing flyspecks applying to documents containing scanned pages containing mathematical expressions (on DML-CZ and EuDML)

Motivation of this project decrease size of PDF documents improvement of perceptually lossless compression of encoder jbig2enc by removing flyspecks applying to documents containing scanned pages containing mathematical expressions (on DML-CZ and EuDML)

Motivation of this project decrease size of PDF documents improvement of perceptually lossless compression of encoder jbig2enc by removing flyspecks applying to documents containing scanned pages containing mathematical expressions (on DML-CZ and EuDML)

JBIG2 standard (ISO/IEC 14492) for compression of bi-level images with a very good compression ratio designed to compress text in images, i.e. scanned text, faxes supports compression of multipage documents by using global dictionary supports both lossless and lossy compression (also perceptually lossless compression) segments each page to several regions, they are typically three (text region, halftone region and generic region) apply specific coding for each region (modifications of arithmetic or Huffman coding) supported in PDF since version 1.4

JBIG2 standard (ISO/IEC 14492) for compression of bi-level images with a very good compression ratio designed to compress text in images, i.e. scanned text, faxes supports compression of multipage documents by using global dictionary supports both lossless and lossy compression (also perceptually lossless compression) segments each page to several regions, they are typically three (text region, halftone region and generic region) apply specific coding for each region (modifications of arithmetic or Huffman coding) supported in PDF since version 1.4

JBIG2 standard (ISO/IEC 14492) for compression of bi-level images with a very good compression ratio designed to compress text in images, i.e. scanned text, faxes supports compression of multipage documents by using global dictionary supports both lossless and lossy compression (also perceptually lossless compression) segments each page to several regions, they are typically three (text region, halftone region and generic region) apply specific coding for each region (modifications of arithmetic or Huffman coding) supported in PDF since version 1.4

JBIG2 standard (ISO/IEC 14492) for compression of bi-level images with a very good compression ratio designed to compress text in images, i.e. scanned text, faxes supports compression of multipage documents by using global dictionary supports both lossless and lossy compression (also perceptually lossless compression) segments each page to several regions, they are typically three (text region, halftone region and generic region) apply specific coding for each region (modifications of arithmetic or Huffman coding) supported in PDF since version 1.4

JBIG2 standard (ISO/IEC 14492) for compression of bi-level images with a very good compression ratio designed to compress text in images, i.e. scanned text, faxes supports compression of multipage documents by using global dictionary supports both lossless and lossy compression (also perceptually lossless compression) segments each page to several regions, they are typically three (text region, halftone region and generic region) apply specific coding for each region (modifications of arithmetic or Huffman coding) supported in PDF since version 1.4

JBIG2 standard (ISO/IEC 14492) for compression of bi-level images with a very good compression ratio designed to compress text in images, i.e. scanned text, faxes supports compression of multipage documents by using global dictionary supports both lossless and lossy compression (also perceptually lossless compression) segments each page to several regions, they are typically three (text region, halftone region and generic region) apply specific coding for each region (modifications of arithmetic or Huffman coding) supported in PDF since version 1.4

JBIG2 standard (ISO/IEC 14492) for compression of bi-level images with a very good compression ratio designed to compress text in images, i.e. scanned text, faxes supports compression of multipage documents by using global dictionary supports both lossless and lossy compression (also perceptually lossless compression) segments each page to several regions, they are typically three (text region, halftone region and generic region) apply specific coding for each region (modifications of arithmetic or Huffman coding) supported in PDF since version 1.4

What is halftone Figure: Picture of how halftone works 1 1 Extracted from <http://encyclopedia.tfd.com/halftone> 8. 2. 2010

Example of how is saved JBIG2 image in PDF 2 0 obj «/DecodeParms «/JBIG2Globals 1 0 R» /Width 3265 /BitsPerComponent 1 /Height 4911 /Filter /JBIG2Decode /Subtype /Image /Length 4582 /ColorSpace /DeviceGray /Type /XObject» stream... endstream

PDF recompressor idea in steps: 1 extraction of image bit streams and conversion to format accepted by encoder jbig2enc 2 invoking of encoder on extracted images 3 replacing images by their compressed version according to JBIG2 standard input of concrete image input of global data and making reference to it (global data put only once) used libraries PDFBox used for extraction of images (extraction including conversion) IText used for decryption of PDF document and replacing original images their compressed version

PDF recompressor idea in steps: 1 extraction of image bit streams and conversion to format accepted by encoder jbig2enc 2 invoking of encoder on extracted images 3 replacing images by their compressed version according to JBIG2 standard input of concrete image input of global data and making reference to it (global data put only once) used libraries PDFBox used for extraction of images (extraction including conversion) IText used for decryption of PDF document and replacing original images their compressed version

PDF recompressor idea in steps: 1 extraction of image bit streams and conversion to format accepted by encoder jbig2enc 2 invoking of encoder on extracted images 3 replacing images by their compressed version according to JBIG2 standard input of concrete image input of global data and making reference to it (global data put only once) used libraries PDFBox used for extraction of images (extraction including conversion) IText used for decryption of PDF document and replacing original images their compressed version

PDF recompressor idea in steps: 1 extraction of image bit streams and conversion to format accepted by encoder jbig2enc 2 invoking of encoder on extracted images 3 replacing images by their compressed version according to JBIG2 standard input of concrete image input of global data and making reference to it (global data put only once) used libraries PDFBox used for extraction of images (extraction including conversion) IText used for decryption of PDF document and replacing original images their compressed version

PDF recompressor idea in steps: 1 extraction of image bit streams and conversion to format accepted by encoder jbig2enc 2 invoking of encoder on extracted images 3 replacing images by their compressed version according to JBIG2 standard input of concrete image input of global data and making reference to it (global data put only once) used libraries PDFBox used for extraction of images (extraction including conversion) IText used for decryption of PDF document and replacing original images their compressed version

PDF recompressor idea in steps: 1 extraction of image bit streams and conversion to format accepted by encoder jbig2enc 2 invoking of encoder on extracted images 3 replacing images by their compressed version according to JBIG2 standard input of concrete image input of global data and making reference to it (global data put only once) used libraries PDFBox used for extraction of images (extraction including conversion) IText used for decryption of PDF document and replacing original images their compressed version

Jbig2enc and leptonica open-source JBIG2 encoder [4] uses open-source library leptonica [1] for manipulation with images and bitmaps of symbols supports only arithmetic coding

Jbig2enc and leptonica open-source JBIG2 encoder [4] uses open-source library leptonica [1] for manipulation with images and bitmaps of symbols supports only arithmetic coding

Jbig2enc and leptonica open-source JBIG2 encoder [4] uses open-source library leptonica [1] for manipulation with images and bitmaps of symbols supports only arithmetic coding

Modification of jbig2enc comparision of all templates (representative symbols) with the same size for finding equivalence two templates are considered equivalent if there is not found big enough accumulation of differences we look for accumulations in shapes such as points or lines unification of two equivalent symbols to one

Modification of jbig2enc comparision of all templates (representative symbols) with the same size for finding equivalence two templates are considered equivalent if there is not found big enough accumulation of differences we look for accumulations in shapes such as points or lines unification of two equivalent symbols to one

Preliminary results These are average results from aplying to 810 documents Figure: Comparision of reductions in size through different number of pages

Problematic comparision Figure: Example of problematic comparision

Current and future steps OCR tools and techniques interface for calling InftyReader 1 that will return up to ten results containing recognised letter with score (accuracy of a hit) applying some other procedures on results that we got like featured vectors to decide which to choose heuristics to decrease number of compared symbols suggestions are welcome PDF recompressor improve posibilities such as recompression of images already compressed according to JBIG2 standard combination with pdfsizeopt.py [6] developed by Péter Szábo (Google) 1 info about InftyProject can be found at <http://www.inftyproject.org/>

Current and future steps OCR tools and techniques interface for calling InftyReader 1 that will return up to ten results containing recognised letter with score (accuracy of a hit) applying some other procedures on results that we got like featured vectors to decide which to choose heuristics to decrease number of compared symbols suggestions are welcome PDF recompressor improve posibilities such as recompression of images already compressed according to JBIG2 standard combination with pdfsizeopt.py [6] developed by Péter Szábo (Google) 1 info about InftyProject can be found at <http://www.inftyproject.org/>

Current and future steps OCR tools and techniques interface for calling InftyReader 1 that will return up to ten results containing recognised letter with score (accuracy of a hit) applying some other procedures on results that we got like featured vectors to decide which to choose heuristics to decrease number of compared symbols suggestions are welcome PDF recompressor improve posibilities such as recompression of images already compressed according to JBIG2 standard combination with pdfsizeopt.py [6] developed by Péter Szábo (Google) 1 info about InftyProject can be found at <http://www.inftyproject.org/>

Current and future steps OCR tools and techniques interface for calling InftyReader 1 that will return up to ten results containing recognised letter with score (accuracy of a hit) applying some other procedures on results that we got like featured vectors to decide which to choose heuristics to decrease number of compared symbols suggestions are welcome PDF recompressor improve posibilities such as recompression of images already compressed according to JBIG2 standard combination with pdfsizeopt.py [6] developed by Péter Szábo (Google) 1 info about InftyProject can be found at <http://www.inftyproject.org/>

Conclussion prototype is already functional and tested at sample taken from data on DML-CZ still much to do to achieve optimal compression ratio of perceptually lossless compression suggestions are welcome

Questions?

References Dan Bloomberg: Leptonica. <http://www.leptonica.com/>. R. Hatlapatka: JBIG2 compression. Brno 2010. Bachaleor thesis at Faculty of Informatics MU. Advisor doc. RNDr. Petr Sojka, Ph.D. http://is.muni.cz/th/208155/fi_b/. R. Hatlapatka: Websites of the project PDF recompression. http://www.fi.muni.cz/~xhatlap/pdfrecompression.html. Adam Langley: Jbig2enc. <http://github.com/agl/jbig2enc/>. Masakazu Suzuki: Infty project. Faculty of Mathematics and Graduate School of Mathematics Kyushu University, 2008. <http://www.inftyproject.org/>. Péter Szábo: Optimizing PDF output size of TEX documents. <http://code.google.com/p/pdfsizeopt/>.