Data Hiding in Binary Text Documents 1. Q. Mei, E. K. Wong, and N. Memon

Similar documents
DATA EMBEDDING IN TEXT FOR A COPIER SYSTEM

VARIABLE RATE STEGANOGRAPHY IN DIGITAL IMAGES USING TWO, THREE AND FOUR NEIGHBOR PIXELS

A reversible data hiding based on adaptive prediction technique and histogram shifting

(Refer Slide Time: 00:02:00)

Multilayer Data Embedding Using Reduced Difference Expansion

Picture Maze Generation by Repeated Contour Connection and Graph Structure of Maze

Highly Secure Invertible Data Embedding Scheme Using Histogram Shifting Method

Information Cloaking Technique with Tree Based Similarity

An Efficient Character Segmentation Based on VNP Algorithm

Fingerprint Watermark Embedding by Discrete Cosine Transform for Copyright Ownership Authentication

Watermark based Recovery of Tampered Documents

A Document Image Analysis System on Parallel Processors

Multipurpose Color Image Watermarking Algorithm Based on IWT and Halftoning

Robust Lossless Data Hiding. Outline

Scene Text Detection Using Machine Learning Classifiers

signal-to-noise ratio (PSNR), 2

Scanner Parameter Estimation Using Bilevel Scans of Star Charts

OCR For Handwritten Marathi Script

Lecture 3: Art Gallery Problems and Polygon Triangulation

EAG: Edge Adaptive Grid Data Hiding for Binary Image Authentication

Digital image steganography using LSB substitution, PVD, and EMD

Layout Segmentation of Scanned Newspaper Documents

Robust Steganography Using Texture Synthesis

Digital Image Steganography Using Bit Flipping

A Robust Wipe Detection Algorithm

AN IMPROVISED LOSSLESS DATA-HIDING MECHANISM FOR IMAGE AUTHENTICATION BASED HISTOGRAM MODIFICATION

Data Hiding on Text Using Big-5 Code

Morphological Image Processing

A System for Joining and Recognition of Broken Bangla Numerals for Indian Postal Automation

Data Hiding in Video

Adaptive Pixel Pair Matching Technique for Data Embedding

COMPARISONS OF DCT-BASED AND DWT-BASED WATERMARKING TECHNIQUES

DATA HIDING IN PDF FILES AND APPLICATIONS BY IMPERCEIVABLE MODIFICATIONS OF PDF OBJECT PARAMETERS

Image Compression for Mobile Devices using Prediction and Direct Coding Approach

Hybrid Stegnography using ImagesVaried PVD+ LSB Detection Program

Reversible Blind Watermarking for Medical Images Based on Wavelet Histogram Shifting

Finger Print Enhancement Using Minutiae Based Algorithm

AN EFFICIENT VIDEO WATERMARKING USING COLOR HISTOGRAM ANALYSIS AND BITPLANE IMAGE ARRAYS

A Revisit to LSB Substitution Based Data Hiding for Embedding More Information

Skeletonization Algorithm for Numeral Patterns

I. INTRODUCTION. Figure-1 Basic block of text analysis

Random Traversing Based Reversible Data Hiding Technique Using PE and LSB

Computer Graphics. Chapter 4 Attributes of Graphics Primitives. Somsak Walairacht, Computer Engineering, KMITL 1

Reversible Data Hiding VIA Optimal Code for Image

An Information Hiding Scheme Based on Pixel- Value-Ordering and Prediction-Error Expansion with Reversibility

Locating 1-D Bar Codes in DCT-Domain

2D rendering takes a photo of the 2D scene with a virtual camera that selects an axis aligned rectangle from the scene. The photograph is placed into

Morphological Image Processing

A Fast Personal Palm print Authentication based on 3D-Multi Wavelet Transformation

Optical Character Recognition (OCR) for Printed Devnagari Script Using Artificial Neural Network

Skew Detection for Complex Document Images Using Fuzzy Runlength

LSB Based Audio Steganography Using Pattern Matching

CHAPTER 4 REVERSIBLE IMAGE WATERMARKING USING BIT PLANE CODING AND LIFTING WAVELET TRANSFORM

Image Authentication and Recovery Scheme Based on Watermarking Technique

AUTOMATIC LOGO EXTRACTION FROM DOCUMENT IMAGES

A Model-based Line Detection Algorithm in Documents

Introduction to Visible Watermarking. IPR Course: TA Lecture 2002/12/18 NTU CSIE R105

Use of Shape Deformation to Seamlessly Stitch Historical Document Images

Object Shape Recognition in Image for Machine Vision Application

IMPROVING THE RELIABILITY OF DETECTION OF LSB REPLACEMENT STEGANOGRAPHY

A New Approach to Compressed Image Steganography Using Wavelet Transform

Bit-Plane Decomposition Steganography Using Wavelet Compressed Video

Error-free Authentication Watermarking Based on Prediction-Error-Expansion Reversible Technique

DIGITAL WATERMARKING FOR GRAY-LEVEL WATERMARKS

Recognition of Unconstrained Malayalam Handwritten Numeral

Scanner Parameter Estimation Using Bilevel Scans of Star Charts

A new approach to reference point location in fingerprint recognition

Published by: PIONEER RESEARCH & DEVELOPMENT GROUP ( 1

Footprint Recognition using Modified Sequential Haar Energy Transform (MSHET)

A New Approach To Fingerprint Recognition

A Detailed look of Audio Steganography Techniques using LSB and Genetic Algorithm Approach

Keyword Spotting in Document Images through Word Shape Coding

Restoring Chinese Documents Images Based on Text Boundary Lines

A Reversible Data Hiding Scheme for BTC- Compressed Images

SPREAD SPECTRUM AUDIO WATERMARKING SCHEME BASED ON PSYCHOACOUSTIC MODEL

CS443: Digital Imaging and Multimedia Binary Image Analysis. Spring 2008 Ahmed Elgammal Dept. of Computer Science Rutgers University

2013, IJARCSSE All Rights Reserved Page 1637

Hierarchical Representation of 2-D Shapes using Convex Polygons: a Contour-Based Approach

A Compressed Representation of Mid-Crack Code with Huffman Code

Bi-level Image Watermarking and Distortion Measure

Dynamic Stroke Information Analysis for Video-Based Handwritten Chinese Character Recognition

Light Field Occlusion Removal

Data hiding technique in JPEG compressed domain

CAPTCHAs and Information Hiding

Robust biometric image watermarking for fingerprint and face template protection

Texture Segmentation by Windowed Projection

Image Error Concealment Based on Watermarking

Genetic Algorithm For Fingerprint Matching

TEXT DETECTION AND RECOGNITION IN CAMERA BASED IMAGES

IMPROVED RHOMBUS INTERPOLATION FOR REVERSIBLE WATERMARKING BY DIFFERENCE EXPANSION. Catalin Dragoi, Dinu Coltuc

Computer Graphics. Attributes of Graphics Primitives. Somsak Walairacht, Computer Engineering, KMITL 1

A NOVEL SECURED BOOLEAN BASED SECRET IMAGE SHARING SCHEME

Image Steganography Technique By Using Braille Method of Blind People (LSBraille)

A New Algorithm for Detecting Text Line in Handwritten Documents

Countermeasure for the Protection of Face Recognition Systems Against Mask Attacks

A Flexible Scheme of Self Recovery for Digital Image Protection

Historical Handwritten Document Image Segmentation Using Background Light Intensity Normalization

Towards a Telltale Watermarking Technique for Tamper-Proofing

Khmer OCR for Limon R1 Size 22 Report

Digital Image Steganography Techniques: Case Study. Karnataka, India.

Transcription:

Data Hiding in Binary Text Documents 1 Q. Mei, E. K. Wong, and N. Memon Department of Computer and Information Science Polytechnic University 5 Metrotech Center, Brooklyn, NY 11201 ABSTRACT With the proliferation of digital media such as digital images, digital audio, and digital video, robust digital watermarking and data hiding techniques are needed for copyright protection, copy control, annotation, and authentication. While many techniques have been proposed for digital color and grayscale images, not all of them can be directly applied to binary text images. The difficulty lies in the fact that changing pixel values in a binary document could introduce irregularities that are very visually noticeable. We propose a new method for data hiding in binary text documents by embedding data in the 8-connected boundary of a character. We have identified a fixed set of pairs of five-pixel long boundary patterns for embedding data. One of the patterns in a pair requires deletion of the center foreground pixel, whereas the other requires the addition of a foreground pixel. A unique property of the proposed method is that the two patterns in each pair are dual of each other -- changing the pixel value of one pattern at the center position would result in the other. This property allows easy detection of the embedded data without referring to the original document, and without using any special enforcing techniques for detecting embedded data. Keywords: data hiding, watermarking, binary text documents, authentication 1. Introduction As digital devices such as scanners and digital cameras become more and more available, and mass storage media for digital data become more affordable, the use of digital images in practical applications are becoming more widespread. Practical image applications range from those for famous works of art, bank checks, to medical images depicting an obscure disease. Reliable methods for copyright protection, copy control, annotation, and authentication are therefore needed. A variety of digital watermarking and data hiding techniques have been proposed for such purposes. However, most of the methods developed today are for grayscale and color images [1], by which the gray level or color value of a selected group of pixels is changed by a small amount without causing visually noticeable artifacts. These techniques cannot be directly applied to binary images where the pixels have either 0 or 1 value. Arbitrarily changing pixels on a binary image causes very noticeable artifacts (See Figure 1 for an example.) A different class of embedding techniques must therefore be developed for binary images. This has important applications in a wide variety of document images that are represented as binary foreground and background; e.g. text documents. There has been limited work on watermarking and data hiding in binary images. In [2], the input binary image is divided into 3 x 3 (or larger) blocks. The flipping priorities of pixels in a 3 x 3 block are then computed and those with the lowest scores are those to be changed. The flipping priority of a pixel is indicative of the estimated visual distortion that would be caused by flipping the value of a pixel from 0 to 1 or from 1 to 0. Data are embedded such that the total number of black pixels is either odd or even in a block. Shuffling was used to equalize the uneven embedding capacity. The data hiding technique is not 1 This work was supported in part by NSF REU Grant # 9619749 made to Polytechnic University. Corresponding author: wong@poly.edu

robust to printing and scanning and hence suitable only for steganography and authentication (fragile watermarking) applications. In [3], data is embedded in text documents by shifting lines and words spacing. This approach has low embedding capacity but the embedded data is robust to photocopying, scanning, and printing processes. In [4], the difference between the average widths of character strokes extracted from two sets of partitions arranged symmetrically is used to embed data. This method can only be applied to documents containing characters. In [5], data is embedded in dithered images by changing the dithering patterns, and in fax images by changing the run-lengths. The method cannot be applied to general binary images but is claimed to be robust to printing and scanning. In this paper, we propose a new method for data hiding in binary text document images by changing pixel values along the non-smooth portions of character boundaries. The method could also be applied to other types of binary images that contain connected components. Our method uses an efficient table look up procedure for determining the boundary patterns to embed data. Like the method in [2], our technique is not robust to printing and scanning and hence is useful only in steganography and authentication applications. The rest of this paper is organized as follows: In Section 2, we describe our proposed method for embedding and extracting data. In Section 3, we present some experimental results. And, in Section 4, we give our conclusions. 2. Proposed Approach In this section, we present our proposed data hiding technique that embeds information bits along character boundaries. 2.1 Boundary Patterns Selection In our proposed approach, data are embedded in the 8-connected boundary of a character. We assume that the input text document is in binary image form, or has been converted into a binary image from a grayscale or color document. We have identified 100 pairs of five-pixel long boundary patterns for embedding data. One of the patterns in a pair requires addition of a foreground pixel adjacent to the center pixel, whereas the other requires the deletion of the center foreground pixel. For convenience, we will refer to these operations as the Add and Delete operations, and call these two types of patterns as the A (Add)- and D (Delete)-patterns respectively. A unique property of the chosen patterns is that the A- and D- patterns in each pair are dual of each other; that is, changing the center pixel of one pattern would result in the other pattern. We refer to this operation as flipping the pattern; that is, flipping an A-pattern would result in a D-pattern, and vice versa. Figure 2 shows 28 example dual pairs from the 100 pairs. In the figure, the black pixels are boundary pixels, and the gray pixels are foreground object pixels. The A- patterns are for encoding information bit 1, and the D-patterns are for encoding information bit 0. The duality property allows easy detection of the embedded data without referring to the original document, and without using any special enforcing techniques in the detection process. In obtaining the 100 boundary pattern pairs, the goal is to preserve the overall shape of a character, and to minimize noticeable artifacts and distortions around the boundary after embedding data. First we assume that each of the five consecutive boundary pixels do not touch (as 8-connected neighbors) any pixels in the boundary segment other than the one immediately preceding or following it. This is a reasonable assumption for the boundary of a character or connected component of reasonable size (in terms of pixel count.) We start out with a set of all possible five-pixel-long boundary patterns satisfying this requirement. The following types of boundary patterns are then eliminated from the initial set: (a) Boundary segments that do not preserve length after Add or Delete operations (b) Straight line segments (c) Boundary segments with a 90 degree corner

Elimination of boundary segments in category (a) above ensures that the length of boundary segments remains as five pixels long after an Add or Delete operations. This allows the extraction of embedded data from the same 5-pixel long fixed partitioning of the boundary, without using a special enforcing procedure. Straight-line segments are eliminated since they no longer look straight after adding or deleting pixels at the center position, and may become noticeable. Here, we eliminated five-pixel long horizontal, vertical, and diagonal straight-line segments. Finally, boundary segments that form a 90-degree corner at the center pixel are eliminated, since they no longer look like a sharp 90-degree corner after adding or deleting pixel at the center position. 2.2 Embedding and Extracting Data The 100 pairs of boundary patterns are stored in a lookup table called the pattern table. In the embedding process, the input image is scanned in a left-to-right, and top-to-bottom manner to extract all connected components, which correspond to characters or other symbols in a text document. For each connected component, the first upper-left foreground pixel encountered in the scanning process is used as the starting pixel. An 8-connected boundary following algorithm is then used to obtain the closed outer boundary of a connected component. Certain characters, such as the characters o and b, contain one or more inner boundaries and they are not used in our current implementation. The outer boundary of a character is then traversed in a clockwise manner and divided into a set of consecutive non-overlapping five-pixel-long segments. If the last boundary segment is less than five pixels long, it is discarded. The set of consecutive boundary segments is then matched with patterns in the pattern table. If a boundary segment matches a pattern in the pattern table, it is called a valid boundary segment. Valid boundary segments are used to embed data and other segments on the boundary are simply ignored in the process. If the data bit to be embedded is a 0 and the current boundary segment is an Add pattern, the pattern is flipped to become a Delete pattern; otherwise no changes are necessary. Similarly, if the data bit to be embedded is a 1 and the current boundary segment is a Delete pattern, the pattern is flipped to become an Add pattern; otherwise, no changes are necessary. Data bits are embedded in the characters (or connected components) of a document in a left-to-right, and top-to-bottom manner. Figure 3 shows a block diagram of the embedding process. In the extraction process, the same procedure as used in the embedding process is used to extract fivepixel long boundary segments from connected components. Valid boundary segments are, again, identified using a table look up procedure and converted to a binary data bits. Figure 4 shows a block diagram of the extraction process. The data hiding capacity c for each connected component is bounded by [0, int{n/5}] where N is the number of boundary points in the connected component, and int{ } is the integer operation. The total data hiding capacity C for a text document is c_ave*m where c_ave is the average capacity per character, and M is the total number of characters in the document. 3. Experimental Results A set of experiments was performed on a SUN Ultra Spark 60 workstation to prove the validity of the proposed method. Figure 5(a) shows a signature image 2 of size 287 x 61 pixels. It has seven connected components. Figure 5(b) shows the marked image with 91 bits of embedded data, which contain 14 bits of header information and the ASCII representation of the 11 letters POLYTECHNIC (77 bits). Figure 5(c) shows the difference image. The same image was used in [2] in their experiment and it was reported that 7 letters were embedded using their method. Figure 6(a) shows a text paragraph of 72 dpi resolution. The image was generated using the Paint program on a Windows 98 PC. It has 334 connected components. Figure 6(b) shows the marked image with 648 embedded bits. It can be seen that the marked images of both Figures 5(b) and Figure 6(b) are almost visually identical to the original. We also applied our method to a test document with a full page of text of font size 11 characters. Table I shows the test results for the document scanned in at 100, 200, and 300 dpi. The total number of connected components at 300 dpi is 2 The signature was obtained from the U.S. White House website http://whitehouse.gov during the summer of 2000.

2,284, which corresponds to the 2,060 characters on the document, plus other symbols such as period, commas, etc. At 200 dpi, the number of connected components increases slightly to 2,326. A visual inspection shows that at 200 dpi, some of the connected components got broken into two or more pieces. We suspect this has to do with either the scanning process or the software that converts the original scanned image from grayscale to binary. At 300 dpi, the total number of embedded bits is 13,000 with an average of 5.69 bits/connected component (CC). This decreases to 6,560 bits and 2.82 bits/cc at 200 dpi. The decrease in data hiding capacity as resolution decreases is expected because the characters or connected components have fewer pixels, and consequently have shorter boundaries in terms of pixels. An interesting observation is that although the boundary length (in terms of pixels) increases 1.5 times from 200 dpi to 300 dpi, the data hiding capacity increases about 2 times. A possible explanation could be that more valid boundary patterns are matched at 300 dpi. A more thorough investigation needs to be done. At 100 dpi, visual inspection shows that an unacceptable number of connected components got broken into two or more pieces. The data hiding capacity drops to only 721 total bits, or 0.17 bits/cc. We expect that with better scanning and grayscale to binary conversion process, broken connected components would be reduced and the number of connected components should come out to be about the same as that of 200 or 300 dpi images. The data hiding capacity would then be improved as the size of the connected components get bigger. 4. Conclusions A novel data hiding technique for binary text documents was developed. Experimental results demonstrated good data hiding capacity of the technique. In the current implementation, we only used the outer boundary of a character to embed data. If we include inner boundary, the data hiding capacity can be further increased. Since the method hides data in non-smooth portions of text character boundaries, alterations are hardly noticeable. The duality property of the Add-Delete patterns allows easy extraction of hidden data without complicated enforcing techniques, and without referring to the original document. The proposed method is useful for annotating messages in a text document, and for detecting alterations. This method could also be applied to other binary images with connected components. In future work, we will explore the use of boundary segments other than five pixels long for embedding data, and study their data hiding capacity. References [1] M. Swanson, M. Kobayashi, and A. Tewfik, Multimedia Data Embedding and Watermarking Technologies, IEEE Proceedings, vol. 86, No. 6, pp 1064-1087, June 1998. [2] M. Wu, E. Tang, and B. Liu, Data Hiding in Digital Binary Images, Proc. Int l Conf. on Multimedia and Expo, Jul 31-Aug 2, 2000, New York, NY. [3] S. H. Low, N. F. Maxemchuk, A. M. Lapone, Document Identification for Copyright Protection Using Centroid Detection, IEEE Trans. on Comm., vol. 46, no. 3, Mar 1998, pp. 372-83. [4] T. Amamo and D. Misaki, Feature Calibration Method for Watermarking of Document Images, Proc. 5 th Int l Conf on Document Analysis and Recognition, 1999, pp. 91-94, Bangalore, India. [5] K. Matsui and K. Tanaka, Video-steganography: How to Secretly Embed a Signature in a Picture, Proc. of IMA Intellectual Property Project, v.1, no. 1, 1994. Figure 1. Effect of Arbitrarily Changing Pixel Values on a Binary Image

Figure 2. Twenty-eight of the 100 Dual A- and D-Patterns. Pattern Table Original Text Document Extract Character Boundary Divide into Segments Match Segment Pattern Embed Data Marked Text Document Figure 3. Data Embedding Process

Pattern Table Marked Text Document Extract Character Boundary Divide into Segments Match Segment Pattern Extract Data Figure 4. Data Extraction Process (a) Original Signature Image (287 x 61 pixels) (b) Image with Letters POLYTECHNIC Embedded (c) Difference Image Figure 5. Experiment Result from the Signature Image (a) Original Text Image at 72 dpi

(b) Marked Text Image with 648 bits embedded Figure 6. Experimental Result for a Paragraph of Text Table I. Results from a Full Text Document Resolution in dpi 100 200 300 # Connected Components (CC) 4,254 2,326 2,284 # bits embedded 721 6,560 13,000 Ave # bits/cc 0.17 2.82 5.69