Data Hiding in Binary Text Documents 1. Q. Mei, E. K. Wong, and N. Memon

Data Hiding in Binary Text Documents 1 Q. Mei, E. K. Wong, and N. Memon Department of Computer and Information Science Polytechnic University 5 Metrotech Center, Brooklyn, NY 11201 ABSTRACT With the proliferation of digital media such as digital images, digital audio, and digital video, robust digital watermarking and data hiding techniques are needed for copyright protection, copy control, annotation, and authentication. While many techniques have been proposed for digital color and grayscale images, not all of them can be directly applied to binary text images. The difficulty lies in the fact that changing pixel values in a binary document could introduce irregularities that are very visually noticeable. We propose a new method for data hiding in binary text documents by embedding data in the 8-connected boundary of a character. We have identified a fixed set of pairs of five-pixel long boundary patterns for embedding data. One of the patterns in a pair requires deletion of the center foreground pixel, whereas the other requires the addition of a foreground pixel. A unique property of the proposed method is that the two patterns in each pair are dual of each other -- changing the pixel value of one pattern at the center position would result in the other. This property allows easy detection of the embedded data without referring to the original document, and without using any special enforcing techniques for detecting embedded data. Keywords: data hiding, watermarking, binary text documents, authentication 1. Introduction As digital devices such as scanners and digital cameras become more and more available, and mass storage media for digital data become more affordable, the use of digital images in practical applications are becoming more widespread. Practical image applications range from those for famous works of art, bank checks, to medical images depicting an obscure disease. Reliable methods for copyright protection, copy control, annotation, and authentication are therefore needed. A variety of digital watermarking and data hiding techniques have been proposed for such purposes. However, most of the methods developed today are for grayscale and color images [1], by which the gray level or color value of a selected group of pixels is changed by a small amount without causing visually noticeable artifacts. These techniques cannot be directly applied to binary images where the pixels have either 0 or 1 value. Arbitrarily changing pixels on a binary image causes very noticeable artifacts (See Figure 1 for an example.) A different class of embedding techniques must therefore be developed for binary images. This has important applications in a wide variety of document images that are represented as binary foreground and background; e.g. text documents. There has been limited work on watermarking and data hiding in binary images. In [2], the input binary image is divided into 3 x 3 (or larger) blocks. The flipping priorities of pixels in a 3 x 3 block are then computed and those with the lowest scores are those to be changed. The flipping priority of a pixel is indicative of the estimated visual distortion that would be caused by flipping the value of a pixel from 0 to 1 or from 1 to 0. Data are embedded such that the total number of black pixels is either odd or even in a block. Shuffling was used to equalize the uneven embedding capacity. The data hiding technique is not 1 This work was supported in part by NSF REU Grant # 9619749 made to Polytechnic University. Corresponding author: wong@poly.edu

robust to printing and scanning and hence suitable only for steganography and authentication (fragile watermarking) applications. In [3], data is embedded in text documents by shifting lines and words spacing. This approach has low embedding capacity but the embedded data is robust to photocopying, scanning, and printing processes. In [4], the difference between the average widths of character strokes extracted from two sets of partitions arranged symmetrically is used to embed data. This method can only be applied to documents containing characters. In [5], data is embedded in dithered images by changing the dithering patterns, and in fax images by changing the run-lengths. The method cannot be applied to general binary images but is claimed to be robust to printing and scanning. In this paper, we propose a new method for data hiding in binary text document images by changing pixel values along the non-smooth portions of character boundaries. The method could also be applied to other types of binary images that contain connected components. Our method uses an efficient table look up procedure for determining the boundary patterns to embed data. Like the method in [2], our technique is not robust to printing and scanning and hence is useful only in steganography and authentication applications. The rest of this paper is organized as follows: In Section 2, we describe our proposed method for embedding and extracting data. In Section 3, we present some experimental results. And, in Section 4, we give our conclusions. 2. Proposed Approach In this section, we present our proposed data hiding technique that embeds information bits along character boundaries. 2.1 Boundary Patterns Selection In our proposed approach, data are embedded in the 8-connected boundary of a character. We assume that the input text document is in binary image form, or has been converted into a binary image from a grayscale or color document. We have identified 100 pairs of five-pixel long boundary patterns for embedding data. One of the patterns in a pair requires addition of a foreground pixel adjacent to the center pixel, whereas the other requires the deletion of the center foreground pixel. For convenience, we will refer to these operations as the Add and Delete operations, and call these two types of patterns as the A (Add)- and D (Delete)-patterns respectively. A unique property of the chosen patterns is that the A- and D- patterns in each pair are dual of each other; that is, changing the center pixel of one pattern would result in the other pattern. We refer to this operation as flipping the pattern; that is, flipping an A-pattern would result in a D-pattern, and vice versa. Figure 2 shows 28 example dual pairs from the 100 pairs. In the figure, the black pixels are boundary pixels, and the gray pixels are foreground object pixels. The A- patterns are for encoding information bit 1, and the D-patterns are for encoding information bit 0. The duality property allows easy detection of the embedded data without referring to the original document, and without using any special enforcing techniques in the detection process. In obtaining the 100 boundary pattern pairs, the goal is to preserve the overall shape of a character, and to minimize noticeable artifacts and distortions around the boundary after embedding data. First we assume that each of the five consecutive boundary pixels do not touch (as 8-connected neighbors) any pixels in the boundary segment other than the one immediately preceding or following it. This is a reasonable assumption for the boundary of a character or connected component of reasonable size (in terms of pixel count.) We start out with a set of all possible five-pixel-long boundary patterns satisfying this requirement. The following types of boundary patterns are then eliminated from the initial set: (a) Boundary segments that do not preserve length after Add or Delete operations (b) Straight line segments (c) Boundary segments with a 90 degree corner

Elimination of boundary segments in category (a) above ensures that the length of boundary segments remains as five pixels long after an Add or Delete operations. This allows the extraction of embedded data from the same 5-pixel long fixed partitioning of the boundary, without using a special enforcing procedure. Straight-line segments are eliminated since they no longer look straight after adding or deleting pixels at the center position, and may become noticeable. Here, we eliminated five-pixel long horizontal, vertical, and diagonal straight-line segments. Finally, boundary segments that form a 90-degree corner at the center pixel are eliminated, since they no longer look like a sharp 90-degree corner after adding or deleting pixel at the center position. 2.2 Embedding and Extracting Data The 100 pairs of boundary patterns are stored in a lookup table called the pattern table. In the embedding process, the input image is scanned in a left-to-right, and top-to-bottom manner to extract all connected components, which correspond to characters or other symbols in a text document. For each connected component, the first upper-left foreground pixel encountered in the scanning process is used as the starting pixel. An 8-connected boundary following algorithm is then used to obtain the closed outer boundary of a connected component. Certain characters, such as the characters o and b, contain one or more inner boundaries and they are not used in our current implementation. The outer boundary of a character is then traversed in a clockwise manner and divided into a set of consecutive non-overlapping five-pixel-long segments. If the last boundary segment is less than five pixels long, it is discarded. The set of consecutive boundary segments is then matched with patterns in the pattern table. If a boundary segment matches a pattern in the pattern table, it is called a valid boundary segment. Valid boundary segments are used to embed data and other segments on the boundary are simply ignored in the process. If the data bit to be embedded is a 0 and the current boundary segment is an Add pattern, the pattern is flipped to become a Delete pattern; otherwise no changes are necessary. Similarly, if the data bit to be embedded is a 1 and the current boundary segment is a Delete pattern, the pattern is flipped to become an Add pattern; otherwise, no changes are necessary. Data bits are embedded in the characters (or connected components) of a document in a left-to-right, and top-to-bottom manner. Figure 3 shows a block diagram of the embedding process. In the extraction process, the same procedure as used in the embedding process is used to extract fivepixel long boundary segments from connected components. Valid boundary segments are, again, identified using a table look up procedure and converted to a binary data bits. Figure 4 shows a block diagram of the extraction process. The data hiding capacity c for each connected component is bounded by [0, int{n/5}] where N is the number of boundary points in the connected component, and int{ } is the integer operation. The total data hiding capacity C for a text document is c_ave*m where c_ave is the average capacity per character, and M is the total number of characters in the document. 3. Experimental Results A set of experiments was performed on a SUN Ultra Spark 60 workstation to prove the validity of the proposed method. Figure 5(a) shows a signature image 2 of size 287 x 61 pixels. It has seven connected components. Figure 5(b) shows the marked image with 91 bits of embedded data, which contain 14 bits of header information and the ASCII representation of the 11 letters POLYTECHNIC (77 bits). Figure 5(c) shows the difference image. The same image was used in [2] in their experiment and it was reported that 7 letters were embedded using their method. Figure 6(a) shows a text paragraph of 72 dpi resolution. The image was generated using the Paint program on a Windows 98 PC. It has 334 connected components. Figure 6(b) shows the marked image with 648 embedded bits. It can be seen that the marked images of both Figures 5(b) and Figure 6(b) are almost visually identical to the original. We also applied our method to a test document with a full page of text of font size 11 characters. Table I shows the test results for the document scanned in at 100, 200, and 300 dpi. The total number of connected components at 300 dpi is 2 The signature was obtained from the U.S. White House website http://whitehouse.gov during the summer of 2000.

2,284, which corresponds to the 2,060 characters on the document, plus other symbols such as period, commas, etc. At 200 dpi, the number of connected components increases slightly to 2,326. A visual inspection shows that at 200 dpi, some of the connected components got broken into two or more pieces. We suspect this has to do with either the scanning process or the software that converts the original scanned image from grayscale to binary. At 300 dpi, the total number of embedded bits is 13,000 with an average of 5.69 bits/connected component (CC). This decreases to 6,560 bits and 2.82 bits/cc at 200 dpi. The decrease in data hiding capacity as resolution decreases is expected because the characters or connected components have fewer pixels, and consequently have shorter boundaries in terms of pixels. An interesting observation is that although the boundary length (in terms of pixels) increases 1.5 times from 200 dpi to 300 dpi, the data hiding capacity increases about 2 times. A possible explanation could be that more valid boundary patterns are matched at 300 dpi. A more thorough investigation needs to be done. At 100 dpi, visual inspection shows that an unacceptable number of connected components got broken into two or more pieces. The data hiding capacity drops to only 721 total bits, or 0.17 bits/cc. We expect that with better scanning and grayscale to binary conversion process, broken connected components would be reduced and the number of connected components should come out to be about the same as that of 200 or 300 dpi images. The data hiding capacity would then be improved as the size of the connected components get bigger. 4. Conclusions A novel data hiding technique for binary text documents was developed. Experimental results demonstrated good data hiding capacity of the technique. In the current implementation, we only used the outer boundary of a character to embed data. If we include inner boundary, the data hiding capacity can be further increased. Since the method hides data in non-smooth portions of text character boundaries, alterations are hardly noticeable. The duality property of the Add-Delete patterns allows easy extraction of hidden data without complicated enforcing techniques, and without referring to the original document. The proposed method is useful for annotating messages in a text document, and for detecting alterations. This method could also be applied to other binary images with connected components. In future work, we will explore the use of boundary segments other than five pixels long for embedding data, and study their data hiding capacity. References [1] M. Swanson, M. Kobayashi, and A. Tewfik, Multimedia Data Embedding and Watermarking Technologies, IEEE Proceedings, vol. 86, No. 6, pp 1064-1087, June 1998. [2] M. Wu, E. Tang, and B. Liu, Data Hiding in Digital Binary Images, Proc. Int l Conf. on Multimedia and Expo, Jul 31-Aug 2, 2000, New York, NY. [3] S. H. Low, N. F. Maxemchuk, A. M. Lapone, Document Identification for Copyright Protection Using Centroid Detection, IEEE Trans. on Comm., vol. 46, no. 3, Mar 1998, pp. 372-83. [4] T. Amamo and D. Misaki, Feature Calibration Method for Watermarking of Document Images, Proc. 5 th Int l Conf on Document Analysis and Recognition, 1999, pp. 91-94, Bangalore, India. [5] K. Matsui and K. Tanaka, Video-steganography: How to Secretly Embed a Signature in a Picture, Proc. of IMA Intellectual Property Project, v.1, no. 1, 1994. Figure 1. Effect of Arbitrarily Changing Pixel Values on a Binary Image

Figure 2. Twenty-eight of the 100 Dual A- and D-Patterns. Pattern Table Original Text Document Extract Character Boundary Divide into Segments Match Segment Pattern Embed Data Marked Text Document Figure 3. Data Embedding Process

Pattern Table Marked Text Document Extract Character Boundary Divide into Segments Match Segment Pattern Extract Data Figure 4. Data Extraction Process (a) Original Signature Image (287 x 61 pixels) (b) Image with Letters POLYTECHNIC Embedded (c) Difference Image Figure 5. Experiment Result from the Signature Image (a) Original Text Image at 72 dpi

(b) Marked Text Image with 648 bits embedded Figure 6. Experimental Result for a Paragraph of Text Table I. Results from a Full Text Document Resolution in dpi 100 200 300 # Connected Components (CC) 4,254 2,326 2,284 # bits embedded 721 6,560 13,000 Ave # bits/cc 0.17 2.82 5.69