DATA EMBEDDING IN TEXT FOR A COPIER SYSTEM Anoop K. Bhattacharjya and Hakan Ancin Epson Palo Alto Laboratory 3145 Porter Drive, Suite 104 Palo Alto, CA 94304 e-mail: {anoop, ancin}@erd.epson.com Abstract In this paper, we present a scheme for embedding data in copies (color or monochrome) of predominantly text pages that may also contain color s or graphics. Embedding data imperceptibly in documents or s is a key ingredient of watermarking and data hiding schemes. It is comparatively easy to hide a signal in natural s since the human visual system is less sensitive to signals embedded in noisy regions containing high spatial frequencies. In other instances, e.g., simple graphics or monochrome text documents, additional constraints need to be satisfied to embed signals imperceptibly. Data may be embedded imperceptibly in printed text by altering some measurable property of a font such as position of a character or font size. This scheme however, is not very useful for embedding data in copies of text pages, as that would require accurate text segmentation and possibly optical character recognition, both of which would deteriorate the error rate performance of the data-embedding system considerably. Similarly, other schemes that alter pixels on text boundaries have poor performance due to boundarydetection uncertainties introduced by scanner noise, sampling and blurring. The scheme presented in this paper ameliorates the above problems by using a textregion based embedding approach. Since the bulk of documents reproduced today contain black on white text, this data-embedding scheme can form a print-level layer in applications such as copy tracking and annotation. 1. Introduction In this paper, we present a method for embedding or hiding information in predominantly text document copies, such that the embedded signal is visually imperceptible. The method is also applicable to originals containing color graphics and s in addition to text. A number of methods have been proposed for embedding signals in s of natural scenes [1]. Data may be embedded imperceptibly in printed text by altering some measurable property of a font such as position of a character or font size. This scheme however, is not very useful for embedding data in copies of text pages, as that would require accurate text segmentation and possibly optical character recognition using the document copy, both of which would deteriorate the error rate performance of the data-embedding system considerably. Similarly, other schemes that alter pixels on text boundaries have poor performance due to boundary-detection uncertainties introduced by scanner noise, sampling and blurring. Another approach is to embed the data to be hidden, in the halftoning patterns used by the printer to generate a copy. But this approach works best for documents that contain natural s or continuous-tone content. Many printers today employ halftone patterns for printer tracking. However, these systems are inadequate for copytracking applications that may require additional annotation in terms of, say copier serial number or user identification. Since a large percentage of reproduced documents consist of black and white text, there is a need for development of schemes that can hide data imperceptibly in copies of such pages. In the scheme presented in this paper, we identify small regions (sub-character sized) that consist mainly of pixels that meet criteria of text-character parts as described below, and embed data by modulating the lightness of these regions. Although the method relies on the existence of these regions, it does not rely on the fact that these regions actually represent parts of text characters. While the variations in lightness do not affect perceived text quality, they can be picked up easily using a scanner, and can be decoded to retrieve the message. The robustness of the scheme is improved by using an error-correcting code coupled with a bit-dispersal scheme to disperse the message bits throughout the document. The steps involved in the data embedding and retrieval steps are presented in the following sections. 2. The data embedding and retrieval system This section presents the steps by which data is embedded into and retrieved from the copy of a text document. The
processing requires two scans of the original document. The first is a preview scan, at a lower resolution, that is used to identify the various components of the document and establish a coordinate system based on the paragraphs, lines and words found in the document. The second scan is a full-resolution scan that is used to generate the document copy. The data from this scan is processed with the results of the preview scan to embed/retrieve the embedded message. As part of a copier pipeline, this data may then be sent for printing. The principal steps of the preview processing are shown in Figure 1. Once a site list is obtained from an analysis of the preview, the bits to be embedded are used to modulate the pixel intensities in the scanned, in regions determined by the site list. Details of the preview -processing steps and data-embedding steps are provided in the following sections. 2.1. Preview processing Before performing the copy scan, the copier performs a preview scan to determine candidate sites in the text document for embedding data. This scan is typically of a lower resolution than the scan resolution for making a copy, so that the memory and processing requirements of the preview scan are minimized. In this paper, the preview scan is assumed to be half or a third of the copy scan resolution. The preview is first segmented into regions that approximately correspond to text, and background regions. 2.1.1. Image Segmentation. Image segmentation is a two step process. First the pixels are classified based on their luminance and color-saturation values. Pixels with low luminance and low saturation are classified as text, those with high luminance and low saturation are classified as background and the remaining pixels are classified as pixels [2]. These labels may be further refined using run-length information as described in [3], however, most documents do not require this level of sophistication for adequate initial segmentation. A morphological filter is used to delete very small and large regions of connected text labels. Pixels corresponding to the deleted text labels are marked as unknown. The binary comprised of text and non-text pixels is analyzed further to establish a rotation and translation invariant reference frame for the document. 2.1.2. Connected components labeling, deskewing and block identification. A connected components [4] algorithm is used to identify connected regions of text pixels. Text-label components with areas and lengths that are smaller or larger than preset thresholds are deleted, and the corresponding pixels are marked as non-text. Very long components are excluded as potential sites as these are susceptible to greater cumulative registration errors during the process of data extraction. The components that survive this step are used to determine the skew angle of the document so as to establish the orientation of the page. The orientation of the page is established using a Hough-transform technique using the following steps. First, the components are grouped in a hierarchical structure based on the inter-component distance. This hierarchical structure groups the components into characters, words, lines and paragraphs. This grouping is performed by calculating the distance between the elements of a group at a given level. Individual characters form the lowest level in the hierarchy. These correspond simply to the connected components themselves. Note that with this classification, characters may not correspond to actual text characters, i.e., a text character may be composed of multiple components, or multiple text characters may fuse into a single component. However, while this misclassification impacts character recognition, it does not impact the skew detection and data embedding problems. The median component height is used as a length scale to group components into word and paragraph elements. Words are formed as groups of characters that are closer than a preset inter-word distance, determined as a fixed proportion of the median component height. Similarly, a preset inter-line distance is used to group words into lines. Paragraphs are determined by two methods. The first method uses indentation of the first word in a line to find paragraphs. The second method looks for lines separated by more than a preset interparagraph distance to mark paragraphs. Once the page has been described as a collection of words, lines, and paragraphs, the centroids of all the components in a given line are used to determine its orientation. This is performed by using a Hough transform on the family of straight lines defined by the centroid of each component belonging to the same line grouping. Since the page orientation obtained in this manner is symmetric with respect to horizontal and vertical reflections, the retrieval algorithm needs to monitor two scan directions to retrieve an embedded bit stream. This ensures that if the page is rotated by 180 degrees on the scanner bed, the embedded message can still be retrieved. Once the page orientation is known, the page is deskewed, and the bounding boxes of all the components belonging to a character, word, line or paragraph grouping as described above, are used to define character, word, line and paragraph boxes respectively. The paragraph boxes are used to define multiple coordinate frames, one for each paragraph, for the entire document. With the establishment of the coordinate/reference frames, the next step involves the identification of sites for embedding the hidden message.
2.1.3. Site selection. Sites for intensity modulation are determined in one of two methods. The first uses a coordinate system associated with each paragraph or line element to embed the data. If a paragraph block is used to establish the local coordinate frame, the pixels in each paragraph block are partitioned into a fine square grid consisting of 3x3 pixels in each grid cell/partition. The sites in which data will be embedded are chosen from among the grid cells. Site selection proceeds as follows. First, the grid cells that contain predominantly text-type pixels are identified. To perform this selection, the 90th percentile of the luminance histogram of all text components is chosen as a threshold. Any grid cells that contain more than a preset percentage of pixels that are below this threshold are marked as candidate sites for data embedding. Data is embedded in these sites by modulating the luminance of all pixels belonging to a candidate site s cell. The second method for site selection uses a local coordinate frame associated with characters with long strokes. Such strokes are detected using a morphological operator. The height of the stroke provides a scaleindependent coordinate system for modulating pixel intensities at locations along the stroke defined by this local coordinate system. Two or more candidate sites are required for embedding each bit. For example, a bit may be embedded in two sites using the following scheme. If the difference between the average luminance of the pixels belonging to the current site and the next one is positive, the bit is a 1, else, if the difference is negative, the bit is a 0. Similar difference-based schemes may be used for embedding a single bit in three or more sites. For example, a bit may be embedded in three sites using average grid-cell luminance differences as follows: if the first difference is positive and the next is negative, the bit is a 1, else, if the first difference is negative and the next is positive, the bit is a 0. The number of independently controllable sites for the purpose of bit embedding is extracted from the candidate site list based on the number of sites required to embed a bit. A line or word synchronization scheme is used to minimize accumulative errors due to site-identification errors. In this scheme, message words are always embedded starting at a line or word boundary, and the embedded message is repeated multiple times in the entire document depending on the number of available sites. During data extraction, the decoder attempts to decode the embedded data from the start of every line or word boundary. This provides increased robustness with respect to accumulative errors due to random site misclassification. The site list output by the previewprocessing module consists of independently controllable sites that also satisfy the line- and paragraphsynchronization constraints. This site list also contains page orientation information so that pixels belonging to each site may be mapped to the higher scan resolution used for copying the document. 2.2. Data embedding and retrieval from high resolution The data to be embedded in the document is first coded using an error correcting code. The resulting bits are then scrambled so that they are dispersed uniformly across the page. This scrambling is achieved by using a disperseddither matrix, typically used for halftoning in color printers. The ranks of a dispersed dither matrix [5] have the property that each successive rank is located at a position in the matrix that is as far away (spatially) as possible from locations containing all previous ranks. Since the site list generated in the previous section has a fixed number of sites per line, all the sites can be arranged in a two-dimensional array. This array is tiled periodically by a large (512x512) dither array, and each site is assigned a rank based on the rank of the dither array and the index of the dither-array tile at that location. The rank of each site is used to index into the error-coded bit-stream to determine the bit that will be embedded in the pixels belonging to the site. During the high-resolution copy scan, data may be embedded to or extracted from the document. For data embedding, the pixel luminances are modulated based on the bit-embedding scheme described in the previous section. The degree of luminance modulation is based on the characteristics of the scanner and printer used for the copier, and is determined experimentally. For data retrieval, the average luminance for the pixels in each site is computed and the data is retrieved according to the embedding scheme and the input site list. Figure 3 shows a portion of text in which data is embedded using the scheme presented in this paper. The sites chosen for pixel modulation are marked, and copy output with and without data embedding are presented to illustrate their virtual indistinguishability. Errors may creep into the data retrieval process if the grid described in Section 2.1.3 is not constructed similarly during data embedding and retrieval phases. Typically, there may be small translation or scaling differences in the embedding and retrieval grids. This problem is countered by performing a multiple-grid search on the highresolution scanned data. A series of site-lists are constructed during preview processing by perturbing the segmentation parameters and moving the local coordinate systems by a couple of pixels along horizontal and vertical directions. Message retrieval is then performed using these multiple-grid site lists.
3. Conclusions We have presented a robust method for imperceptibly embedding data in text documents. The embedded data can also be retrieved robustly. However, this algorithm does not preserve previously embedded information directly. The only way to achieve that is to first retrieve the embedded bits and then possibly, append a summary of the retrieved message to the current message to be embedded. This is a weakness that continues to challenge all algorithms for data hiding. A further drawback of this method is that not enough sites may be available in the scanned document to embed large messages. In this case, one of a series of messages with varying site requirements may need to be provided for embedding. The number of sites available for data embedding, however, increases with scanning and printing resolution. 4. References [1] W. Bender, D. Gruhl, N. Morimoto, and A. Lu, Techniques for data hiding, IBM Systems Journal, Vol 35, Nos. 3 & 4, pp 313-336, 1996. [2] H. Ancin, and A. K. Bhattacharjya, "Text enhancement for laser copiers," in Proceedings of IEEE ICIP '99, Kobe, Japan, Oct. 25-28, 1999. [3] H. Ancin, Document Segmentation for High Quality Printing, IS&T/SPIE Symposium on Electronic Imaging: Science & Technology, Color Imaging: Device Independent Color, Color Hard Copy, and Graphic Arts II, pp. 360-371, February 1997. [4] W. K. Pratt, Digital Image Processing, John Wiley & Sons, Inc., New York, second edition, 1991. [5] R. Ulichney, Digital Halftoning, The MIT Press, Cambridge, Massachusetts, 1987. input segmentation connected components labeling deskew block identification site selection site list Figure 1: Preview processing for data embedding/retrieval. site list bits to be embedded input identify site-list pixels modulate pixel values output Figure 2: Embedding data in the high-resolution scanned.
3(a) 3(b) 3(c) 3(d) Figure 3: (a) Original (scanned) text. (b) Pixels corresponding to sites that will be modulated in luminance to hide information, are shown in a different color. The word the is magnified to show more detail. (c) Print output containing embedded data. (d) Print output without embedded data.