DATA EMBEDDING IN TEXT FOR A COPIER SYSTEM

Similar documents
Data Hiding in Binary Text Documents 1. Q. Mei, E. K. Wong, and N. Memon

Locating 1-D Bar Codes in DCT-Domain

Texture Analysis of Painted Strokes 1) Martin Lettner, Paul Kammerer, Robert Sablatnig

OCR For Handwritten Marathi Script

Determining Document Skew Using Inter-Line Spaces

[10] Industrial DataMatrix barcodes recognition with a random tilt and rotating the camera

Motivation. Intensity Levels

Optical Character Recognition (OCR) for Printed Devnagari Script Using Artificial Neural Network

Motion Detection Algorithm

Character Recognition

A Document Image Analysis System on Parallel Processors

Speeding up the Detection of Line Drawings Using a Hash Table

Scene Text Detection Using Machine Learning Classifiers

An Accurate Method for Skew Determination in Document Images

One Dim~nsional Representation Of Two Dimensional Information For HMM Based Handwritten Recognition

Identifying and Reading Visual Code Markers

OPTIMIZING A VIDEO PREPROCESSOR FOR OCR. MR IBM Systems Dev Rochester, elopment Division Minnesota

An ICA based Approach for Complex Color Scene Text Binarization

Color Dithering with n-best Algorithm

Short Survey on Static Hand Gesture Recognition

Model-based segmentation and recognition from range data

Time Stamp Detection and Recognition in Video Frames

Motivation. Gray Levels

Detecting Printed and Handwritten Partial Copies of Line Drawings Embedded in Complex Backgrounds

Biometrics Technology: Image Processing & Pattern Recognition (by Dr. Dickson Tong)

Error-Diffusion Robust to Mis-Registration in Multi-Pass Printing

AN EFFICIENT VIDEO WATERMARKING USING COLOR HISTOGRAM ANALYSIS AND BITPLANE IMAGE ARRAYS

Medical images, segmentation and analysis

Chapter 3 Image Registration. Chapter 3 Image Registration

Text Information Extraction And Analysis From Images Using Digital Image Processing Techniques

A Hillclimbing Approach to Image Mosaics

Human Motion Detection and Tracking for Video Surveillance

Region-based Segmentation

Structured Light II. Thanks to Ronen Gvili, Szymon Rusinkiewicz and Maks Ovsjanikov

Searching of meteors in astronomical images using Matlab GUI

Physical Color. Color Theory - Center for Graphics and Geometric Computing, Technion 2

COSC160: Detection and Classification. Jeremy Bolton, PhD Assistant Teaching Professor

Robotics Programming Laboratory

Measures of Dispersion

Content-based Image Retrieval (CBIR)

Image Processing Fundamentals. Nicolas Vazquez Principal Software Engineer National Instruments

Cs : Computer Vision Final Project Report

Auto-Digitizer for Fast Graph-to-Data Conversion

Visible Color. 700 (red) 580 (yellow) 520 (green)

Problem definition Image acquisition Image segmentation Connected component analysis. Machine vision systems - 1

Scalable Coding of Image Collections with Embedded Descriptors

Color Image Segmentation

Pixels. Orientation π. θ π/2 φ. x (i) A (i, j) height. (x, y) y(j)

Review on Image Segmentation Techniques and its Types

MRT based Adaptive Transform Coder with Classified Vector Quantization (MATC-CVQ)

3D Computer Vision. Structured Light II. Prof. Didier Stricker. Kaiserlautern University.

HCR Using K-Means Clustering Algorithm

WATERMARKING FOR LIGHT FIELD RENDERING 1

Triangular Mesh Segmentation Based On Surface Normal

I. INTRODUCTION. Figure-1 Basic block of text analysis

Research on QR Code Image Pre-processing Algorithm under Complex Background

OBJECT SORTING IN MANUFACTURING INDUSTRIES USING IMAGE PROCESSING

UNIVERSITY OF OSLO. Faculty of Mathematics and Natural Sciences

Massachusetts Institute of Technology. Department of Computer Science and Electrical Engineering /6.866 Machine Vision Quiz I

Binary Image Processing. Introduction to Computer Vision CSE 152 Lecture 5

Improvement of SURF Feature Image Registration Algorithm Based on Cluster Analysis

FAST REGISTRATION OF TERRESTRIAL LIDAR POINT CLOUD AND SEQUENCE IMAGES

New Edge-Enhanced Error Diffusion Algorithm Based on the Error Sum Criterion

Robustness of Selective Desensitization Perceptron Against Irrelevant and Partially Relevant Features in Pattern Classification

Localization, Extraction and Recognition of Text in Telugu Document Images

THE preceding chapters were all devoted to the analysis of images and signals which

Layout Segmentation of Scanned Newspaper Documents

MULTI ORIENTATION PERFORMANCE OF FEATURE EXTRACTION FOR HUMAN HEAD RECOGNITION

An adaptive container code character segmentation algorithm Yajie Zhu1, a, Chenglong Liang2, b

Content Based Image Retrieval Using Color Quantizes, EDBTC and LBP Features

Image-Based Competitive Printed Circuit Board Analysis

Multi-scale Techniques for Document Page Segmentation

Measurement of 3D Foot Shape Deformation in Motion

Chain Code Histogram based approach

IRIS SEGMENTATION OF NON-IDEAL IMAGES

ECE 172A: Introduction to Intelligent Systems: Machine Vision, Fall Midterm Examination

EE368 Project: Visual Code Marker Detection

Scanner Parameter Estimation Using Bilevel Scans of Star Charts

Volocity ver (2013) Standard Operation Protocol

Digital Image Processing

Varun Manchikalapudi Dept. of Information Tech., V.R. Siddhartha Engg. College (A), Vijayawada, AP, India

Towards copy-evident JPEG images

Definition, Detection, and Evaluation of Meeting Events in Airport Surveillance Videos

Content-Based Image Retrieval Readings: Chapter 8:

Component-based Face Recognition with 3D Morphable Models

ADOBE ILLUSTRATOR CS3

Image Restoration and Reconstruction

Video Alignment. Literature Survey. Spring 2005 Prof. Brian Evans Multidimensional Digital Signal Processing Project The University of Texas at Austin

AUTOMATIC LOGO EXTRACTION FROM DOCUMENT IMAGES

Automatic Video Caption Detection and Extraction in the DCT Compressed Domain

Fast and Efficient Automated Iris Segmentation by Region Growing

Critique: Efficient Iris Recognition by Characterizing Key Local Variations

Small-scale objects extraction in digital images

Range Imaging Through Triangulation. Range Imaging Through Triangulation. Range Imaging Through Triangulation. Range Imaging Through Triangulation

3D Convolutional Neural Networks for Landing Zone Detection from LiDAR

An Efficient Character Segmentation Based on VNP Algorithm

Extracting Layers and Recognizing Features for Automatic Map Understanding. Yao-Yi Chiang

Content-Based Image Retrieval Readings: Chapter 8:

Segmentation of Images

Vision. OCR and OCV Application Guide OCR and OCV Application Guide 1/14

Transcription:

DATA EMBEDDING IN TEXT FOR A COPIER SYSTEM Anoop K. Bhattacharjya and Hakan Ancin Epson Palo Alto Laboratory 3145 Porter Drive, Suite 104 Palo Alto, CA 94304 e-mail: {anoop, ancin}@erd.epson.com Abstract In this paper, we present a scheme for embedding data in copies (color or monochrome) of predominantly text pages that may also contain color s or graphics. Embedding data imperceptibly in documents or s is a key ingredient of watermarking and data hiding schemes. It is comparatively easy to hide a signal in natural s since the human visual system is less sensitive to signals embedded in noisy regions containing high spatial frequencies. In other instances, e.g., simple graphics or monochrome text documents, additional constraints need to be satisfied to embed signals imperceptibly. Data may be embedded imperceptibly in printed text by altering some measurable property of a font such as position of a character or font size. This scheme however, is not very useful for embedding data in copies of text pages, as that would require accurate text segmentation and possibly optical character recognition, both of which would deteriorate the error rate performance of the data-embedding system considerably. Similarly, other schemes that alter pixels on text boundaries have poor performance due to boundarydetection uncertainties introduced by scanner noise, sampling and blurring. The scheme presented in this paper ameliorates the above problems by using a textregion based embedding approach. Since the bulk of documents reproduced today contain black on white text, this data-embedding scheme can form a print-level layer in applications such as copy tracking and annotation. 1. Introduction In this paper, we present a method for embedding or hiding information in predominantly text document copies, such that the embedded signal is visually imperceptible. The method is also applicable to originals containing color graphics and s in addition to text. A number of methods have been proposed for embedding signals in s of natural scenes [1]. Data may be embedded imperceptibly in printed text by altering some measurable property of a font such as position of a character or font size. This scheme however, is not very useful for embedding data in copies of text pages, as that would require accurate text segmentation and possibly optical character recognition using the document copy, both of which would deteriorate the error rate performance of the data-embedding system considerably. Similarly, other schemes that alter pixels on text boundaries have poor performance due to boundary-detection uncertainties introduced by scanner noise, sampling and blurring. Another approach is to embed the data to be hidden, in the halftoning patterns used by the printer to generate a copy. But this approach works best for documents that contain natural s or continuous-tone content. Many printers today employ halftone patterns for printer tracking. However, these systems are inadequate for copytracking applications that may require additional annotation in terms of, say copier serial number or user identification. Since a large percentage of reproduced documents consist of black and white text, there is a need for development of schemes that can hide data imperceptibly in copies of such pages. In the scheme presented in this paper, we identify small regions (sub-character sized) that consist mainly of pixels that meet criteria of text-character parts as described below, and embed data by modulating the lightness of these regions. Although the method relies on the existence of these regions, it does not rely on the fact that these regions actually represent parts of text characters. While the variations in lightness do not affect perceived text quality, they can be picked up easily using a scanner, and can be decoded to retrieve the message. The robustness of the scheme is improved by using an error-correcting code coupled with a bit-dispersal scheme to disperse the message bits throughout the document. The steps involved in the data embedding and retrieval steps are presented in the following sections. 2. The data embedding and retrieval system This section presents the steps by which data is embedded into and retrieved from the copy of a text document. The

processing requires two scans of the original document. The first is a preview scan, at a lower resolution, that is used to identify the various components of the document and establish a coordinate system based on the paragraphs, lines and words found in the document. The second scan is a full-resolution scan that is used to generate the document copy. The data from this scan is processed with the results of the preview scan to embed/retrieve the embedded message. As part of a copier pipeline, this data may then be sent for printing. The principal steps of the preview processing are shown in Figure 1. Once a site list is obtained from an analysis of the preview, the bits to be embedded are used to modulate the pixel intensities in the scanned, in regions determined by the site list. Details of the preview -processing steps and data-embedding steps are provided in the following sections. 2.1. Preview processing Before performing the copy scan, the copier performs a preview scan to determine candidate sites in the text document for embedding data. This scan is typically of a lower resolution than the scan resolution for making a copy, so that the memory and processing requirements of the preview scan are minimized. In this paper, the preview scan is assumed to be half or a third of the copy scan resolution. The preview is first segmented into regions that approximately correspond to text, and background regions. 2.1.1. Image Segmentation. Image segmentation is a two step process. First the pixels are classified based on their luminance and color-saturation values. Pixels with low luminance and low saturation are classified as text, those with high luminance and low saturation are classified as background and the remaining pixels are classified as pixels [2]. These labels may be further refined using run-length information as described in [3], however, most documents do not require this level of sophistication for adequate initial segmentation. A morphological filter is used to delete very small and large regions of connected text labels. Pixels corresponding to the deleted text labels are marked as unknown. The binary comprised of text and non-text pixels is analyzed further to establish a rotation and translation invariant reference frame for the document. 2.1.2. Connected components labeling, deskewing and block identification. A connected components [4] algorithm is used to identify connected regions of text pixels. Text-label components with areas and lengths that are smaller or larger than preset thresholds are deleted, and the corresponding pixels are marked as non-text. Very long components are excluded as potential sites as these are susceptible to greater cumulative registration errors during the process of data extraction. The components that survive this step are used to determine the skew angle of the document so as to establish the orientation of the page. The orientation of the page is established using a Hough-transform technique using the following steps. First, the components are grouped in a hierarchical structure based on the inter-component distance. This hierarchical structure groups the components into characters, words, lines and paragraphs. This grouping is performed by calculating the distance between the elements of a group at a given level. Individual characters form the lowest level in the hierarchy. These correspond simply to the connected components themselves. Note that with this classification, characters may not correspond to actual text characters, i.e., a text character may be composed of multiple components, or multiple text characters may fuse into a single component. However, while this misclassification impacts character recognition, it does not impact the skew detection and data embedding problems. The median component height is used as a length scale to group components into word and paragraph elements. Words are formed as groups of characters that are closer than a preset inter-word distance, determined as a fixed proportion of the median component height. Similarly, a preset inter-line distance is used to group words into lines. Paragraphs are determined by two methods. The first method uses indentation of the first word in a line to find paragraphs. The second method looks for lines separated by more than a preset interparagraph distance to mark paragraphs. Once the page has been described as a collection of words, lines, and paragraphs, the centroids of all the components in a given line are used to determine its orientation. This is performed by using a Hough transform on the family of straight lines defined by the centroid of each component belonging to the same line grouping. Since the page orientation obtained in this manner is symmetric with respect to horizontal and vertical reflections, the retrieval algorithm needs to monitor two scan directions to retrieve an embedded bit stream. This ensures that if the page is rotated by 180 degrees on the scanner bed, the embedded message can still be retrieved. Once the page orientation is known, the page is deskewed, and the bounding boxes of all the components belonging to a character, word, line or paragraph grouping as described above, are used to define character, word, line and paragraph boxes respectively. The paragraph boxes are used to define multiple coordinate frames, one for each paragraph, for the entire document. With the establishment of the coordinate/reference frames, the next step involves the identification of sites for embedding the hidden message.

2.1.3. Site selection. Sites for intensity modulation are determined in one of two methods. The first uses a coordinate system associated with each paragraph or line element to embed the data. If a paragraph block is used to establish the local coordinate frame, the pixels in each paragraph block are partitioned into a fine square grid consisting of 3x3 pixels in each grid cell/partition. The sites in which data will be embedded are chosen from among the grid cells. Site selection proceeds as follows. First, the grid cells that contain predominantly text-type pixels are identified. To perform this selection, the 90th percentile of the luminance histogram of all text components is chosen as a threshold. Any grid cells that contain more than a preset percentage of pixels that are below this threshold are marked as candidate sites for data embedding. Data is embedded in these sites by modulating the luminance of all pixels belonging to a candidate site s cell. The second method for site selection uses a local coordinate frame associated with characters with long strokes. Such strokes are detected using a morphological operator. The height of the stroke provides a scaleindependent coordinate system for modulating pixel intensities at locations along the stroke defined by this local coordinate system. Two or more candidate sites are required for embedding each bit. For example, a bit may be embedded in two sites using the following scheme. If the difference between the average luminance of the pixels belonging to the current site and the next one is positive, the bit is a 1, else, if the difference is negative, the bit is a 0. Similar difference-based schemes may be used for embedding a single bit in three or more sites. For example, a bit may be embedded in three sites using average grid-cell luminance differences as follows: if the first difference is positive and the next is negative, the bit is a 1, else, if the first difference is negative and the next is positive, the bit is a 0. The number of independently controllable sites for the purpose of bit embedding is extracted from the candidate site list based on the number of sites required to embed a bit. A line or word synchronization scheme is used to minimize accumulative errors due to site-identification errors. In this scheme, message words are always embedded starting at a line or word boundary, and the embedded message is repeated multiple times in the entire document depending on the number of available sites. During data extraction, the decoder attempts to decode the embedded data from the start of every line or word boundary. This provides increased robustness with respect to accumulative errors due to random site misclassification. The site list output by the previewprocessing module consists of independently controllable sites that also satisfy the line- and paragraphsynchronization constraints. This site list also contains page orientation information so that pixels belonging to each site may be mapped to the higher scan resolution used for copying the document. 2.2. Data embedding and retrieval from high resolution The data to be embedded in the document is first coded using an error correcting code. The resulting bits are then scrambled so that they are dispersed uniformly across the page. This scrambling is achieved by using a disperseddither matrix, typically used for halftoning in color printers. The ranks of a dispersed dither matrix [5] have the property that each successive rank is located at a position in the matrix that is as far away (spatially) as possible from locations containing all previous ranks. Since the site list generated in the previous section has a fixed number of sites per line, all the sites can be arranged in a two-dimensional array. This array is tiled periodically by a large (512x512) dither array, and each site is assigned a rank based on the rank of the dither array and the index of the dither-array tile at that location. The rank of each site is used to index into the error-coded bit-stream to determine the bit that will be embedded in the pixels belonging to the site. During the high-resolution copy scan, data may be embedded to or extracted from the document. For data embedding, the pixel luminances are modulated based on the bit-embedding scheme described in the previous section. The degree of luminance modulation is based on the characteristics of the scanner and printer used for the copier, and is determined experimentally. For data retrieval, the average luminance for the pixels in each site is computed and the data is retrieved according to the embedding scheme and the input site list. Figure 3 shows a portion of text in which data is embedded using the scheme presented in this paper. The sites chosen for pixel modulation are marked, and copy output with and without data embedding are presented to illustrate their virtual indistinguishability. Errors may creep into the data retrieval process if the grid described in Section 2.1.3 is not constructed similarly during data embedding and retrieval phases. Typically, there may be small translation or scaling differences in the embedding and retrieval grids. This problem is countered by performing a multiple-grid search on the highresolution scanned data. A series of site-lists are constructed during preview processing by perturbing the segmentation parameters and moving the local coordinate systems by a couple of pixels along horizontal and vertical directions. Message retrieval is then performed using these multiple-grid site lists.

3. Conclusions We have presented a robust method for imperceptibly embedding data in text documents. The embedded data can also be retrieved robustly. However, this algorithm does not preserve previously embedded information directly. The only way to achieve that is to first retrieve the embedded bits and then possibly, append a summary of the retrieved message to the current message to be embedded. This is a weakness that continues to challenge all algorithms for data hiding. A further drawback of this method is that not enough sites may be available in the scanned document to embed large messages. In this case, one of a series of messages with varying site requirements may need to be provided for embedding. The number of sites available for data embedding, however, increases with scanning and printing resolution. 4. References [1] W. Bender, D. Gruhl, N. Morimoto, and A. Lu, Techniques for data hiding, IBM Systems Journal, Vol 35, Nos. 3 & 4, pp 313-336, 1996. [2] H. Ancin, and A. K. Bhattacharjya, "Text enhancement for laser copiers," in Proceedings of IEEE ICIP '99, Kobe, Japan, Oct. 25-28, 1999. [3] H. Ancin, Document Segmentation for High Quality Printing, IS&T/SPIE Symposium on Electronic Imaging: Science & Technology, Color Imaging: Device Independent Color, Color Hard Copy, and Graphic Arts II, pp. 360-371, February 1997. [4] W. K. Pratt, Digital Image Processing, John Wiley & Sons, Inc., New York, second edition, 1991. [5] R. Ulichney, Digital Halftoning, The MIT Press, Cambridge, Massachusetts, 1987. input segmentation connected components labeling deskew block identification site selection site list Figure 1: Preview processing for data embedding/retrieval. site list bits to be embedded input identify site-list pixels modulate pixel values output Figure 2: Embedding data in the high-resolution scanned.

3(a) 3(b) 3(c) 3(d) Figure 3: (a) Original (scanned) text. (b) Pixels corresponding to sites that will be modulated in luminance to hide information, are shown in a different color. The word the is magnified to show more detail. (c) Print output containing embedded data. (d) Print output without embedded data.