MODERN technology has made it possible to produce,

Size: px

Start display at page:

Download "MODERN technology has made it possible to produce,"

Jerome Thornton
5 years ago
Views:

1 1398 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 16, NO. 11, NOVEMBER 2004 Information Retrieval in Document Image Databases Yue Lu and Chew Lim Tan, Senior Member, IEEE Abstract With the rising popularity and importance of document images as an information source, information retrieval in document image databases has become a growing and challenging problem. In this paper, we propose an approach with the capability of matching partial word images to address two issues in document image retrieval: word spotting and similarity measurement between documents. First, each word image is represented by a primitive string. Then, an inexact string matching technique is utilized to measure the similarity between the two primitive strings generated from two word images. Based on the similarity, we can estimate how a word image is relevant to the other and, thereby, decide whether one is a portion of the other. To deal with various character fonts, we use a primitive string which is tolerant to serif and font differences to represent a word image. Using this technique of inexact string matching, our method is able to successfully handle the problem of heavily touching characters. Experimental results on a variety of document image databases confirm the feasibility, validity, and efficiency of our proposed approach in document image retrieval. Index Terms Document image retrieval, partial word image matching, primitive string, word searching, document similarity measurement. æ 1 INTRODUCTION MODERN technology has made it possible to produce, process, store, and transmit document images efficiently. In an attempt to move toward the paperless office, large quantities of printed documents are digitized and stored as images in databases. The popularity and importance of document images as an information source are evident. As a matter of fact, many organizations currently use and are dependent on document image databases. However, such databases are often not equipped with adequate index information. This makes it much harder to retrieve user-relevant information from image data than from text data. Thus, the study of information retrieval in document image databases is an important subject in knowledge and data engineering. 1.1 Research Background Millions of digital documents are constantly transmitted from one point to another over the Internet. The most common format of these digital documents is text, in which characters are represented by machine-readable codes. The majority of computer generated documents are in this format. On the other hand, to make billions of volumes of traditional and legacy documents available and accessible on the Internet, they are scanned and converted to digital images using digitization equipment. Although the technology of Document Image Processing (DIP) may be utilized to automatically convert the digital images of these documents to the machine-readable. Y. Lu is with the Department of Computer Science and Technology, East China Normal University, Shanghai , China. ylu@cs.ecnu.edu.cn.. C.L. Tan is with the Department of Computer Science, School of Computing, National University of Singapore, Kent Ridge, Singapore tancl@comp.nus.edu.sg. Manuscript received 8 Sept. 2003; revised 1 Feb. 2004; accepted 25 Feb For information on obtaining reprints of this article, please send to: tkde@computer.org, and reference IEEECS Log Number TKDE text format using Optical Character Recognition (OCR) technology, it is typically not a cost effective and practical way to process a huge number of paper documents. One reason is that the technique of layout analysis is still immature in handling documents with complicated layouts. Another reason is that OCR technology still suffers inherent weaknesses in its recognition ability, especially with document images of poor quality. Interactive manual correction/proofreading of OCR results is usually unavoidable in most DIP systems. As a result, storing traditional and legacy documents in image format has become the alternative in many cases. Nowadays, we can find on the Internet many digital documents in image format, including scanned journal/conference papers, student theses, handbooks, out-of-print books, etc. Moreover, many digital libraries and Web portals such as MEDLINE, ACM, IEEE, etc., keep scanned document images without their text equivalents. Many tools have been developed for retrieving information from text format documents, but they are not applicable to image format documents. There is therefore a need to study the strategies of retrieving information from document images. For example, a user facing a large number of imaged documents on the Internet has to download each one to see its contents before knowing whether the document is relevant to his/her interest, e.g., by looking for keywords. Obviously, it is of practical value to study a method that is capable of notifying the user, prior to downloading, whether a document image contains information of interest to him/her. In recent years, there has been much interest in the research area of Document Image Retrieval (DIR) [1], [2]. DIR is relevant to document image processing (DIP), but there are some essential differences between them. A DIP system needs to analyze different text areas in a page document, understand the relationship among these text areas, and then convert them to a machine-readable version using OCR, in which each character object is assigned to a certain class. Many commercial OCR systems, which claim /04/$20.00 ß 2004 IEEE Published by the IEEE Computer Society

2 LU AND TAN: INFORMATION RETRIEVAL IN DOCUMENT IMAGE DATABASES 1399 to achieve a recognition rate of above 99 percent for single characters, have been introduced in the market over the past few decades. It is obvious that a document image processing system can be applied for the purpose of document image retrieval. For example, a DIP system is used to automatically convert a document image to machine-readable text codes first, and traditional information retrieval strategies [3], [4] are then applied. However, the performance of a DIP system relies heavily on the quality of the scanned images. It deteriorates severely if the images are of poor quality or complicated layout. It is generally acknowledged that the recognition accuracy requirements of document image retrieval methods are considerably lower than those for many document processing applications [5]. Motivated by this observation, some retrieval methods with the ability to tolerate the recognition errors of OCR have been recently researched [6], [7], [8], [9]. Additionally, some methods are reported to improve retrieval performance by using OCR candidates [10], [11]. It is a fact that such information retrieval methods based on the technology of document image processing are still the best so far because a great deal of research has been devoted to the area, and numerous algorithms have been proposed in the past decades, especially in OCR research. However, we argue that DIR, which bypasses OCR, still has its practical value today. The main question that a DIR system seeks to answer is whether a document image contains words which are of interest to the user, while paying no attention to unrelated words. In other words, a DIR system provides an answer of yes or no with respect to the user s query, rather than the exact recognition of a character/word as in DIP. Moreover, words, rather than characters, are the basic units of meaning in information retrieval. Therefore, directly matching word images in a document image is an alternative way to retrieve information from the document. In short, DIR and DIP address different needs and have different merits of their own. The former, which is tailored for directly retrieving information from document images, could achieve a relatively higher performance in terms of recall, precision, and processing speed. 1.2 Previous Related Work In recent years, a number of attempts have been made by researchers to avoid the use of character recognition for various document image retrieval applications. For example, Chen and Bloomberg [12], [13] described a method for automatically selecting sentences and key phrases to create a summary from an imaged document without any need for recognition of the characters in each word. They built word equivalence classes by using a rank blur hit-miss transform to compare word images and using a statistical classifier to determine the likelihood of each sentence being a summary sentence. Liu and Jain [14] presented an approach to image-based form document retrieval. They proposed a similarity measure for forms that is insensitive to translation, scaling, moderate skew, and image quality fluctuations, and developed a prototype form retrieval system based on their proposed similarity measure. Niyogi and Srihari [15] described an approach to retrieve information from document images stored in a digital library by means of knowledge-based layout analysis and logical structure derivation techniques, in which significant sections of documents, such as the title, are utilized. Tang et al. [16] proposed methods for automatic knowledge acquisition in document images by analyzing the geometric structure and logical structure of the images. In the domain of Chinese document image retrieval, He et al. [17] proposed an index and retrieval method based on the stroke density of Chinese characters. Sptiz described character shape codes for duplicate document detection [18], information retrieval [19], word recognition [20], and document reconstruction [21], without resorting to character recognition. Character shape codes encode whether or not the character in question fits between the baseline and the x-line or, if not, whether it has an ascender or descender, and the number and spatial distribution of the connected components. To get character shape codes, character cells must first be segmented. This method is therefore unsuitable for dealing with words with connected characters. Additionally, it is lexicon-dependent, and its performance is somewhat affected by the appropriateness of the lexicon to the document being processed. Yu and Tan [22], [23] proposed a method for text retrieval from document images without the use of OCR. In their method, documents are segmented into character objects, whose image features are utilized to generate document vectors. Then, the text similarity between documents is measured by calculating the dot product of the document vectors. In the work of both Spitz and Tan et al., character segmentation is a necessary step before downstream processes can be performed. In many cases, especially with document images of poor quality, it is not easy to separate connected characters. Moreover, the word, rather than the character, is normally a basic unit of useful meaning in document image retrieval. To avoid the difficulties of separating touching adjacent characters in a word image, segmentation-free methods have been researched in some document image retrieval systems, in which image matching at the word level is commonly utilized. A segmentation-free word image matching approach treats each word object as a single, indivisible entity, and attempts to recognize it using features of the word as a whole. For example, Ho et al. [24] proposed a word recognition method based on word shape analysis without character segmentation and recognition. Every word is first partitioned into a fixed 4 10 grid. Features are extracted and their relative locations in the grid are recorded in a feature vector. The city-block distance is used to compare the feature vector of an input word and that of a word in a given lexicon. This method can only be applied to a fixed vocabulary word recognition system and has no ability of partial word matching. Some approaches have been reported in the past years for searching keywords in handwritten [25], [26], [27] and printed [28], [29], [30], [31], [32] documents. In most of these approaches, the Hidden Markov Model and Viterbi dynamic programming algorithm are widely used to recognize keywords. However, these approaches have their inherent disadvantages. Although DeCurtins and Chen s approach [28] and Chen et al. s approach [30], [31] are capable of searching a word portion, they can only process predefined keywords and a pretraining procedure is necessary. In our previous word searching system [32], a weighted Hausdorff distance is used to measure the dissimilarity between word images. Although the system

3 1400 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 16, NO. 11, NOVEMBER 2004 can cope with any word the user specifies, it cannot deal with the problem of partial matching. For instance, the approach is unable to search the words undeveloped and development when the user keys in develop as the query word because the three words are treated as different from the standpoint of word image classification. Obviously, such a classification is not practical from the viewpoint of document understanding by human beings. In the compressed document images domain, Hull [33] proposed a method to detect equivalent document images by matching the pass mode codes extracted from CCITT compressed document images. He created a feature vector that counts the number of pass mode codes in each cell of a fixed grid in the image, and equivalent images are located by applying the Hausdorff distance to feature vectors. In our previous paper [34], we also presented a method for measuring the similarity between two CCITT group 4 compressed document images. 1.3 Current Work In this paper, we propose an improved document image retrieval approach with the ability of matching partial word images. First, word images are represented by feature strings. A feature string matching method based on dynamic programming is then utilized to evaluate the similarity between two feature strings or a part of them. By using the technique of inexact feature string matching, the approach can successfully handle not only kerning, but also words with heavily touching characters. The underlying idea of the proposed partial word image matching technique is somewhat similar to that of word recognition based on Hidden Markov Models. However, inexact string matching has the advantage of not requiring any training prior to use. We have tested our technique in two applications of document image retrieval. One application is to search for user-specified words and their variations in document images (word spotting). The user s query word and a word image object extracted from documents are first represented by two respective feature strings. Then, the similarity between the two feature strings is evaluated. Based on the evaluation, we can estimate how the document word image is relevant to the user-specified word. For example, when the user keys in the query word string. our method detects words such as strings and substring. Our experiments on real document images, including scanned books, student theses, journal/conference papers, etc., show promising performance by our proposed approach to word spotting in document images. The other application is content-based similarity measurement between two document images. The technique of partial word image matching is applied to assign all word objects from two document images to corresponding classes, except potential stop words, which are excluded. Document vectors are built using the occurrence frequencies of the word object classes, and the pair-wise similarity of two document images is calculated by the scalar product of the document vectors. We have used nine group articles pertaining to different domains to test the validity of our approach. The results show that our proposed method based on partial word image matching outperforms the method based on entire word image matching, with the F 1 rating improved. Fig. 1. Different spacing: (a) separated adjacent characters, (b) overlapped adjacent characters, and (c) touching adjacent characters. The remainder of this paper is organized as follows: Section 2 presents the feature string description for word images. Section 3 describes inexact matching between two feature strings. Section 4 gives the experimental results that demonstrate the effectiveness of the proposed partial word image matching approach in word spotting. Section 5 presents the results demonstrating the effectiveness of the proposed partial word image matching in document similarity measurement. Section 6 concludes the paper with a summary of the contributions of our proposed technique and an indication of the directions for future work. 2 FEATURE STRING DESCRIPTION FOR WORD IMAGE It is presumed that word objects are extracted from document images via some image processing such as skew estimation and correction, connected component labeling, word bounding, etc. The feature employed in our approach to represent word bitmap images is the Left-to-Right Primitive String (LRPS), which is a code string sequenced from the leftmost of a word to its rightmost. Line and traversal features are used to extract the primitives of a word image. A word printed in documents can be in various sizes, fonts, and spacings. When we extract features from word bitmaps, we have to take this fact into consideration. Generally speaking, it is easy to find a way to cope with different sizes. But, dealing with multiple fonts and the touching characters caused by condensed spacing is still a challenging issue. For a word image in a printed text, two characters could be spaced apart by a few white columns caused by intercharacter spacing, as shown in Fig. 1a. But, it is also common for one character to overlap another by a few columns due to kerning, as shown in Fig. 1b. Worse still, as shown in Fig. 1c, two or more adjacent characters may touch each other due to condensed spacing. We are thus faced with the challenge of separating such touching characters. We shall utilize inexact feature string matching to handle the problem. 2.1 LRPS Feature Representation A word is explicitly segmented, from the leftmost to the rightmost, into discrete entities. Each entity, called a primitive here, is represented using definite attributes. A primitive p is described using a two-tuple (,!), where is the Line-or-Traversal Attribute (LTA) of the primitive, and! is the Ascender-and-Descender Attribute (ADA). As a result, the word image is expressed as a sequence P of p i s P ¼< p 1 p 2...p n >¼< ð 1 ;! 1 Þð 2 ;! 2 Þ...ð n ;! n Þ >; ð1þ

4 LU AND TAN: INFORMATION RETRIEVAL IN DOCUMENT IMAGE DATABASES 1401 Fig. 2. Primitive string extraction (a) straight stroke line features, (b) remainder part of (a), (c) traversal T N ¼ 2, (d) traversal T N ¼ 4, and (e) traversal T N ¼ 6. where the ADA of a primitive! 2 = { x, a, A, D, Q }, which are defined as:. x : The primitive is between the x-line (an imaginary line at the x-height running parallel with the baseline, as shown in Fig. 2) and the baseline.. a : The primitive is between the top boundary and the x-line.. A : The primitive is between the top boundary and the baseline.. D : The primitive is between the x-line and the bottom boundary.. Q : The primitive is between the top-boundary and the bottom boundary. The definition of x-line, baseline, top boundary, and bottom boundary may be found in Fig. 2. A word bitmap extracted from a document image already contains the information of the baseline and x-line, which is a byproduct of the text line extraction from the previous stage. 2.2 Generating Line-or-Traversal Attribute LTA generation consists of two steps. We extract the straight stroke line feature from the word bitmap first, as shown in Fig. 2a. Note that only the vertical stroke lines and diagonal stroke lines are extracted at this stage. Then, the traversal features of the remainder part are computed. Finally, the features from the above two steps are aggregated to generate the LTAs of the corresponding primitives. In other words, the LTA of a primitive is represented by either a straight-line feature (if applicable) or a traversal feature Straight Stroke Line Feature A run-length-based method is utilized to extract straight stroke lines from word images. We use Rða; Þ to represent a directional run, which is defined as a set of black concatenating pixels that contains pixel a, along the specified direction. jrða; Þj is the run length of Rða; Þ, which is the number of black points in the run. The straight stroke line detection algorithm is summarized as follows: 1. Along the middle line (between the x-line and baseline), detect the boundary pair ½A l ;A r Š of each stroke, where A l and A r are the left and right boundary points, respectively. 2. Detect the midpoint A m of a line segment A l A r. 3. Calculate RðA m ;Þ for different s, from which we select max as the A s s run direction. 4. If jrða m ; max Þj is near or larger than the x-height, the pixels containing A m, between the boundary points A l and A r along the direction max, are extracted as a stroke line. In the example of Fig. 2, the stroke lines are extracted as in Fig. 2a, while the remainder is as in Fig. 2b. According to its direction, a line is categorized as one of three basic stroke lines: vertical stroke line, left-down diagonal stroke line, and right-down diagonal stroke line. According to the type of stroke lines, three basic primitives are generated from extracted stroke lines. Meanwhile, their ADAs are assigned categories based on their top-end and bottom-end positions. Their LTAs are respectively expressed as:. l : Vertical straight stroke line, such as in the characters l, d, p, q, D, P, etc. For the primitive whose ADA is x or D, we further check whether there is a dot over the vertical stroke line. If the answer is yes, the LTA of the primitive is reassigned as i.. v : Right-down diagonal straight stroke line, such as in the characters v, w, V, W, etc.. w : Left-down diagonal straight stroke line, such as in the characters v, w, z, etc. For the primitive whose ADA is x or A, we further check whether there are two horizontal stroke lines connected with it at the top and bottom. If so, the LTA of the primitive is reassigned as z. Additionally, it is easy to detect primitives containing two or more straight stroke lines. They are:. x : One left-down diagonal straight stroke line crosses one right-down diagonal straight stroke line.. y : One left-down diagonal straight stroke line meets one right-down diagonal straight stroke line at its middle point.. Y : One left-down diagonal stroke line, one rightdown diagonal stroke line and one vertical stroke line cross at one point, like the character Y.. k : One left-down diagonal stroke line, one rightdown diagonal stroke line, and one vertical stroke line meet at one point, like the characters k and K Traversal Feature After the primitives based on the stroke line features are extracted as described above, the primitives of the remainder part in the word image are computed based on traversal features. To extract traversal features, we scan the word image column by column, and the traversal number T N is recorded by counting the number of transitions from black pixel to white pixel, or vice versa, along each column. This process is not carried out on the part represented by the stroke line features described above. According to the value of T N, different feature codes are assigned as follows:

5 1402 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 16, NO. 11, NOVEMBER 2004 TABLE 1 Primitive String Tokens of Characters Fig. 3. Different fonts.. & : There is no image pixel in the column (T N ¼ 0). It corresponds to the blank intercharacter space. We treat intercharacter space as a special primitive. In addition, the overlap of adjacent characters caused by kerning is easily detected by analyzing the relative positions of the adjacent connected components. We can insert a space primitive between them in this case. If T N ¼ 2, two parameters are utilized to assign it a feature code. One is the ratio of its black pixel number to x- height,. The other is its relative position with respect to the x-line and the baseline, ¼ D m =D b, where D m is the distance from the topmost stroke pixel in the column to the x-line and D b is the distance from the bottommost stroke pixel to the baseline.. n : <0:2 and <0:3,. u : <0:2 and >3, and. c : >0:5 and 0:5 <<1:5. If T N 4, the feature code is assigned as:. o : T N ¼ 4,. e : T N ¼ 6, and. g : T N ¼ 8. Then, the same feature codes in consecutive columns are merged and represented by one primitive. Note that a few columns may possibly have no resultant feature codes because they do not meet the requirements of the various eligible feature codes described above, which is usually the result of noise. Such columns are eliminated straightaway. As illustrated in Fig. 2, the word image is decomposed into one part with stroke lines (as in Fig. 2a) and other parts with different traversal numbers (Fig. 2c for T N ¼ 2, Fig. 2d for T N ¼ 4, and Fig. 2e for T N ¼ 6). The primitive string is composed by concatenating the above generated primitives from the leftmost to the rightmost. The resultant primitive string of the word image in Fig. 2 is: (n,x)(l,x)(u,x)(o,x) (l,x)(u,x)(o,x)(l,x)(o,x)(n,x)(o,x)(l,x)(u,x)(&,&)(o,a) (l,a)(o,x)(l,x)(n,x)(&,&)(c,x)(e,x)(o,x)(&,&)(o,x)(e,x)(l,x)(u,x) (o,a)(l,a)(&,&)(n,x)(l,a)(o,x)(o,a)(l,a)(o,x)(n,x)(o,x) (l,x)(u,x)(&,&)(y,d). 2.3 Postprocessing To be able to deal with different fonts, primitives should be independent of typefaces. Among various fonts, the significant difference that impacts the extraction of an LRPS is the presence of serif, especially at the parts expressed by traversal features. Fig. 3 gives some examples of where serif is present and others, where it is not. It is a basic necessity to avoid the effect of serif in LRPS representation. Our observation shows that some fonts such as Times Roman contain serifs as opposed to serifless fonts such as Arial. Such serifs do not add on any distinguishing features to the characters. Primitives produced by such serifs can thus be eliminated by analyzing its preceding and succeeding primitives. For instance, a primitive (u,x) in a primitive subsequence (l,x)(u,x)(&,&) is normally generated by a right-side serif of characters such as a, h, m, n, u, etc. Therefore, we can remove the primitive (u,x) from a primitive subsequence (l,x)(u,x)(&,&). Similarly, a primitive (o,x) in a primitive subsequence (n,x)(o,x)(l,x) is normally generated by a serif of characters such as h, m, n, etc. Therefore, we can directly eliminate the primitive (o,x) from a primitive subsequence (n,x)(o,x)(l,x). More postprocessing rules can be used to eliminate the primitives caused by serif. With this postprocessing, the primitive string of the word in Fig. 2 becomes: (l,x)(u,x) (l,x)(u,x)(o,x)(l,x)(n,x)(l,x)(&,&)(l,a)(o,x)(l,x)(&,&)(c,x)(e,x) (o,x)(&,&)(o,x)(e,x)(l,x)(l,a)(&,&)(n,x)(l,a)(o,x)(l,a)(n,x) (l,x)(&,&)(y,d). Based on the feature extraction described above, we can give each character a standard primitive string token (PST). For example, the primitive string token of the character b is (l,a)(o,x)(c,x), and that of the character p is (l,d)(o,x)(c,x). Table 1 lists the primitive string tokens of all characters in the alphabet. The primitive string of a word can be generated by synthesizing the primitive string token of each character in the word and inserting a special primitive (&,&) among them to indicate a spacing gap. Generally speaking, the resulting primitive string of a real image is not as perfect as that synthesized from the standard PST of corresponding characters; this is due to a

6 LU AND TAN: INFORMATION RETRIEVAL IN DOCUMENT IMAGE DATABASES 1403 multitude of factors such as the connection between adjacent characters, the noise effect, etc. As shown in Fig. 2, the primitive substring with respect to the character h is changed from (l,a)(n,x)(l,x) to (l,a)(o,x)(l,x) because of the effect of noise. The touching characters al and th have also affected the corresponding primitive substrings. In the next section, we shall consider the technique of inexact matching for robust matching. 3 INEXACT FEATURE MATCHING Based on the processing described above, each word image is described by a primitive string. The word searching problem can then be stated as finding a particular sequence/subsequence in the primitive string of a word. The procedure of matching word images then becomes a measurement of the similarity between the string A ¼ ½a 1 ;a 2 ;...;a n Š representing the features of the query word and the string B ¼½b 1 ;b 2 ;...;b m Š representing the features of a word image extracted from a document. Matching partial words becomes evaluating the similarity between the feature string A with a subsequence of the feature string B. For example, the problem of matching the word health with the word unhealthy in Fig. 2 is to find whether there exists the subsequence A=(l,A)(n,x)(l,x)(&,&)(c,x)(e,x) (o,x)(&,&)(o,x)(e,x)(l,x)(&,&)(l,a)(&,&)(n,x)(l,a)(o,x)(&,&) (l,a)(n,x)(l,x) in the primitive sequence of the word unhealthy. In a word image, it is common for two or more adjacent characters to be connected with each other, possibly because of low scanning resolution or poor printing quality. This would result in the omission of the feature & in the corresponding feature string, compared to the primitive string generated from standard PSTs. Moreover, the noise effect would undoubtedly produce the substitution or insertion of features in the primitive string of a word image. Such deletions, insertions, and substitutions are very similar to the evolutionary mutations of DNA sequences in molecular biology [35]. Lopresti and Zhou applied the inexact string matching strategy to information retrieval [36] and duplicate document detection [37] to deal with the imprecise text data in OCR results. Drawing inspiration from the alignment problem of two DNA sequences and the progress achieved by Lopresti and Zhou, we apply the technique of inexact string matching to evaluate the similarity between two primitive strings, one from a template image and the other from a word image extracted from a document. From the standpoint of string alignment [38], two differing features that mismatch correspond to a substitution; a space contained in the first string corresponds to an insertion of an extra feature into the second string and a space in the second string corresponds to a deletion of an extra feature from the first string. Insertion and deletion are the reverse of each other. A dash ( ) is therefore used to represent a space primitive inserted into the corresponding positions in the strings in the case of deletion. Notice that we use spacing to represent the gap in a word image, whereas we use space to represent the inserted feature in a feature string. The two concepts are completely different. For a primitive string A of length n and a primitive string B of length m, V ði; jþ is defined as the similarity value of the prefixes ½a 1 ;a 2 ;...;a i Š and ½b 1 ;b 2 ;...;b j Š. The similarity of A and B is precisely the value V ðn; mþ. The similarity of two strings A and B can be computed by dynamic programming with recurrences [39]. The base conditions are: V ði; 0Þ ¼0 8i; j : ð2þ V ð0;jþ¼0: The general recurrence relation is: For 1 i n; 1 j m : 8 0 >< V ði 1;j 1Þþða i ;b j Þ V ði; jþ ¼max V ði 1;jÞþða i ; Þ >: V ði; j 1Þþð ;b j Þ: The zero in the above recurrence implements the operation of restarting the recurrence, which makes sure that unmatched prefixes are discarded from the computation. In our experiments, the score of matching any primitive with the the space primitive is defined as: ð3þ ða k ; Þ ¼ ð ;b k Þ¼ 1 for a k 6¼ ;b k 6¼ ; ð4þ while the score of matching any primitive with the spacing primitive & is defined as: ða k ; &Þ ¼ð&;b k Þ¼ 1 for a k 6¼ &;b k 6¼ & ð5þ and the matching score between two primitives & is given by: ð&; &Þ ¼2; while the matching score between two primitives a i and b j is defined as: ða i ;b j Þ¼ðð a i ;!a i Þ; ðb j ;!b j ÞÞ ¼ " 1ð a i ;b j Þþ" 2ð! a i ;!b j Þ; where " 1 is the function specifying the match value between two elements x 0 and y 0 of LTA. It is defined as: " 1 ðx 0 ;y 0 1 if x Þ¼ 0 ¼ y 0 ð8þ 1 else: Similarly, " 2 is the function specifying the match value between two elements x 00 and y 00 of ADA. It is defined as: " 2 ðx 00 ;y 00 Þ¼ 1 if x00 ¼ y 00 ð9þ 1 else: Finally, maximum score is normalized to the interval [0,1], with 1 corresponding to a perfect match: S 1 ¼ max V ði; jþ=v A ðn; nþ; ð10þ 8i;j where VA ðn; nþ is the matching score between the string A and itself. The maximum operation in (10) and the restarting recurrence operation in (2) ensure the ability of partial word matching. If the normalized maximum score in (10) is greater than a predefined threshold, then we recognize that one word image is matched with the other (or portion of it). ð6þ ð7þ

7 1404 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 16, NO. 11, NOVEMBER 2004 TABLE 2 Score Table for Computing V ði; jþ On the other hand, the similarity of matching two whole word images in their entirety (i.e., no partial matching is allowed) can be represented by: S 2 ¼ V ðn; mþ=minðva ðn; nþ;v Bðm; mþþ: ð11þ The problem can be evaluated systematically using tabular computation. In this approach, a bottom-up approach is used to compute V ði; jþ. We first compute V ði; jþ for the smallest possible values of i and j, and then compute the value of V ði; jþ for increasing values of i and j. This computation is organized with a dynamic programming table of size ðn þ 1Þðmþ1Þ. The table holds the values of V ði; jþ for the choices of i and j (see Table 2). The values in row zero and column zero are filled in directly from the base conditions for V ði; jþ. After that, the remaining n m subtable is filled in one row at a time, in the order of increasing i. Within each row, the cells are filled in, in the order of increasing j. The following pseudo code describes the algorithm: The entire dynamic programming table for computing the similarity between a string of length n and a string of length m can be obtained in OðnmÞ time, since only three comparisons and arithmetic operations are needed per cell. Table 2 illustrates the score table computing from the primitive string of the word image unhealthy extracted from a real document image and the word health whose primitive string is generated by aggregating the characters primitive string tokens according to the character sequence of the word, with the special primitive & inserted between two adjacent PSTs. It can be seen that the maximum score achieved in the table corresponds to the Fig. 4. System diagram for searching a user-specified word in a document image. match of the character sequence health in the word unhealthy. 4 APPLICATION A: WORD SEARCHING We now focus on the application of our proposed partial word image matching method to word spotting. Searching/ locating a user-specified keyword in image format documents has been a topic of interest for many years. It has its practical value for document information retrieval. For example, by using this technique, the user can locate a specified word in document images without any prior need for the images to be OCR-processed. 4.1 System Overview The overall system structure is illustrated in Fig. 4. When a document image is presented to the system, it goes through preprocessing, as in many document image processing systems. It is presumed that processes such as skew estimation and correction [40], if applicable, and other image-quality related processing, are performed in the first module of the system. Then, all of the connected components in the image are calculated using an eight-connectedcomponent analysis algorithm. The representation of each connected component includes the coordinates and dimensions of the bounding box and so on. Word objects are bounded based on a merger operation on the connected components. As a result, the left, top, right, and bottom coordinates of each word bitmap are obtained. Meanwhile, the baseline and x-line locations in each word are also available for subsequent processing. Extracted word bitmaps with baseline and x-line information are the basic units for the downstream process of word matching, and are represented with the use of primitive strings as described in Section 2. When a user keys in a query word, the system generates its corresponding word primitive token (WPT) by aggregating the characters primitive string tokens according to the character sequence of the word, with the special primitive & inserted between two adjacent PSTs. For example, the WPT of the word health is: (l,a)(n,x)(l,x)(&,&)(c,x)(e,x) (o,x)(&,&)(o,x)(e,x)(l,x)(&,&)(l,a)(&,&)(n,x)(l,a)(o,x)(&,&) (l,a)(n,x)(l,x).

According to the similarity measurement, we can estimate how the word in the document image is relevant to the query word.

8 LU AND TAN: INFORMATION RETRIEVAL IN DOCUMENT IMAGE DATABASES 1405 Fig. 5. Results of searching for the word cool. This primitive string is matched with the primitive string of each word object in the document image to measure the similarity between the two. According to the similarity measurement, we can estimate how the word in the document image is relevant to the query word. It is always the case that, for a given query word, a larger score corresponds to a better match. If the normalized maximum score in (10) is greater than the predefined threshold, then we recognize that the word image (or portion of it) matches the query word. The match may be imprecise in the sense that certain primitives are missing. Some adjacent characters in an image word may touch each other due to various reasons. This results in a lesser number of character spacing to be detected and, hence, a lesser number of spacing primitives (&, &) in the primitive string. If we look up Table 2, we find that we can get a best matching track if we backtrack from the maximum score. Our investigation shows that the score in the best matching track with respect to the spacing primitive (&, &) in the query word s primitive string decreases if its corresponding spacing primitive in the real word is missing. To remedy this problem, the normalized maximum score in (10) is modified as: S m ¼ðmax V ði; jþþðn gþþ=va ðn; nþ; ð12þ 8i;j where N g is the number of such mismatched spacing primitives and ðn g Þ¼2N g in our experiments. 4.2 Data Preparation The digital library of our university stores a huge number of old student theses and books, and it has been deemed desirable to make these documents accessible on the Internet. The original plan was to digitize all the documents, then transform the document images into text format using a commercial OCR product. However, the OCR results were not satisfactory. Manually correcting the errors was obviously expensive and impractical for such a huge data set. It was thus decided to retain the documents as PDF files in image format. We selected 3,250 pages from 113 such PDF files of scanned student theses and 328 pages from 15 PDF files of scanned books, to generate a document image database called the NUSDL database. For various reasons, documents available on the Internet are also often in image format. Examples of such documents Fig. 6. Results of searching for the word string. are the journal or conference papers available as IEEE Xplore online publications and similar papers at other journal online publication systems such as CVGIP: Graphical Models and Image Processing. We obtained 39 such documents with 294 pages, and named the collection the JPaper database. To search words in such PDF files containing document images, we developed a tool based on our proposed approach, using Acrobat SDK. When a PDF file is opened in Adobe Acrobat, the plug-in tool is able to detect and locate the user s query words in the document images without any OCR process, like the Find function provided by Adobe Acrobat for word search in text format documents. Besides PDF document images, binary images in a UW document image database are also used in our experiments. 4.3 Experimental Results Fig. 5 shows the results of searching words in a scanned student thesis selected from the NUSDL database. In this example, the words cooler, Cryocooler, cryocooler, microcooler, and cooling in the document image are successfully detected and located when the user inputs the word cool for searching. Another example is given in Fig. 6. The document image is stored in a PDF file which is accessible at the Web site of IEEE Xplore online publications. The words string, strings, and substring in different fonts are correctly detected and located in the document when string is used as the query word. To evaluate the performance of our system, we select 150 words as queries and search for their corresponding words and variations in the document images. The system achieves a precision ranging from to percent and

9 1406 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 16, NO. 11, NOVEMBER 2004 TABLE 3 Performance versus Threshold a recall 1 ranging from to percent, depending on the threshold, as shown in Table 3. A lower results in a higher recall, but at the expense of lower precision. The reverse is true for a higher. Thus, if the goal is to retrieve as many relevant words as possible, then a lower should be set. On the other hand, for low tolerance to precision, then should be raised. It is common practice to combine recall R and precision P in some way so that system performance can be compared in terms of a single rating. Our experiments adopt one of the commonly used measures, the F 1 rating [4], which gives equal weight to R and P. It is defined as: F 1 ¼ 2RP R þ P : ð13þ Fig. 7 shows the average precision and recall, as well as the F 1 rating. On average, the best F 1 rating that the system can achieve is , where the precision and recall are and percent, and the threshold is set at These results are also better than those of our previous system [32], whose precision and recall are and percent, respectively, with the best F 1 measure at Precision is defined as the percentage of the number of correctly searched words over the number of all searched words, while recall is the percentage of the number of correctly searched words over the number of words that should be searched. 4.4 Comparison with Commercial OCR The majority of the document images used in our test are packed in PDF files. Adobe Acrobat provides an OCR tool of Page Capture which converts a page of document image in a PDF file to its text format using document image processing. So, it is selected to compare with our proposed approach. In the Page Capture, if a word object in a document image cannot be recognized, its original image will be kept and inserted in the corresponding position in the text for display purpose. Fig. 8a shows the OCR result on a part of a document image page selected from a student thesis, in which the words TMS320C30, passband, and stopband have not been successfully converted into text format. So, these words still cannot be searched even though the OCR tool has been applied to the page of document image. In comparison, Fig. 8b shows the search result with the query word band using the word searching tool based on our proposed word matching approach. All of the words containing the string band are successfully detected. To compare word searching performance between Adobe Acrobat and our proposed approach, the NUSDL database and JPaper database are used. The UW database is not included in the comparison because Adobe Acrobat can work on document images packed in PDF files only. The precision and recall achieved by Adobe Acrobat are and percent, respectively. Our approach achieves a precision of percent and a recall of percent, when the threshold is set at It can be seen that the precision of Adobe Acrobat is very high while its recall is a little lower compared with our approach. The reason is that the OCR tool in Adobe Acrobat is lexicon dependent. A lexicon may be built into its recognition engine, so it can achieve very high precision; but, it does not handle well with most words that are not included in the lexicon, especially certain technical terms in student theses and papers. On the other hand, our proposed approach does not use any language or lexical information at all. In addition, Fig. 7. Performance versus different thresholds.

LU AND TAN: INFORMATION RETRIEVAL IN DOCUMENT IMAGE DATABASES 1407 Fig. 8. Comparison of Adobe Acrobat OCR and the proposed approach. (a) OCR results by Adobe Acrobat.

10 LU AND TAN: INFORMATION RETRIEVAL IN DOCUMENT IMAGE DATABASES 1407 Fig. 8. Comparison of Adobe Acrobat OCR and the proposed approach. (a) OCR results by Adobe Acrobat. (b) Results of searching for the word band by our approach. our experiments also show that our proposed approach is approximately 2.6 times faster than the OCR tool. 5 APPLICATION B: DOCUMENT SIMILARITY MEASUREMENT Measuring the similarity between documents has practical applications in document image retrieval. For instance, a user may use an entire document rather than a keyword to retrieve documents whose content is similar to the query document. As we have mentioned before, the word, rather than the character, is the basic unit of meaning in document image retrieval. Therefore, we use word objects in documents as our basic unit of information. In this section, we present another application of our proposed partial word matching to document image retrieval, i.e., similarity measurement between two document images. The system diagram is shown in Fig Document Vector and Similarity The word objects in the images are bounded based on the analysis of the connected components first. Some common words (such as a, the, by, etc.) called stop words in linguistics are eliminated according to their shapes as they do not contribute much to the overall weight of the document. Although the width estimates of single words are unreliable for stop word identification, stop words tend to be short and are composed of a few characters. Suppose that w and h represent the width and height of a word object, respectively. If the aspect ratio w=h of a word object is less than a certain threshold, it is excluded from the calculation of document vectors. In our experiments, is set at 2.0. This process can also prevent a short word from being matched with a long word. For example, it can prevent the word the and the word weather being assigned to the same class because the word the has already been eliminated. The remaining word objects in each document are represented by means of their corresponding primitive feature strings. An unsupervised classifier is employed to place all of the word objects (except the stop words) coming from both document images to K object classes. The number of classes is initialized as null. The first incoming word generates a new class automatically. The next incoming word is compared with the word stored in the existing class. The above described word matching method is utilized to measure the similarity between two word objects. If the similarity score of the two word objects is greater than a predefined threshold, we assign them to the same class. Otherwise, the incoming word is assigned to a new class. The respective occurrence frequencies of all classes in a document are used to build the document s vector. Each occurrence frequency is normalized by dividing it by the Fig. 9. System diagram for measuring the similarity between two document images.

1408 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 16, NO. 11, NOVEMBER 2004 TABLE 4 Corpus Used in the Experiments total number of word objects in the document.

11 1408 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 16, NO. 11, NOVEMBER 2004 TABLE 4 Corpus Used in the Experiments total number of word objects in the document. The similarity score between two document vectors is defined as their scalar product divided by their lengths, which is the same as in our pervious work [34]. A scalar product is calculated by summing up the products of the corresponding elements. This is equivalent to the cosine of the angle between two document vectors seen from the origin. So, the similarity between the query document image Q and the archived document image Y is: X K k¼1 q k y k SðQ; Y Þ¼vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ; ð14þ ux K X t K qk 2 y 2 k k¼1 k¼1 where, Q ¼ðq 1 ;q 2 ;...;q K Þ T is the document vector of the image Q, and Y ¼ðy 1 ;y 2 ;...;y K Þ T is the document vector of the image Y. K is the dimension number of the document vector, which equals the class number generated by the unsupervised classifier. 5.2 Experimental Results Our experiments compute the similarity measurement between document images selected from the NUSDL database. These articles address nine different kinds of topics, and they are student theses pertaining to electrical engineering, chemical engineering, mechanical engineering, medical engineering, etc. The topics and the number of each used in the experiments are listed in Table 4. The similarity measurement between any two document images in the corpus is calculated, in which the proposed word image matching approach using (10) is applied. Fig. 10 illustrates the similarity measurement of the document C5 in group C with the first three documents taken from each group. If the document C5 is chosen as a query document, the documents in group C (i.e., C1, C2, and C3) can be successfully retrieved from the corpus, subject to a suitable threshold. The retrieval performance of each group is tabulated in Table 5, according to different similarity thresholds. The system achieves an average precision ranging from to percent, and an average recall 2 ranging from to percent, depending on different similarity thresholds. A lower allows more relevant documents to be retrieved (hence, the higher recall) but at the expense of false positives (hence, the lower precision). The reverse is true for a higher. Thus, if the goal is to retrieve as many relevant documents as possible by allowing irrelevant ones to be rejected by manual reading, then a lower should be set. On the other hand, for low tolerance to false positives, then should be raised. We compare system performance based on the application of partial word image matching (PWM method using (10)) with that based on entire word image matching (EWM method using (11)). The maximum F 1 ratings with respect to each group are illustrated in Fig. 11. From the results, we can see that the PWM method outperforms the EWM method in general. The overall F 1 rating achieved by the PWM method improves to , compared to the EWM method s CONCLUSION AND FUTURE WORK Document images have become a popular information source in our modern society, and information retrieval in document image databases is an important topic in knowledge and data engineering research. Document image Fig. 10. The similarity with document C5. 2. Precision is defined as the percentage of the number of correctly retrieved articles over the number of all retrieved articles, while recall is the percentage of the number of correctly retrieved articles over the number of articles in the category.

LU AND TAN: INFORMATION RETRIEVAL IN DOCUMENT IMAGE DATABASES 1409 TABLE 5 Retrieval Performance for Different Thresholds retrieval without OCR has its practical value, but it is also a challenging

12 LU AND TAN: INFORMATION RETRIEVAL IN DOCUMENT IMAGE DATABASES 1409 TABLE 5 Retrieval Performance for Different Thresholds retrieval without OCR has its practical value, but it is also a challenging problem. In this paper, we have proposed a word image matching approach with the ability of matching partial words. To measure the similarity of two word images, each word image is represented by a primitive string. Then, an inexact string matching technique is utilized to measure the similarity between the two primitive strings generated from the two word images. Based on the matching score, we can estimate how one word image is relevant to the other and decide whether one is a portion of the other. The proposed word image approach has been applied to two document image retrieval problems, i.e., word spotting and document similarity measurement. Experiments have been carried out on various document images such as scanned books, students theses, and journal/conference papers downloaded from the Internet, as well as UW document images. The test results confirm the validity and feasibility of the applications of the proposed word matching approach to document image retrieval. The setting of matching scores between different primitive codes is beyond the scope of this paper and is left to future research. How to choose a suitable threshold for achieving maximum F 1 rating also requires further investigation in the future. ACKNOWLEDGMENTS This research is jointly supported by the Agency for Science, Technology and Research, and the Ministry of Education of Singapore under research grant R /303. The authors would also like to thank Mr. Ng Kok Koon and Mrs. Khoo Yee Hoon of the digital library of the National University of Singapore for providing us with the document image data of student theses. Fig. 11. F 1 rating comparison between PWM and EWM. REFERENCES [1] D. Doermann, The Indexing and Retrieval of Document Images: A Survey, Computer Vision and Image Understanding, vol. 70, no. 3, pp , [2] M. Mitra and B.B. Chaudhuri, Information Retrieval from Documents: A Survey, Information Retrieval, vol. 2, nos. 2/3, pp , [3] G. Salton, J. Allan, C. Buckley, and A. Singhal, Automatic Analysis, Theme Generation, and Summarization of Machine- Readable Text, Science, vol. 264, pp , [4] Y. Yang and X. Liu, A Re-Examination of Text Categorization Methods, Proc. 22th Ann. Int l ACM SIGIR Conf. Research and Development in Information Retrieval, pp , [5] K. Tagvam, J. Borsack, A. Condir, and S. Erva, The Effects of Noisy Data on Text Retrieval, J. Am. Soc. for Information Science, vol. 45, no. 1, pp , 1994.

1410 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 16, NO. 11, NOVEMBER 2004 [6] Y. Ishitani, Model-Based Information Extraction Method Tolerant of OCR Errors for Document Images, Proc.

Fourth Int l Conf. Document Analysis and Recognition, pp. 950-956, 1997. [8] S.M. Harding, W.B. Croft, and C. Weir, Probabilistic Retrieval of OCR Degraded Text Using N-Grams, Proc. European Conf.

13 1410 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 16, NO. 11, NOVEMBER 2004 [6] Y. Ishitani, Model-Based Information Extraction Method Tolerant of OCR Errors for Document Images, Proc. Sixth Int l Conf. Document Analysis and Recognition, pp , [7] M. Ohtam, A. Takasu, and J. Adachi, Retrieval Methods for English Text with Misrecognized OCR Characters, Proc. Fourth Int l Conf. Document Analysis and Recognition, pp , [8] S.M. Harding, W.B. Croft, and C. Weir, Probabilistic Retrieval of OCR Degraded Text Using N-Grams, Proc. European Conf. Research and Advanced Technology for Digital Libraries (ECDL 97), pp , [9] P. Kantor and E. Voorhes, Report on the TREC-5 Confusion Track, Online Proc. TREC-5, NIST special publication , pp , [10] T. Kameshiro, T. Hirano, Y. Okada, and F. Yoda, A Document Image Retrieval Method Tolerating Recognition and Segmentation Errors of OCR Using Shape-feature and Multiple Candidates, Proc. Fifth Int l Conf. Document Analysis and Recognition, pp , [11] K. Katsuyama et al., Highly Accurate Retrieval of Japanese Document Images through a Combination of Morphological Analysis and OCR, Proc. SPIE, Document Recognition and Retrieval, vol. 4670, pp , [12] F.R. Chen and D.S. Bloomberg, Summarization of Imaged Documents without OCR, Computer Vision and Image Understanding, vol. 70, no. 3, pp , [13] D.S. Bloomberg and F.R. Chen, Document Image Summarization without OCR, Proc. Int l Conf. Image Processing, vol. 2, pp , [14] J. Liu and A.K. Jain, Image-Based Form Document Retrieval, Pattern Recognition, vol. 33, no. 3, pp , [15] D. Niyogi and S. Srihari, The Use of Document Structure Analysis to Retrieve Information from Documents in Digital Libraries, Proc. SPIE, Document Recognition IV, vol. 3027, pp , [16] Y.Y. Tang, C.D. Yan, and C.Y. Suen, Document Processing for Automatic Knowledge Acquisition, IEEE Trans. Knowledge and Data Eng., vol. 6, no. 1, pp. 3-21, Feb [17] Y. He, Z. Jiang, B. Liu, and H. Zhao, Content-Based Indexing and Retrieval Method of Chinese Document Images, Prof. Fifth Int l Conf. Document Analysis and Recognition (ICDAR 99), pp , [18] A.L. Spitz, Duplicate Document Detection, Proc. SPIE, Document Recognition IV, vol. 3027, pp , [19] A.F. Smeaton and A.L. Spitz, Using Character Shape Coding for Information Retrieval, Proc. Fourth Int l Conf. Document Analysis and Recognition, pp , [20] A.L. Spitz, Shape-Based Word Recognition, Int l J. Document Analysis and Recognition, vol. 1, no. 4, pp , [21] A.L. Spitz, Progress in Document Reconstruction, Proc. 16th Int l Conf. Pattern Recognition, vol. 1, pp , [22] Z. Yu and C.L. Tan, Image-Based Document Vectors for Text Retrieval, Proc. 15th Int l Conf. Pattern Recognition, vol. 4, pp , [23] C.L. Tan, W. Huang, Z. Yu, and Y. Xu, Imaged Document Text Retrieval without OCR, IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 24, no. 6, pp , June [24] T.K. Ho, J.J. Hull, and S.N. Srihari, A Word Shape Analysis Approach to Lexicon Based Word Recognition, Pattern Recognition Letters, vol. 13, pp , [25] T. Syeda-Mahmood, Indexing of Handwritten Document Images, Proc. Workshop Document Image Analysis, pp , [26] A. Kolcz, J. Alspector, M. Augusteijn, R. Carlson, and G.V. Popescu, A Line-Oriented Approach to Word Spotting in Handwritten Documents, Pattern Analysis and Applications, vol. 3, no. 2, pp , [27] R. Manmatha, C. Han, and E.M. Riseman, Word Spotting: A New Approach to Indexing Handwriting, Proc. Int l Conf. Computer Vision and Pattern Recognition, pp , [28] J. DeCurtins and E. Chen, Keyword Spotting via Word Shape Recognition, Proc. SPIE, Document Recognition II, vol. 2422, pp , [29] S. Kuo and O.F. Agazzi, Keyword Spotting in Poorly Printed Documents Using Pseudo 2-D Hidden Markov Models, IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 16, no. 8, pp , Aug [30] F.R. Chen, L.D. Wilcox, and D.S. Bloomberg, Word Spotting in Scanned Images Using Hidden Markov Models, Proc. Int l Conf. Acoustics, Speech, and Signal Processing, vol. 5, pp. 1-4, [31] F.R. Chen, L.D. Wilcox, and D.S. Bloomberg, Detecting and Locating Partially Specified Keywords in Scanned Images Using Hidden Markov Models, Proc. Int l Conf. Document Analysis and Recognition, pp , [32] Y. Lu, C.L. Tan, W. Huang, and L. Fan, An Approach to Word Image Matching Based on Weighted Hausdorff Distance, Proc. Sixth Int l Conf. Document Analysis and Recognition, pp , [33] J.J. Hull, Document Matching on CCITT Group 4 Compressed Images, Proc. SPIE, Document Recognition IV, vol. 3027, pp , [34] Y. Lu and C.L. Tan, Document Retrieval from Compressed Images, Pattern Recognition, vol. 36, no. 4, pp , [35] A. Apostolico and R. Giancarlo, Sequence Alignment in Molecular Biology, DIMACS Series in Discrete Math. and Theoretical Computer Sciences, vol. 47, pp , [36] D. Lopresti and J. Zhou, Retrieval Strategies for Noisy Text, Proc. Fifth Ann. Symp. Document Analysis and Information Retrieval, pp , [37] D.P. Lopresti, A Comparison of Text-Based Methods for Detecting Duplication in Scanned Document Databases, Information Retrieval, vol. 4, no. 2, pp , [38] D. Gusfield, Algorithms on Strings, Trees, and Sequences. Cambridge Univ. Press, [39] R.A. Wagner and M.J. Fisher, The String-to-String Correction Problem, J. ACM, vol. 21, pp , [40] Y. Lu and C.L. Tan, A Nearest-Neighbor-Chain Based Approach to Skew Estimation in Document Images, Pattern Recognition Letters, vol. 24, pp , Yue Lu received the BE and ME degrees in telecommunications and electronic engineering from Zhejiang University, Zhejiang, China, in 1990 and 1993, respectively, and his PhD degree in pattern recognition and intelligence system from Shanghai Jiao Tong University, Shanghai, China, in He is a professor in the Department of Computer Science and Technology, East China Normal University. From 2000 to 2004, he was a research fellow with the Department of Computer Science, National University of Singapore. From 1993 to 2000, he was an engineer at the Shanghai Research Institute of China State Post Bureau. Because of his outstanding contributions to the research and development of the automatic mail sorting system OVCS, he won the Science and Technology Progress Awards issued by the Ministry of Posts and Telecommunications in 1995, 1997, 1998, respectively. His research interests include document image processing, document retrieval, character recognition, and intelligence system. Chew Lim Tan received the BSc (Hons) degree in physics in 1971 from the University of Singapore, the MSc degree in radiation studies in 1973 from the University of Surrey, UK, and the PhD degree in computer science in 1986 from the University of Virginia, Virginia. He is an associate professor in the Department of Computer Science, School of Computing, National University of Singapore. His research interests include document image and text processing, neural networks, and genetic programming. He has published more than 170 research publications in these areas. He is an associate editor of Pattern Recognition and has served on the program committees of the International Conference on Pattern Recognition (ICPR) 2002, the Graphics Recognition Workshop (GREC) 2001 and 2003, the Web Document Analysis Workshop (WDA) 2001 and 2003, the Document Image Analysis and Retrieval Workshop (DIAR) 2003, the Document Image Analysis for Libraries Workshop (DIAL) 2004, the International Conference on Image Processing (ICIP) 2004, and the International Conference on Document Analysis and Recognition (ICDAR) He is a senior member of the IEEE and the IEEE Computer Society.

Retrieval Of Information In Document Image Databases Using Partial Word Image Matching Technique

Retrieval Of Information In Document Image Databases Using Partial Word Image Matching Technique Seema Yadav, Dr. Sudhir Sawarkar Abstract-With the popularity and importance of document images as an information