MODERN technology has made it possible to produce,

Size: px
Start display at page:

Download "MODERN technology has made it possible to produce,"

Transcription

1 1398 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 16, NO. 11, NOVEMBER 2004 Information Retrieval in Document Image Databases Yue Lu and Chew Lim Tan, Senior Member, IEEE Abstract With the rising popularity and importance of document images as an information source, information retrieval in document image databases has become a growing and challenging problem. In this paper, we propose an approach with the capability of matching partial word images to address two issues in document image retrieval: word spotting and similarity measurement between documents. First, each word image is represented by a primitive string. Then, an inexact string matching technique is utilized to measure the similarity between the two primitive strings generated from two word images. Based on the similarity, we can estimate how a word image is relevant to the other and, thereby, decide whether one is a portion of the other. To deal with various character fonts, we use a primitive string which is tolerant to serif and font differences to represent a word image. Using this technique of inexact string matching, our method is able to successfully handle the problem of heavily touching characters. Experimental results on a variety of document image databases confirm the feasibility, validity, and efficiency of our proposed approach in document image retrieval. Index Terms Document image retrieval, partial word image matching, primitive string, word searching, document similarity measurement. æ 1 INTRODUCTION MODERN technology has made it possible to produce, process, store, and transmit document images efficiently. In an attempt to move toward the paperless office, large quantities of printed documents are digitized and stored as images in databases. The popularity and importance of document images as an information source are evident. As a matter of fact, many organizations currently use and are dependent on document image databases. However, such databases are often not equipped with adequate index information. This makes it much harder to retrieve user-relevant information from image data than from text data. Thus, the study of information retrieval in document image databases is an important subject in knowledge and data engineering. 1.1 Research Background Millions of digital documents are constantly transmitted from one point to another over the Internet. The most common format of these digital documents is text, in which characters are represented by machine-readable codes. The majority of computer generated documents are in this format. On the other hand, to make billions of volumes of traditional and legacy documents available and accessible on the Internet, they are scanned and converted to digital images using digitization equipment. Although the technology of Document Image Processing (DIP) may be utilized to automatically convert the digital images of these documents to the machine-readable. Y. Lu is with the Department of Computer Science and Technology, East China Normal University, Shanghai , China. ylu@cs.ecnu.edu.cn.. C.L. Tan is with the Department of Computer Science, School of Computing, National University of Singapore, Kent Ridge, Singapore tancl@comp.nus.edu.sg. Manuscript received 8 Sept. 2003; revised 1 Feb. 2004; accepted 25 Feb For information on obtaining reprints of this article, please send to: tkde@computer.org, and reference IEEECS Log Number TKDE text format using Optical Character Recognition (OCR) technology, it is typically not a cost effective and practical way to process a huge number of paper documents. One reason is that the technique of layout analysis is still immature in handling documents with complicated layouts. Another reason is that OCR technology still suffers inherent weaknesses in its recognition ability, especially with document images of poor quality. Interactive manual correction/proofreading of OCR results is usually unavoidable in most DIP systems. As a result, storing traditional and legacy documents in image format has become the alternative in many cases. Nowadays, we can find on the Internet many digital documents in image format, including scanned journal/conference papers, student theses, handbooks, out-of-print books, etc. Moreover, many digital libraries and Web portals such as MEDLINE, ACM, IEEE, etc., keep scanned document images without their text equivalents. Many tools have been developed for retrieving information from text format documents, but they are not applicable to image format documents. There is therefore a need to study the strategies of retrieving information from document images. For example, a user facing a large number of imaged documents on the Internet has to download each one to see its contents before knowing whether the document is relevant to his/her interest, e.g., by looking for keywords. Obviously, it is of practical value to study a method that is capable of notifying the user, prior to downloading, whether a document image contains information of interest to him/her. In recent years, there has been much interest in the research area of Document Image Retrieval (DIR) [1], [2]. DIR is relevant to document image processing (DIP), but there are some essential differences between them. A DIP system needs to analyze different text areas in a page document, understand the relationship among these text areas, and then convert them to a machine-readable version using OCR, in which each character object is assigned to a certain class. Many commercial OCR systems, which claim /04/$20.00 ß 2004 IEEE Published by the IEEE Computer Society

2 LU AND TAN: INFORMATION RETRIEVAL IN DOCUMENT IMAGE DATABASES 1399 to achieve a recognition rate of above 99 percent for single characters, have been introduced in the market over the past few decades. It is obvious that a document image processing system can be applied for the purpose of document image retrieval. For example, a DIP system is used to automatically convert a document image to machine-readable text codes first, and traditional information retrieval strategies [3], [4] are then applied. However, the performance of a DIP system relies heavily on the quality of the scanned images. It deteriorates severely if the images are of poor quality or complicated layout. It is generally acknowledged that the recognition accuracy requirements of document image retrieval methods are considerably lower than those for many document processing applications [5]. Motivated by this observation, some retrieval methods with the ability to tolerate the recognition errors of OCR have been recently researched [6], [7], [8], [9]. Additionally, some methods are reported to improve retrieval performance by using OCR candidates [10], [11]. It is a fact that such information retrieval methods based on the technology of document image processing are still the best so far because a great deal of research has been devoted to the area, and numerous algorithms have been proposed in the past decades, especially in OCR research. However, we argue that DIR, which bypasses OCR, still has its practical value today. The main question that a DIR system seeks to answer is whether a document image contains words which are of interest to the user, while paying no attention to unrelated words. In other words, a DIR system provides an answer of yes or no with respect to the user s query, rather than the exact recognition of a character/word as in DIP. Moreover, words, rather than characters, are the basic units of meaning in information retrieval. Therefore, directly matching word images in a document image is an alternative way to retrieve information from the document. In short, DIR and DIP address different needs and have different merits of their own. The former, which is tailored for directly retrieving information from document images, could achieve a relatively higher performance in terms of recall, precision, and processing speed. 1.2 Previous Related Work In recent years, a number of attempts have been made by researchers to avoid the use of character recognition for various document image retrieval applications. For example, Chen and Bloomberg [12], [13] described a method for automatically selecting sentences and key phrases to create a summary from an imaged document without any need for recognition of the characters in each word. They built word equivalence classes by using a rank blur hit-miss transform to compare word images and using a statistical classifier to determine the likelihood of each sentence being a summary sentence. Liu and Jain [14] presented an approach to image-based form document retrieval. They proposed a similarity measure for forms that is insensitive to translation, scaling, moderate skew, and image quality fluctuations, and developed a prototype form retrieval system based on their proposed similarity measure. Niyogi and Srihari [15] described an approach to retrieve information from document images stored in a digital library by means of knowledge-based layout analysis and logical structure derivation techniques, in which significant sections of documents, such as the title, are utilized. Tang et al. [16] proposed methods for automatic knowledge acquisition in document images by analyzing the geometric structure and logical structure of the images. In the domain of Chinese document image retrieval, He et al. [17] proposed an index and retrieval method based on the stroke density of Chinese characters. Sptiz described character shape codes for duplicate document detection [18], information retrieval [19], word recognition [20], and document reconstruction [21], without resorting to character recognition. Character shape codes encode whether or not the character in question fits between the baseline and the x-line or, if not, whether it has an ascender or descender, and the number and spatial distribution of the connected components. To get character shape codes, character cells must first be segmented. This method is therefore unsuitable for dealing with words with connected characters. Additionally, it is lexicon-dependent, and its performance is somewhat affected by the appropriateness of the lexicon to the document being processed. Yu and Tan [22], [23] proposed a method for text retrieval from document images without the use of OCR. In their method, documents are segmented into character objects, whose image features are utilized to generate document vectors. Then, the text similarity between documents is measured by calculating the dot product of the document vectors. In the work of both Spitz and Tan et al., character segmentation is a necessary step before downstream processes can be performed. In many cases, especially with document images of poor quality, it is not easy to separate connected characters. Moreover, the word, rather than the character, is normally a basic unit of useful meaning in document image retrieval. To avoid the difficulties of separating touching adjacent characters in a word image, segmentation-free methods have been researched in some document image retrieval systems, in which image matching at the word level is commonly utilized. A segmentation-free word image matching approach treats each word object as a single, indivisible entity, and attempts to recognize it using features of the word as a whole. For example, Ho et al. [24] proposed a word recognition method based on word shape analysis without character segmentation and recognition. Every word is first partitioned into a fixed 4 10 grid. Features are extracted and their relative locations in the grid are recorded in a feature vector. The city-block distance is used to compare the feature vector of an input word and that of a word in a given lexicon. This method can only be applied to a fixed vocabulary word recognition system and has no ability of partial word matching. Some approaches have been reported in the past years for searching keywords in handwritten [25], [26], [27] and printed [28], [29], [30], [31], [32] documents. In most of these approaches, the Hidden Markov Model and Viterbi dynamic programming algorithm are widely used to recognize keywords. However, these approaches have their inherent disadvantages. Although DeCurtins and Chen s approach [28] and Chen et al. s approach [30], [31] are capable of searching a word portion, they can only process predefined keywords and a pretraining procedure is necessary. In our previous word searching system [32], a weighted Hausdorff distance is used to measure the dissimilarity between word images. Although the system

3 1400 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 16, NO. 11, NOVEMBER 2004 can cope with any word the user specifies, it cannot deal with the problem of partial matching. For instance, the approach is unable to search the words undeveloped and development when the user keys in develop as the query word because the three words are treated as different from the standpoint of word image classification. Obviously, such a classification is not practical from the viewpoint of document understanding by human beings. In the compressed document images domain, Hull [33] proposed a method to detect equivalent document images by matching the pass mode codes extracted from CCITT compressed document images. He created a feature vector that counts the number of pass mode codes in each cell of a fixed grid in the image, and equivalent images are located by applying the Hausdorff distance to feature vectors. In our previous paper [34], we also presented a method for measuring the similarity between two CCITT group 4 compressed document images. 1.3 Current Work In this paper, we propose an improved document image retrieval approach with the ability of matching partial word images. First, word images are represented by feature strings. A feature string matching method based on dynamic programming is then utilized to evaluate the similarity between two feature strings or a part of them. By using the technique of inexact feature string matching, the approach can successfully handle not only kerning, but also words with heavily touching characters. The underlying idea of the proposed partial word image matching technique is somewhat similar to that of word recognition based on Hidden Markov Models. However, inexact string matching has the advantage of not requiring any training prior to use. We have tested our technique in two applications of document image retrieval. One application is to search for user-specified words and their variations in document images (word spotting). The user s query word and a word image object extracted from documents are first represented by two respective feature strings. Then, the similarity between the two feature strings is evaluated. Based on the evaluation, we can estimate how the document word image is relevant to the user-specified word. For example, when the user keys in the query word string. our method detects words such as strings and substring. Our experiments on real document images, including scanned books, student theses, journal/conference papers, etc., show promising performance by our proposed approach to word spotting in document images. The other application is content-based similarity measurement between two document images. The technique of partial word image matching is applied to assign all word objects from two document images to corresponding classes, except potential stop words, which are excluded. Document vectors are built using the occurrence frequencies of the word object classes, and the pair-wise similarity of two document images is calculated by the scalar product of the document vectors. We have used nine group articles pertaining to different domains to test the validity of our approach. The results show that our proposed method based on partial word image matching outperforms the method based on entire word image matching, with the F 1 rating improved. Fig. 1. Different spacing: (a) separated adjacent characters, (b) overlapped adjacent characters, and (c) touching adjacent characters. The remainder of this paper is organized as follows: Section 2 presents the feature string description for word images. Section 3 describes inexact matching between two feature strings. Section 4 gives the experimental results that demonstrate the effectiveness of the proposed partial word image matching approach in word spotting. Section 5 presents the results demonstrating the effectiveness of the proposed partial word image matching in document similarity measurement. Section 6 concludes the paper with a summary of the contributions of our proposed technique and an indication of the directions for future work. 2 FEATURE STRING DESCRIPTION FOR WORD IMAGE It is presumed that word objects are extracted from document images via some image processing such as skew estimation and correction, connected component labeling, word bounding, etc. The feature employed in our approach to represent word bitmap images is the Left-to-Right Primitive String (LRPS), which is a code string sequenced from the leftmost of a word to its rightmost. Line and traversal features are used to extract the primitives of a word image. A word printed in documents can be in various sizes, fonts, and spacings. When we extract features from word bitmaps, we have to take this fact into consideration. Generally speaking, it is easy to find a way to cope with different sizes. But, dealing with multiple fonts and the touching characters caused by condensed spacing is still a challenging issue. For a word image in a printed text, two characters could be spaced apart by a few white columns caused by intercharacter spacing, as shown in Fig. 1a. But, it is also common for one character to overlap another by a few columns due to kerning, as shown in Fig. 1b. Worse still, as shown in Fig. 1c, two or more adjacent characters may touch each other due to condensed spacing. We are thus faced with the challenge of separating such touching characters. We shall utilize inexact feature string matching to handle the problem. 2.1 LRPS Feature Representation A word is explicitly segmented, from the leftmost to the rightmost, into discrete entities. Each entity, called a primitive here, is represented using definite attributes. A primitive p is described using a two-tuple (,!), where is the Line-or-Traversal Attribute (LTA) of the primitive, and! is the Ascender-and-Descender Attribute (ADA). As a result, the word image is expressed as a sequence P of p i s P ¼< p 1 p 2...p n >¼< ð 1 ;! 1 Þð 2 ;! 2 Þ...ð n ;! n Þ >; ð1þ

4 LU AND TAN: INFORMATION RETRIEVAL IN DOCUMENT IMAGE DATABASES 1401 Fig. 2. Primitive string extraction (a) straight stroke line features, (b) remainder part of (a), (c) traversal T N ¼ 2, (d) traversal T N ¼ 4, and (e) traversal T N ¼ 6. where the ADA of a primitive! 2 = { x, a, A, D, Q }, which are defined as:. x : The primitive is between the x-line (an imaginary line at the x-height running parallel with the baseline, as shown in Fig. 2) and the baseline.. a : The primitive is between the top boundary and the x-line.. A : The primitive is between the top boundary and the baseline.. D : The primitive is between the x-line and the bottom boundary.. Q : The primitive is between the top-boundary and the bottom boundary. The definition of x-line, baseline, top boundary, and bottom boundary may be found in Fig. 2. A word bitmap extracted from a document image already contains the information of the baseline and x-line, which is a byproduct of the text line extraction from the previous stage. 2.2 Generating Line-or-Traversal Attribute LTA generation consists of two steps. We extract the straight stroke line feature from the word bitmap first, as shown in Fig. 2a. Note that only the vertical stroke lines and diagonal stroke lines are extracted at this stage. Then, the traversal features of the remainder part are computed. Finally, the features from the above two steps are aggregated to generate the LTAs of the corresponding primitives. In other words, the LTA of a primitive is represented by either a straight-line feature (if applicable) or a traversal feature Straight Stroke Line Feature A run-length-based method is utilized to extract straight stroke lines from word images. We use Rða; Þ to represent a directional run, which is defined as a set of black concatenating pixels that contains pixel a, along the specified direction. jrða; Þj is the run length of Rða; Þ, which is the number of black points in the run. The straight stroke line detection algorithm is summarized as follows: 1. Along the middle line (between the x-line and baseline), detect the boundary pair ½A l ;A r Š of each stroke, where A l and A r are the left and right boundary points, respectively. 2. Detect the midpoint A m of a line segment A l A r. 3. Calculate RðA m ;Þ for different s, from which we select max as the A s s run direction. 4. If jrða m ; max Þj is near or larger than the x-height, the pixels containing A m, between the boundary points A l and A r along the direction max, are extracted as a stroke line. In the example of Fig. 2, the stroke lines are extracted as in Fig. 2a, while the remainder is as in Fig. 2b. According to its direction, a line is categorized as one of three basic stroke lines: vertical stroke line, left-down diagonal stroke line, and right-down diagonal stroke line. According to the type of stroke lines, three basic primitives are generated from extracted stroke lines. Meanwhile, their ADAs are assigned categories based on their top-end and bottom-end positions. Their LTAs are respectively expressed as:. l : Vertical straight stroke line, such as in the characters l, d, p, q, D, P, etc. For the primitive whose ADA is x or D, we further check whether there is a dot over the vertical stroke line. If the answer is yes, the LTA of the primitive is reassigned as i.. v : Right-down diagonal straight stroke line, such as in the characters v, w, V, W, etc.. w : Left-down diagonal straight stroke line, such as in the characters v, w, z, etc. For the primitive whose ADA is x or A, we further check whether there are two horizontal stroke lines connected with it at the top and bottom. If so, the LTA of the primitive is reassigned as z. Additionally, it is easy to detect primitives containing two or more straight stroke lines. They are:. x : One left-down diagonal straight stroke line crosses one right-down diagonal straight stroke line.. y : One left-down diagonal straight stroke line meets one right-down diagonal straight stroke line at its middle point.. Y : One left-down diagonal stroke line, one rightdown diagonal stroke line and one vertical stroke line cross at one point, like the character Y.. k : One left-down diagonal stroke line, one rightdown diagonal stroke line, and one vertical stroke line meet at one point, like the characters k and K Traversal Feature After the primitives based on the stroke line features are extracted as described above, the primitives of the remainder part in the word image are computed based on traversal features. To extract traversal features, we scan the word image column by column, and the traversal number T N is recorded by counting the number of transitions from black pixel to white pixel, or vice versa, along each column. This process is not carried out on the part represented by the stroke line features described above. According to the value of T N, different feature codes are assigned as follows:

5 1402 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 16, NO. 11, NOVEMBER 2004 TABLE 1 Primitive String Tokens of Characters Fig. 3. Different fonts.. & : There is no image pixel in the column (T N ¼ 0). It corresponds to the blank intercharacter space. We treat intercharacter space as a special primitive. In addition, the overlap of adjacent characters caused by kerning is easily detected by analyzing the relative positions of the adjacent connected components. We can insert a space primitive between them in this case. If T N ¼ 2, two parameters are utilized to assign it a feature code. One is the ratio of its black pixel number to x- height,. The other is its relative position with respect to the x-line and the baseline, ¼ D m =D b, where D m is the distance from the topmost stroke pixel in the column to the x-line and D b is the distance from the bottommost stroke pixel to the baseline.. n : <0:2 and <0:3,. u : <0:2 and >3, and. c : >0:5 and 0:5 <<1:5. If T N 4, the feature code is assigned as:. o : T N ¼ 4,. e : T N ¼ 6, and. g : T N ¼ 8. Then, the same feature codes in consecutive columns are merged and represented by one primitive. Note that a few columns may possibly have no resultant feature codes because they do not meet the requirements of the various eligible feature codes described above, which is usually the result of noise. Such columns are eliminated straightaway. As illustrated in Fig. 2, the word image is decomposed into one part with stroke lines (as in Fig. 2a) and other parts with different traversal numbers (Fig. 2c for T N ¼ 2, Fig. 2d for T N ¼ 4, and Fig. 2e for T N ¼ 6). The primitive string is composed by concatenating the above generated primitives from the leftmost to the rightmost. The resultant primitive string of the word image in Fig. 2 is: (n,x)(l,x)(u,x)(o,x) (l,x)(u,x)(o,x)(l,x)(o,x)(n,x)(o,x)(l,x)(u,x)(&,&)(o,a) (l,a)(o,x)(l,x)(n,x)(&,&)(c,x)(e,x)(o,x)(&,&)(o,x)(e,x)(l,x)(u,x) (o,a)(l,a)(&,&)(n,x)(l,a)(o,x)(o,a)(l,a)(o,x)(n,x)(o,x) (l,x)(u,x)(&,&)(y,d). 2.3 Postprocessing To be able to deal with different fonts, primitives should be independent of typefaces. Among various fonts, the significant difference that impacts the extraction of an LRPS is the presence of serif, especially at the parts expressed by traversal features. Fig. 3 gives some examples of where serif is present and others, where it is not. It is a basic necessity to avoid the effect of serif in LRPS representation. Our observation shows that some fonts such as Times Roman contain serifs as opposed to serifless fonts such as Arial. Such serifs do not add on any distinguishing features to the characters. Primitives produced by such serifs can thus be eliminated by analyzing its preceding and succeeding primitives. For instance, a primitive (u,x) in a primitive subsequence (l,x)(u,x)(&,&) is normally generated by a right-side serif of characters such as a, h, m, n, u, etc. Therefore, we can remove the primitive (u,x) from a primitive subsequence (l,x)(u,x)(&,&). Similarly, a primitive (o,x) in a primitive subsequence (n,x)(o,x)(l,x) is normally generated by a serif of characters such as h, m, n, etc. Therefore, we can directly eliminate the primitive (o,x) from a primitive subsequence (n,x)(o,x)(l,x). More postprocessing rules can be used to eliminate the primitives caused by serif. With this postprocessing, the primitive string of the word in Fig. 2 becomes: (l,x)(u,x) (l,x)(u,x)(o,x)(l,x)(n,x)(l,x)(&,&)(l,a)(o,x)(l,x)(&,&)(c,x)(e,x) (o,x)(&,&)(o,x)(e,x)(l,x)(l,a)(&,&)(n,x)(l,a)(o,x)(l,a)(n,x) (l,x)(&,&)(y,d). Based on the feature extraction described above, we can give each character a standard primitive string token (PST). For example, the primitive string token of the character b is (l,a)(o,x)(c,x), and that of the character p is (l,d)(o,x)(c,x). Table 1 lists the primitive string tokens of all characters in the alphabet. The primitive string of a word can be generated by synthesizing the primitive string token of each character in the word and inserting a special primitive (&,&) among them to indicate a spacing gap. Generally speaking, the resulting primitive string of a real image is not as perfect as that synthesized from the standard PST of corresponding characters; this is due to a

6 LU AND TAN: INFORMATION RETRIEVAL IN DOCUMENT IMAGE DATABASES 1403 multitude of factors such as the connection between adjacent characters, the noise effect, etc. As shown in Fig. 2, the primitive substring with respect to the character h is changed from (l,a)(n,x)(l,x) to (l,a)(o,x)(l,x) because of the effect of noise. The touching characters al and th have also affected the corresponding primitive substrings. In the next section, we shall consider the technique of inexact matching for robust matching. 3 INEXACT FEATURE MATCHING Based on the processing described above, each word image is described by a primitive string. The word searching problem can then be stated as finding a particular sequence/subsequence in the primitive string of a word. The procedure of matching word images then becomes a measurement of the similarity between the string A ¼ ½a 1 ;a 2 ;...;a n Š representing the features of the query word and the string B ¼½b 1 ;b 2 ;...;b m Š representing the features of a word image extracted from a document. Matching partial words becomes evaluating the similarity between the feature string A with a subsequence of the feature string B. For example, the problem of matching the word health with the word unhealthy in Fig. 2 is to find whether there exists the subsequence A=(l,A)(n,x)(l,x)(&,&)(c,x)(e,x) (o,x)(&,&)(o,x)(e,x)(l,x)(&,&)(l,a)(&,&)(n,x)(l,a)(o,x)(&,&) (l,a)(n,x)(l,x) in the primitive sequence of the word unhealthy. In a word image, it is common for two or more adjacent characters to be connected with each other, possibly because of low scanning resolution or poor printing quality. This would result in the omission of the feature & in the corresponding feature string, compared to the primitive string generated from standard PSTs. Moreover, the noise effect would undoubtedly produce the substitution or insertion of features in the primitive string of a word image. Such deletions, insertions, and substitutions are very similar to the evolutionary mutations of DNA sequences in molecular biology [35]. Lopresti and Zhou applied the inexact string matching strategy to information retrieval [36] and duplicate document detection [37] to deal with the imprecise text data in OCR results. Drawing inspiration from the alignment problem of two DNA sequences and the progress achieved by Lopresti and Zhou, we apply the technique of inexact string matching to evaluate the similarity between two primitive strings, one from a template image and the other from a word image extracted from a document. From the standpoint of string alignment [38], two differing features that mismatch correspond to a substitution; a space contained in the first string corresponds to an insertion of an extra feature into the second string and a space in the second string corresponds to a deletion of an extra feature from the first string. Insertion and deletion are the reverse of each other. A dash ( ) is therefore used to represent a space primitive inserted into the corresponding positions in the strings in the case of deletion. Notice that we use spacing to represent the gap in a word image, whereas we use space to represent the inserted feature in a feature string. The two concepts are completely different. For a primitive string A of length n and a primitive string B of length m, V ði; jþ is defined as the similarity value of the prefixes ½a 1 ;a 2 ;...;a i Š and ½b 1 ;b 2 ;...;b j Š. The similarity of A and B is precisely the value V ðn; mþ. The similarity of two strings A and B can be computed by dynamic programming with recurrences [39]. The base conditions are: V ði; 0Þ ¼0 8i; j : ð2þ V ð0;jþ¼0: The general recurrence relation is: For 1 i n; 1 j m : 8 0 >< V ði 1;j 1Þþða i ;b j Þ V ði; jþ ¼max V ði 1;jÞþða i ; Þ >: V ði; j 1Þþð ;b j Þ: The zero in the above recurrence implements the operation of restarting the recurrence, which makes sure that unmatched prefixes are discarded from the computation. In our experiments, the score of matching any primitive with the the space primitive is defined as: ð3þ ða k ; Þ ¼ ð ;b k Þ¼ 1 for a k 6¼ ;b k 6¼ ; ð4þ while the score of matching any primitive with the spacing primitive & is defined as: ða k ; &Þ ¼ð&;b k Þ¼ 1 for a k 6¼ &;b k 6¼ & ð5þ and the matching score between two primitives & is given by: ð&; &Þ ¼2; while the matching score between two primitives a i and b j is defined as: ða i ;b j Þ¼ðð a i ;!a i Þ; ðb j ;!b j ÞÞ ¼ " 1ð a i ;b j Þþ" 2ð! a i ;!b j Þ; where " 1 is the function specifying the match value between two elements x 0 and y 0 of LTA. It is defined as: " 1 ðx 0 ;y 0 1 if x Þ¼ 0 ¼ y 0 ð8þ 1 else: Similarly, " 2 is the function specifying the match value between two elements x 00 and y 00 of ADA. It is defined as: " 2 ðx 00 ;y 00 Þ¼ 1 if x00 ¼ y 00 ð9þ 1 else: Finally, maximum score is normalized to the interval [0,1], with 1 corresponding to a perfect match: S 1 ¼ max V ði; jþ=v A ðn; nþ; ð10þ 8i;j where VA ðn; nþ is the matching score between the string A and itself. The maximum operation in (10) and the restarting recurrence operation in (2) ensure the ability of partial word matching. If the normalized maximum score in (10) is greater than a predefined threshold, then we recognize that one word image is matched with the other (or portion of it). ð6þ ð7þ

7 1404 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 16, NO. 11, NOVEMBER 2004 TABLE 2 Score Table for Computing V ði; jþ On the other hand, the similarity of matching two whole word images in their entirety (i.e., no partial matching is allowed) can be represented by: S 2 ¼ V ðn; mþ=minðva ðn; nþ;v Bðm; mþþ: ð11þ The problem can be evaluated systematically using tabular computation. In this approach, a bottom-up approach is used to compute V ði; jþ. We first compute V ði; jþ for the smallest possible values of i and j, and then compute the value of V ði; jþ for increasing values of i and j. This computation is organized with a dynamic programming table of size ðn þ 1Þðmþ1Þ. The table holds the values of V ði; jþ for the choices of i and j (see Table 2). The values in row zero and column zero are filled in directly from the base conditions for V ði; jþ. After that, the remaining n m subtable is filled in one row at a time, in the order of increasing i. Within each row, the cells are filled in, in the order of increasing j. The following pseudo code describes the algorithm: The entire dynamic programming table for computing the similarity between a string of length n and a string of length m can be obtained in OðnmÞ time, since only three comparisons and arithmetic operations are needed per cell. Table 2 illustrates the score table computing from the primitive string of the word image unhealthy extracted from a real document image and the word health whose primitive string is generated by aggregating the characters primitive string tokens according to the character sequence of the word, with the special primitive & inserted between two adjacent PSTs. It can be seen that the maximum score achieved in the table corresponds to the Fig. 4. System diagram for searching a user-specified word in a document image. match of the character sequence health in the word unhealthy. 4 APPLICATION A: WORD SEARCHING We now focus on the application of our proposed partial word image matching method to word spotting. Searching/ locating a user-specified keyword in image format documents has been a topic of interest for many years. It has its practical value for document information retrieval. For example, by using this technique, the user can locate a specified word in document images without any prior need for the images to be OCR-processed. 4.1 System Overview The overall system structure is illustrated in Fig. 4. When a document image is presented to the system, it goes through preprocessing, as in many document image processing systems. It is presumed that processes such as skew estimation and correction [40], if applicable, and other image-quality related processing, are performed in the first module of the system. Then, all of the connected components in the image are calculated using an eight-connectedcomponent analysis algorithm. The representation of each connected component includes the coordinates and dimensions of the bounding box and so on. Word objects are bounded based on a merger operation on the connected components. As a result, the left, top, right, and bottom coordinates of each word bitmap are obtained. Meanwhile, the baseline and x-line locations in each word are also available for subsequent processing. Extracted word bitmaps with baseline and x-line information are the basic units for the downstream process of word matching, and are represented with the use of primitive strings as described in Section 2. When a user keys in a query word, the system generates its corresponding word primitive token (WPT) by aggregating the characters primitive string tokens according to the character sequence of the word, with the special primitive & inserted between two adjacent PSTs. For example, the WPT of the word health is: (l,a)(n,x)(l,x)(&,&)(c,x)(e,x) (o,x)(&,&)(o,x)(e,x)(l,x)(&,&)(l,a)(&,&)(n,x)(l,a)(o,x)(&,&) (l,a)(n,x)(l,x).

8 LU AND TAN: INFORMATION RETRIEVAL IN DOCUMENT IMAGE DATABASES 1405 Fig. 5. Results of searching for the word cool. This primitive string is matched with the primitive string of each word object in the document image to measure the similarity between the two. According to the similarity measurement, we can estimate how the word in the document image is relevant to the query word. It is always the case that, for a given query word, a larger score corresponds to a better match. If the normalized maximum score in (10) is greater than the predefined threshold, then we recognize that the word image (or portion of it) matches the query word. The match may be imprecise in the sense that certain primitives are missing. Some adjacent characters in an image word may touch each other due to various reasons. This results in a lesser number of character spacing to be detected and, hence, a lesser number of spacing primitives (&, &) in the primitive string. If we look up Table 2, we find that we can get a best matching track if we backtrack from the maximum score. Our investigation shows that the score in the best matching track with respect to the spacing primitive (&, &) in the query word s primitive string decreases if its corresponding spacing primitive in the real word is missing. To remedy this problem, the normalized maximum score in (10) is modified as: S m ¼ðmax V ði; jþþðn gþþ=va ðn; nþ; ð12þ 8i;j where N g is the number of such mismatched spacing primitives and ðn g Þ¼2N g in our experiments. 4.2 Data Preparation The digital library of our university stores a huge number of old student theses and books, and it has been deemed desirable to make these documents accessible on the Internet. The original plan was to digitize all the documents, then transform the document images into text format using a commercial OCR product. However, the OCR results were not satisfactory. Manually correcting the errors was obviously expensive and impractical for such a huge data set. It was thus decided to retain the documents as PDF files in image format. We selected 3,250 pages from 113 such PDF files of scanned student theses and 328 pages from 15 PDF files of scanned books, to generate a document image database called the NUSDL database. For various reasons, documents available on the Internet are also often in image format. Examples of such documents Fig. 6. Results of searching for the word string. are the journal or conference papers available as IEEE Xplore online publications and similar papers at other journal online publication systems such as CVGIP: Graphical Models and Image Processing. We obtained 39 such documents with 294 pages, and named the collection the JPaper database. To search words in such PDF files containing document images, we developed a tool based on our proposed approach, using Acrobat SDK. When a PDF file is opened in Adobe Acrobat, the plug-in tool is able to detect and locate the user s query words in the document images without any OCR process, like the Find function provided by Adobe Acrobat for word search in text format documents. Besides PDF document images, binary images in a UW document image database are also used in our experiments. 4.3 Experimental Results Fig. 5 shows the results of searching words in a scanned student thesis selected from the NUSDL database. In this example, the words cooler, Cryocooler, cryocooler, microcooler, and cooling in the document image are successfully detected and located when the user inputs the word cool for searching. Another example is given in Fig. 6. The document image is stored in a PDF file which is accessible at the Web site of IEEE Xplore online publications. The words string, strings, and substring in different fonts are correctly detected and located in the document when string is used as the query word. To evaluate the performance of our system, we select 150 words as queries and search for their corresponding words and variations in the document images. The system achieves a precision ranging from to percent and

9 1406 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 16, NO. 11, NOVEMBER 2004 TABLE 3 Performance versus Threshold a recall 1 ranging from to percent, depending on the threshold, as shown in Table 3. A lower results in a higher recall, but at the expense of lower precision. The reverse is true for a higher. Thus, if the goal is to retrieve as many relevant words as possible, then a lower should be set. On the other hand, for low tolerance to precision, then should be raised. It is common practice to combine recall R and precision P in some way so that system performance can be compared in terms of a single rating. Our experiments adopt one of the commonly used measures, the F 1 rating [4], which gives equal weight to R and P. It is defined as: F 1 ¼ 2RP R þ P : ð13þ Fig. 7 shows the average precision and recall, as well as the F 1 rating. On average, the best F 1 rating that the system can achieve is , where the precision and recall are and percent, and the threshold is set at These results are also better than those of our previous system [32], whose precision and recall are and percent, respectively, with the best F 1 measure at Precision is defined as the percentage of the number of correctly searched words over the number of all searched words, while recall is the percentage of the number of correctly searched words over the number of words that should be searched. 4.4 Comparison with Commercial OCR The majority of the document images used in our test are packed in PDF files. Adobe Acrobat provides an OCR tool of Page Capture which converts a page of document image in a PDF file to its text format using document image processing. So, it is selected to compare with our proposed approach. In the Page Capture, if a word object in a document image cannot be recognized, its original image will be kept and inserted in the corresponding position in the text for display purpose. Fig. 8a shows the OCR result on a part of a document image page selected from a student thesis, in which the words TMS320C30, passband, and stopband have not been successfully converted into text format. So, these words still cannot be searched even though the OCR tool has been applied to the page of document image. In comparison, Fig. 8b shows the search result with the query word band using the word searching tool based on our proposed word matching approach. All of the words containing the string band are successfully detected. To compare word searching performance between Adobe Acrobat and our proposed approach, the NUSDL database and JPaper database are used. The UW database is not included in the comparison because Adobe Acrobat can work on document images packed in PDF files only. The precision and recall achieved by Adobe Acrobat are and percent, respectively. Our approach achieves a precision of percent and a recall of percent, when the threshold is set at It can be seen that the precision of Adobe Acrobat is very high while its recall is a little lower compared with our approach. The reason is that the OCR tool in Adobe Acrobat is lexicon dependent. A lexicon may be built into its recognition engine, so it can achieve very high precision; but, it does not handle well with most words that are not included in the lexicon, especially certain technical terms in student theses and papers. On the other hand, our proposed approach does not use any language or lexical information at all. In addition, Fig. 7. Performance versus different thresholds.

10 LU AND TAN: INFORMATION RETRIEVAL IN DOCUMENT IMAGE DATABASES 1407 Fig. 8. Comparison of Adobe Acrobat OCR and the proposed approach. (a) OCR results by Adobe Acrobat. (b) Results of searching for the word band by our approach. our experiments also show that our proposed approach is approximately 2.6 times faster than the OCR tool. 5 APPLICATION B: DOCUMENT SIMILARITY MEASUREMENT Measuring the similarity between documents has practical applications in document image retrieval. For instance, a user may use an entire document rather than a keyword to retrieve documents whose content is similar to the query document. As we have mentioned before, the word, rather than the character, is the basic unit of meaning in document image retrieval. Therefore, we use word objects in documents as our basic unit of information. In this section, we present another application of our proposed partial word matching to document image retrieval, i.e., similarity measurement between two document images. The system diagram is shown in Fig Document Vector and Similarity The word objects in the images are bounded based on the analysis of the connected components first. Some common words (such as a, the, by, etc.) called stop words in linguistics are eliminated according to their shapes as they do not contribute much to the overall weight of the document. Although the width estimates of single words are unreliable for stop word identification, stop words tend to be short and are composed of a few characters. Suppose that w and h represent the width and height of a word object, respectively. If the aspect ratio w=h of a word object is less than a certain threshold, it is excluded from the calculation of document vectors. In our experiments, is set at 2.0. This process can also prevent a short word from being matched with a long word. For example, it can prevent the word the and the word weather being assigned to the same class because the word the has already been eliminated. The remaining word objects in each document are represented by means of their corresponding primitive feature strings. An unsupervised classifier is employed to place all of the word objects (except the stop words) coming from both document images to K object classes. The number of classes is initialized as null. The first incoming word generates a new class automatically. The next incoming word is compared with the word stored in the existing class. The above described word matching method is utilized to measure the similarity between two word objects. If the similarity score of the two word objects is greater than a predefined threshold, we assign them to the same class. Otherwise, the incoming word is assigned to a new class. The respective occurrence frequencies of all classes in a document are used to build the document s vector. Each occurrence frequency is normalized by dividing it by the Fig. 9. System diagram for measuring the similarity between two document images.

11 1408 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 16, NO. 11, NOVEMBER 2004 TABLE 4 Corpus Used in the Experiments total number of word objects in the document. The similarity score between two document vectors is defined as their scalar product divided by their lengths, which is the same as in our pervious work [34]. A scalar product is calculated by summing up the products of the corresponding elements. This is equivalent to the cosine of the angle between two document vectors seen from the origin. So, the similarity between the query document image Q and the archived document image Y is: X K k¼1 q k y k SðQ; Y Þ¼vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ; ð14þ ux K X t K qk 2 y 2 k k¼1 k¼1 where, Q ¼ðq 1 ;q 2 ;...;q K Þ T is the document vector of the image Q, and Y ¼ðy 1 ;y 2 ;...;y K Þ T is the document vector of the image Y. K is the dimension number of the document vector, which equals the class number generated by the unsupervised classifier. 5.2 Experimental Results Our experiments compute the similarity measurement between document images selected from the NUSDL database. These articles address nine different kinds of topics, and they are student theses pertaining to electrical engineering, chemical engineering, mechanical engineering, medical engineering, etc. The topics and the number of each used in the experiments are listed in Table 4. The similarity measurement between any two document images in the corpus is calculated, in which the proposed word image matching approach using (10) is applied. Fig. 10 illustrates the similarity measurement of the document C5 in group C with the first three documents taken from each group. If the document C5 is chosen as a query document, the documents in group C (i.e., C1, C2, and C3) can be successfully retrieved from the corpus, subject to a suitable threshold. The retrieval performance of each group is tabulated in Table 5, according to different similarity thresholds. The system achieves an average precision ranging from to percent, and an average recall 2 ranging from to percent, depending on different similarity thresholds. A lower allows more relevant documents to be retrieved (hence, the higher recall) but at the expense of false positives (hence, the lower precision). The reverse is true for a higher. Thus, if the goal is to retrieve as many relevant documents as possible by allowing irrelevant ones to be rejected by manual reading, then a lower should be set. On the other hand, for low tolerance to false positives, then should be raised. We compare system performance based on the application of partial word image matching (PWM method using (10)) with that based on entire word image matching (EWM method using (11)). The maximum F 1 ratings with respect to each group are illustrated in Fig. 11. From the results, we can see that the PWM method outperforms the EWM method in general. The overall F 1 rating achieved by the PWM method improves to , compared to the EWM method s CONCLUSION AND FUTURE WORK Document images have become a popular information source in our modern society, and information retrieval in document image databases is an important topic in knowledge and data engineering research. Document image Fig. 10. The similarity with document C5. 2. Precision is defined as the percentage of the number of correctly retrieved articles over the number of all retrieved articles, while recall is the percentage of the number of correctly retrieved articles over the number of articles in the category.

12 LU AND TAN: INFORMATION RETRIEVAL IN DOCUMENT IMAGE DATABASES 1409 TABLE 5 Retrieval Performance for Different Thresholds retrieval without OCR has its practical value, but it is also a challenging problem. In this paper, we have proposed a word image matching approach with the ability of matching partial words. To measure the similarity of two word images, each word image is represented by a primitive string. Then, an inexact string matching technique is utilized to measure the similarity between the two primitive strings generated from the two word images. Based on the matching score, we can estimate how one word image is relevant to the other and decide whether one is a portion of the other. The proposed word image approach has been applied to two document image retrieval problems, i.e., word spotting and document similarity measurement. Experiments have been carried out on various document images such as scanned books, students theses, and journal/conference papers downloaded from the Internet, as well as UW document images. The test results confirm the validity and feasibility of the applications of the proposed word matching approach to document image retrieval. The setting of matching scores between different primitive codes is beyond the scope of this paper and is left to future research. How to choose a suitable threshold for achieving maximum F 1 rating also requires further investigation in the future. ACKNOWLEDGMENTS This research is jointly supported by the Agency for Science, Technology and Research, and the Ministry of Education of Singapore under research grant R /303. The authors would also like to thank Mr. Ng Kok Koon and Mrs. Khoo Yee Hoon of the digital library of the National University of Singapore for providing us with the document image data of student theses. Fig. 11. F 1 rating comparison between PWM and EWM. REFERENCES [1] D. Doermann, The Indexing and Retrieval of Document Images: A Survey, Computer Vision and Image Understanding, vol. 70, no. 3, pp , [2] M. Mitra and B.B. Chaudhuri, Information Retrieval from Documents: A Survey, Information Retrieval, vol. 2, nos. 2/3, pp , [3] G. Salton, J. Allan, C. Buckley, and A. Singhal, Automatic Analysis, Theme Generation, and Summarization of Machine- Readable Text, Science, vol. 264, pp , [4] Y. Yang and X. Liu, A Re-Examination of Text Categorization Methods, Proc. 22th Ann. Int l ACM SIGIR Conf. Research and Development in Information Retrieval, pp , [5] K. Tagvam, J. Borsack, A. Condir, and S. Erva, The Effects of Noisy Data on Text Retrieval, J. Am. Soc. for Information Science, vol. 45, no. 1, pp , 1994.

13 1410 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 16, NO. 11, NOVEMBER 2004 [6] Y. Ishitani, Model-Based Information Extraction Method Tolerant of OCR Errors for Document Images, Proc. Sixth Int l Conf. Document Analysis and Recognition, pp , [7] M. Ohtam, A. Takasu, and J. Adachi, Retrieval Methods for English Text with Misrecognized OCR Characters, Proc. Fourth Int l Conf. Document Analysis and Recognition, pp , [8] S.M. Harding, W.B. Croft, and C. Weir, Probabilistic Retrieval of OCR Degraded Text Using N-Grams, Proc. European Conf. Research and Advanced Technology for Digital Libraries (ECDL 97), pp , [9] P. Kantor and E. Voorhes, Report on the TREC-5 Confusion Track, Online Proc. TREC-5, NIST special publication , pp , [10] T. Kameshiro, T. Hirano, Y. Okada, and F. Yoda, A Document Image Retrieval Method Tolerating Recognition and Segmentation Errors of OCR Using Shape-feature and Multiple Candidates, Proc. Fifth Int l Conf. Document Analysis and Recognition, pp , [11] K. Katsuyama et al., Highly Accurate Retrieval of Japanese Document Images through a Combination of Morphological Analysis and OCR, Proc. SPIE, Document Recognition and Retrieval, vol. 4670, pp , [12] F.R. Chen and D.S. Bloomberg, Summarization of Imaged Documents without OCR, Computer Vision and Image Understanding, vol. 70, no. 3, pp , [13] D.S. Bloomberg and F.R. Chen, Document Image Summarization without OCR, Proc. Int l Conf. Image Processing, vol. 2, pp , [14] J. Liu and A.K. Jain, Image-Based Form Document Retrieval, Pattern Recognition, vol. 33, no. 3, pp , [15] D. Niyogi and S. Srihari, The Use of Document Structure Analysis to Retrieve Information from Documents in Digital Libraries, Proc. SPIE, Document Recognition IV, vol. 3027, pp , [16] Y.Y. Tang, C.D. Yan, and C.Y. Suen, Document Processing for Automatic Knowledge Acquisition, IEEE Trans. Knowledge and Data Eng., vol. 6, no. 1, pp. 3-21, Feb [17] Y. He, Z. Jiang, B. Liu, and H. Zhao, Content-Based Indexing and Retrieval Method of Chinese Document Images, Prof. Fifth Int l Conf. Document Analysis and Recognition (ICDAR 99), pp , [18] A.L. Spitz, Duplicate Document Detection, Proc. SPIE, Document Recognition IV, vol. 3027, pp , [19] A.F. Smeaton and A.L. Spitz, Using Character Shape Coding for Information Retrieval, Proc. Fourth Int l Conf. Document Analysis and Recognition, pp , [20] A.L. Spitz, Shape-Based Word Recognition, Int l J. Document Analysis and Recognition, vol. 1, no. 4, pp , [21] A.L. Spitz, Progress in Document Reconstruction, Proc. 16th Int l Conf. Pattern Recognition, vol. 1, pp , [22] Z. Yu and C.L. Tan, Image-Based Document Vectors for Text Retrieval, Proc. 15th Int l Conf. Pattern Recognition, vol. 4, pp , [23] C.L. Tan, W. Huang, Z. Yu, and Y. Xu, Imaged Document Text Retrieval without OCR, IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 24, no. 6, pp , June [24] T.K. Ho, J.J. Hull, and S.N. Srihari, A Word Shape Analysis Approach to Lexicon Based Word Recognition, Pattern Recognition Letters, vol. 13, pp , [25] T. Syeda-Mahmood, Indexing of Handwritten Document Images, Proc. Workshop Document Image Analysis, pp , [26] A. Kolcz, J. Alspector, M. Augusteijn, R. Carlson, and G.V. Popescu, A Line-Oriented Approach to Word Spotting in Handwritten Documents, Pattern Analysis and Applications, vol. 3, no. 2, pp , [27] R. Manmatha, C. Han, and E.M. Riseman, Word Spotting: A New Approach to Indexing Handwriting, Proc. Int l Conf. Computer Vision and Pattern Recognition, pp , [28] J. DeCurtins and E. Chen, Keyword Spotting via Word Shape Recognition, Proc. SPIE, Document Recognition II, vol. 2422, pp , [29] S. Kuo and O.F. Agazzi, Keyword Spotting in Poorly Printed Documents Using Pseudo 2-D Hidden Markov Models, IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 16, no. 8, pp , Aug [30] F.R. Chen, L.D. Wilcox, and D.S. Bloomberg, Word Spotting in Scanned Images Using Hidden Markov Models, Proc. Int l Conf. Acoustics, Speech, and Signal Processing, vol. 5, pp. 1-4, [31] F.R. Chen, L.D. Wilcox, and D.S. Bloomberg, Detecting and Locating Partially Specified Keywords in Scanned Images Using Hidden Markov Models, Proc. Int l Conf. Document Analysis and Recognition, pp , [32] Y. Lu, C.L. Tan, W. Huang, and L. Fan, An Approach to Word Image Matching Based on Weighted Hausdorff Distance, Proc. Sixth Int l Conf. Document Analysis and Recognition, pp , [33] J.J. Hull, Document Matching on CCITT Group 4 Compressed Images, Proc. SPIE, Document Recognition IV, vol. 3027, pp , [34] Y. Lu and C.L. Tan, Document Retrieval from Compressed Images, Pattern Recognition, vol. 36, no. 4, pp , [35] A. Apostolico and R. Giancarlo, Sequence Alignment in Molecular Biology, DIMACS Series in Discrete Math. and Theoretical Computer Sciences, vol. 47, pp , [36] D. Lopresti and J. Zhou, Retrieval Strategies for Noisy Text, Proc. Fifth Ann. Symp. Document Analysis and Information Retrieval, pp , [37] D.P. Lopresti, A Comparison of Text-Based Methods for Detecting Duplication in Scanned Document Databases, Information Retrieval, vol. 4, no. 2, pp , [38] D. Gusfield, Algorithms on Strings, Trees, and Sequences. Cambridge Univ. Press, [39] R.A. Wagner and M.J. Fisher, The String-to-String Correction Problem, J. ACM, vol. 21, pp , [40] Y. Lu and C.L. Tan, A Nearest-Neighbor-Chain Based Approach to Skew Estimation in Document Images, Pattern Recognition Letters, vol. 24, pp , Yue Lu received the BE and ME degrees in telecommunications and electronic engineering from Zhejiang University, Zhejiang, China, in 1990 and 1993, respectively, and his PhD degree in pattern recognition and intelligence system from Shanghai Jiao Tong University, Shanghai, China, in He is a professor in the Department of Computer Science and Technology, East China Normal University. From 2000 to 2004, he was a research fellow with the Department of Computer Science, National University of Singapore. From 1993 to 2000, he was an engineer at the Shanghai Research Institute of China State Post Bureau. Because of his outstanding contributions to the research and development of the automatic mail sorting system OVCS, he won the Science and Technology Progress Awards issued by the Ministry of Posts and Telecommunications in 1995, 1997, 1998, respectively. His research interests include document image processing, document retrieval, character recognition, and intelligence system. Chew Lim Tan received the BSc (Hons) degree in physics in 1971 from the University of Singapore, the MSc degree in radiation studies in 1973 from the University of Surrey, UK, and the PhD degree in computer science in 1986 from the University of Virginia, Virginia. He is an associate professor in the Department of Computer Science, School of Computing, National University of Singapore. His research interests include document image and text processing, neural networks, and genetic programming. He has published more than 170 research publications in these areas. He is an associate editor of Pattern Recognition and has served on the program committees of the International Conference on Pattern Recognition (ICPR) 2002, the Graphics Recognition Workshop (GREC) 2001 and 2003, the Web Document Analysis Workshop (WDA) 2001 and 2003, the Document Image Analysis and Retrieval Workshop (DIAR) 2003, the Document Image Analysis for Libraries Workshop (DIAL) 2004, the International Conference on Image Processing (ICIP) 2004, and the International Conference on Document Analysis and Recognition (ICDAR) He is a senior member of the IEEE and the IEEE Computer Society.

Retrieval Of Information In Document Image Databases Using Partial Word Image Matching Technique

Retrieval Of Information In Document Image Databases Using Partial Word Image Matching Technique Retrieval Of Information In Document Image Databases Using Partial Word Image Matching Technique Seema Yadav, Dr. Sudhir Sawarkar Abstract-With the popularity and importance of document images as an information

More information

Keyword Spotting in Document Images through Word Shape Coding

Keyword Spotting in Document Images through Word Shape Coding 2009 10th International Conference on Document Analysis and Recognition Keyword Spotting in Document Images through Word Shape Coding Shuyong Bai, Linlin Li and Chew Lim Tan School of Computing, National

More information

Optical Character Recognition (OCR) for Printed Devnagari Script Using Artificial Neural Network

Optical Character Recognition (OCR) for Printed Devnagari Script Using Artificial Neural Network International Journal of Computer Science & Communication Vol. 1, No. 1, January-June 2010, pp. 91-95 Optical Character Recognition (OCR) for Printed Devnagari Script Using Artificial Neural Network Raghuraj

More information

A Graphics Image Processing System

A Graphics Image Processing System A Graphics Image Processing System Linlin Li and Chew Lim Tan Department of Computer Science, National University of Singapore Kent Ridge, Singapore 117543 {lilinlin,lusj,tancl}@comp.nus.edu.sg Abstract

More information

Invariant Recognition of Hand-Drawn Pictograms Using HMMs with a Rotating Feature Extraction

Invariant Recognition of Hand-Drawn Pictograms Using HMMs with a Rotating Feature Extraction Invariant Recognition of Hand-Drawn Pictograms Using HMMs with a Rotating Feature Extraction Stefan Müller, Gerhard Rigoll, Andreas Kosmala and Denis Mazurenok Department of Computer Science, Faculty of

More information

Automatic Detection of Change in Address Blocks for Reply Forms Processing

Automatic Detection of Change in Address Blocks for Reply Forms Processing Automatic Detection of Change in Address Blocks for Reply Forms Processing K R Karthick, S Marshall and A J Gray Abstract In this paper, an automatic method to detect the presence of on-line erasures/scribbles/corrections/over-writing

More information

An Efficient Character Segmentation Based on VNP Algorithm

An Efficient Character Segmentation Based on VNP Algorithm Research Journal of Applied Sciences, Engineering and Technology 4(24): 5438-5442, 2012 ISSN: 2040-7467 Maxwell Scientific organization, 2012 Submitted: March 18, 2012 Accepted: April 14, 2012 Published:

More information

Comment Extraction from Blog Posts and Its Applications to Opinion Mining

Comment Extraction from Blog Posts and Its Applications to Opinion Mining Comment Extraction from Blog Posts and Its Applications to Opinion Mining Huan-An Kao, Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University, Taipei, Taiwan

More information

A Survey of Problems of Overlapped Handwritten Characters in Recognition process for Gurmukhi Script

A Survey of Problems of Overlapped Handwritten Characters in Recognition process for Gurmukhi Script A Survey of Problems of Overlapped Handwritten Characters in Recognition process for Gurmukhi Script Arwinder Kaur 1, Ashok Kumar Bathla 2 1 M. Tech. Student, CE Dept., 2 Assistant Professor, CE Dept.,

More information

HMM-Based Handwritten Amharic Word Recognition with Feature Concatenation

HMM-Based Handwritten Amharic Word Recognition with Feature Concatenation 009 10th International Conference on Document Analysis and Recognition HMM-Based Handwritten Amharic Word Recognition with Feature Concatenation Yaregal Assabie and Josef Bigun School of Information Science,

More information

Extracting Characters From Books Based On The OCR Technology

Extracting Characters From Books Based On The OCR Technology 2016 International Conference on Engineering and Advanced Technology (ICEAT-16) Extracting Characters From Books Based On The OCR Technology Mingkai Zhang1, a, Xiaoyi Bao1, b,xin Wang1, c, Jifeng Ding1,

More information

A semi-incremental recognition method for on-line handwritten Japanese text

A semi-incremental recognition method for on-line handwritten Japanese text 2013 12th International Conference on Document Analysis and Recognition A semi-incremental recognition method for on-line handwritten Japanese text Cuong Tuan Nguyen, Bilan Zhu and Masaki Nakagawa Department

More information

Character Recognition

Character Recognition Character Recognition 5.1 INTRODUCTION Recognition is one of the important steps in image processing. There are different methods such as Histogram method, Hough transformation, Neural computing approaches

More information

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS CHAPTER 4 CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS 4.1 Introduction Optical character recognition is one of

More information

Available online at ScienceDirect. Procedia Computer Science 45 (2015 )

Available online at  ScienceDirect. Procedia Computer Science 45 (2015 ) Available online at www.sciencedirect.com ScienceDirect Procedia Computer Science 45 (2015 ) 205 214 International Conference on Advanced Computing Technologies and Applications (ICACTA- 2015) Automatic

More information

A System to Automatically Index Genealogical Microfilm Titleboards Introduction Preprocessing Method Identification

A System to Automatically Index Genealogical Microfilm Titleboards Introduction Preprocessing Method Identification A System to Automatically Index Genealogical Microfilm Titleboards Samuel James Pinson, Mark Pinson and William Barrett Department of Computer Science Brigham Young University Introduction Millions of

More information

CMPSCI 646, Information Retrieval (Fall 2003)

CMPSCI 646, Information Retrieval (Fall 2003) CMPSCI 646, Information Retrieval (Fall 2003) Midterm exam solutions Problem CO (compression) 1. The problem of text classification can be described as follows. Given a set of classes, C = {C i }, where

More information

Image Matching Using Run-Length Feature

Image Matching Using Run-Length Feature Image Matching Using Run-Length Feature Yung-Kuan Chan and Chin-Chen Chang Department of Computer Science and Information Engineering National Chung Cheng University, Chiayi, Taiwan, 621, R.O.C. E-mail:{chan,

More information

DATA EMBEDDING IN TEXT FOR A COPIER SYSTEM

DATA EMBEDDING IN TEXT FOR A COPIER SYSTEM DATA EMBEDDING IN TEXT FOR A COPIER SYSTEM Anoop K. Bhattacharjya and Hakan Ancin Epson Palo Alto Laboratory 3145 Porter Drive, Suite 104 Palo Alto, CA 94304 e-mail: {anoop, ancin}@erd.epson.com Abstract

More information

Texture Segmentation by Windowed Projection

Texture Segmentation by Windowed Projection Texture Segmentation by Windowed Projection 1, 2 Fan-Chen Tseng, 2 Ching-Chi Hsu, 2 Chiou-Shann Fuh 1 Department of Electronic Engineering National I-Lan Institute of Technology e-mail : fctseng@ccmail.ilantech.edu.tw

More information

Recognition-based Segmentation of Nom Characters from Body Text Regions of Stele Images Using Area Voronoi Diagram

Recognition-based Segmentation of Nom Characters from Body Text Regions of Stele Images Using Area Voronoi Diagram Author manuscript, published in "International Conference on Computer Analysis of Images and Patterns - CAIP'2009 5702 (2009) 205-212" DOI : 10.1007/978-3-642-03767-2 Recognition-based Segmentation of

More information

1. Introduction. 2. Motivation and Problem Definition. Volume 8 Issue 2, February Susmita Mohapatra

1. Introduction. 2. Motivation and Problem Definition. Volume 8 Issue 2, February Susmita Mohapatra Pattern Recall Analysis of the Hopfield Neural Network with a Genetic Algorithm Susmita Mohapatra Department of Computer Science, Utkal University, India Abstract: This paper is focused on the implementation

More information

Keywords Connected Components, Text-Line Extraction, Trained Dataset.

Keywords Connected Components, Text-Line Extraction, Trained Dataset. Volume 4, Issue 11, November 2014 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Language Independent

More information

Time Stamp Detection and Recognition in Video Frames

Time Stamp Detection and Recognition in Video Frames Time Stamp Detection and Recognition in Video Frames Nongluk Covavisaruch and Chetsada Saengpanit Department of Computer Engineering, Chulalongkorn University, Bangkok 10330, Thailand E-mail: nongluk.c@chula.ac.th

More information

Parallel string matching for image matching with prime method

Parallel string matching for image matching with prime method International Journal of Engineering Research and Development e-issn: 2278-067X, p-issn: 2278-800X, www.ijerd.com Volume 10, Issue 6 (June 2014), PP.42-46 Chinta Someswara Rao 1, 1 Assistant Professor,

More information

Data Hiding in Binary Text Documents 1. Q. Mei, E. K. Wong, and N. Memon

Data Hiding in Binary Text Documents 1. Q. Mei, E. K. Wong, and N. Memon Data Hiding in Binary Text Documents 1 Q. Mei, E. K. Wong, and N. Memon Department of Computer and Information Science Polytechnic University 5 Metrotech Center, Brooklyn, NY 11201 ABSTRACT With the proliferation

More information

An Accurate and Efficient System for Segmenting Machine-Printed Text. Yi Lu, Beverly Haist, Laurel Harmon, John Trenkle and Robert Vogt

An Accurate and Efficient System for Segmenting Machine-Printed Text. Yi Lu, Beverly Haist, Laurel Harmon, John Trenkle and Robert Vogt An Accurate and Efficient System for Segmenting Machine-Printed Text Yi Lu, Beverly Haist, Laurel Harmon, John Trenkle and Robert Vogt Environmental Research Institute of Michigan P. O. Box 134001 Ann

More information

CORRELATION BASED CAR NUMBER PLATE EXTRACTION SYSTEM

CORRELATION BASED CAR NUMBER PLATE EXTRACTION SYSTEM CORRELATION BASED CAR NUMBER PLATE EXTRACTION SYSTEM 1 PHYO THET KHIN, 2 LAI LAI WIN KYI 1,2 Department of Information Technology, Mandalay Technological University The Republic of the Union of Myanmar

More information

N.Priya. Keywords Compass mask, Threshold, Morphological Operators, Statistical Measures, Text extraction

N.Priya. Keywords Compass mask, Threshold, Morphological Operators, Statistical Measures, Text extraction Volume, Issue 8, August ISSN: 77 8X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com A Combined Edge-Based Text

More information

Scene Text Detection Using Machine Learning Classifiers

Scene Text Detection Using Machine Learning Classifiers 601 Scene Text Detection Using Machine Learning Classifiers Nafla C.N. 1, Sneha K. 2, Divya K.P. 3 1 (Department of CSE, RCET, Akkikkvu, Thrissur) 2 (Department of CSE, RCET, Akkikkvu, Thrissur) 3 (Department

More information

Automatic Recognition and Verification of Handwritten Legal and Courtesy Amounts in English Language Present on Bank Cheques

Automatic Recognition and Verification of Handwritten Legal and Courtesy Amounts in English Language Present on Bank Cheques Automatic Recognition and Verification of Handwritten Legal and Courtesy Amounts in English Language Present on Bank Cheques Ajay K. Talele Department of Electronics Dr..B.A.T.U. Lonere. Sanjay L Nalbalwar

More information

A Model-based Line Detection Algorithm in Documents

A Model-based Line Detection Algorithm in Documents A Model-based Line Detection Algorithm in Documents Yefeng Zheng, Huiping Li, David Doermann Laboratory for Language and Media Processing Institute for Advanced Computer Studies University of Maryland,

More information

A Document Image Analysis System on Parallel Processors

A Document Image Analysis System on Parallel Processors A Document Image Analysis System on Parallel Processors Shamik Sural, CMC Ltd. 28 Camac Street, Calcutta 700 016, India. P.K.Das, Dept. of CSE. Jadavpur University, Calcutta 700 032, India. Abstract This

More information

Finger Print Enhancement Using Minutiae Based Algorithm

Finger Print Enhancement Using Minutiae Based Algorithm Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 3, Issue. 8, August 2014,

More information

Separation of Overlapping Text from Graphics

Separation of Overlapping Text from Graphics Separation of Overlapping Text from Graphics Ruini Cao, Chew Lim Tan School of Computing, National University of Singapore 3 Science Drive 2, Singapore 117543 Email: {caorn, tancl}@comp.nus.edu.sg Abstract

More information

OCR For Handwritten Marathi Script

OCR For Handwritten Marathi Script International Journal of Scientific & Engineering Research Volume 3, Issue 8, August-2012 1 OCR For Handwritten Marathi Script Mrs.Vinaya. S. Tapkir 1, Mrs.Sushma.D.Shelke 2 1 Maharashtra Academy Of Engineering,

More information

Recognition of Unconstrained Malayalam Handwritten Numeral

Recognition of Unconstrained Malayalam Handwritten Numeral Recognition of Unconstrained Malayalam Handwritten Numeral U. Pal, S. Kundu, Y. Ali, H. Islam and N. Tripathy C VPR Unit, Indian Statistical Institute, Kolkata-108, India Email: umapada@isical.ac.in Abstract

More information

Building Multi Script OCR for Brahmi Scripts: Selection of Efficient Features

Building Multi Script OCR for Brahmi Scripts: Selection of Efficient Features Building Multi Script OCR for Brahmi Scripts: Selection of Efficient Features Md. Abul Hasnat Center for Research on Bangla Language Processing (CRBLP) Center for Research on Bangla Language Processing

More information

An Efficient Character Segmentation Algorithm for Printed Chinese Documents

An Efficient Character Segmentation Algorithm for Printed Chinese Documents An Efficient Character Segmentation Algorithm for Printed Chinese Documents Yuan Mei 1,2, Xinhui Wang 1,2, Jin Wang 1,2 1 Jiangsu Engineering Center of Network Monitoring, Nanjing University of Information

More information

Layout Segmentation of Scanned Newspaper Documents

Layout Segmentation of Scanned Newspaper Documents , pp-05-10 Layout Segmentation of Scanned Newspaper Documents A.Bandyopadhyay, A. Ganguly and U.Pal CVPR Unit, Indian Statistical Institute 203 B T Road, Kolkata, India. Abstract: Layout segmentation algorithms

More information

Robust line segmentation for handwritten documents

Robust line segmentation for handwritten documents Robust line segmentation for handwritten documents Kamal Kuzhinjedathu, Harish Srinivasan and Sargur Srihari Center of Excellence for Document Analysis and Recognition (CEDAR) University at Buffalo, State

More information

A New Algorithm for Detecting Text Line in Handwritten Documents

A New Algorithm for Detecting Text Line in Handwritten Documents A New Algorithm for Detecting Text Line in Handwritten Documents Yi Li 1, Yefeng Zheng 2, David Doermann 1, and Stefan Jaeger 1 1 Laboratory for Language and Media Processing Institute for Advanced Computer

More information

CHAPTER 6. 6 Huffman Coding Based Image Compression Using Complex Wavelet Transform. 6.3 Wavelet Transform based compression technique 106

CHAPTER 6. 6 Huffman Coding Based Image Compression Using Complex Wavelet Transform. 6.3 Wavelet Transform based compression technique 106 CHAPTER 6 6 Huffman Coding Based Image Compression Using Complex Wavelet Transform Page No 6.1 Introduction 103 6.2 Compression Techniques 104 103 6.2.1 Lossless compression 105 6.2.2 Lossy compression

More information

Text Area Detection from Video Frames

Text Area Detection from Video Frames Text Area Detection from Video Frames 1 Text Area Detection from Video Frames Xiangrong Chen, Hongjiang Zhang Microsoft Research China chxr@yahoo.com, hjzhang@microsoft.com Abstract. Text area detection

More information

A Generalized Method to Solve Text-Based CAPTCHAs

A Generalized Method to Solve Text-Based CAPTCHAs A Generalized Method to Solve Text-Based CAPTCHAs Jason Ma, Bilal Badaoui, Emile Chamoun December 11, 2009 1 Abstract We present work in progress on the automated solving of text-based CAPTCHAs. Our method

More information

Chapter 2. Frequency distribution. Summarizing and Graphing Data

Chapter 2. Frequency distribution. Summarizing and Graphing Data Frequency distribution Chapter 2 Summarizing and Graphing Data Shows how data are partitioned among several categories (or classes) by listing the categories along with the number (frequency) of data values

More information

Mono-font Cursive Arabic Text Recognition Using Speech Recognition System

Mono-font Cursive Arabic Text Recognition Using Speech Recognition System Mono-font Cursive Arabic Text Recognition Using Speech Recognition System M.S. Khorsheed Computer & Electronics Research Institute, King AbdulAziz City for Science and Technology (KACST) PO Box 6086, Riyadh

More information

Spotting Words in Latin, Devanagari and Arabic Scripts

Spotting Words in Latin, Devanagari and Arabic Scripts Spotting Words in Latin, Devanagari and Arabic Scripts Sargur N. Srihari, Harish Srinivasan, Chen Huang and Shravya Shetty {srihari,hs32,chuang5,sshetty}@cedar.buffalo.edu Center of Excellence for Document

More information

One Dim~nsional Representation Of Two Dimensional Information For HMM Based Handwritten Recognition

One Dim~nsional Representation Of Two Dimensional Information For HMM Based Handwritten Recognition One Dim~nsional Representation Of Two Dimensional Information For HMM Based Handwritten Recognition Nafiz Arica Dept. of Computer Engineering, Middle East Technical University, Ankara,Turkey nafiz@ceng.metu.edu.

More information

Fabric Defect Detection Based on Computer Vision

Fabric Defect Detection Based on Computer Vision Fabric Defect Detection Based on Computer Vision Jing Sun and Zhiyu Zhou College of Information and Electronics, Zhejiang Sci-Tech University, Hangzhou, China {jings531,zhouzhiyu1993}@163.com Abstract.

More information

Data analysis using Microsoft Excel

Data analysis using Microsoft Excel Introduction to Statistics Statistics may be defined as the science of collection, organization presentation analysis and interpretation of numerical data from the logical analysis. 1.Collection of Data

More information

Machine Learning Final Project

Machine Learning Final Project Machine Learning Final Project Team: hahaha R01942054 林家蓉 R01942068 賴威昇 January 15, 2014 1 Introduction In this project, we are asked to solve a classification problem of Chinese characters. The training

More information

Isolated Handwritten Words Segmentation Techniques in Gurmukhi Script

Isolated Handwritten Words Segmentation Techniques in Gurmukhi Script Isolated Handwritten Words Segmentation Techniques in Gurmukhi Script Galaxy Bansal Dharamveer Sharma ABSTRACT Segmentation of handwritten words is a challenging task primarily because of structural features

More information

A Visualization Tool to Improve the Performance of a Classifier Based on Hidden Markov Models

A Visualization Tool to Improve the Performance of a Classifier Based on Hidden Markov Models A Visualization Tool to Improve the Performance of a Classifier Based on Hidden Markov Models Gleidson Pegoretti da Silva, Masaki Nakagawa Department of Computer and Information Sciences Tokyo University

More information

Fast trajectory matching using small binary images

Fast trajectory matching using small binary images Title Fast trajectory matching using small binary images Author(s) Zhuo, W; Schnieders, D; Wong, KKY Citation The 3rd International Conference on Multimedia Technology (ICMT 2013), Guangzhou, China, 29

More information

Skew Detection and Correction of Document Image using Hough Transform Method

Skew Detection and Correction of Document Image using Hough Transform Method Skew Detection and Correction of Document Image using Hough Transform Method [1] Neerugatti Varipally Vishwanath, [2] Dr.T. Pearson, [3] K.Chaitanya, [4] MG JaswanthSagar, [5] M.Rupesh [1] Asst.Professor,

More information

LECTURE 6 TEXT PROCESSING

LECTURE 6 TEXT PROCESSING SCIENTIFIC DATA COMPUTING 1 MTAT.08.042 LECTURE 6 TEXT PROCESSING Prepared by: Amnir Hadachi Institute of Computer Science, University of Tartu amnir.hadachi@ut.ee OUTLINE Aims Character Typology OCR systems

More information

Vision. OCR and OCV Application Guide OCR and OCV Application Guide 1/14

Vision. OCR and OCV Application Guide OCR and OCV Application Guide 1/14 Vision OCR and OCV Application Guide 1.00 OCR and OCV Application Guide 1/14 General considerations on OCR Encoded information into text and codes can be automatically extracted through a 2D imager device.

More information

Automatic Techniques for Gridding cdna Microarray Images

Automatic Techniques for Gridding cdna Microarray Images Automatic Techniques for Gridding cda Microarray Images aima Kaabouch, Member, IEEE, and Hamid Shahbazkia Department of Electrical Engineering, University of orth Dakota Grand Forks, D 58202-765 2 University

More information

Recognition of Gurmukhi Text from Sign Board Images Captured from Mobile Camera

Recognition of Gurmukhi Text from Sign Board Images Captured from Mobile Camera International Journal of Information & Computation Technology. ISSN 0974-2239 Volume 4, Number 17 (2014), pp. 1839-1845 International Research Publications House http://www. irphouse.com Recognition of

More information

Hierarchical Representation of 2-D Shapes using Convex Polygons: a Contour-Based Approach

Hierarchical Representation of 2-D Shapes using Convex Polygons: a Contour-Based Approach Hierarchical Representation of 2-D Shapes using Convex Polygons: a Contour-Based Approach O. El Badawy, M. S. Kamel Pattern Analysis and Machine Intelligence Laboratory, Department of Systems Design Engineering,

More information

A Robust Wipe Detection Algorithm

A Robust Wipe Detection Algorithm A Robust Wipe Detection Algorithm C. W. Ngo, T. C. Pong & R. T. Chin Department of Computer Science The Hong Kong University of Science & Technology Clear Water Bay, Kowloon, Hong Kong Email: fcwngo, tcpong,

More information

Revealing the Modern History of Japanese Philosophy Using Digitization, Natural Language Processing, and Visualization

Revealing the Modern History of Japanese Philosophy Using Digitization, Natural Language Processing, and Visualization Revealing the Modern History of Japanese Philosophy Using Digitization, Natural Language Katsuya Masuda *, Makoto Tanji **, and Hideki Mima *** Abstract This study proposes a framework to access to the

More information

Refinement of digitized documents through recognition of mathematical formulae

Refinement of digitized documents through recognition of mathematical formulae Refinement of digitized documents through recognition of mathematical formulae Toshihiro KANAHORI Research and Support Center on Higher Education for the Hearing and Visually Impaired, Tsukuba University

More information

SUPPLEMENTARY FILE S1: 3D AIRWAY TUBE RECONSTRUCTION AND CELL-BASED MECHANICAL MODEL. RELATED TO FIGURE 1, FIGURE 7, AND STAR METHODS.

SUPPLEMENTARY FILE S1: 3D AIRWAY TUBE RECONSTRUCTION AND CELL-BASED MECHANICAL MODEL. RELATED TO FIGURE 1, FIGURE 7, AND STAR METHODS. SUPPLEMENTARY FILE S1: 3D AIRWAY TUBE RECONSTRUCTION AND CELL-BASED MECHANICAL MODEL. RELATED TO FIGURE 1, FIGURE 7, AND STAR METHODS. 1. 3D AIRWAY TUBE RECONSTRUCTION. RELATED TO FIGURE 1 AND STAR METHODS

More information

Domain-specific Concept-based Information Retrieval System

Domain-specific Concept-based Information Retrieval System Domain-specific Concept-based Information Retrieval System L. Shen 1, Y. K. Lim 1, H. T. Loh 2 1 Design Technology Institute Ltd, National University of Singapore, Singapore 2 Department of Mechanical

More information

Research on QR Code Image Pre-processing Algorithm under Complex Background

Research on QR Code Image Pre-processing Algorithm under Complex Background Scientific Journal of Information Engineering May 207, Volume 7, Issue, PP.-7 Research on QR Code Image Pre-processing Algorithm under Complex Background Lei Liu, Lin-li Zhou, Huifang Bao. Institute of

More information

Improvement of SURF Feature Image Registration Algorithm Based on Cluster Analysis

Improvement of SURF Feature Image Registration Algorithm Based on Cluster Analysis Sensors & Transducers 2014 by IFSA Publishing, S. L. http://www.sensorsportal.com Improvement of SURF Feature Image Registration Algorithm Based on Cluster Analysis 1 Xulin LONG, 1,* Qiang CHEN, 2 Xiaoya

More information

Classification of Printed Chinese Characters by Using Neural Network

Classification of Printed Chinese Characters by Using Neural Network Classification of Printed Chinese Characters by Using Neural Network ATTAULLAH KHAWAJA Ph.D. Student, Department of Electronics engineering, Beijing Institute of Technology, 100081 Beijing, P.R.CHINA ABDUL

More information

Scanned Documents. LBSC 796/INFM 718R Douglas W. Oard Week 8, March 30, 2011

Scanned Documents. LBSC 796/INFM 718R Douglas W. Oard Week 8, March 30, 2011 Scanned Documents LBSC 796/INFM 718R Douglas W. Oard Week 8, March 30, 2011 Expanding the Search Space Scanned Docs Identity: Harriet Later, I learned that John had not heard High Payoff Investments

More information

Data Hiding on Text Using Big-5 Code

Data Hiding on Text Using Big-5 Code Data Hiding on Text Using Big-5 Code Jun-Chou Chuang 1 and Yu-Chen Hu 2 1 Department of Computer Science and Communication Engineering Providence University 200 Chung-Chi Rd., Shalu, Taichung 43301, Republic

More information

Downloaded from

Downloaded from UNIT 2 WHAT IS STATISTICS? Researchers deal with a large amount of data and have to draw dependable conclusions on the basis of data collected for the purpose. Statistics help the researchers in making

More information

Robust PDF Table Locator

Robust PDF Table Locator Robust PDF Table Locator December 17, 2016 1 Introduction Data scientists rely on an abundance of tabular data stored in easy-to-machine-read formats like.csv files. Unfortunately, most government records

More information

Unit 4. Multimedia Element: Text. Introduction to Multimedia Semester 2

Unit 4. Multimedia Element: Text. Introduction to Multimedia Semester 2 Unit 4 Multimedia Element: Text 2017-18 Semester 2 Unit Outline In this unit, we will learn Fonts Typography Serif, Sans Serif, Decorative Monospaced vs. Proportional Style Size Spacing Color Alignment

More information

Image Enhancement Techniques for Fingerprint Identification

Image Enhancement Techniques for Fingerprint Identification March 2013 1 Image Enhancement Techniques for Fingerprint Identification Pankaj Deshmukh, Siraj Pathan, Riyaz Pathan Abstract The aim of this paper is to propose a new method in fingerprint enhancement

More information

Information Retrieval. Chap 7. Text Operations

Information Retrieval. Chap 7. Text Operations Information Retrieval Chap 7. Text Operations The Retrieval Process user need User Interface 4, 10 Text Text logical view Text Operations logical view 6, 7 user feedback Query Operations query Indexing

More information

HANDWRITTEN GURMUKHI CHARACTER RECOGNITION USING WAVELET TRANSFORMS

HANDWRITTEN GURMUKHI CHARACTER RECOGNITION USING WAVELET TRANSFORMS International Journal of Electronics, Communication & Instrumentation Engineering Research and Development (IJECIERD) ISSN 2249-684X Vol.2, Issue 3 Sep 2012 27-37 TJPRC Pvt. Ltd., HANDWRITTEN GURMUKHI

More information

RECOGNIZING TYPESET DOCUMENTS USING WALSH TRANSFORMATION. Attila Fazekas and András Hajdu University of Debrecen 4010, Debrecen PO Box 12, Hungary

RECOGNIZING TYPESET DOCUMENTS USING WALSH TRANSFORMATION. Attila Fazekas and András Hajdu University of Debrecen 4010, Debrecen PO Box 12, Hungary RECOGNIZING TYPESET DOCUMENTS USING WALSH TRANSFORMATION Attila Fazekas and András Hajdu University of Debrecen 4010, Debrecen PO Box 12, Hungary Abstract. In this paper we present an effective character

More information

Polar Harmonic Transform for Fingerprint Recognition

Polar Harmonic Transform for Fingerprint Recognition International Journal Of Engineering Research And Development e-issn: 2278-067X, p-issn: 2278-800X, www.ijerd.com Volume 13, Issue 11 (November 2017), PP.50-55 Polar Harmonic Transform for Fingerprint

More information

Encoding Words into String Vectors for Word Categorization

Encoding Words into String Vectors for Word Categorization Int'l Conf. Artificial Intelligence ICAI'16 271 Encoding Words into String Vectors for Word Categorization Taeho Jo Department of Computer and Information Communication Engineering, Hongik University,

More information

FREEMAN CODE BASED ONLINE HANDWRITTEN CHARACTER RECOGNITION FOR MALAYALAM USING BACKPROPAGATION NEURAL NETWORKS

FREEMAN CODE BASED ONLINE HANDWRITTEN CHARACTER RECOGNITION FOR MALAYALAM USING BACKPROPAGATION NEURAL NETWORKS FREEMAN CODE BASED ONLINE HANDWRITTEN CHARACTER RECOGNITION FOR MALAYALAM USING BACKPROPAGATION NEURAL NETWORKS Amritha Sampath 1, Tripti C 2 and Govindaru V 3 1 Department of Computer Science and Engineering,

More information

Logical Templates for Feature Extraction in Fingerprint Images

Logical Templates for Feature Extraction in Fingerprint Images Logical Templates for Feature Extraction in Fingerprint Images Bir Bhanu, Michael Boshra and Xuejun Tan Center for Research in Intelligent Systems University of Califomia, Riverside, CA 9252 1, USA Email:

More information

Short Survey on Static Hand Gesture Recognition

Short Survey on Static Hand Gesture Recognition Short Survey on Static Hand Gesture Recognition Huu-Hung Huynh University of Science and Technology The University of Danang, Vietnam Duc-Hoang Vo University of Science and Technology The University of

More information

Keyword Extraction by KNN considering Similarity among Features

Keyword Extraction by KNN considering Similarity among Features 64 Int'l Conf. on Advances in Big Data Analytics ABDA'15 Keyword Extraction by KNN considering Similarity among Features Taeho Jo Department of Computer and Information Engineering, Inha University, Incheon,

More information

6. Applications - Text recognition in videos - Semantic video analysis

6. Applications - Text recognition in videos - Semantic video analysis 6. Applications - Text recognition in videos - Semantic video analysis Stephan Kopf 1 Motivation Goal: Segmentation and classification of characters Only few significant features are visible in these simple

More information

Pedestrian Detection Using Correlated Lidar and Image Data EECS442 Final Project Fall 2016

Pedestrian Detection Using Correlated Lidar and Image Data EECS442 Final Project Fall 2016 edestrian Detection Using Correlated Lidar and Image Data EECS442 Final roject Fall 2016 Samuel Rohrer University of Michigan rohrer@umich.edu Ian Lin University of Michigan tiannis@umich.edu Abstract

More information

Figure-2.1. Information system with encoder/decoders.

Figure-2.1. Information system with encoder/decoders. 2. Entropy Coding In the section on Information Theory, information system is modeled as the generationtransmission-user triplet, as depicted in fig-1.1, to emphasize the information aspect of the system.

More information

Script and Language Identification in Degraded and Distorted Document Images

Script and Language Identification in Degraded and Distorted Document Images Script and Language Identification in Degraded and Distorted Document Images Shijian Lu and Chew Lim Tan Department of Computer Science National University of Singapore, Kent Ridge, 117543 {lusj, tancl@comp.nus.edu.sg}

More information

Equation to LaTeX. Abhinav Rastogi, Sevy Harris. I. Introduction. Segmentation.

Equation to LaTeX. Abhinav Rastogi, Sevy Harris. I. Introduction. Segmentation. Equation to LaTeX Abhinav Rastogi, Sevy Harris {arastogi,sharris5}@stanford.edu I. Introduction Copying equations from a pdf file to a LaTeX document can be time consuming because there is no easy way

More information

Mobile Visual Search with Word-HOG Descriptors

Mobile Visual Search with Word-HOG Descriptors Mobile Visual Search with Word-HOG Descriptors Sam S. Tsai, Huizhong Chen, David M. Chen, and Bernd Girod Department of Electrical Engineering, Stanford University, Stanford, CA, 9435 sstsai@alumni.stanford.edu,

More information

EE 355 OCR PA You Better Recognize

EE 355 OCR PA You Better Recognize EE 355 OCR PA You Better Recognize Introduction In this programming assignment you will write a program to take a black and white (8-bit grayscale) image containing the bitmap representations of a set

More information

DYADIC WAVELETS AND DCT BASED BLIND COPY-MOVE IMAGE FORGERY DETECTION

DYADIC WAVELETS AND DCT BASED BLIND COPY-MOVE IMAGE FORGERY DETECTION DYADIC WAVELETS AND DCT BASED BLIND COPY-MOVE IMAGE FORGERY DETECTION Ghulam Muhammad*,1, Muhammad Hussain 2, Anwar M. Mirza 1, and George Bebis 3 1 Department of Computer Engineering, 2 Department of

More information

Content Based Image Retrieval Using Color Quantizes, EDBTC and LBP Features

Content Based Image Retrieval Using Color Quantizes, EDBTC and LBP Features Content Based Image Retrieval Using Color Quantizes, EDBTC and LBP Features 1 Kum Sharanamma, 2 Krishnapriya Sharma 1,2 SIR MVIT Abstract- To describe the image features the Local binary pattern (LBP)

More information

Texture recognition of medical images with the ICM method

Texture recognition of medical images with the ICM method Nuclear Instruments and Methods in Physics Research A 525 (2004) 387 391 Texture recognition of medical images with the ICM method Jason M. Kinser*, Guisong Wang George Mason University, Institute for

More information

Project Report for EE7700

Project Report for EE7700 Project Report for EE7700 Name: Jing Chen, Shaoming Chen Student ID: 89-507-3494, 89-295-9668 Face Tracking 1. Objective of the study Given a video, this semester project aims at implementing algorithms

More information

Segmentation of Characters of Devanagari Script Documents

Segmentation of Characters of Devanagari Script Documents WWJMRD 2017; 3(11): 253-257 www.wwjmrd.com International Journal Peer Reviewed Journal Refereed Journal Indexed Journal UGC Approved Journal Impact Factor MJIF: 4.25 e-issn: 2454-6615 Manpreet Kaur Research

More information

Recognition of online captured, handwritten Tamil words on Android

Recognition of online captured, handwritten Tamil words on Android Recognition of online captured, handwritten Tamil words on Android A G Ramakrishnan and Bhargava Urala K Medical Intelligence and Language Engineering (MILE) Laboratory, Dept. of Electrical Engineering,

More information

Improving Suffix Tree Clustering Algorithm for Web Documents

Improving Suffix Tree Clustering Algorithm for Web Documents International Conference on Logistics Engineering, Management and Computer Science (LEMCS 2015) Improving Suffix Tree Clustering Algorithm for Web Documents Yan Zhuang Computer Center East China Normal

More information

SOME stereo image-matching methods require a user-selected

SOME stereo image-matching methods require a user-selected IEEE GEOSCIENCE AND REMOTE SENSING LETTERS, VOL. 3, NO. 2, APRIL 2006 207 Seed Point Selection Method for Triangle Constrained Image Matching Propagation Qing Zhu, Bo Wu, and Zhi-Xiang Xu Abstract In order

More information

Scanner Parameter Estimation Using Bilevel Scans of Star Charts

Scanner Parameter Estimation Using Bilevel Scans of Star Charts ICDAR, Seattle WA September Scanner Parameter Estimation Using Bilevel Scans of Star Charts Elisa H. Barney Smith Electrical and Computer Engineering Department Boise State University, Boise, Idaho 8375

More information