Word extraction using irregular pyramid C. L. Tan a and P. K. Loo b

Size: px

Start display at page:

Download "Word extraction using irregular pyramid C. L. Tan a and P. K. Loo b"

Sheena Ursula Murphy
5 years ago
Views:

1 Header for SPIE use Word extraction using irregular pyramid C. L. Tan a and P. K. Loo b a School of Computing, National University of Singapore, Kent Ridge, Singapore b Civil Engineering & Building Department, Singapore Polytechnic, 500 Dover Road, Singapore ABSTRACT This paper proposed a new algorithm to perform text extraction from image document. The paper focused in the extraction of word group. Irregular pyramid structure is used as the basis of the algorithm. The uniqueness of this algorithm is its inclusion of strategic background information in the analysis where most techniques have discarded. Both foreground (i.e. text area) and portion of background (i.e. white area) regions are examined. The fundamental of the algorithm is based on the concept of closeness where text information within a group is closed to each other, in terms of spatial distance, as compare to other text area. The result produced by the algorithm is encouraging with the ability to correctly group words of different size, font, arrangement and orientation. Keywords: Text extraction, Word group, Irregular Pyramid, Image processing 1. INTRODUCTION The problem of text extraction from an image document still remains an important issue in the field of image processing. Application such as map interpretation, referencing system for digitized manuscripts, and news article search from microfilms require some form of text extraction and categorizing of text images into logical group. There have been many studies about text extraction. We can categories them into non-pyramid and pyramid technique. Most non-pyramid techniques perform detail spatial analysis of the subject area. Assumption about the physical spatial property of the image is required. In some cases, text images must be aligned or grouped in a specific direction (i.e. horizontally or vertically). The splitting and merging technique as proposed by (Wong, Casy and Wahl) 6 required text information to be separable in the horizontal or vertical direction. Others require the detail analysis of the inter-component spacing. The labeling algorithm 3 also requires the alteration in the image horizontal scale to facilitate the extraction. In the pyramid category, majority of the proposed method makes use of regular pyramid structure. Most of these studies require the connected component analysis. A strong assumption of disjoint component is needed to ensure for the correct extraction of text images 2. In this proposal, no connected component analysis is required. The aggregation of pixel into character and character into words is done through the nature grouping of pixel and region. This proposed algorithm has no assumption in terms of the size and orientation of text images. It will handle text image of any font, size, arrangement and orientation. Although regular pyramid shares the same benefit as in irregular pyramid with the ability to carry out image abstraction in achieving reduce computation cost and permitting local analysis of image features, it suffers from the rigid contraction scheme. The processing can be parallel but it is not strictly local. In addition, there exists the shift dependent problem 8. For irregular pyramid, in particular the stochastic pyramid 9, only local information is required in its decimation. Large decimation ratio is obtained and thus resulted in a faster pyramid construction time. The shift dependence problem no longer exists, since its structure is flexible enough to match with the input content. The content of the image will control the aggregation process. Till date there are only a handful of studies on irregular pyramid. Some addresses the issues of increasing the efficiency in building the pyramid through the reduction of pyramid level 5. Others use such pyramid structure to perform region segmentation 1, edge detection 5, or connected component analysis 3. No direct attempt is make to use irregular pyramid to extract logical text group from a text image. This paper will propose an algorithm based on irregular pyramid to perform such extraction in particular extracting word s group.

2 2. FUNDAMENTAL CONCEPT OF THE ALGORITHM As compared to many other text extraction techniques, our proposed algorithm differs by two unique features. The first is the involvement of the non-text image area in the analysis. The other is the introduction of the closeness concept that allows the forming of word group. 2.1 Inclusion of background information One of the unique features of this proposed algorithm is the inclusion of background (i.e. non-text) image area in the analysis. Background information is usually considered not important and discarded in most of the proposed text extraction techniques. Analysis effort is focused only on the subject area (i.e. black pixel). Although it is a direct approach to solve the problem on hand, the holistic view of the entire picture is loss. The problem in such approach will escalate when the extraction task goes beyond the character level and if there exists irregular text alignment and orientation. In the context of word extraction, this approach may no longer be suitable. The motivation to include the non-text area is from the simple observation where the proportional non-text area surrounding a character is smaller than the non-text area surrounding word group. Crucial information about the spatial distance between text image objects is held among the non-text area or white pixel. Figure 1 Irregular regions containing word fragment. With the involvement of the background information, we can now view a text image as a combination of multiple irregular smaller regions as illustrated by figure 1. Some regions may contain text information others may be empty. A potential word group can thus be viewed as a bigger region containing smaller regions holding fragment of the word group including empty region. The concatenation of these region fragments formed a word group. With multiple fragments of region that can belongs to different word group, the problem in clustering these fragments to the right word group occur. In our proposed algorithm, this problem is solved by introducing the concept of closeness. 2.2 Concept of closeness A human reader can easily identify different logical text group from a text image if the text information within the group is closed to each other, in terms of spatial distance (i.e. in all directions), as compare to other text area. The distance between characters is smaller within a word group, as compare to distance between two separate word groups. If we view a text image as a combination of multiple regions, then a logical text group is defined as a region enclosing various sub-regions that are close to each other. Some of these sub-regions may contain text area. Others may be empty. Two regions are considered close if they appear in the immediate surrounding of each other. No computation of the physical distance is required. To be close, regions just need to be around in the neighborhood. In cases where there are more than two regions, the closeness property between regions is defined with respect to a reference point or the pivot region. A group of regions are considered close if there are close to a pivot region. We can also view such region as a central pulling force in pulling all other surrounding regions together, which are considered close to it. By using irregular pyramid structure, such concept of closeness can be implemented. The pyramid structure of successive condensing image resolution allows the closeness among local region to grow progressively from level to level into the closeness among global region. r1 r2 r5 r1 r2 pivot region r3 r4 Figure 2 closeness between two regions. Figure 3 closeness among multiple regions.

3 3. IRREGULAR PYRAMID The basic pyramid structure used in this algorithm is based on the classical irregular pyramid model 9. With various modifications and additional features to the basic model the new pyramid structure proof to yield good result in the context of word extraction. The original design in using uniform distributed random number as the feature value for the decimation process is replaced with a new scheme of assigning feature value. Feature value is now the mass of a region. In the new pyramid structure, a single data point on any resolution level represents a region. Each data point has a unique set of attribute. The attributes are the total area (i.e. total number of black and white pixel), mass (i.e. number of black pixel enclosed by the region), density (i.e. mass/area), surviving value, list of neighbor, list of children and the parent link. Since the main aim is to perform text extraction, the decimation process will focus on region with mass. Nevertheless, empty region is also processed with a lower priority. In most text extraction algorithm such empty regions are usually discarded in the analysis. The proposed algorithm needs the complete involvement from all the pixel points. Most pyramid construction process will continue until its reaches the pyramid apex. Since in this context the main purpose is to extract word group, the process will stop only when there is no possibility to locate any other word group. Word group is identified when a region has no other neighboring region with mass. Another new feature is the used of region density. Density of a region is defined as the total mass enclosed by the region over the actual area of the region. This value reflects the mass to area ratio. Density value of a region that constitutes a word group reflects the size of area enclosing the specific word s mass. All regions containing word group will have a similar density value regardless of the region s mass. Word group having bigger mass requires larger area to enclose the word. Smaller mass will have smaller enclosing area. As a result the density value of the word group is relatively stable and thus become suitable stopping criteria in identifying word group. In order to obtain the most accurate and up to date density value as the stopping criteria, each resolution level will have a different target density. Target density of a resolution level is defined by the average density of all the word group regions on the lower resolution level. As the formation of more word group on each resolution level, the average density will converge and reflect a more accurate value. This density value can thus become a better target density for the next higher resolution level. The idea is to base on probability. If majority of the region require a certain density value to form word group, then the remaining region should have the same ratio. 4. THE ALGORITHM There are three key components involved in building a pyramid structure. These are the input image, a function transforming the input image to produce an output image with reduced resolution size. For a regular pyramid such function can be defined directly (e.g. the summation of four pixels into one). It is usually simpler. With irregular pyramid the function is more complicated. In the proposed algorithm, the function of transforming the current resolution level to a higher level involved four main steps. These steps are the identification of neighboring region, assignment of surviving value to all regions, selection of the appropriate surviving region, and finally the grouping of non-surviving regions as the child of the surviving region. Below (figure 4) is the pusedo code of the main algorithm. 1: create_resolution_level(0, original_image) 2: select_survivor(0) 3: select_child(0,_) 4: add_scope(0) 5: done = false 6: for (level=1; pixel_number>1 and!done; level++) 7: { number_word[level-1] = find_word(level-1) 8: target_density = compute_density(level-1) 9: if ( level>1 and number_word[level-2] = number_word[level-1] ) done = true 10: if (!done) { create_higher_level(level) 11: update_neigbour_list(level) 12: select_survivor(level) 13: select_child(level, target_density) 14: add_scope(level) 15: } 16: } Figure 4 Main algorithm.

4 The main algorithm is divided into two major sections. Line 1 to line 4 is the construction of the base resolution level. From line 5 onwards the algorithm will continue to build the higher resolution level. This process will repeat until there is only one remaining pixel on the resolution level or there is no possibility to locate other word group. 4.1 Base resolution level Pyramid construction process will begin from the base level (i.e. level 0). At this level, the input image is the original binary image in full resolution. All regions will have an area of one and either with one unit of mass (i.e. black pixel) or with no mass (i.e. white pixel). The input image on this base resolution level is in its original square grid layout. The 8-connectivity algorithm easily identifies neighboring regions. Region with mass is assigned with a value of one as its surviving value. Region with no mass will have zero surviving value. This will force all survivors to be region with mass. During the survivor selection process in line 2, a region is analyzed against all its neighboring regions. Surviving value of each region is compared and evaluated. A region survives if there is no other neighboring region with a larger surviving value. Such region is considered as the local maximal. Once a region is elected as the survivor, all its neighboring regions are prohibited to participate in any further evaluation with other region. This will ensure no two survivors are neighbors. The final step in line 3 is the child selection process where non-surviving regions are assigned to a surviving region. This is a very essential step, as it will determine the grouping of potential word group. Since this is only the base resolution level the possibility of locating word group is very unlikely as a result no special evaluation is required. A survivor will claim all neighboring non-survivors if there is no other survivor claiming for the same non-survivor. In situation where two survivors are claiming the same non-survivor, the non-survivor is assigned to the survivor with greater surviving value. On this resolution level the effect of such claiming rule is not obvious. Once the region grows to a bigger size on a higher resolution level, the rule of allowing a region with bigger mass to claim more non-survivors than a region with less mass will become necessary. Region with heavier mass is more likely to be the center of a word group. As such in allowing it to pull in more neighbor will promote the possibility of forming a word group. After this process all non-survivors are assigned to a survivor and their parent-child links are updated. Example: Figure 5 illustrates some of the key processes mentioned above. There are seven different regions on level i. Region B and E are selected as the surviving regions. They are the local maximal among their surrounding neighbors. During the child selection process, region A and region D,F,G are easily selected as the child of region B and E respectively. For region C the decision to group with region E is made. This is due to the fact that between the two survivors, region E has a larger mass. On level i+1 both survivors B and E become the new members, re-labeled as H and K respectively. Both will inherit the total mass and area of their children. Since the total area covered by region H is next to the area covered by region K on a lower resolution level i, H and K will become neighbor on level i+1. Finally the parent-child links are established. neighborhood relation parent-child relation H,12m K,29m Level i+1 reference region A,2m B,10m C,6m E,15m F,5m Level i D,0m G,3m Figure 5 Irregular pyramid construction process.

5 Before proceeding to build the next higher resolution level, the algorithm will execute a routine in line 4 called add_scope. The purpose of this routine is to add sufficient background information or white area just enough for the process to construct the next resolution level. Not all background information is used in the pyramid construction process. Only area (i.e. non-text) that is the immediate surrounding of a region with mass is added. This coincides with the characteristic in pyramid construction where only local analysis is done. The process will restrict its examination among the surrounding neighbor. Thus there is no necessity to include any other background information besides the immediate surrounding. This will greatly enhance the efficiency of the algorithm and preserve valuable memory space. 4.2 Higher resolution level Once all processes on a resolution level are done its output will become the input to the construction of the next resolution level (i.e. line 10). The survivor list that obtained from a lower resolution level will now become the member of the higher resolution level. The original resolution size is reduced by a certain factor. The child list of each member on a higher resolution level represents the area of region that the member has covered on the lower resolution level. The total area of all child regions including the survivor itself will become the area of this new member on the higher resolution level. The mass of the new region will also inherit the mass from all its children and the survivor. This new region is now the condensed version of all the lower regions it has covered. This exhibits one of the advantages in using irregular pyramid where key image information of a group of pixels are condensed and represented in a single pixel. Unlike the base resolution, on the higher level the 8-connectivity property of a region no longer exists. In order to establish neighborhood link, the child of each region is analyzed. This is done in line 11. If the children of any two regions are neighbors on the immediate lower level, then these two regions will become neighbors on the current level. Two regions are neighbors if the total area covered by each region on a lower level is next to each other. Once the list of neighbor is determined, each region will again assign with a surviving value. Just like the base resolution level, the surviving value of each region is the total mass of the region. Depending on how wide the current region has covered on a lower resolution level, the mass varies. A heavier mass reflects higher possibility to be the center of a word group. A lighter mass is more likely to be the edges or sides of the word group. During the survivor selection process in line 12, heavier region is preferred over the lighter region. This will ensure more edges of such heavier region can be grouped quickly to form a potential word. Just like the base resolution level, it is assumed to have less possibility to locate word group on resolution level one. Child selection is based on the normal process. Once resolution level advances to level two and above, a more detail child selection process is used. 4.3 Detail child selection 1: if (exist mass in the survivor surrounding) 2: { for (all survivor neighbors with mass) 3: { assign neighbor as survivor child } 4: for (all survivor neighbors with no mass and new_density<target_density) 5: { assign neighbor as survivor child } 6: } 7: else 8: { for (all survivor neighbors and new_density<target_density) 9: { assign neighbor as survivor child } 10: } Figure 6 Detail child selection algorithm. A more complete child selection procedure is used from level two onwards. At this level we will see two categories of survivor. The first category is those survivors where there are neighbors with mass (i.e. line 1-6). The other category is survivor with no neighboring mass (i.e. line 8-10). These two categories of survivor are mutually exclusive. At any instance only one can occur. Conceptually, the first category is those survivors where there are still pieces and parts of a word group in the neighborhood. The second category is those that may already formed a word group and there exists no other mass in its surrounding. For the first category the algorithm is again divided into two parts. Unlike previously, these two parts are processed one after the other. The algorithm will focus on those neighbors with mass (i.e. line 2-3). The neighbor is pulled in as child immediately. While doing so, the validation against other surviving region as on level zero and one will also

6 carry out. Once all neighbors with mass are examined, the algorithm will continue to process the remaining neighbor with zero mass (i.e. line 4-5). Unlike the previous two levels where even the blank region are taken in with no question ask, from this level onwards special care must be taken in order to avoid over growing of region into other word group. This may result in the over lapping of more than one word group. Two different word groups may thus wrongly merge. In order to control the growth, density is used. For each addition of a neighboring region, the density of the new region is computed. Blank region will continue to be added until the resolution level target density is reached. In the second category (i.e. line 8-10), survivors are surrounded by blank region. Since there is no other neighboring mass, such surviving region may already formed a word group. The majority of the region fall under this scenario is a correct word group and thus no further pulling of neighbor is required. Nevertheless, there still exist a small percentage of regions that are only part of the word group. Density is again used to force these remaining regions to find the correct word group. As in the second part of the first category, the survivor will continue to pull in blank region until the new region reaches the resolution level target density. 5. RESULTS Below reported one of the test cases used in our experiment. The original text image is shown in figure 7 where word group of different size, font, arrangement and orientation are used. The size of the image is 541 x 298 pixels. There are a total of 94 words. Figure 8 is the final output with different shaded regions covering different word groups. In the actual pyramid structure, each shaded regions are represented by one data point. The output shown in figure 8 is produced by traversing down to the base level using the parent-child link. Figure 9 is an additional output representation of using rectangular boxes to surround the extracted word group. Figure 7 Original text image (541 x 298 pixels).

7 Figure 8 Output result showing various extracted word groups. Figure 9 Different representation of the output result.

8 Level Number of Number of Density extracted words remaining pixels Figure 10 Output result at various resolution levels. Figure 10 shows the detail of the output results at different resolution levels. The number of identified word group at each level is shown. For level 0, 1 and 2 the reflected values are merely grouping of potential regions and not the actual word group. From level 3 onwards the possibility in forming word group is higher and thus reflected the slowing down of the reduction in the number of extracted word groups. Once there is no likelihood in forming new word group the process stop (i.e. level 8). By examining the density column, we can see that the density value re-adjusts itself from level to level, as more and more word group is formed at each level. The number of remaining pixels includes the surviving pixels and those un-processed background areas. Thus on level 8 there are 94 surviving pixels and 13,419 remaining white pixels. 6. CONCLUSIONS This paper has reported an algorithm base on irregular pyramid structure to extract word group with varying sizes, fonts, alignments and orientations. The proposed algorithm has differed itself from other text extraction techniques by the inclusion of background information (i.e. non-text area) and the introduction of the closeness concept. Both text and nontext area are used in the analysis. The algorithm view text image as multiple irregular regions with or without text mass. The formation of word group region is by concatenating various smaller regions that are close to each other and thus with high potential of being the word group fragment. The algorithm uses the classical irregular pyramid model with various modifications. The region mass (i.e. total black pixel) is used as the feature value in the survivor selection process. Unlike most pyramid structure where the final resolution level has only one pixel, the final level in our pyramid will contain all identified word group with no further possibility of locating more word group. Average density value of all word groups in the previous level is used as the stopping criteria. The main steps in the algorithm are the identification of neighboring region, assignment of surviving value to all regions, selection of surviving region and finally the grouping of non-surviving regions as the child of the surviving region. The final step is the most essential stage where it will determine the clustering of correct regions to form word group. All regions will continue to grow into their immediate surrounding until the target density is reached. A region is identified as a word group when there is no other region with mass in its immediate surrounding. The final result as shown in figure 8 and 9 has proven our proposed technique with the ability to correctly identified word group without any assumption in the physical text layout. 7. FUTURE WORKS Our next stage of research is to extract multiple word groups. Writer normally placed group of words closer together relative to other group to convey a similar message. All words in the future works paragraph is expressing what we are still experimenting and what we will do in the future. The ability to identify and extract logical group of words will greatly help in the analysis of the image document. Extending from the technique used in our word extraction, we are refining the technique to combine words into logical group. An added feature that will help this task is the computation of word growing direction value that we are currently experimenting. As we examine the formation of word region from multiple smaller regions, we discover the existence in the continuity of the growing direction. By aggregating the growing direction value from level to level, while absorbing more and more word fragments, we are able to determine which direction the algorithm should follow to combine the neighboring words.

9 REFERENCES 1. A. Montanvert, P.Meer and A.Rosenfeld, Hierarchical image analysis using irregular tessellations, IEEE Trans. on Pattern Analysis and Machine Intelligence, Vol 13, No. 4, (1991). 2. C.L.Tan and P.O.Ng, Text extraction using pyramid, Pattern Recognition, Vol. 31, No. 1, (1998). 3. G.Nagy and S.Seth, Hierarchical representation of optically scanned documents, In Proc. 7 th Int. Conf. Patt. Recogn. (ICPR), (1984). 4. Hideyuki Negishi, Jien Kato, Hiroyuki Hase and Toyohide Watanable, Character Extraction from Noisy Background for an Automatic Reference System, In Proc. 5 th Int. Conf. On Document Analysis and Recogn. (ICDAR), (1999). 5. Horace H.S. Ip and Stephen W.C.Lam, Alternative strategies for irregular pyramid construction, Image and Vision Computing, 14, (1996). 6. K.Y.Wong, R.G.Casy and F.M.Wahl, Document analysis system, IBM J. Res. Development, Vol 26, (1982). 7. P. Meer, Stochastic image pyramids, Computer Vision, Graphics and Image Processing, Vol. 45, No. 3, (1989). 8. W.G.Kropatsch and A.Montanvert, Irregular versus regular pyramid structures, In U. Eckhardt, A. Hubler, W.Nagel, and G.Werner, editors, Geometrical Problems of Image Processing, (1991). 9. W.G.Kropatsch, Irregular pyramids, Proceedings of the 15 th OAGM meeting in klagenfurt, (1991).

Scene Text Detection Using Machine Learning Classifiers

601 Scene Text Detection Using Machine Learning Classifiers Nafla C.N. 1, Sneha K. 2, Divya K.P. 3 1 (Department of CSE, RCET, Akkikkvu, Thrissur) 2 (Department of CSE, RCET, Akkikkvu, Thrissur) 3 (Department