A Computer Vision System for Graphical Pattern Recognition and Semantic Object Detection

A Computer Vision System for Graphical Pattern Recognition and Semantic Object Detection Tudor Barbu Institute of Computer Science, Iaşi, Romania Abstract We have focused on a set of problems related to image pattern recognition and interpretation domain. In the introduction, we describe the general aspects of this computer vision task. Then we provide an approach for image objects identification and we provide a structure for these objects. A shape classification and a structure-based classification of the obtained image objects are then performed. Finally, after completing the steps of extraction and recognition, an objects interpretation (semantic extraction) method is also proposed. The paper is ending with a conclusions section. Keywords: pattern recognition, shape classification, graphical objects structure, image interpretation, semantic extraction, human interaction, image retrieval 1. Introduction Our main task is to implement a computer vision system which is detecting, recognizing and interpreting the image objects. As we know, any color or grayscale image contains a finite number of uniform or (and) textured regions. We have considered a graphical object as a connected set of such image regions. Our proposed computer vision system, is completing the following steps: 1. A preprocessing stage 2. A region based image segmentation 3. Graphical (image) objects identification 4. Image objects recognition 5. A semantic-based classification of the recognized objects We will not insist here on the first two steps. Depending on the image s state, several enhancement filtering actions are applied in the preprocessing state: image smoothing, edge enhancement or contrast adjustement ([1],[2]). For the resulted enhanced image we then perform a region-based segmentation, using the uniformity (or texture) similarity as a similarity criterion ([1],[4],[5]). In the third step we provide a structure for the graphical objects, based on the image segments obtained in second step. Then a recursive extraction method for image objects is proposed. At the fourth level, a moment based shape feature extraction, then supervised and unsupervised shape classification methods, and finally a structure based clustering, are provided. In the last step we give an image interpretation algorithm. We have done Matlab implementations for the described approaches and obtained a set of satisfactory results. 2. A recursive image object detection approach In this section we present a structuring method and an extraction approach for graphical objects. The clusters obtained as the result of previous image segmentation are analysed and labeled as background or foreground. We have proposed the following reasoning for foreground extraction: each cluster it is analysed and if the area of the smallest convex polygon including that cluster is equal with or greater than 3/4 of the total image s area, then the cluster is considered as background, otherwise it is a foreground segment. The computed foreground clusters become the primitives, our basic graphical units. A graphical primitive is not characterized by texture or omogenity anymore, the shape being its only characteristic. Each image object is modeled as an n order complex object, which represents a set of connected primitives. A first order one is called a simple object. A complex object is characterized not only by shape but also by its structure, which is related to the types of its components. Our image recognition approach is based on the following property: two objects are considered similar if they have similar shapes and similar structures. The uniformity (texture) similarity being not compulsory, we do not consider the texture

(uniformity) as a primitive feature. Let p 1, p 2,..., p N be the image primitives. An n- order object x of input image I can be modeled as x = {p i1, p i2,..., p in }, where i 1, i 2,..., i n {1,..., N}. An m order subobject of x is a complex object y = {p j1,..., p jm }, where j 1,..., j m {i 1,..., i n }. For x we have defined its shape, Shape(x), as the set of coordonates (X, Y ) of its pixels, therefore having relation: Shape(x) = Shape(p i1 )...Shape(p in ) (1) Also we express the structure of x, Struct(x), as the pair: Struct(x) = (S(x), R(x)) (2) where S(x) = {i 1, i 2,..., i n } and R(x) = R(p i1,..., p in ) represents a relationship between primitives, related to the way of combining them in the body of x. There are many ways to express R, each of them describing the relations of neighborhood in the set {p i1,..., p in }. The basis form of R is that of a neighborhoods matrix n n, neigh n x, given by: { 1, if neigh n pia, p ib connected x(a, b) = (3) 0, if p ia, p ib unconnected For getting easier implementations of matrices neigh n x, we have considered the general neighborhoods matrix of I, defined as: { 1, if pi and p j connected neigh I (i, j) = (4) 0, otherwise This matrix is computed considering each pair of different simple objects, (p i, p j ), i j. These primitives are neighbors if there is at least a pair of connected pixels such as a pixel belongs to p i and the other to p j. We use 4- neighborhoods in this work. For each (p i, p j ), i < j and i, j {1,..., N} all coordonate elements from Shape(p i ) and Shape(p j ) are searched until a pair (C 1, C 2 ), C 1 Shape(p i ) and C 2 Shape(p j ), is found having the property d(c 1, C 2 ) = 1 (d being euclidean distance between vectors). In this case p i and p j are neighbors, otherwise they are not. Extracting an object x from the image I means computing Shape(x), S(x) and R(x). The next recursive algorithm is computing S(x) only, the other two measures being obtained from (1) and (3). Our method extracts the (n 1) order objects first, then adds to each of them the primitives from their neighborhood. The n order objects are obtained this way. 1. If n = 1 then the n-objects are x i = {p i }, i = 1, N, and S(x i ) = {i}, i = 1, N. Else {n > 1} 2. Find the (n 1) order objects recursively: x i and S(x i ), k < N, i = 1, K, resulting For i = 1 to K do For each p s with s / S(x i ) If t S(x i ) such that neigh I (s, t) = 1 3. Find n order object x, given by S(x) = S(x i ) {s}; All the complex objects of I can be identified, applying this procedure for each order n. Having all the graphical objects, we can focus on their recognition now. Every complex object x is given by the pair (Shape(x), Struct(x)) = (Shape(x), (S(x), R(x))), therefore the classification task has to be divided in two problems, related to Shape(x) and (S(x), R(x)), respectively. Obviously, objects of different orders cannot belong to the same class, even if they have similar shapes, because of their structures. The classification making sense for same order objects only, we can consider each n as determining a category of image objects: the class of n order objects (let name it n class). Each n class will be divided in new subclasses on the principle of shape similarity and structure similarity. 3. A shape recognition approach We have formulated our shape recognition task as a pattern recognition problem with the two steps: feature extraction and pattern classification. Considering a given shape as pattern, a similar new shape could be obtained through geometric transforms, such as translation, rotation, scaling and reflection, or smooth deformation transforms, such as stretching, squeezing and dilation. For featuring a shape we need therefore an invariant measure to these transformations. There are many approaches in this direction, the most known being the Fourier descriptors ([6]), the moment based methods ([1],[3]), Zernike polynomials or scalar methods. We have used a moment based approach for shape feature extraction. Testing first area moments, then perimeter moments, for shape description, we have obtained similar feature vectors. Therefore we describe here the area moments case only, when all the pixels of a pattern are used in the computation. For an object o we consider all the elements of Shape(o), the (x, y) coordonate paris corresponding to the pixels of o, and the image function f(x, y) = { 1, (x, y) Shape(o) 0, (x, y) / Shape(o). (5)

For this f, the (p + q)-order moment of o become: m p,q = x p y q f(x, y) = x p y q (6) x y x y The center of gravity (centroid) of the object o has the coordonate pair ( x, ȳ) given by: x = m 10 m 00, ȳ = m 01 m 00 (7) The central moment of o is obtained by computing the pq moment with respect to the centroid: µ pq = (x x) p (y ȳ) q (8) x y The obtained shape feature µ pq is a translation invariant moment. Using the standard deviations in the x and y directions, given by µ20 µ02 σ x =, σ y =, (9) m 00 m 00 a stretching, squeezing and dilation invariant moment is computed by: µ pq = ( ) p ( ) q x x y ȳ (10) σ x y x σ y. To make it scaling invariant, this new central moment must be normalized as below: η pq = µ pq µ γ, γ = p + q + 1 (11) 00 2 Using the η pq moments, we propose a feature vector v(o) = (f 1, f 2 ) where: f 1 = (η 30 + η 12 ) 2 + (η 03 + η 21 ) 2, (12) a Hu moment ([3]) used in expressing the object s eccentricity, rotation reflection invariant, and f 2 = η 24 + η 42, (13) a rotation reflection invariant moment based feature too. We have tested both v(o) = (f 1, f 2 ) and also scalar form v(o) = f 1 + f 2 and obtained similar results in the classification stage following feature detection step. For a set of shapes, supervised or unsupervised learning approaches can be applied, depending of the knowledge we have about classes. Our supervised classification approach is based on a VQ (Vector Quantization) method ([1],[7]), implementing a variant of the k means algorithm. The number of classes, K, is set by human interaction after displaying the shapes on the screen. For obtaining a training set the user will select, by right clicking on screen, K appropriate different shapes, their feature vectors becoming training vectors. We have proposed the next k means algorithm: For i = 1 to K do 1. C i := {t i }; t i, i = 1, K being the training vectors Repeat For i = 1 to N do (v 1,..., v N being all feature vectors) 2. Find j for which min j d(v i, mean(c j )) is obtained If v i / C j then {position s change} 3. C j := C j {v i }; 4. Find C l, l j, such that v i C l ; 5. C l := C l \ {v i }; Until no more position changes {optimization criterion} Our optimization criterion conditions that all feature vectors are at right place. The algorithm performes a supervised classification, the sets C i being the resulted classes of vectors. Unsupervised learning consists of a natural grouping in the feature space, no training sets being used ([1]). We have proposed a region growing algorithm for unsupervised classification. In the next pseudocode, K, the classes number, is supposed known (it can be set also interactively). For i = 1 to N 1. C i = {V (x i )}; 2. C = { }; index = 1; {C helping set}, Repeat 2. m := number of C i s (initially N); For i = 1 to m 1 For j = 2 to m If d(mean(c i ), mean(c j )=d min {smallest overall distance} then 3. C(index) := C i C j ; {region merging} 4. inc(index); For i = 1 to index 1 5. C i := C(i); Until m = K {optimization criterion} At each loop the most closed regions are unified in a new region. The process continues until the optimization criterion is satisfied: regions number equals the classes number. We could obtain a more natural grouping if we eliminate the knowledge of K and therefore the human interaction. If the number of clusters is unknown, at step 3 we introduce one more condition for the two classes merging: C i and C j are unified if d(mean(c i ), mean(c j )) < T,

where T is a threshold depending on values from C i and C j. The optimization criterion is modified as: Repeat... until d min T. We have tested as threshold the value T = 1 α min(mean(c i), mean(c j )). The K classes of shapes being obtained, the recognition task is therefore solved. 4. Structure-based object classifications Let us present now the complex objects recognition solution. We have considered the following situations: 1. Perform the order classification of the objects computing the n classes of image I. For each n class do the following steps: 2. n = 1: In this case only a shape classification it is performed, as presented in last section. In next figure the shape classification results for a segmented image are described: five classes, containing 3, 2 or 1 simple objects, distinguished by different colors (background is represented in black). Then for each primitive p i, its class value is associated: Class1(i) = the number, between 1 and K, of the class of p i. Fig. 1 4. If n 3, a shape classification is done first. Then, for each shape class an S classification is performed over it. Finally, every S class will be divided into subclasses determined by R similarity. It is obviously that the case n 3 implies a total structure classification (using both S and R), while the n = 2 case implies a partial structure classification (using S only). Fig. 2 a) (1,1) b) (2,2) c) (3,3) d) (4,4) e) (5,5) f) (6,1) g) (7,2) h) (8,3) 3. If n = 2, the 2-order image objects are computed and classified by shape, first. Then, for each shape class an S classification is performed. For each x, belonging to a class and given by S(x) = {i 1, i 2 }, the unordered set {Class1(i 1 ), Class1(i 2 )} is considered. If no other object, processed before it, is described by this set, x is inserted in a new S-class. Otherwise it is included in the S class whose objects have that set. The obtained S-classes being final classes, a class value is associated for each object: Class2(i) = the class number of the 2-order object x i. The results for our particular case are represented in the next figure. For each a)-h) image, representing a complex 2-order object x i, the labeling pair (i, Class2(i)) has been associated. The reason of this situation is that in the case of 2 order objects the neighborhood notion has no relevance. If x is an 2 order object and p i, p j are 2 primitives of it, it is obviously that p i an p j are connected. But if x is an n-order object, n being equal or greater than 3, p i could be connected with p j or not, therefore the relation of neighborhood, given by R(x), is very important. The examples offered in Fig. 3 give a graphical explanation of the difference between cases. We can easily observe that the two 3-order objects displayed in image are not similar, although they have similar shapes and primitives, because of different R relationships. We may use R(x) = neigh n x, n 3, or we can obtain new forms for R by transforming the neighborhood matrix of x. Determining the 2-order subobjects of x and then classifying them is a good approach for the structure classification of x. Fig. 3

a) (1,1) b) (2,2) c) (3,3) d) (4,4) e) (5,5) f) (6,6) g) (7,7) h) (8,1) 5. An semantic extraction method From relation 3, the equivalence between neigh n x and the set {(p ia, p ib ) a, b {1,..., n}, a b p ia, p ib being neighbors} results. This set is equivalent with {(i a, i b ) p ia connected with p ib, i a, i b S(x)} = {S(x 2 ) x 2 x, x 2 is 2-order object}, the 2-order subobjects being identified this way. Two complex objects will be considered similar if they have similar shapes and also imilar 2-order subobjects. Therefore, for each object x of a shape class, we consider the set of the indices of its 2-order subobjects, S 2 (x) = {j 1,..., j t }. Then we compute the set {Class2(j 1 ),..., Class2(j t )} as a feature vector for object x. Then we perform a classification of these vectors, in the S-classification manner previously described, the final classes being obtained. The results of the case n = 4 are represented in the next figure, each complex graphical object x i being placed in an image labeled with a pair (i, index of class of x i ). Fig. 4 The last processing operation implemented by our computer vision system is a graphical object interpretation (understanding). As we have just described, a number of classes have been obtained as the object recognition result. It is obviously that each resulted class has the following property: all its members are either semantic entities or nonsemantic ones. We have to handle an interpretation-based clustering task: to separate the semantic classes from the meaningless ones. Of course, the term semantic it is a relative one, depending on what graphical objects are of interest for us. They become semantic objects and the others remain just sintactic ones. We can associate to semantic classes a text description of their objects meaning and to the non-sematic ones, the label meaningless. For example, in Fig. 2 we could label the first class with the text House, because it contains two objects representing houses, and the 4th class with meaningless. We have proposed the next method for automatic interpretation-based classification: a graphical object recognition is performed for an input image, a number of classes resulting; a set of text-labeled image objects representing entities of interest are created (houses, trees, cars, for example) and their orders are determined; each such labeled object is then compared with the classes having the same order with it and if it can be insert in such a class, that class becomes semantic and it is labeled with object s description; finally, all labeled classes are considered semantic classes and the unlabeled ones become non-semantic, being labeled with meaningless. The semantic extraction task is therefore solved.

6. Conclusions Graphical recognition and interpretation approaches have been proposed in this paper. They have been implemented and their results have been presented. These results can be succesfully applied in image retrieval domain, more exactly in solving the graphical object retrieval tasks. For example, considering an n-order object as an image query, we have to solve the task of finding it in an image database. The output must be composed of all the images which contain similar graphical objects. Using our image understanding approach, the image query can be replaced with a text-query related to object s interpretation. Given a text label as input, the database is searched and the images which contain graphical objects having a closed description are retrieved. For solving this task we should extract, classify and text-label the complex objects of each image from analyzed database and compare the obtained labels with the input query. References [1] Anil K. Jain, Fundamentals of Digital Image Processing, Prentice Hall International, Inc., 1989. [2] Jan T. Young, Jan. J. Gerbronds, L. J. van Sliet, Fundamentals of Image Processing, 1998. [3] Hu, M.K., Visual pattern recognition by moment invariants, IRE Transactions on Information Theory, vol. IT- 8, February 1962, pp. 179 187. [4] B.S. Monjunath and W.Y. Ma, Texture Features for Browsing and Retrieval of Image Data, IEEE Transactions on Pattern Analysis and machine Intelligence, vol. 18, No. 8, 1996. [5] Andrew Lian and Jian Fan, An Adaptive Approach for Texture Segmentation by Multichannel Wavelet Frames, Center for Computer Vision and Visualization, August 1993. [6] Dengsheng Zhang and Guojun Lu, Enhanced generic Fourier descriptors for object-based image retrieval, International Conference on Acoustics, Speech and Signal Processing, 2002 [7] Richard O. Duda, Pattern Recognition for HCI, Department of Electrical Engineering San Jose State University, 1996-1997