A New Implementation of the co-vat Algorithm for Visual Assessment of Clusters in Rectangular Relational Data

Size: px

Start display at page:

Download "A New Implementation of the co-vat Algorithm for Visual Assessment of Clusters in Rectangular Relational Data"

Reynard Jenkins
6 years ago
Views:

1 A New Implementation of the co-vat Algorithm for Visual Assessment of Clusters in Rectangular Relational Data Timothy C. Havens 1, James C. Bezdek 1, and James M. Keller 1 Department of Electrical and Computer Engineering University of Missouri Columbia, MO 65211, USA ( havenst@gmail.com, jcbezdek@gmail.com, kellerj@missouri.edu) Abstract. This paper presents a new implementation of the co-vat algorithm. We assume we have an m n matrix D, where the elements of D are pair-wise dissimilarities between m row objects O r and n column objects O c. The union of these disjoint sets are (N = m + n) objects O. Clustering tendency assessment is the process by which a data set is analyzed to determine the number(s) of clusters present. In 2007, the co-visual Assessment of Tendency (co-vat) algorithm was proposed for rectangular data such as these. co-vat is a visual approach that addresses four clustering tendency questions: i) How many clusters are in the row objects O r? ii) How many clusters are in the column objects O c? iii) How many clusters are in the union of the row and column objects O r O c? And, iv) How many (co)-clusters are there that contain at least one of each type? co-vat first imputes pair-wise dissimilarity values among the row objects, the square relational matrix D r, and the column objects, the square relational matrix D c, and then builds a larger square dissimilarity matrix D r c. The clustering questions can then be addressed by using the VAT algorithm on D r, D c, and D r c; D is reordered by shuffling the reordering indices of D r c. Subsequently, the co-vat image of D may show tendency for co-clusters (problem iv). We first discuss a different way to construct this image, and then we also extend a path-based distance transform, which is used in the ivat algorithm, to co-vat. The new algorithm, co-ivat, shows dramatic improvement in the ability of co-vat to show cluster tendency in rectangular dissimilarity data. 1 Introduction Consider a set of objects O = {o 1,...,o N }. These objects can represent virtually anything vintage bass guitars, pure-bred cats, cancer genes expressed in a microarray experiment, cake recipes, or web-pages. The object set O is unlabeled data; that is, each object has no associated class label. However, it is assumed that there are subsets of similar objects in O. These subsets are called clusters. Numerical object data is represented as X = {x 1,...,x N } IR p, where each dimension of the vector x i is a feature value of the associated object o i. These features can be a veritable cornucopia of numerical descriptions, i.e., RGB values, gene

2 expression, year of manufacture, number of stripes, etc. Another way to represent the objects in O is with numerical relational data, which consist of N 2 values that represent the (dis)similarity between pairs of objects. These data are commonly represented by a relational matrix R =[r ij = relation(o i,o j ) 1 i, j N]. The relational matrix often takes the form of a dissimilarity matrix D. Dissimilarity can be interpreted as a distance between objects. For instance, numerical data X can always be converted to D by d ij = x i x j (any vector norm on IR p ). There are, however, similarity and dissimilarity relational data sets that do not begin as numerical object data; for these, there is no choice but to use a relational algorithm. Hence, relational data represent the most general form of input data. An even more general form of relational data is rectangular. These data are represented by an m n dissimilarity matrix D, where the entries are the pair-wise dissimilarity values between m row objects O r and n column objects O c. An example comes from web-document analysis, where the row objects are m web-pages, the columns are n words, and the (dis)similarity entries are occurrence measures of words in web-pages [1]. In each case, the row and column objects are non-intersecting sets, such that the pair-wise relation among row (or column) objects is unknown. Conventional relational clustering algorithms are ill-equipped to deal with rectangular data. Additionally, the definition of a cluster as a group of similar objects takes on a new meaning. There can be groups of similar objects that are composed of only row objects, of only column objects, or of mixed objects (often called co-clusters). In this paper we consider these four types of clusters in rectangular data. Clustering is the process of grouping the objects in O in a sensible manner. This process is often performed to elucidate the similarity and dissimilarity among and between the grouped objects. Clustering has also been called unsupervised learning, typology, and partitioning [2]. Although clustering is typically thought of as only the act of separating objects into the proper groups, cluster analysis actually consists of three concise questions: i) cluster tendency how many clusters are there?; ii) partitioning which objects belong to which cluster and to what degree?; and iii) cluster validity are the partitions good?. The VAT [3] and co-vat [4] algorithms address problem i). 1.1 VAT and ivat Algorithms The VAT algorithm displays an image of reordered and scaled dissimilarity data [3]. Each pixel of the grayscale VAT image I(D ) displays the scaled dissimilarity value of two objects. White pixels represent high dissimilarity, while black represents low dissimilarity. Each object is exactly similar with itself, which results in zero-valued (black) diagonal elements of I(D ). The off-diagonal elements of I(D ) are scaled to the range [0, 1]. A dark block along the diagonal of I(D ) is a sub-matrix of similarly small dissimilarity values; hence, the dark block represents a cluster of objects that are relatively similar to each other. Thus, cluster tendency is shown by the number of dark blocks along the diagonal of the VAT image. The VAT algorithm is based on Prim s algorithm [5] for finding the minimum spanning tree (MST) of a weighted connected graph [3]. Algorithm 1 illustrates the steps of the VAT algorithm. The resulting VAT-reordered dissimilarity matrix D can be nor-

3 malized and mapped to a gray-scale image with black representing the minimum dissimilarity and white the maximum. Algorithm 1: VAT Ordering Algorithm [3] Input: D n n dissimilarity matrix Data: K = {1, 2,...,n}; I = J = ; P =(0, 0,...,0). Select (i, j) arg max p K,q K D pq. Set P (1) = i; I = {i}; and J = K {i}. for r =2,...,ndo Select (i, j) arg min p I,q J D pq. Set P (r) =j; Replace I I {j} and J J {j}. Obtain the ordered dissimilarity matrix D using the ordering array P as: D pq = D P (p),p (q), for 1 p, q n. Output: Reordered dissimilarity D Reference [6] proposed an improved VAT (ivat) algorithm that uses a path-based distance measure from [7]. Consider D to represent the weights of the edges of a fullyconnected graph. The path-based distance is defined as D ij =min max D p[h]p[h+1], (1) 1 h< p p P ij where p P ij is an acyclic path in the set of all acyclic paths between vertex i (o i ) and vertex j (o j ), p[h] is the index of the hth vertex along path p, and p is the number of vertexes along the path. Hence, D p[h]p[h+1] is the weight of the hth edge along path p. Essentially the cost of each path p is the maximum weight of its p edges. The distance between i and j is the cost of the minimum-cost path in P ij. The authors of [6] first transform D into D with (1), then they use VAT on the transformed dissimilarity matrix. The ivat images show considerable improvement over VAT images in showing the cluster tendency for tough cases. Note that computing D exhaustively is very computationally expensive. We can show that i) the ivat dissimilarity matrix D can be computed recursively from the VAT-reordered data D (see Algorithm 2) and, ii) the matrix D computed by Algorithm 2 is already in VAT-order (we have an article in preparation that includes proofs of these assertions). We will denote D as D to indicate that it is a VAT-ordered matrix. Essentially, ivat is a distance transform that improves the visual contrast of the dark blocks along the VAT diagonal. 1.2 co-vat Algorithm The co-vat algorithm begins by creating a square matrix D r c, part of which is composed of the rectangular dissimilarity matrix D. D r c is created by first estimating the dissimilarity matrices D r and D c, which are, respectively, square dissimilarity matrices that relate the objects in O r and O c to themselves i.e. [D r ] ij d(o i,o j ) and [D c ] ij d(o m+i,o m+j ). D r c is organized as in Eq. (2).

4 Algorithm 2: Recursive calculation of ivat image Input: D - VAT-reordered dissimilarity matrix Data: D =[0] n n for r =2,...,ndo 1 j = arg min k=1,...,r 1 Drk 2 D rc = Drc,c= j D rc =max { 3 Drj,D jc},c=1,...,r 1,c j D is symmetric, thus D rc = D cr. [ ] Dr D D r c = D T D c d(o 1,o 1 ) d(o 1,o m ) d(o 1,o m+1 ) d(o 1,o m+n ) d(o m,o 1 ) d(o m,o m ) d(o m,o m+1 ) d(o m,o m+n ) d(o 1,o m+1 ) d(o m,o m+n ) d(o m+1,o m+1 ) d(o m+1,o m+n ) d(o 1,o m+n ) d(o m,o m+n ) d(o m+n,o m+1 ) d(o m+n,o m+n ) (2) The elements in D r and D c are estimated from D using any vector norms on IR n and IR m, [D r ] ij = λ r d i d j, 1 i, j m, (3) [D c ] ij = λ c d i d j, 1 i, j n, (4) where d i is the ith row of D, d j is the jth column of D, and λ r and λ c are scale factors such that the mean of the off-diagonal elements of D r and D c match the mean of D. Section 2.1 presents a new method for finding the reordering of the rectangular dissimilarity matrix D. Section 2.2 adapts the ivat distance transform to co-vat, which shows improved performance over the standard co-vat method. Section 3 presents a numerical example of the new implementations of co-vat and Section 4 concludes this paper. 2 New Implementations of co-vat 2.1 Alternate co-vat reordering scheme The original co-vat algorithm, outlined in Algorithm 3, reorders the rectangular matrix D by shuffling the VAT-reordering indexes of D r c. Thus, co-vat is very dependent on the construction of D r c. We have discovered that the original co-vat fails to show

5 Algorithm 3: co-vat Algorithm [4] Input: D - m n rectangular dissimilarity matrix Build estimates of D r and D c using Eqs.(3) and (4), respectively. Build D r c using Eq.(2). Run VAT on D r c, saving permutation array P r c = {P (1),...,P(m + n)} Initialize rc = cc =0; RP = CP =0. 1 for t =1,...,m+ n do 2 if P (t) m then 3 rc = rc +1, rc is row component 4 RP (rc) =P (t), RP are row indexes else 5 cc = cc +1, cc is column component 6 CP(cc) =P (t) m, CP are column indexes Form the co-vat ordered rectangular dissimilarity matrix, D =[D ij] =[D RP (i)cp(j) ], 1 i m; 1 j n Output: Reordered dissimilarity matrices D, D r, D c, and D r c cluster tendency in certain cases; we have an upcoming paper that discusses these cases in detail. Algorithm 4 presents a reordering scheme that is not dependent on the reordering of D r c this matrix does not even need to be constructed. However, we still need the matrix of the union if we intend to assess cluster tendency in O r c. Essentially, the reordering of the row indexes of D are taken from the VAT-reordering of D r and the reordering of the column indexes are taken from the VAT-reordering of D c. Another advantage of this alternate reordering scheme is that the scale factors, λ r and λ c in (3) and (4), can be ignored. Algorithm 4: Alternate co-vat Reordering Scheme Input: D - m n rectangular dissimilarity matrix Build estimates of D r and D c using Eqs.(3) and (4), respectively. 1 Run VAT on D r, saving permutation array, RP = {RP (1),...,RP(m)} 2 Run VAT on D c, saving permutation array, CP = {CP(1),...,CP(n)} 3 Form the co-vat ordered rectangular dissimilarity matrix, D =[Dij] =[D RP (i)cp(j) ], 1 i m; 1 j n Output: Reordered dissimilarity matrices, D, D r, and D c 2.2 Using the ivat distance transform in co-vat We apply the ivat distance transform in (1) to the three square co-vat matrices, D r, D c, and D r c, using the recursive formulation in Algorithm 2. We denote the transformed matrices as D r, D c, and D r c (examples of these matrices are shown in Figs. 2(a,c,d), respectively). Although, by definition, (1) could applied to the rectangular dissimilarity matrix D by considering D to represent a partially-connected graph

6 (edges only exist between row objects and column objects), applying this transform directly is computationally expensive. However, if we consider the elements of D r c that correspond to elements of the rectangular matrix, we can build the reordered rectangular matrix D from D r c. The rectangular co-ivat image is created as follows: 1. Build D r c and run VAT to produce D r c, where the reordering indexes are P r c = {P (1),...,P(m + n)}. 2. Compute D r c from D r c using the recursive ivat distance transform outlined in Algorithm Build the rectangular co-ivat image D from the corresponding elements of D r c. First, create the reordering indexes K and L, where K are the indexes of the elements of P r c m and L are the indexes of the elements of P r c >m(mis the number of row objects). Then create D by D = [ D ij ] ] = [(D r c) K(i),L(j), 1 i m, 1 j n. (5) We have also adapted ivat to the new reordering scheme of co-vat presented in Algorithm 4. This adaptation requires the construction of D r c, as above, and the corresponding elements of D r c are extracted to create the reordered rectangular matrix D. We denote the co-vat matrices built with the ivat distance transform as co-ivat images. Next we present an example that shows the effectiveness of the new implementations of co-vat. 3 Numerical Example Our numerical example is composed of 360 row objects and 360 column objects, as displayed in Fig. 1(a). Note that, although the number of row objects N equals the number column objects M, this data set is rectangular data because O r O c =Ø. The associated dissimilarity data, calculated using Euclidean distance, is shown in Fig. 1(b). The column objects (shown as green squares) are composed of three groups, the two groups of 50 objects located around coordinates (1.5,3) and (4.5,3), and the 260 objects organized along the curved line extending from the upper-left to the lower-right. The row objects (shown as blue circles) are composed of three groups, the two groups of 50 objects located around coordinates (1.5,0.5) and (3.5,5.5), and the 260 objects organized along the curved line extending from the upper-left to the lower-right. Hence, this example has a preferable cluster tendency of 3 clusters of row objects, 3 clusters composed of column objects, 5 clusters in the union of the row and column objects, and 1 co-cluster. Figure 1(c,d) shows that both co-vat and the new co-vat display the 1 co-cluster as a diagonal band in the upper-left of the image, with both giving approximately equally pleasing results. The co-vat images of D r and D c in Figs. 1(e,f), respectively, clearly show the smaller 2 clusters in each of the row objects and column objects as 2 smaller dark blocks in the lower-right of the image. Again the large co-cluster is shown as a dark diagonal band in the upper-left. The image of D r c is shown in Fig. 1(g); it

7 6 Y coordinate Row objects Column objects X coordinate (a) Object Data (b) Dissimilarity Data - D (c) co-vat image - D (d) New co-vat image - D (e) co-vat image - D r (f) co-vat image - D c (g) co-vat image - D r c Fig. 1: co-vat images of 360 row objects and 360 column objects represented by rectangular dissimilarity data D

(a) co-ivat image - D (b) New co-ivat image - D (c) co-ivat image - D r (d) co-ivat image - D c (e) co-ivat image - D r c Fig.

Further more, the contrast in Figs. 1(c,d) make it difficult to determine the number of co-clusters. Figure 2 shows the corresponding co-ivat images of the rectangular dissimilarity data shown in Fig.

8 (a) co-ivat image - D (b) New co-ivat image - D (c) co-ivat image - D r (d) co-ivat image - D c (e) co-ivat image - D r c Fig. 2: co-ivat images of 360 row objects and 360 column objects represented by rectangular dissimilarity data D shows the 5 clusters in O r O c as the four dark blocks in the lower-right and the dark diagonal band in the upper-left. While we hesitate to say that co-vat and the new co-vat have failed for this example, we believe that the large diagonal band leads to ambiguity as to the cluster tendency of these data. Further more, the contrast in Figs. 1(c,d) make it difficult to determine the number of co-clusters. Figure 2 shows the corresponding co-ivat images of the rectangular dissimilarity data shown in Fig. 1. The co-ivat images give a very clear view of the cluster tendency for each of the four types of clusters: D shows 1 co-cluster, D r shows 3 row-clusters, D c shows 3 column-clusters, and D r c shows 5 clusters in O r O c. 4 Conclusion This paper presented a new implementation of the co-vat algorithm with two innovations; a new reordering scheme was presented in Algorithm 4 and each co-vat algorithm was adapted to include the distance transform in (1). Although the numerical example shown in Fig. 1 does not clearly show the strength of the alternate reordering scheme presented in Algorithm 4, we are currently preparing a paper that will present several examples, many of which show that the alternate reordering scheme performs well in cases where the original co-vat fails. Moreover, we emphasize that the alter-

9 nate reordering scheme is computationally less expensive as D r c does not need to be constructed or VAT-reordered. Figure 2 shows that the co-ivat images are clearly superior to the original co-vat images in showing the cluster tendency of the four types of clusters in the rectangular data. Additionally, due to our recursive implementation in Algorithm 2, these improved images come at very little additional computational cost. We left many questions unanswered in this paper and are currently penning an article that describes, in detail, each of the following: 1. What are the advantages and disadvantages of the alternate reordering scheme presented in Section 2.1? We have discovered pure relational data (data for which object data X does not exist) on which the original co-vat formulation fails and on which the alternate reordering scheme is successful. Additionally, we have revisited the normalization of D r and D c in (3) and (4) and have devised different ways to normalize these matrices which show improvement in performance. 2. We have developed proofs of our assertions on ivat presented in Section 1.1 and the recursive formulation of (1) outlined in Algorithm The path-based distance transform can be applied to the rectangular dissimilarity data directly. We are working on an algorithm, similar to Algorithm 2, that will compute the distance transform of the rectangular dissimilarity data, without having to exhaustively search through every path in the partially connected tree. 4. The sco-vat [8] algorithm performs the operations of co-vat for very-large (unloadable) data. We are extending the co-vat implementation presented here to the sco-vat algorithm. 5. As always, we wish to demonstrate the performance of our algorithms on real data. However, gold-standard rectangular data-sets are not as ubiquitous as square relational data-sets. Hence, we are identifying data on which we can validate our work.

10 Bibliography [1] Dhillon, I.: Co-clustering documents and words using bipartite spectral graph partitioning. In: Proc. 7th ACM SIGKDD Int. Conf. on Knowledge Discovery Data Mining, San Francisco, CA (2001) [2] Theodoridis, S., Koutroumbas, K.: Pattern Recognition. 3rd edn. Academic Press, San Diego, CA (2006) [3] Bezdek, J., Hathaway, R.: VAT: A tool for visual assessment of (cluster) tendency. In: Proc. IJCNN 2002, Honolulu, HI (2002) [4] Bezdek, J., Hathaway, R., Huband, J.: Visual assessment of clustering tendency for rectangular dissimilarity matrices. IEEE Trans. Fuzzy Systems 15(5) (October 2007) [5] Prim, R.: Shortest connection networks and some generalisations. Bell System Tech. J. 36 (1957) [6] Wang, L., Nguyen, T., Bezdek, J., Leckie, C., Ramamohanarao, K.: ivat and avat: enhanced visual analysis for cluster tendency assessment. in review (2009) [7] Fisher, B., Zoller, T., Buhmann, J.: Path based pairwise data clustering with application to texture segmentation. Energy Minimization Methods in Computer Vision and Pattern Recognition 2134 (2001) [8] Park, L., Bezdek, J., Leckie, C.: Visualization of clusters in very large rectangular dissimilarity data. In Gupta, G.S., Mukhopadhyay, S., eds.: Proc. 4th Int. Conf. Autonomous Robots and Agents. (February 2009)

CHAPTER 6 IDENTIFICATION OF CLUSTERS USING VISUAL VALIDATION VAT ALGORITHM

96 CHAPTER 6 IDENTIFICATION OF CLUSTERS USING VISUAL VALIDATION VAT ALGORITHM Clustering is the process of combining a set of relevant information in the same group. In this process KM algorithm plays