XML Clustering by Bit Vector

Size: px

Start display at page:

Download "XML Clustering by Bit Vector"

Madeline Simmons
5 years ago
Views:

1 XML Clustering by Bit Vector WOOSAENG KIM Department of Computer Science Kwangwoon University 26 Kwangwoon St. Nowongu, Seoul KOREA Abstract: - XML is increasingly important in data exchange and information management. A large amount of efforts have been spent in developing efficient techniques for accessing, querying, and storing XML documents. In this paper, we propose a new method to cluster XML documents efficiently. A bit vector which represents a XML document is proposed to cluster the XML documents. The similarity between two XML documents is measured by a bit-wise AND operation between two corresponding bit vectors. The experiment shows that the clusters are formed well and efficiently when a bit vector is used for the feature of a XML document. Key-Words: - XML, XML clustering, bit vector, hierarchical clustering. 1 Introduction The growth of the Internet has greatly simplified access to existing information sources and spurred the creation of new sources. As the Internet continues to grow and evolve, more and more information is being placed in structurally rich documents such as XML. With the large number of documents on the Web, there is an increasing need to be able to automatically process those structurally rich documents for information retrieval and search applications. XML consists of elements and each element is a pair of matching start tags and end tags and all the text that appears between them. Since users can define tags freely, tags represent not only data itself but also data meaning. The elements of XML are organized in a nested structure such that XML can be modeled as an ordered labeled tree [1]. Each node in this tree corresponds to a tag in the document and is labeled with the tag's name. Each edge in this tree represents inclusion of the tag corresponding to the child node under the tag corresponding to the parent node in the XML. The general purpose of data clustering is to derive some relevant information from the various data for further data processing. The clustered data may show some tendency or regularity in the data and may even show some relevant knowledge worth noting. The clustering of XML documents is to group similar documents to facilitate searching because similar documents can be searched and processed within a specific category. The appropriate clustering of XML documents is also effective for systematic document management and the efficient storage of XML documents, and even for system protection purpose because unusual document can be discovered easily. This paper proposes a bit vector for the feature of a XML document to cluster the documents efficiently. A bit vector is constructed by the paths of the tree corresponding to a XML document. The similarity between two XML documents is measured by a bit-wise AND operation between two corresponding bit vectors. We regard the documents generated from the same DTDs as the similar documents to be clustered into one group. We propose two different methods to use bit vectors to cluster XML documents when their DTDs information is available and when it is not available. The experiment shows that the clusters are formed well and efficiently when a bit vector is used to cluster the XML documents. The rest of the paper is organized as follows. The related works are discussed in section 2. In section 3, we propose a method to construct a bit vector from a XML document. In section 3, we proposed a document bit vector and a DTD bit vector when the information about DTDs of the XML documents is available. In section 4, we propose a method to apply the well known hierarchical clustering algorithm with a document bit vector when the information about DTDs of the XML documents is not available. In section 5, we test and verify the effectiveness of our algorithm with several examples. Finally, we make a conclusion in section 6. ISSN: ISBN:

2 2 Related Works The need for organizing and clustering XML data has become challenging, due to the increase of heterogeneity of XML sources. Recently, several clustering techniques which consider the structure and/or the contents of XML documents are studied. Reference [2] applies a K-means clustering technique to XML documents represented in a vector-space model. In this representation, each document is represented by an N-dimensional vector, with N being the number of document features such as text features, tag features, and a combination of both in the collection. They only consider the contents of XML. In [3], [4], a new bitmap indexing based technique to cluster XML documents is described. A BitCube is presented in as a 3-dimensional bitmap index of triplets (document, XML-element path, word). BitCube indexes can be manipulated to partition documents into clusters by exploiting bit-wise distance and popularity measures. However, this method needs manual operations to create a bitmap index. Reference [5] devises features for XML data, focusing on content information extracted from textual elements and structure information derived from tag paths. They introduce the notion of tree tuple in the definition of an XML representation model that allows for mapping XML document trees into transactional data, i.e., variable length sequences of objects with categorical attributes. A partitional clustering approach has been developed and applied to the XML transactional domain. On the other hand, reference [6] transforms the structure of the XML document into a discrete function. The discrete function, then, is transformed into frequency domain by FFT. The result of FFT is a pair of complex numbers consisting of x and y values and considered to be a pair of n-dimensional vectors. The pairs of n-dimensional vectors are compared using a weighted Euclidean distance metric in an incremental and unsupervised fashion. This approach considers solely the structure of elements. Reference [7] transforms XML trees into vectors in a high dimensional Euclidean space based on the occurrences of the features in the documents. Next, they apply principal component analysis (PCA) to the matrix to reduce its dimensionality. Finally they use a K-means algorithm to cluster the vectors residing in the reduced dimensional space and place them in appropriate categories. However, their method only works for documents with the same DTD. Reference [8] proposes a method to extract representative paths by considering both the sequence and occurrence frequency of elements of XML tree by sequential pattern mining technique. Then the paths composed of items are clustered by the notion of large items, i.e., items contained in some minimum fraction of transactions in a cluster to measure the similarity of a cluster of transactions [9]. 3 Similar XML Document and a Bit Vector Since the XML documents generated by the same DTDs are similar ones, we try to cluster them into one group. For example, Fig. 2(a), 2(b), and 2(c) generated by the Actor DTD of Fig. 1 are similar documents. In this paper, a XML document is represented by all the paths from the root to the leaves of the tree corresponding to the document. Therefore, the document A1 of Fig. 2(a) is represented by the three paths of actor/name/lname, actor/filmography/movie/ title, and actor/filmography/movie/ year. On the other hand, the document A2 of Fig. 2(b) is represented by the four paths of actor/name/fname, actor/name/lname, actor/filmography/movie/title, and actor /filmography/ movie/year. While the structure of the document A3 of Fig. 2(c) is different from that of the document A2, the document A3 is represented by the same four paths of the document A2. Actor <!ELEMENT actor (name, filmography*)> <!ELEMENT name (fname?, lname)> <!ELEMENT fname (#PCDATA)> <!ELEMENT lname (#PCDATA)> <!ELEMENT filmography (movie)> <!ELEMENT movie (title, year)> <!ELEMENT title (#PCDATA)> <!ELEMENT year (#PCDATA)> Fig. 1. Actor DTD (a) (b) (c) Fig. 2. Trees correspond to Actor DTD When the four paths actor/name/fname, actor/ name/lname, actor/filmography /movie/title, actor/ filmography/movie/year generated by the documents A1, A2, and A3 are notated by P1, P2, P3, P4 ISSN: ISBN:

3 respectively, the document bit vectors constructed by these three documents are as shown in Table 1. In the following sections, we will explain how we can use the bit vectors to cluster XML documents. Table 1. Document Bit Vectors P1 P2 P3 P4 A A A Clustering Method when DTD Information is Available When we cluster the documents which have the DTDs information, we first get all the paths of the tree corresponding to each DTD. In this case, an element corresponding to a special symbol?, +, or * of a DTD is assumed to be occurred just one time. Therefore, the Actor DTD of Fig. 1 is represented by the four paths of actor/name/fname, actor/name/lname, actor/filmography/movie/title, and actor/filmography/movie/year. Then the DTD bit vector for the Actor DTD can be constructed by these paths. Let's suppose that the Star_Fan DTD of Fig. 3 is also available, then all the paths of the tree corresponding to the Star_Fan DTD are actor/name/fname, actor/name/lname, actor/club /name, actor/club/tel, actor/club/addr, actor/club/fan/ name, and actor /club/fan/tel. When these nine paths of actor/name/fname, actor/name/lname, actor/filmography/movie/title, actor/filmography/movie/year, actor/club/name, actor/club/tel, actor/club/addr, actor/club/fan/name, and actor/club/fan/tel are notated by P1, P2, P3, P4, P5, P6, P7, P8, and P9 respectively, the DTD bit vectors for the Actor DTD and Star_Fan DTD are shown in Table 2. Star_Fan <!ELEMENT actor (name, club*)> <!ELEMENT name (fname?, lname)> <!ELEMENT fname (#PCDATA)> <!ELEMENT lname (#PCDATA)> <!ELEMENT club (name, tel, addr, fan*)> <!ELEMENT name (#PCDATA)> <!ELEMENT tel (#PCDATA)> <!ELEMENT addr (#PCDATA)> <!ELEMENT fan (name, tel)> <!ELEMENT name (#PCDATA)> <!ELEMENT tel (#PCDATA)> Table 2. DTD Bit Vectors P1 P2 P3 P4 P5 P6 P7 P8 P9 Act S_F In order to cluster documents, their document bit vectors are constructed from the paths of the corresponding trees by the method illustrated in section 3. Then, the bit-wise AND operations are performed between all the DTD bit vectors and the document bit vector of the document to be clustered. A document is clustered into the DTD group where its bit-wise AND operation has the biggest value. For example, the document A1 of Fig. 2(a) is clustered in the following way. (i) The document bit vector of A1 which is comprised of 9 paths is constructed. Since the paths of A1 are P2, P3, and P4, 1 bit is assigned to these paths and 0 bit is assigned to other remaining paths for the document bit vector. Therefore, the document bit vector of A1 becomes (ii) the bit-wise AND operations are performed between the document bit vector of A1 and all the DTD bit vectors in TABEL 2. (iii) The document A1 is clustered into an Actor group because its bit-wise AND operation with the Actor DTD bit vector has greater value than its bit-wise AND operation with the Star_Fan DTD bit vector. 5 Clustering Method when DTD Information is not Available We use hierarchical clustering algorithm to cluster documents which do not have the DTDs information. The hierarchical clustering algorithms produce a hierarchy of nested clustering. A clustering R 1 containing k clusters is said to be nested in the clustering R 2, which contains r (<k) clusters, if each cluster in R 1 is proper subset of R 2. The hierarchical clustering algorithms are classified into two groups -agglomerative and divisive- in accordance with the building up direction of the clusters. The pseudo code of general agglomerative clustering algorithm is described in Fig. 4 when the total number of patterns is n. Fig. 3. Star_Fan DTD ISSN: ISBN:

4 1 Begin with n clusters, each consists of one pattern. 2 Repeat step 3 a total of n-1 times. 3 Find the most similar clusters C i and C j Fig. 4. Agglomerative clustering algorithm Different agglomerative clustering algorithms are obtained by using different methods to determine the similarity of clusters. The single-linkage algorithm is obtained by defining the distance between two clusters to be the smallest distance between two patterns such that one pattern is in each cluster. Therefore, if C i and C j are clusters, the distance between them is defined as in d ( C, C ) = min d ( X, Y ) (1) SL i j X C, Y C i j where d (a, b) denotes the distance between the samples X and Y. On the other hand, the complete-linkage algorithm is obtained by defining the distance between two clusters to be the largest distance between a pattern in one cluster and a pattern in the other cluster. Therefore, if C i and C j are clusters, the distance between them is defined as in d ( C, C ) = max d( X, Y ) (2) CL i j X C, Y C i j where d (a, b) denotes the distance between the samples X and Y. One of the important issues for clustering process is how a similarity measure between patterns is quantified. The most obvious measure of the similarity (or dissimilarity) between two patterns is the distance between them, especially Euclidean distance. Since we represent a document by a bit vector, we use a bit vector for a similarity measure between documents. The similar documents generated from the same DTDs have many common paths, therefore their bit-wise AND operations have a big value. Therefore, we quantify the similarity measure as the value of a bit-wise AND operation between two document bit vectors. The similarity between two XML documents f(x, Y) is defined as in (3) when the document bit vectors corresponding to two XML documents X and Y are notated by Vx and Vy. f(x,y) = bit-wise AND (V X,V Y ). (3) Let us assume that there are two Star_Fan documents S1 and S2 and their paths of the corresponding trees are as follows: S1= {actor/fname, actor/lname, actor/club/name, actor/club/tel, actor/ club/addr}. S2 = {actor/fname, actor/lname, actor/ club/name, actor/club/tel, actor/club/ addr, actor/club/ fan/name, actor/club/fan/tel}. The documents to be clustered are arrived in the order of A1, S1, A2, A3, and S2. Then the document bit vectors constructed by these five documents are shown in Table 3 where the paths generated from these documents are notated by A1={P1, P2, P3}, S1={P4, P1, P5, P6, P7}, A2 = {P4, P1, P2, P3}, A3 = { P4, P1, P2, P3}, and S2 = {P4, P1, P5, P6, P7, P8, P9}. Table 3. Document Bit Vectors P 1 P2 P3 P 4 P5 P6 P7 P 8 A S A A S In Table 3, the most similar two groups are S1 and S2 because their bit-wise AND operation has the biggest value 5. Therefore, S1 and S2 are clustered into one group. Next, A2 and A3 are clustered into one group because their bit-wise AND operation has the biggest value 4. In the last, A1 and {A2, A3} are clustered into one group because their bit-wise AND operation has the biggest value 3. Therefore, all the groups are clustered correctly as {A1, A2, A3} and {S1, S2} when two clusters are asked for five documents. 6 Clustering Experiments We use the data provided by the XML data bank of the University of Wisconsin to measure the effectiveness of our methods [10]. This data bank provides 8 DTDs such as bibliography, club, company profiles, stock quotes, department, personal information, movies, and actor. We test our methods with 80 XML documents generated by these 8 DTDs. P9 ISSN: ISBN:

5 6.1 Clustering Results when DTD Information is Available 8 DTD bit vectors corresponding to 8 DTDs and 80 document bit vectors corresponding to 80 documents are constructed to cluster 80 documents when their DTDs information is available. Fig. 5 shows the clustering result of 80 documents. The number of correctly clustered documents (NCC) and the number of incorrectly clustered documents (NIC) are counted. Fig. 5 shows that all the documents are correctly clustered into 8 groups. The result is justified in that a bit-wise AND operation between the document bit vector of a given document with the corresponding DTD bit vector always has the biggest value. Fig. 6. DTD and document bit vectors 6.2 Clustering Results when DTD Information is not Available 80 document bit vectors corresponding to 80 documents are constructed to cluster 80 documents when their DTDs information is not available. Fig. 7 shows the dendrogram when a single-linkage algorithms is applied to these 80 documents. All the documents are numbered to observe the result easily: Actor (1~5), Club (6~10), Bibliography (11~15), Movie (16~20), Department (21~25), Personal (26~30), Company (31~35), Stock (36~40). As we can see in the diagram, all the documents are clustered correctly. In this figure, we can observe that the Department documents (21~25) are clustered in the first. That's because a Department DTD has the most parsed character data (#PCDATA) so that its document bit vector has the most 1 bit. Fig. 5. Clustering number Fig. 6 shows the number of 1 bits of each DTD bit vector (NTB) and the average number of 1 bits of the document bit vectors (NDB) clustered into corresponding group. The reason why a Department DTD has the biggest value for NTB is that it has the most parsed character data (#PCDATA). In other words, the tree of a Department DTD has the most leaf nodes. In this figure, the value of NTB and NDB of Company or Stock DTD is the same. That's because all the elements included in the DTD of Company or Stock are also appeared in the corresponding document since the DTD of Company or Stock does not have special symbols such as '?' and '*'. Fig. 7. Dendrogram by single-linkage Fig. 8 is the dendrogram when a complete-linkage algorithm is applied to 80 documents. Although all the documents are clustered correctly as shown in Fig. 8, their clustering order is somewhat different from that of Fig. 7. For example, all the Department documents are clustered at first in Fig. 7. However, the Department document 21 is clustered with other Department documents after all the Company documents are clustered in Fig. 8. That's because the complete-linkage algorithm obtains the distance between two clusters to be the largest distance between a pattern in one cluster and a pattern in the other cluster. In case of a Department document, it has staff, undergraduate, and graduate data. However, in some cases a document can have either undergraduate or graduate data only. Thus, the Department ISSN: ISBN:

6 documents (23~25) are clustered first because they have both the undergraduate and graduate data so that many 1 bits are shared. However, the document 22 has only graduate data and the document 21 has only undergraduate data. Therefore, these documents only share staff data so that the clustering between the document 21 and the document 22 is delayed until all the Profile documents are clustered. Fig. 8. Dendrogram by complete-linkage 7 Conclusion We propose a bit vector as a feature of a XML document for clustering XML documents. The similarity between two XML documents is measured by a bit-wise AND operation between two corresponding bit vectors. We use not only a document bit vector but also a DTD bit vector to cluster XML documents when their DTD information is available. However, we use a hierarchical clustering algorithm with a document bit vector to cluster XML documents when their DTD information is not available. The experiment shows that the clusters are formed well and efficiently when a bit vector is used. Scientific and Statistical Database Management, Fairfax, Virginia, [4] J. Yoon, V. Raghavan, V.Chakilam, L. Kerschberg, BitCube: a 3-D bitmap indexing for XML documents, Journal of Intelligent Information Systems, Vol. 17, 2001, pp [5] A. Tagarelli, A. Greco, Toward semantic XML clustering, 6th SIAM International Conference on Data Mining (SDM 06), Bethesda, Maryland, USA, 2006, pp [6] H. Lee, An unsupervised clustering technique of XML documents based on function transform and FFT, Journal of Korea Information Processing Society, [7] J. Liu, T. Jason, L. Wang, W. Hsu, K.G. Herbert, XML clustering by principal component analysis, Proc. of the 26th IEEE International Conference on Tools with Artificial Intelligence, [8] J. Hwang, K. Ryu, XML document clustering based on sequential pattern, Journal of Korea Information Processing Society, [9] K. Wang, C. Xu, B. Liu, Clustering transactions using large items, Proc. of ACM CIKM-99, [10] Niagara Query Engine, References: [1] R. Behrens, A Grammar based model for XML schema integration, Proc. of the 17th British National Conf. on Databases, 2000, pp [2] A. Doucet, H. Ahonen-Myka, Naive clustering of a large XML document collection, Proc. 1st Annual Workshop of the Initiative for the Evaluation of XML retrieval(inex), Germany, 2002, pp [3] J. Yoon, V. Raghavan, V. Chakilam, BitCube: clustering and statistical analysis for XML documents, Proc. of the 13th Int. Conf. on ISSN: ISBN:

Cluster Analysis. Ying Shen, SSE, Tongji University

Cluster Analysis. Ying Shen, SSE, Tongji University Cluster Analysis Ying Shen, SSE, Tongji University Cluster analysis Cluster analysis groups data objects based only on the attributes in the data. The main objective is that The objects within a group