OntoExtractor: A Fuzzy-Based Approach to Content and Structure-Based Metadata Extraction

OntoExtractor: A Fuzzy-Based Approach to Content and Structure-Based Metadata Extraction Paolo Ceravolo, Ernesto Damiani, Marcello Leida, and Marco Viviani Università degli studi di Milano, Dipartimento di Tecnologie dell Informazione, via Bramante, 65 26013 Crema (CR), Italy {ceravolo, damiani, leida, viviani}@dti.unimi.it http://ra.crema.unimi.it/kiwi Abstract. This paper describes OntoExtractor a tool for extracting metadata from heterogeneous sources of information, producing a quick-and-dirty hierarchy of knowledge. This tool is specifically tailored for a quick classification of semi-structured data. By this feature, OntoExtractor is convenient for dealing with a web-based data source. 1 Introduction Typically, knowledge management techniques use metadata in order to specify content, quality, type, creation, and context of a data item. A number of specialized formats for the creation of metadata exist. A typical example is the Resource Description Framework (RDF). But metadata can be stored in any format such as free text, Extensible Markup Language (XML), or database entries. All of these formats must relay on a vocabulary that can have different degree of formality. If this vocabulary is compliant to a set of logical axioms it is called an ontology. There are a number of well-known advantages in using information extracted from data instead of data themselves. On one hand, because of their small size compared to the data they describe, metadata are more easily shareable than data. Thanks to metadata sharing, information about data becomes readily available to anyone seeking it. Thus, metadata make data discovery easier and reduces data duplication. But on the other hand some important drawbacks are restraining the diffusion of metadata format. First of all, building a knowledge-base is an onerous process. The domain analysis involves different activities often difficult to integrate, because they are usually performed by different professional roles. In addition, the high cost of knowledge-base building is in contradiction to important characteristics of knowledge management principles. Any knowledge management activity need to be configured for a given domain. But every domain evolves and the knowledge-bases related to it have to evolve as well. If the domain is evolving rapidly, a dis-alignment may result between the actual domain s R. Meersman, Z. Tari, P. Herrero et al. (Eds.): OTM Workshops 2006, LNCS 4278, pp. 1825 1834, 2006. c Springer-Verlag Berlin Heidelberg 2006

1826 P. Ceravolo et al. state of affairs and the knowledge-base. In addition, classical knowledge extraction technologies are not tailored for web-based data. These techniques were largely experimented with successful results. Anyhow they present some limitations. First, they need a high number of documents (typically, many thousands) to work properly. Secondly, they hardly take into account document structure and are therefore unsuitable for semi-structured document formats used on the Web. In this paper we present OntoExtractor a tool supporting knowledge extraction activities in a web-based environment. OntoExtractor was designed to be inserted in a more general system aimed at managing the whole Ontology Life Cycle [5]. The classification produced as output is transformed in a standard metadata format and proposed to a community of used. Feedbacks from the community are collected in order to refine the classification, discarding metadata expressing not relevant classes or misclassified documents [4]. In order to support continuos domain evolutions, OntoExtractor is designed for quickly producing a preliminary classification of a knowledge base. This tool supports heterogeneous source of information, including semi-strucutred data. A fuzzy representation of document vectors allows to segment documents according to their structural topology, assigning different relevance values to each segment. Another important feature of OntoExtractor is to produce different classifications organizing the classes of documents according to different degree of cohesion. This feature allows the user to quickly discard a classification not coherent to his vision of the domain. The paper is organized as follows: Section 2, introduces the tool, Section 3 describes the format adopted for document representation, Section 4 explains the techniques used in the structural classification of documents, Section 5 explains the techniques used in the content classification, while Section 6 goes to the conclusions. 2 OntoExtractor OntoExtractor is a tool, developed in the context of the KIWI project 1, which extracts metadata from heterogeneous sources of information, producing a quick-and-dirty hierarchy of knowledge. The construction of the hierarchy occurs in a bottom-up fashion: starting from the heterogeneous document set a clustering process groups documents in meaningful clusters. These clusters identify the backbone hierarchy of the ontology. Construction of the hierarchy is a three-step process, composed of the following phases: 1. Normalize the incoming documents into XML format [9]. 2. Clustering the documents according to their structure using a Fuzzy Bag representation of the XML tree [3] [6]. 1 This work was partly funded by the Italian Ministry of Research Fund for Basic Research (FIRB) under projects RBAU01CLNB 001 Knowledge Management for the Web Infrastructure (KIWI).

OntoExtractor: A Fuzzy-Based Approach to Content 1827 Fig. 1. Overview of the OntoExtractor process 3. Refine the structural clustering analyzing the content of the document, producing a semantic clustering of the documents. 3 Normalize the Knowledge Base This first step in our process is choosing a common representation format for the information to be managed. Data may come from different and heterogeneous sources: including unstructured, semi-structured or structured information, such as textual documents, HTML files, XML files or records in a database. In order to conciliate these different data sources we developed a set of wrapper applications transforming most used document formats in a XML target representation. The wrapping process is shown in Figure 2: for semi-structured and structured sources the wrapper does not have much to do. All it has to perform is applying a mapping between the original data and elements in the target XML tree. Unstructured sources of information need additional processing aimed to extracting the hidden structure of the documents. This phase uses well-known text-segmentation techniques [9] in order to find relations among parts of a text. This is an iterative process that takes as input a text blob (which is a continuous flow of characters, representing the whole content of a document) and gives as output a set of text-segments identified by the text segmentation process. The process stops when no text blob can be segmented further. At this point, a post-processing phase analyzes the resulting tree structure and generates the corresponding XML document. In the current version of the OntoExtractor software, a Regular Expressions matching approach is also available in order to discover regular patterns like titles of sections in the documents, helping controlling the text segmentation process. This is a preliminary approach that compares each row of the document with the regular expression (i.e. [0 9] + (([.]?) ([.]?[0 9]+)) (\s + \w+)+ we used this expression to match chapter, sections and paragraph headlines, which are usually proceeded by numbers separated by a. ).

1828 P. Ceravolo et al. Fig. 2. Wrapping process 4 Clustering by Structure The OntoExtractor tool uses a flat encoding for the internal representation of XML documents for processing and analysis purposes. Documents are represented as Fuzzy Bags, i.e. a collection of elements which may contain duplicates. Due to the fact that the importance of tags can differ, it is possible to assign a different weight (in the range form 0 to 1) to each tag in the document. In other words, for each element in the XML document d, the Fuzzy Bag encoding d contains a Fuzzy Element whose membership value is determined by the position of the tag in the document s structure or by other topological properties. OntoExtractor tool currently provides two different algorithms to calculate the membership function of a Fuzzy Element: 1. Nesting: this is a lossy representation of the original document s topology, because this membership value does not keep track of which is the parent tag of the current tag, as shown in Figure 3. Giving a vocabulary V = {R/1,a/0.9,b/0.8,d/0.6,e/0.4}, applying the nesting weighting function to a generic XML document, such as A.xml or B.xml, we obtain the fuzzy bag A = B = {R/1,a/0.3,a/0.225,b/0.2,d/0.3,e/0.2}. The membership value for each element is: M = V e /L. Where: M: membership value; V e : weight of the tag in the vocabulary; L: nestinglevelofthetagwithl root =0.

OntoExtractor: A Fuzzy-Based Approach to Content 1829 Fig. 3. Two generic XML documents A.XML and B.XML 2. MV : this is an experimental algorithm introduced by our group, which keeps memory of the parent tag. The membership value for each element is: M = (V e + M p )/L. Where: M: membership value; M p : membership value of the parent tag with M root =0; V e : weight of the tag in the vocabulary; L: nestinglevelofthetagwithl root =0. The MV membership value helps, in certain cases, to keep memory of the tree structure of the original document, referring to figure 3: using the same vocabulary V, applying the MV weighting function to the tree representation of the two XML documents A.xml and B.xml we obtain A = {R/1,a/0.53,a/0.36,b/0.33,d/0.8,e/0.7} and B = {R/1,a/0.56,a/0.37,b/0.34,d/0.8,e/0.7} which are different. Figure 4 shows the differences in processing an XML document coming from Amazon, alternatively by Nesting and MV algorithms. In order to compare the XML documents modeled as fuzzy bags well known similarity measures studied in [1] [2]. We privileged measures giving higher similarity weight to the bags where elements (tags) belonging to the intersection are less nested. This is motivated by the fact that, if a tag is near to the root it seems reasonable to assume that it has a higher semantic value. In OntoExtractor the comparison between two Fuzzy Bags is computed using Jaccard norm: Where: S(B1,B2) = Approx B1and B2 are the input fuzzy bags; is the intersection operator; is the union operator; is the cardinality operator; Approx() is the approximation operator; Sis the similarity value between B1 and B2. ( ) Bag1 Bag2 Bag1 Bag2

1830 P. Ceravolo et al. Fig. 4. Fuzzy Bags generated by Nesting and MV algorithms. And the XML representation of the document. For more theoretical information about this norm and how the union, intersection, approximation and cardinality operations are expressed, please refer to [3] and [6]. Using this norm the tool can perform a partitioned clustering technique that is an hybrid version between K-means and K-NN clustering algorithms. OntoExtractor uses an alpha-cut value as a threshold for the clustering process, in order to avoid to suggest the initial number of clusters (k) and skipping this way some clustering problems related to the k-means algorithm. The clustering algorithm compares all the documents with the centroid of each cluster, considering only the bigger resemblance value. If this value is bigger than the given alpha the document is inserted in the cluster, otherwise a new empty cluster is generated and the document is inserted in it. OntoExtractor tool offers two different ways to calculate the centroid of each cluster: one method chooses the document that has the smaller representative Fuzzy Bag. In this method the centroid always corresponds to a real document. The other method generates a new Fuzzy Bag as the union of all the Fuzzy Bags in the cluster. This way the generated Fuzzy Bag does not have a compulsory correspondence in a real document. 5 Clustering by Content The second clustering process that we propose is based on the content connected to leaf nodes. Content-based clustering is independent for each structural cluster

OntoExtractor: A Fuzzy-Based Approach to Content 1831 selected so on it is possible to give different clustering criteria for each structural cluster generated, as shown in Figure 5. Note that users can select which clustering process to perform; for instance, if there is no need of structural clustering then only content-based clustering is performed. Is important to remember that Fig. 5. Domain class subdivision based on structure (a) and refinement based on content (b) our clustering technique works on XML documents that are somehow structured. Therefore we compute content-based similarity at tag level, comparing content of the same tag between different documents. Then we compute content-based similarity at document level by aggregating tag level similarity values. Referring Fig. 6. Tag-Level ccomparison between data belonging to the same tag in different documents to Figure 6 it is necessary to choose two different functions: a function f to compare data belonging to tags with the same name in different documents: f a (a[data] A ; a[data] B ); f b (b[data] A ; b[data] B ); f c (c[data] A ; c[data] B ) (1)

1832 P. Ceravolo et al. and a function F to aggregate the individual fs: F (f a,f b,f c ). We have two possibilities for choosing the F function: F is a t-norm: conjunction of the single values (f a f b f c ); F is a t-conorm: disjunction of the single values (f a f b f c ). Fig. 7. A: comparison in case of null values. B: comparison in case of nested values. Referring to Figure 7 it is evident that we need to consider also cases where the tag is not present in the document and cases of documents having multiple instances of the same tag at different nesting levels. So in the first case we have: f b (null; b[data] B )=0; (2) and in the second case we evaluate the distancebetweenthetagsusingthe formula: 1 f x =max f xp,k (x p [data] p,k 1+Δ A ; x k [data] B ); p,k (3) Δ = μ(x p ) μ(x k ). (4) Occurrences of terms have distinct informative roles depending on the tags they belong to. So, it is possible either to define a different function f for each group of data of the same tag in different documents, or choosing a function considering the membership value μ(x i ) associated to the i-th tag. We represent the content of each tag (A n [data],b n [data, ]C n [data],... in (1)) with the well-known Vector Space Model,widelyusedinthemoderninformation retrieval system. The vector space model (VSM) is an algebraic model used for information filtering and information retrieval. It represents natural language documents in a formal manner by the use of vectors in a multi-dimensional space. The vector space model usually builds a documents-terms matrix and processes it to generate the document-terms vectors. Our approach is similar but we generate one matrix for each tag in the document; correspondingly, we generate a tag-terms vector. There are several methods to generate the tag-terms vector, such as LSA (Latent Semantic Analysis [7]) or SVD (Singular Value Decomposition), a well-known method of matrix reduction that adds latent semantic meaning to the vectors. In OntoExtractor, generating the tag-terms vectors is a three-step process:

OntoExtractor: A Fuzzy-Based Approach to Content 1833 Generating the tags-terms matrix: for each tag in the document, a documents-terms matrix is produced. It is important to remember that we do not consider the document as a unique text-blob, but we build the documents-terms matrix at the tag level. If a tag is not present in a document, a row of zeros is added to the matrix. Each entry in the matrix can be computed in several ways as well, by choosing one of the weighting methods implemented in the tool. At now it is possible to choose among: tf idf, tf df, tf and term occurrency. Transforming the matrix : once the matrix has been generated we process it by some matrix tranformations. We allow to choose between keeping the original matrix or transform it LSA by SVD. This method relies on the assumption that any m n matrix A (with (m n)) can be written as the product of an m n column-orthogonal matrix U, ann n diagonal matrix with positive or zero elements(σ), and the transpose of an n n orthogonal matrix V. Suppose M is an m n matrix whose entries come from the field K, which is either the field of real numbers or the field of complex numbers. Then there exists a factorization of the form: M = UΣV ;whereu is an m m unitary matrix over K,thematrixS is m n with non-negative numbers on the diagonal and zeros off the diagonal, and V denotes the conjugate transpose of V,ann n unitary matrix over K. Such a factorization is called a singular-value decomposition of M. ThematrixV thus contains a set of orthogonal input or analysing base-vector directions for M. The matrix U contains a set of orthogonal output base-vector directions for M. ThematrixS contains the singular values, which can be thought of as scalar gain controls by which each corresponding input is multiplied to give a corresponding output. After the matrix decomposition we generate a new n m matrix using an r-reduction of the original SVD decomposition: M = U r Σ r Vr. Only the r column vectors of U and r row vectors of V corresponding to the non-zero singular values S r are calculated. The resulting new matrix is not a sparse matrix anymore but it is densely populated by values, with hidden semantic meaning. Storing the vectors: each row in the matrix is stored in the associated tag in the document model as a new Fuzzy Bag with the terms as the element and the entry in the vector as membership value. Now tags contents are represented by Fuzzy Bags and we can compare them by mean of different distances measures: we can use traditional Euclidean distances such as the Cosine distance. 6 Conclusions and Further Work In order to avoid the siononimy and polisemy problem in the next versions of OntoExtractor will be added new processors using external ontologies to identify concept. Anyway this approach introduces other problems that have to be considered. One of this is the Word Sense Disambiguation (WSD). The validity

1834 P. Ceravolo et al. of this tool must be evaluated in the complete system it inserted on. Further works will provide a report on evaluations of the KIWI system. References 1. B. Bouchon-Meunier, M. Rifqi, S. Bothorel: Towards general measures of comparison of objects. Fuzzy Sets and Systems, volume 84, pages 143-153, 1996. 2. P. Bosc, E. Damiani: Fuzzy Service Selection in a Distributed Object-Oriented Environment. IEEE Transactions on Fussy Systems, volume 9, no. 5, pages 682-698, 2001. 3. P. Ceravolo, M.C. Nocerino, M. Viviani: Knowledge extraction from semistructured data based on fuzzy techniques. Knowledge-Based Intelligent Information and Engineering Systems, Proceedings of the 8th International Conference, KES 2004, Part III, pages 328-334, 2004. 4. P. Ceravolo, E. Damiani, M. Viviani: Adding a Peer-to-Peer Trust Layer to Metadata Generators. Lecture Notes in Computer Science, Volume 3762, pages 809-815, 2005. 5. P. Ceravolo, A. Corallo, E. Damiani, G. Elia, M. Viviani, and A. Zilli: Bottomup extraction and maintenance of ontology-based metadata. Fuzzy Logic and the Semantic Web, Computer Intelligence, Elsevier, 2006. 6. E. Damiani, M.C. Nocerino, M. Viviani: Knowledge extraction from an XML data flow: building a taxonomy based on clustering technique. Current Issues in Data and Knowledge Engineering, Proceedings of EUROFUSE 2004: 8th Meeting of the EURO Working Group on Fuzzy Sets, pages 133-142, 2004. 7. T. K. Landauer, P. W. Foltz, & D. Laham: Introduction to Latent Semantic Analysis. Discourse Processes, 25, pages 259-284, 1998. 8. G. Salton. and C. Buckley: Term Weighting Approaches in Automatic Text Retrieval. Technical Report. UMI Order Number: TR87-881., Cornell University. 1987, 9. G. Salton, A. Singhal, C. Buckley and M. Mitra: Automatic Text Decomposition Using Text Segments and Text Themes. Conference on Hypertext, pages 53-65, 1996.