SIMON: A Multi-strategy Classification Approach Resolving Ontology Heterogeneity The P2P Meets the Semantic Web *

SIMON: A Multi-strategy Classification Approach Resolving Ontology Heterogeneity The P2P Meets the Semantic Web * Leyun Pan, Liang Zhang, and Fanyuan Ma Department of Computer Science and Engineering Shanghai Jiao Tong University, 200030 Shanghai, China {pan-ly, zhangliang}@cs.sjtu.edu.cn, fyma@sjtu.edu.cn Abstract. The semantic web technology is seen as a key to realizing peer-topeer for resource discovery and service combination in the ubiquitous communication environment. However, in a Peer-to-Peer environment, we must face the situation, where individual peers maintain their own view of the domain in terms of the organization of the local information sources. Ontology heterogeneity among individual peers is becoming ever more important issues. In this paper, we propose a multi-strategy learning approach to resolve the problem. We describe the SIMON (Semantic Interoperation by Matching between ONtologies) system, which applies multiple classification methods to learn the matching between ontologies. We use the general statistic classification method to discover category features in data instances and use the first-order learning algorithm FOIL to exploit the semantic relations among data instances. On the prediction results of individual methods, the system combines their outcomes using our matching committee rule called the Best Outstanding Champion. The experiments show that SIMON system achieves high accuracy on real-world domain. 1 Introduction Today s P2P solutions support only limited update, search and retrieval functionality, which make current P2P systems unsuitable for knowledge sharing purposes. Metadata plays a central role in the effort of providing search techniques that go beyond string matching. Ontology-based metadata facilitates the access to domain knowledge. Furthermore, it enables the construction of semantic queries [1]. Existing approaches of ontology-based information access almost always assume a setting where information providers share an ontology that is used to access the information. However, we rather face the situation, where individual peers maintain their own view of the domain in terms of the organization of the local file system and other information sources. Enforcing the use of a global ontology in such an environment would mean to give up the benefits of the P2P approach mentioned above. * Research described in this paper is supported by The Science & Technology Committee of Shanghai Municipality Key Project Grant 02DJ14045 and by The Science & Technology Committee of Shanghai Municipality Key Technologies R&D Project Grant 03dz15027. M. Li et al. (Eds.): GCC 2003, LNCS 3033, pp. 744 751, 2004. Springer-Verlag Berlin Heidelberg 2004

SIMON: A Multi-strategy Classification Approach 745 We can consider the process of addressing the semantic heterogeneity as the process of ontology matching (ontology mapping) [2]. Matching processes typically involve analyzing data instances associated with ontologies and comparing them to determine the correspondence among concepts. Given two ontologies in the same domain, we can find the most similar concept node in one ontology for each concept node in another one. However, at the Internet scale, finding such mappings is tedious, error-prone, and clearly not possible. It cannot satisfy the need of online exchange of ontology to two peers not in agreement. Hence, we must find some approaches to assist in the ontology (semi-) automatically matching process. In the paper, we will discuss the use of data instances associated with the ontology for addressing semantic heterogeneity. We propose the SIMON (Semantic Interoperation by Matching between ONtologies) system, which applies multiple classification methods to learn the matching between the pair of ontologies that are homogenous and their elements have significant overlap. Given the source ontology B and the target ontology A, for each concept node in target ontology A, we can find the most similar concept node from source ontology B. SIMON considers the ontology A and its data instances as the learning resource. All concept nodes in ontology A are the classification categories and relevant data instances of each concept are labeled learning samples in a classification process. The data instances of concept nodes in ontology B are unseen samples. SIMON classifies instances of each node in ontology B into the categories of ontology A according the classifiers for A. SIMON uses multiple learning strategies, namely multiple classifiers. Each of classifier exploits different type of information either in data instances or in the semantic relations among these data instances. Using appropriate matching committee method, we can get better result than simple classifier. This paper is organized as follows. In the next section, we introduce the overview of the ontology matching system. In section 3, we will discuss the multi-strategy classification for ontology matching. Section 4 presents the experiment results with our SIMON system. Section 5 reviews related work. We give the conclusion and the future work in section 6. 2 Overview of the Ontology Matching System The ontology matching system is trained to compare two ontologies and to find the correspondence among concept nodes. An example of such task is illustrated in Figure 1 and Figure 2. There are two ontologies of movie database. When a soft agent wants to collect some information about movies, it accesses a P2P system of movie. The movie information on individual peers will be marked up using some ontology such as Figure.1 or Figure.2. Here the data is organized into a hierarchical structure that includes movie, person, company, awards and so on. Movies have attributes such as title, language, cast&crew, production company and genre and so on. Some classes link to each other by some attributes shown as italic in figure. However, because each of peers may use different ontology, it is difficult to completely integrate all data for an agent that only master one ontology. For example, agent may consider that Movie in Allmovie is equivalent to Movie in IMDB. However, in fact Movie in IMDB is just an empty ontology node and MainMovieInfo in IMDB is the most similar to Movie in Allmovie. The

746 L. Pan, L. Zhang, and F. Ma mismatch also may happen between MoviePerson and Person, GenreInstance and Genre, Awards and Nominations and Awards. IMDB homepage: Movie Awards and Nominations result: category: AllMovie homepage: awardname: awardsmovie: MainMovieInfo Company title: name: Language: address: Plot: createdyear: cast&crew: production company: MoviePerson name: biography: countryofbirth: belongsto: filmography: Music title: musicmood: composer: awardswon: GenreInstance genretype: genrekeywords: Recommends: Movie title: Language: cast&crew: production: genre: Company name: address: createdyear: Person name: introduction: country: belongsto: filmography: Music title: musicmood: composer: Genre genretype: genrekeywords: Recommends: Awards result: category: awardname: awardsmovie: awardswon: genre: Actor roleplayed: awards: Director independent: awards: Player roleplayed: awards: Director independent: awards: Fig. 1. Ontology of movie database IMDB Fig. 2. Ontology of movie database Allmovie SIMON uses multi-strategy learning methods including both statistical and firstorder learning techniques. Each base learner exploits well a certain type of information from the training instances to build matching hypotheses. We use a statistical bag-of-words approach to classifying the pure text instances. Furthermore, the relations among concepts can help to learn the classifier. On the prediction results of individual methods, system combines their outcomes using our matching committee rule called the Best Outstanding Champion that is a weighted voting committee. This way, we can achieve higher matching accuracy than with any single base classifier alone. 3 Multi-strategies Learning for Ontology Matching 3.1 Statistical Text Classification One of methods that we use for text classification is naive Bayes, which is a kind of probabilistic models that ignore the words sequence and naively assumes that the presence of each word in a document is conditionally independent of all other words in the document. Naive Bayes for text classification can be formulated as follows. Given a set of classes C = { c1,..., cn} and a document consisting of k words, { w 1,..., wk}, we classify the document as a member of the class, c *, that is most probable, given the words in the document: c * = arg max c Pr( c w1,..., wk) Pr( c w1,..., wk) can be transformed into a computable expression by applying Bayes Rule (Eq. 2); rewriting the expression using the product rule and dropping the denominator, since this term is a constant across all classes, (Eq. 3); and assuming that words are independent of each other (Eq. 4). (1)

SIMON: A Multi-strategy Classification Approach 747 Pr( Pr( c ) Pr( w 1,..., w k c ) c w 1,..., w k ) = (2) Pr( w 1,..., w k ) k Pr( c ) Pr( w i c, w 1,... w i 1) (3) i = 1 = k (4) Pr( c ) Pr( w i c ) i = 1 Pr(c ) is estimated as the portion of training instances that belong to c. So a key step in implementing naive Bayes is estimating the word probabilities, Pr( wi c). We use Witten-Bell smoothing [3], which depends on the relationship between the number of unique words and the total number of word occurrences in the training data for the class: if most of the word occurrences are unique words, the prior is stronger; if words are often repeated, the prior is weaker. 3.2 First-Order Text Classification As mentioned above, data instances under ontology are richly structured datasets, where data best described by a graph where the nodes in the graph are objects and the edges in the graph are links or relations between objects. The methods for classifying data instances that we discussed in the previous section consider the words in a single node of the graph. However, the method can t learn models that take into account such features as the pattern of connectivity around a given instance, or the words occurring in instance of neighboring nodes. For example, we can learn a rule such as An data instance belongs to movie if it contains the words minute and release and is linked to an instance that contains the word birth." Clearly, rules of this type, that are able to represent general characteristics of a graph, can be exploited to improve the predictive accuracy of the learned models. This kind of rules can be concisely represented using a first-order representation. We can learn to classify text instance using a learner that is able to induce first-order rules. The learning algorithm that we use in our system is Quinlan's Foil algorithm [4]. Foil is a greedy covering algorithm for learning function-free Horn clauses definitions of a relation in terms of itself and other relations. Foil induces each Horn clause by beginning with an empty tail and using a hill-climbing search to add literals to the tail until the clause covers only positive instances. When Foil algorithm is used as a classification method, the input file for learning a category consists of the following relations: 1. category(instance): This is the target relation that will be learned from other background relations. Each learned target relation represents a classification rule for a category. 2. has_word(instance): This set of relations indicates which words occur in which instances. The sample belonging a specific has-word relation consists a set of instances in which the word word occurs. 3. linkto(instance, instance): This relation represents that the semantic relations between two data instances.

748 L. Pan, L. Zhang, and F. Ma We apply Foil to learn a separate set of clauses for every concept node in the ontology. When classifying the other ontology s data instances, if an instance can t match any clause of any category, we treat it as an instance of other category. 3.3 Evaluation of Classifiers for Matching and Matching Committees Method of Committees (a.k.a. ensembles) is based on the idea that, given a task that requires expert knowledge to perform, k experts may be better than one if their individual judgments are appropriately combined [7]. For obtaining matching result, there are two different matching committee methods according to whether utilizing classifier committee: microcommittees: System firstly utilizes classifier committee. Classifier committee will negotiate for the category of each unseen data instance. Then System will make matching decision on the base of single classification result. macrocommittees: System doesn t utilize classifier committee. Each classifier individually decides the category of each unseen data instance. Then System will negotiate for matching on the base of multiple classification results. To optimize the result of combination, generally, we wish we could give each member of committees a weight reflecting the expected relative effectiveness of member. There are some differences between evaluations of text classification and ontology matching. In text classification, the initial corpus can be easily split into two sets: a training(- and-validation) set and test set. However, the boundary among training set, test set and unseen data instance set in ontology matching process is not obvious. Firstly, test set is absent in ontology matching process in which the instances of target ontology are regarded as training set and the instances of source ontology are regarded as unseen samples. Secondly, unseen data instances are not completely unseen, because instances of source ontology all have labels and we just don t know what each label means. Because of the absence of test set, it is difficult to evaluate the classifier in microcommittees. Microcommittees can only believe the prior experience and manually evaluate the classifier weights, as did in [2]. We adopt macrocommittees in our ontology matching system. Notes that the instances of source ontology have the relative unseen feature. When these instances are classified, the unit is not a single but a category. So we can observe the distribution of a category of instances. Each classifier will find a champion that gains the maximal similarity degree in categories of target ontology. In these champions, some may have obvious predominance and the others may keep ahead other nodes just a little. Generally, the more outstanding one champions is, the more we believe it. Thus we can adopt the degree of outstandingness of candidate as the evaluation of effectiveness of each classifier. The degree of outstandingness can be observe from classification results and needn t be adjusted and optimized on a validation set. We propose a matching committee rule called the Best Outstanding Champion, which means that system chooses a final champion with maximal accumulated degree of outstandingness among champion-candidates. The method can be regarded as a weighted voting committee. Each classifier votes a ticket for the most similar node according to its judgment. However, each vote has different weight that can be measured by degree of champion s outstandingness. We define the degree of outstandingness as the ratio of champion to the secondary node.

SIMON: A Multi-strategy Classification Approach 749 4 Experiments We take movie as our experiment domain. We choose the first three movie websites as our experimental objects which rank ahead in google directory Arts > Movies > Databases: IMDB, AllMovie and Rotten Tomatoes. We manually match three ontologies to each other to measure the matching accuracy that can be defined as the percentage of the manual mappings that machine predicted correctly. We found about 150 movies in each website. Then we exchange the keywords and found 300 movies again. So each ontology holds about 400 movies data instances except repetition. We use a three-fold cross-matching methodology to evaluate our algorithms. We conduct three runs in which we performed two experiments that map ontologies to each other. In each experiment, we train classifiers using data instances of target ontology and classify data instances of source ontology to find the matching pairs from source ontology to target ontology. Table 1. Results matrixs of statistic classifier and the First-Order classifier Table 1 shows the classification result matrixes of partial categories in Allmovie- IMDB experiment, respectively for the statistic classifier and the First-Order classifier (The numbers in the parentheses are the results of First-Order classifier). Each column of the matrix represents one category of source ontology Allmovie and shows how the instances of this category are classified to categories of target ontology IMDB. Boldface indicates the leading candidate on each column. These matrixes illustrate several interesting results. First, note that for most classes, the coverage of champion is high enough for matching judgment. For example, 63% of the Movie column in statistic classifier and 56% of the Player column in First- Order classifier are correctly classified. And second, there are notable exceptions to this trend: the Player and Director in statistic classifier; the Movie and the Person in First-Order classifier. There will be a wrong matching decision according to results of Player column in statistic classifier, where Player in AllMovie is not matched to Actor but Director in IMDB. In other columns, the first and the second are so close that we can t absolutely believe the matching results according to these classification results. The low level of classification coverage of champion for the Player and Director is explained by the characteristic of categories: two categories lack of feature properties.

750 L. Pan, L. Zhang, and F. Ma For this reason, many of the instances of two categories are classified to many other categories. However, our First-Order classifier can repair the shortcoming. By mining the information of neighboring instances-awards and nominations, we can learn the rules for two categories and classify most instances to the proper categories. Because the Player often wins the best actor awards and vice versa. The neighboring instances don t always provide correct evidence for classification. The Movie column and the Person column in table 6 belong to this situation. Because many data instances between these two categories link to each other, the effectiveness of the learned rules descends. Fortunately, in statistic classifier, the classification results of two categories are ideal. By using our matching committee rule, we can easily integrate the preferable classification results of both classifiers. After calculating and comparing the degree of outstandingness, we more trust the matching results for Movie and Person in statistic classifier and for Player and Director in First-Order classifier. 100 90 80 70 60 50 40 30 20 10 0 statistic learner First-Order Learner Matching committee AllMovie to IMDB IMDB to AllMovie RT to IMDB IMDB to RT RT to AllMovie AllMovie to RT Fig. 3. Ontology matching accuracy Figure.3 shows three runs and six groups of experimental results. We match two ontologies to each other in each run, where there is a little difference between two experimental results. The three bars in each experimental represent the matching accuracy produced by: (1) the statistic learner alone, (2) the First-Order learner alone, and (3) the matching committee using the previous two learners. 5 Related Works From perspective of ontology matching using data instance, some works are related to our system. In [2] some strategies classify the data instances and another strategy Relaxation Labeler searches for the mapping configuration that best satisfies the given domain constraints and heuristic knowledge. However, automated text classification is the core of our system. We focus on the full mining of data instances for automated classification and ontology matching. By constructing the classification samples according to the feature property set and exploiting the classification features in or among data instances, we can furthest utilize the text classification methods.

SIMON: A Multi-strategy Classification Approach 751 Furthermore, as regards the combination multiple learning strategies, [2] uses microcommittees and manually evaluate the classifier weights. But in our system, we adopt the degree of outstandingness as the weights of classifiers that can be computed from classification result. Not using any domain and heuristic knowledge, our system can automatically achieve the similar matching accuracy as in [2]. [5] also compare ontologies using similarity measures, whereas they compute the similarity between lexical entries. [6] describes the use of FOIL algorithm in classification and extraction for constructing knowledge bases from the web. 6 Conclusions The completely distributed nature and the high degree of autonomy of individual peers in a P2P system come with new challenges for the use of semantic descriptions. We propose a multi-strategy learning approach for resolving ontology heterogeneity in P2P systems. In the paper, we introduce the SIMON system and describe the key techniques. We take movie as our experiment domain and extract the ontologies and the data instances from three different movie database websites. We use the general statistic classification method to discover category features in data instances and use the first-order learning algorithm FOIL to exploit the semantic relations among data instances. The system combines their outcomes using our matching committee rule called the Best Outstanding Champion. A series of experiment results show that our approach can achieves higher accuracy on a real-world domain. References 1. J. Broekstra, M. Ehrig, P. Haase. A Metadata Model for Semantics-Based Peer-to-Peer Systems. Proceedings of SemPGRID 03, 1st Workshop on Semantics in Peer-to-Peer and Grid Computing 2. A. Doan, J. Madhavan, P. Domingos, and A. Halevy. Learning to Map between Ontologies on the Semantic Web. In Proceedings of the World Wide Web Conference (WWW-2002). 3. I. H. Witten, T. C. Bell. The zero-frequency problem: Estimating the probabilities of novel events in text compression. IEEE Transactions on Information Theory, 37(4), July 1991. 4. J. R. Quinlan, R. M. Cameron-Jones. FOIL: A midterm report. In Proceedings of the European Conference on Machine Learning, pages 3-20, Vienna, Austria, 1993. 5. A. Maedche, S. Staab. Comparing Ontologies- Similarity Measures and a Comparison Study. Internal Report No. 408, Institute AIFB, University of Karlsruhe, March 2001. 6. M.Craven, D. DiPasquo, D. Freitag, A. McCalluma, T. Mitchell. Learning to Construct Knowledge Bases from the World Wide Web. Artificial Intelligence, Elsevier, 1999. 7. F. Sebastiani. Machine Learning in Automated Text Categorization. ACM Computing Surveys, Vol. 34, No. 1, March 2002.