Mining Relational Databases with Multi-view Learning

Size: px

Start display at page:

Download "Mining Relational Databases with Multi-view Learning"

Leslie Baldwin
6 years ago
Views:

1 Mining Relational Databases with Multi-view Learning Hongyu Guo School of Information Technology and Engineering, University of Ottawa 800 King Edward Road, Ottawa, Ontario, Canada, K1N 6N5 Herna L. Viktor School of Information Technology and Engineering, University of Ottawa 800 King Edward Road, Ottawa, Ontario, Canada, K1N 6N5 ABSTRACT Most of today s structured data resides in relational databases where multiple relations are formed by foreign key joins. In recent years, the field of data mining has played a key role in helping humans analyze and explore large databases. Unfortunately, most methods only utilize flat data representations. Thus, to apply these single-table data mining techniques, we are forced to incur a computational penalty by first converting the data into this flat form. As a result of this transformation, the data not only loses its compact representation but the semantic information present in the relations are reduced or eliminated. In this paper, we describe a classification approach, which addresses this issue by operating directly on relational databases. The approach, called MVC (Multi-View Classification), is based on a multi-view learning framework. In this framework, the target concept is represented in different views and then independently learned using single-table data mining techniques. After constructing multiple classifiers for the target concept in each view, the learners are validated and combined by a meta-learning algorithm. Two methods are employed in the MVC approach, namely (1) target concept propagation and (2) multi-view learning. The propagation method constructs training sets directly from relational databases for use by the multi-view learners. The learning method employs traditional single-table mining techniques to mine data straight from a multi-relational database. Our experiments on benchmark real-world databases show that the MVC method achieves promising results in terms of overall accuracy obtained and run time, when compared with the FOIL and CrossMine learning methods. Categories and Subject Descriptors I.2.6 [Artificial Intelligence]: [Learning] Keywords Data Mining, Multi-view Classification, Relational Database, Multi-relational Data Mining, Aggregation. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. MRDM 05, August 21, 2005, Chicago, Illinois, USA Copyright 2005 ACM /05/ $ INTRODUCTION Most real-world data storage takes advantage of the structure of relational databases and thus information is held in relations specified by foreign key joins. We are therefore faced with the challenge of mining data represented in this type of structure. Recent advances in data mining have played a key role in helping humans analyze and explore large databases. However, most data mining methods work only with flat data presentations. These methods include classification algorithms, such as C4.5 [22], neural networks [16], and support vector machines [4]. It usually takes considerable time and effort to convert the relations present into the required flat format. Many methods, such as LINUS [15], RELAGGS [13], and Polka [14], have been developed to perform this translation. The purpose of these so-called propositionalization methods is to bridge the gap between relational databases and traditional data mining approaches. A drawback of these approaches is that multiple relations lose their compact representation and normalized structure when they are converted into flat form. The process also leads to a loss of information and creates a large amount of redundant data. A promising new approach, multi-view learning, has also received significant attention in current literature [2, 10, 17, 21, 26]. The framework has been applied successfully to many real-world applications such as information extraction [2], text classification [10], and face recognition [26]. We argue that this approach offers a technique which can be used to learn concepts in relational databases. In the multi-view setting, each view can be expressed in terms of a set of disjoint features of the training data [17]. The target concept is learned on each of these separate views using the features present. The results from the views are then combined, each contributing to the learning task. A simple example of multiview learning is as follows. Consider an algorithm that will learn to classify s. Each of the different learners uses either the subject or the content of the s to construct its model. The concepts learned for each can then be combined to perform the final classification. We argue that a multi-view learning problem with n views can be seen as n inter-dependent relations and are thus applicable to multi-relational learning. This is the basic scenario in multirelational learning problem. Multi-relational data mining involves learning patterns from multiple relations [8]. Each relation usually has various attributes (information) contributing to the target concepts to be learned. As another example, consider the loan problem in the PKDD 99 discovery challenge [5], where the

2 banking database consists of eight relations. Each relation describes different characteristics of a client. For example, the Client relation contains a customers age, the Account relation identifies a customers banking account information, and the Card relation refers to customers credit card details. In other words, each relation from this database provides different types of information or views of the concept to be learned, i.e. whether the customer is good or not. This problem is therefore a perfect candidate for multi-view learning. The objective of this paper is to shed light on the role of multiview learning in relational databases. In this paper, we propose the MVC (Multi-View Classification) algorithm. The MVC approach employs the multi-view learning framework to operate directly on multi-relational databases with conventional data mining methods. The approach works as follows. Firstly, the tuple IDs (identifiers for each tuple in the relation) and the target concepts from the target relation (a relation in which each tuple is associated with a class label) are propagated to other relations (background relations), based on the foreign links (key references) between them. Secondly, aggregation functions are applied to each background relation (to which the tuple ID and target concepts were propagated) in order to handle the one-tomany relationships. Each background relation is then used as training data for a particular multi-view learner. Thirdly, conventional single-table data mining methods are used in order to learn the target concept from each view of the data separately. Lastly, the multi-view learners trained are validated and incorporated into a meta-learner to construct the final model. Since the MVC algorithm is based on the multi-view learning framework, it is able to use any conventional method to mine data from relational databases. The paper is organized as follows. Section 2 introduces related work. Section 3 presents the concept of multi-relational data mining and the multi-view learning framework. Next, a detailed discussion of the MVC algorithm is provided in Section 4. In Section 5, we describe and discuss the comparative evaluation performed on the MVC algorithm. Finally, Section 6 concludes this paper and outlines our future work. 2. RELATED WORK Over the past few years, three categories of approaches in relational data mining have been actively investigated. Besides propositionalization methods, another popular approach to relational data mining is to use Inductive Logic Programming (ILP) based methods [7], such as FOIL, TILDE [1], Golem [18], and Progol [19]. Traditional ILP methods learn concepts from relational data within a logic programming framework. The third category of approaches, SRL (statistical relational learning) models also attract significant attentions in current literature. One of the best-known SRL systems is the PRMs (probabilistic relational models) method [12], which extends Bayesian networks to deal with relational data. Learning from multiple views is a new problem setting. This nonstandard new approach draws significant attention in current literature by several interesting applications [2, 10, 26]. From a general point of view, multi-view learning is closely related to ensemble learning approaches. However, from the authors points of view, popular ensemble methods such as boosting, bagging and stacking, focus on constructing different hypotheses from subsets of learning instances. In contrast, an important characteristic of the multi-view learning is that this approach is more interested in learning independent models from disjoint features of the training data. That is, multi-view learning learns separate views from various disjoint-feature-based aspects of the data, leading to independent views on the target concept. The next section introduces multi-relational data mining within the framework of multi-view learning. 3. Multi-Relational Data Mining In this section, we describe the concept of relational classification. Also, this section explains how the multi-view learning framework can be used to directly mine concepts from multirelational databases. 3.1 Relational Databases A schema for a relational database describes a set of tables DB = {T 1, T 2,... T n }, and a set of relationships between pairs of tables. Each row in the table represents a tuple. Each table has at least one primary key attribute, denoted by T. key, which has unique value for each tuple in the table. The other attributes are either descriptive attributes or foreign key attributes. Foreign key attributes link to attributes of other tables (foreign links or key reference). A foreign key attribute of table T 2 referencing table T 1 is denoted as T 2.T 1 _ key. Key attributes in a database include foreign key attributes and primary key attributes. Figure 1 gives a simple example of a relational database schema with eight tables. All relationships among tables are represented with arrows. The arrows show the foreign links, which relate the key attributes of one table to the corresponding key attributes of another table. For example, the Account table has a foreign key attribute account-id which is linked to the key attribute account-id in Loan table. Figure 1. A simple database from PKDD 99 [5] 3.2 Relational Classification The objective of this paper is to undertake relational classification tasks. Suppose we have a database DB and a target relation T target which includes a target variable Y. The relational classification task is to find a function f (x) which maps each tuple of the target

3 table T target to the category Y. The database DB can present various complexities, in terms of the number of tables, the number of relationships between tables, and the number of attributes in the tables. In the above scenario, the target table is considered the target relation while all others are background relations. In the example database (Figure 1), the Loan table is the target table and the target variable is the status attribute. This target variable indicates whether a tuple in the target table is good or bad. All other tables, i.e. tables Account, Order, Transaction, Card, Disposition, Client, and District are background relations. All background relations are associated with the target relation by the foreign key chain. We define a directed foreign key chain as the relationship where a foreign link goes from a background relation directly to the target relation. Otherwise, a relationship is defined as an undirected foreign key chain. For example, in Figure 1, the relations Order, Transaction, Account and Disposition have directed foreign key chains because they are directly linked to the target relation Loan by foreign keys. The relations Card, District, and Client are held in undirected foreign key chains since they do not refer to the target relation directly. 3.3 Multiple View Learning in Relational Databases In a multi-view learning setting, each learner is assigned a set of training data, from which each should be able to learn the target concept. To apply this setting to relational classification tasks, consider a target concept, represented by variable Y that resides in the target table T target. The multi-view classification algorithm has to first correctly propagate the target variable Y to all other background relations, i.e. the other tables in databases. The MVC approach uses foreign key chains to implement this propagation process. The tuple IDs from the target table, which will be used by the aggregation functions, must also be propagated to the background tables. This procedure is formally defined as follows. Definition 1 (directed foreign key chain propagation): Given a database DB = { T target,t 1, T 2,... T n }, where T target is the target table with primary key T target.key and target concept Y. Consider a background table T i, where i {1, 2, n}, with an attribute set A(T i ), and a foreign key T i.t target_key that links to the target table T target. If a tuple in table T i has a key value referencing a tuple in the target table through the foreign key T i.t target_key, we say that this tuple is joinable. All joinable tuples in table T i are selected to form a new relation T i_new. The attributes of table T i _ new are described by the expression A(T i _ new ) =T target.key + Y + A(T i ) A key (T i ), where A key (T i ) are the key attributes of table T i. = T target.key + Y + A(T k ) A key (T k ), where A key (T k ) are the key attributes of table T k. Note that we here only describe an undirected foreign key chain with two background relations involved. This definition can, however, easily be extended to foreign key chains with more than two background relations. Figure 2. A directed foreign key chain Let us use the sample database in Figure 1 to demonstrate these two definitions. The target table Loan has attributes loan-id, account-id, date, amount, duration, payment, and status. Here, the tuple ID is the attribute loan-id and the target concept is from the attribute status. For the background table order, training data from this view will be created. The background relation order has a reference key linking to the target relation Loan, as shown in Figure 2. By Definition 1, we know that this relation follows the directed foreign key chain propagation scenario. As such, the training data for the view order consists of all tuples which have an account-id linking to the account-id in the target table. Each of these tuples has a loan-id, i.e. a tuple ID and status from the target table, along with the attributes to-bank, to-account, account, and type from the table order. The above construction procedure can be converted into the following SQL query for a MySQL database 1 : CREATE TABLE order_new SELECT L.loan-id, L.status, A.to-bank, A.to-account, A.amount, A.type FROM Loan L, Order A WHERE L.account-id = A.account-id; Definition 2 (undirected foreign key chain propagation): Given a database DB = { T target,t 1, T 2,... T n }, where T target is the target table with primary key T target.key and target concept Y. Consider a background table T i, where i {1, 2, n}, with a foreign key T i.t target _ key referencing the target table T target. Next consider background table T k, where k i and k {1, 2, n} with attribute set A(T k ) and foreign key T k.t i _ key that links to table T i. If a tuple in table T k has a key value linking to a tuple in table T i, and this tuple has a link to a tuple in the target table T target, we say that the tuple from T k is joinable to the target table T target. All joinable tuples in table T k are selected to form a new relation T k_new. We define the set of attributes for table T k_new as A(T k_new ) Figure 3. An undirected foreign key chain 1

4 Consider now the situation presented in Figure 3 which is taken from Figure 1. For the background table Client, the training data from this view is created as follows. First we notice that the relation Client has no directed foreign key referencing to the target relation Loan. However, it does have a reference key linked to relation Disposition, which in turn has a foreign key referencing the target relation Loan. By Definition 2, this relation follows the undirected foreign key chain propagation scenario. Following this definition, the tuples from table Client which have client-id linked to the client-id attribute in table Disposition and the tuples from table Disposition with account-id linked to the target table are selected to construct the training data. Since the table Client has attributes client-id, birthday, gender, and districtid, the final training data consists of four attributes, namely birthday and gender from relation Client and loan-id and status from target relation Loan. The above construction procedure can also be converted into the following SQL query for a MySQL database: CREATE TABLE client_new SELECT L.loan-id, L.status, C.birthdate, C.gender FROM Loan L, Disposition D, Client C WHERE L.account-id = D.account-id and D.client-id = C.client-id Aggregation functions are then applied to the newly generated tables T i _ new, to construct the training data sets for the multi-view learners. This aggregation procedure is described next in Section Aggregation-based Feature Generation Relational data mining approaches have to handle the difficulties inherent with one-to-many and many-to-many associations between relations [8, 27]. The relations T i _ new obtained in Section 3.3 consist of one-to-many relationships with respect to the primary key value T target.key (i.e. tuple ID) as propagated from the target relation T target. This kind of indeterminate relationship has a significant impact on the predictive performance and has been studied by several researchers [13, 14, 20]. To address this problem, the MVC method applies aggregation functions taken from the field of relational database. Aggregation has been widely used to summarize the properties of a set of related objects. In the MVC algorithm, the aggregation function squeezes related tuples into a single tuple using the primary key from the target relation. Each new table T i _ new is aggregated to form a multi-view training set T i _ view. Each T i _ view is then used to train separate multi-view learners. For nominal and binary attributes, the MVC method applies the aggregation function COUNT. For numeric attributes, the functions SUM, AVG, MIN, MAX, STDDEV, and COUNT are employed. The aggregation-based feature generation procedure is formally defined as follows. Consider a relation T i with tuple ID attribute T t.key. For each attribute A i, in T i, except the attribute T t.key, new attributes are generated to replace the attribute A i based on the following constraints. 1. Nominal Attribute If A i is a nominal attribute and has n possible different values (n>2), the aggregation-based feature generation procedure produces a new attribute Ψ (A i ), in which the total number of occurrences of values of attribute A i is stored. The value is calculated using the COUNT() function in the database. 2. Binary Attribute If A i is a binary attribute in which only 2 different values are possible, the aggregation procedure creates two new attributes Ψ 1 (A i ) and Ψ 2 (A i ). The attribute Ψ 1 (A i ) stores the number of occurrences of one of the values of Ai, and Ψ 2 (A i ) stores the number of occurrences of the other. Each value is calculated by the COUNT() function in the database. 3. Numeric Attribute If A i is a numeric attribute with a set of values m, six new attributes Ψ 1 6 (A i ) are generated. In each of these attributes, the values for SUM (m), AVG(m), MIN(m), MAX(m), STDDEV(m), and COUNT(m) are stored, respectively. In order to achieve high efficiency, the steps described in Section 3.3 and Section 3.4 are performed simultaneously. Let us go back to the two examples we discussed earlier, namely the order_new and client_new examples, and consider the combination of the propagation and aggregation methods. The table order_new has nominal attributes to-bank, to-account and type and numeric attribute amount. Three aggregation attributes are generated to replace the nominal attributes to-bank, to-account and type, respectively. They are COUNT(to-bank), COUNT(to-account) and COUNT(type). Six new aggregation attributes are created to replace the numeric attribute amount. These are SUM (amount), AVG(amount), MIN(amount), MAX(amount), STDDEV(amount), and COUNT(amount). Combining this procedure with the propagation procedure of Section 3.3 results in the following SQL query: CREATE TABLE order_view SELECT L.loan-id, L.status, COUNT(A.to-bank), COUNT(A.to-account), SUM(A.amount), AVG(A.amount), MIN(A.amount), MAX(A.amount), STDDEV(A.amount), COUNT(A.amount), COUNT(A.type) FROM Loan L, Ordert A WHERE L.account-id = A.account-id; GROUP BY L.loan-id In the case of the client_new table, we have a nominal attribute birthday and a binary attribute gender (male or female). So one aggregation attribute COUNT(birthday) is produced to replace the birthday attribute and two new attributes gender_male and gender_female are created to replace the gender attribute. This procedure can be also converted into the following SQL query: CREATE TABLE client_view SELECT L.loan-id, L.status, COUNT(C.birthdate,) COUNT( IF (C.gender = male, 1, 0 )), COUNT( IF (C.gender = female, 1, 0 )) FROM Loan L, Disposition D, Client C

5 WHERE L.account-id = D.account-id and D.clientid = C.client-id GROUP BY L.loan-id Note that, the construction and aggregation procedures are performed in the relational database management system (RDMS). With the optimal techniques implemented in the RDMS, these procedures are highly efficient. Another benefit is that these procedures can easily be adapted for other database management systems such as Oracle 2 and DB MVC ALGORITHM In this section, we present the MVC algorithm. This approach, which is based on the multi-view learning framework, is able to operate directly on multi-relational databases. It also allows for any traditional single-table learning algorithms developed for flat data representations to be used in the multi-view learning setting. The MVC algorithm, as shown in Figure 4, consists of the following three stages. In the first stage, multi-view training data sets are constructed straight from the relational database. In this step, tuple IDs and class labels from the target table are propagated to all background relations through key references. Also as part of this stage, aggregation functions are applied to each background relation to handle the one-to-many associations. In the second stage, multi-view learners are initiated to learn the target concept in each of the constructed training data sets. In this process conventional single-table learning algorithms are applied. For the third stage of the algorithm, the trained multi-view learners from the second stage are validated and then used in a meta-learning algorithm to construct the final classification model. The pseudo-code of the MVC algorithm is shown below and discussed, in detail, in the following four sections. MVC ALGORITHM Input A relational database DB={T target, T 1, T 2, T n }, consisting of a target relation T target (with target variable Y and primary key ID) and a set of background relations T i. The value n denotes the number of background tables while w denotes the number of possible values of Y. A conventional single-table learning algorithm L. Output A final model for predicting class labels of target tuples. Procedure i. Divide tuples of T target into sets S train(k) and S test(k), using k- fold cross validation. ii. Do for each fold k of the k-fold cross validation iii. Do for i = 1, 2,, n 1. Propagate Y and ID of each tuple in S train(k) to every background relation T i, using the shortest foreign key chain which references the target table S train(k) in order to form the data set D i. 2. Remove key attributes from D i. 3. Apply aggregation functions on D i, grouping by ID 4. Call algorithm L, providing it with D i, obtaining observation V i for the target concept. 5. Calculate the error for V i, denoted as e i. If e i > 0.5, discard V i. End do iv. Call algorithm L, providing it with S train(k), obtaining observation V 0 v. Do for x = 1, 2,, m, where m is the number of tuples in S test(k) Do for s = 0, 1,..., p, where p is the number of observations learned and accepted in the validation of step 5 of iii. If (s ==0) Call V 0, providing it with tuples x. Return predictions {P Yq V0(x)}, where q {1, 2, w}. else Retrieve corresponding tuples from background table T i, denoted as R, following the foreign key chains. If ( R==0) Return {P Yq Vs(x) = 1.0 / w}, where q {1, 2, w}. else Apply aggregation functions to R, grouping by ID, yielding R A Call V s, providing it with R A. Return the predictions of R A {P Yq Vs(x)}, where q {1, 2, w}. End if End if Return X = < P Y0 V0(x), P Y1 V0(x),..., P Yw V0(x), P Y0 V1(x), P Y1 V1(x),..., P yw V1(x),, P Y0 Vi(x), P Y1 Vi(x),...,P Yw Vi(x),Y > as the x th instance of the meta data set D M(k). End do End do End do vi. Call algorithm L, providing it with, forming U D ( ) M j j= 1.. k meta-learner H M vii. Retrain all multi-view learners V i, providing it with T target rather than S train(k), Obtaining hypothesis H i of each view. viii. Return H M and H i, where i = 0, 1,, p, as the final model. End Procedure Figure 4: Pseudo-code for the MVC algorithm

6 4.1 Construct Multi-view Training Data Sets The first step of the MVC algorithm is to, directly from databases, construct multi-view training data sets for use by the multi-view learners. The algorithm proceeds to construct one multi-view training set from each background relation, and one from the target table. In multi-view learning framework, it is important to provide each multi-view learner with sufficient knowledge in order for it to learn the target concept. To achieve this goal, we propagate the class labels and the tuple IDs in the target relation to all other background relations. The tuple IDs identify each tuple in the target relation. These steps are described in step iii of Figure 4. After grouping by tuple ID, the aggregation functions explained in Section 3.4 are applied to deal with the one-to-many associations between background and target tables. For example, the database in Figure 1 has seven background relations and one target table. Thus, the MVC approach constructs eight multi-view training data sets. One data set is from the target relation Loan and the seven others are from the background relations. A summary of the characteristics for each of the eight multi-view training data sets is shown in Table 1. The original characteristics of the attributes are provided in Figure 1. In this construction process, all joinable tuples from the original tables, as described in definition one and two, are selected and used to form the training data sets. In other words, only tuples having foreign key values (either through directed foreign key chain or undirected foreign key chain) which reference key values in the target table are selected. Note that it is possible that the resulting view data set may have fewer tuples than the original. For example, the training data set from the background relation Card has only 36 tuples, as shown in Table 1. This is due to the fact that most of the foreign key values for the tuples from the original table do not link to tuples in target table Loan. Table 1. Summary of the characteristics of the multi-view training data set for the example database (after aggregation). Shown are the number of tuples and attributes as well as the class distribution for each relation. Training data # tuples # attributes Class Distribution Loan :76 Account :76 Order :76 Disposition :76 Card :0 Client :56 District :76 Transaction : Yield Multiple-view Hypothesis After constructing the multi-view training data sets, the MVC algorithm call upon the multi-view learners to learn the target concept from each of these sets. Each learner constructs a different hypothesis based on the data it is given. Any traditional single-table learning algorithms, such C4.5, SVM or Neural Networks, can be applied. This procedure is explained in step iii of Figure 4. In this way, all multi-view learners make different observations on the target concept based on their perspective. The results from the learners will be validated and combined to construct the final classification model. This is discussed in the next section, namely Section Validate Multiple Views The multi-view learners trained in section 4.2 need to be validated before being used by the meta-learner. This step is needed to ensure that they are sufficiently able to learn the target concept on their respective training sets. The view validation has to be able to differentiate the strong learners from the weak ones. In other words, it must identify those learners which are only able to learn concepts that are strictly more general or specific than the target one [17]. In the MVC method, we use the error rate in order to evaluate the quality of a multi-view learner. Learners with training error greater than 50% are discarded. In other words, only learners with predictive performance better than random guessing are used to construct the final model. The method for the construction of this model is discussed in the next section. 4.4 Combine Multi-view Learners The last step of the MVC approach is to construct the final classification model, using the trained multi-view learners. Approaches for combining models have been investigated by many researchers [3, 9, 11, 28, 30]. The most popular are those such as bagging [3], boosting [9] and stacking [30]. In metalearning schemes, a learning algorithm is used to construct the function that combines the predictions of the individual learners. The MVC algorithm applies this scheme in order to combine the classifiers from the multi-view learners. After training, the multi-view learners are used to construct training instances (meta data) for the meta-learning algorithm (meta-learner). Each instance of the meta data consists of the prediction made by the learner on a specific training example, along with the original class label for that example. This process is described in steps v and vi of Figure 4. The details of this procedure proceeds are as follows. Firstly, for each training tuple x from the target table, a multi-view learner V i will retrieve the tuple from background knowledge table T i corresponding to the key reference (a directed or undirected foreign key chain). If a corresponding tuple is found, the learner is called to produce predictions {P Yq Vi(x)} (q {1, 2, w}) for it. Here, P Yq Vi(x) denotes the probability that instance x belongs to class Y q (Y has w different values), as defined by multi-view learner V i. Otherwise, namely if there is no corresponding tuple in table T i, an equal probability is assigned for each class label, i.e. P Yq Vi(x) values equal to 1/w are returned. Following these steps, a feature vector, X, consisting of {P Yq Vi(x)}, along with the original class labels Y is built and used as meta data. The feature vector used in the training of the meta-learner can be formulated as follows: X = < P Y0 V0(x), P Y1 V0(x),..., P Yw V0(x), P Y0 V1(x), P Y1 V1(x),..., P Yw V1(x),, P Y0 Vi(x), P Y1 Vi(x),...,P Yw Vi(x), Y >

7 In the context of the example database, 682 training instance are generated to train the meta-learner. Each instance consists of 17 attributes. For these attributes, one is the original class label (actual class) and the others are the eight pairs of predicted probabilities obtained for the example. The probability values describe the likelihood that the example belongs to the good and bad classes. The following entries show two instances of meta data for the example database , , , , , , , ,1,0, , , , , , ,GOOD , , , , , , , ,1,0, , , , , , ,BAD After building the meta data, the MVC algorithm calls on a metalearner to produce a function to control how the classifiers are used to classify a particular example. The objective of this metalearner is to achieve maximum classification accuracy from this combination of classifiers. This function, along with the hypotheses constructed by each of the multi-view learners constitutes the final classification model. 5. EXPERIMENTAL RESULTS This section presents the results obtained for the MVC algorithm on different standard databases. These results are presented in comparison with two other well known systems for multirelational data mining, namely the FOIL [23] algorithm and another recent method, CrossMine [31]. The FOIL algorithm uses a top-down approach to build rules. This algorithm is commonly used in benchmarking the performance of relational data mining methods. The CrossMine method is a recent addition. This algorithm extends the FOIL approach, with virtual joins, in order to avoid the high cost of producing physical joins in a database [31]. Using tuple ID propagation and a selective sampling method, this approach demonstrates high scalability and accuracy when mining relational databases [31]. 5.1 Methodology Six learning tasks derived from three standard real-world databases were used to compare the predictive performances of the following methods: MVC, FOIL and CrossMine. We implemented the MVC algorithm using Weka [29], a Javabased knowledge learning and analysis environment developed at the University of Waikato in New Zealand. In our experiments, C4.5 [22] decision trees were applied in each of the multi-view learners and for the meta-learner. Each of these experiments produces results using ten-fold cross validation. We report the average running time of each fold for the three methods. The MVC algorithm and the CrossMine method were run on a 3 GHz Pentium 4 PC with 1 GByte of RAM running Windows XP and MySQL. The FOIL approach was run on a Sun4u machine with 4 CPUs. 5.2 Financial Database In our first experiment, we used the Financial database, which was used in the PKDD 1999 discovery challenge [5]. The database was offered by a Czech bank and contains typical business data. The schema of the Financial database is shown in Figure 1. The original database is composed of eight tables. The target table, i.e. the Loan table consists of 682 records, including a class attribute status which indicates the status of the loan, i.e. A (finished and good), B (finished but bad), C (good but not finished), or D (bad and not finished). The background information for each loan is stored in the relations Account, Client, Order, Transaction, Card, Disposition and District. All background relations relate to the target table through directed or undirected foreign key chains, as shown in Figure 1. This database provides us with three different learning problems. Our first learning task is to learn if a loan is good or bad from the 234 finished tuples. The second learning problem attempts to classify if the loan is good or bad from the 682 instances regardless of whether the loan is finished or not. Our third experimental task uses the Financial database as prepared in [31], which has only 400 examples in the target table. The authors sampled the Transaction relation since it contains an extremely large number of instances and discarded some positive examples from the target relation to make the learning problem more balanced [31]. A summary of the three learning data sets is presented in Table 2, where F234AC, F682AC and F400AC denote the first, second and third learning tasks respectively. For this database, eight views, namely, the views loan, account, client, order, transaction, card, disposition, and district were constructed for each task. Table 2. Summary of the data sets used. Shown is the number of tuples in the target relation, the number of background relations, and the target class distribution. Data sets # tuples in the target # background relations target class distribution F400AC :76 F234AC :31 F682AC :76 M :63 M :76 Throm : Mutagenesis Database Our second experiment was conducted on the Mutagenesis data set [25]. This benchmark data set is composed of the structural descriptions of 188 Regression Friendly and 42 Regression Unfriendly molecules that are to be classified as mutagenic or not. Two learning problems can be constructed from this data. The first problem is to classify the 188 Regression Friendly examples. Of the 188 instances 125 tuples are positive and 63 are negative. The second task is the classification of all the Regression molecules, regardless of whether they are Regression Friendly or Regression Unfriendly. This task has 230 learning examples, of which 154 tuples are positive and 76 are negative. The background relations consist of descriptions about the atoms and bonds that make up the molecules. A summary of the characteristics for the two learning data sets are also given in

8 Table 2, where the M188 and M230 describe the first and second tasks respectively. For this database, 3 views, namely mole, bond, and atom are constructed for each learning task. 5.4 Thrombosis Database Our third experiment uses the Thrombosis database from the PKDD 2001 Discovery Challenge [6]. This database is organized using seven relations. We set the Antibody_exam relation as the target table, and Thrombosis as the target concept. This concept describes the different degrees for Thrombosis, i.e. None, Most Severe, Severe, and Mild. The target table has 770 records describing the results of special laboratory examinations performed on patients. Our task here is to determine whether or not a patient is thrombosis free. For this task, we include four relations for our background knowledge, namely Patient_info, Diaagnosis, Ana_pattern, and Thrombosis. All four background relations are linked to the target table by foreign keys. Therefore, for this data set, five different views are constructed by the MVC algorithm. A summary of this data set is shown as part of Table 2 and identified by the index Throm. Table 3. Accuracy obtained for the six learning tasks using FOIL, CrossMine and MVC. Data sets FOIL CM MVC F682AC 83.9 % 90.3 % 92.5 % F400AC 72.8 % 85.8 % 87.8 % F234AC 74.4 % 88.0 % 88.0 % M % 83.9 % 83.0 % M % 85.7 % 86.7 % Throm 100 % 90.0 % 100 % Table 4. Execution time needed for the six learning tasks using FOIL, CrossMine and MVC. Data sets FOIL CM MVC F682AC sec 11.6 sec 5.6 sec F400AC sec 8.1 sec 3.0 sec F234AC sec 5.0 sec 2.0 sec M sec 1.6 sec 3.8 sec M sec 1.0 sec 3.0 sec Throm 0.7 sec 0.9 sec 1.5 sec 5.5 Results We present the predictive accuracy obtained for each of the six learning tasks in Table 3 (CM denotes the CrossMine method). The results obtained using our method, the MVC algorithm, demonstrate its ability to yield better or comparable accuracy when compared to the other popular methods of FOIL and CrossMine. For example, for the F682AC, F400AC, F234AC, M188, and Throm learning tasks, the MVC approach achieved the highest accuracy with 92.5%, 87.8%, 88.0%, 86.7% and 100%, being reported respectively. For the M230 learning problem, the MVC algorithm produced slightly lower (0.9%) accuracy values than FOIL and CrossMine. To evaluate the performance of the MVC approach in terms of run time, we also provide the execution time needed for each of the six learning task in Table 4. These results show that the execution time for the MVC algorithm is comparable to the other two approaches. Promisingly, these outputs indicate that the MVC algorithm is very efficient for complex database schemas. For example, in the F682AC, F400AC and F234AC learning tasks which has the most complex structures in this experimental setting, the MVC algorithm was at least twice as fast as the other two methods. In summary, the results shown in Table 3 and Table 4 indicate that the performances achieved by the MVC approach are comparable to that of the other techniques, when evaluated in terms of overall accuracy. In particular the experimental findings for running time are encouraging. The MVC algorithm was able to successfully learn the target concept in seconds for each of the learning tasks, even for the more complex structure presented in F682AC, F400AC and F234AC. 5.6 Multiple Views Contribution Analysis To better understand how each multi-view learner contributes to the final model, we present the individual performance of each in this section. In this experiment we chose to use the Thrombosis database since it has moderate number of tuples and background relations. This data, as described in Section 5.4, is composed of five tables. A summary of this data set is provided in Table 5. In order to get a good idea of the contribution of each learner, we present its individual predictive performance, i.e. overall accuracy and F-Measures [24] on various classification tasks. This data set consists of two classes which we will refer to as the majority and minority class respectively. The F-measure incorporates the recall and precision into a single number to measure the predictive performance for an algorithm for different threshold values. The values for Precision and Recall are calculated by the expressions TP / (TP + FP) and TP / (TP + FN), respectively. F- Measure is formally defined in the following equation [11]: 2 2 ((1+ β ) Re call Pr ecision ) ( β Re call + Pr ecision ),where β corresponds to the relative importance of precision to recall and is assigned the value of 1. We display in Table 6, the training accuracy, the number of tuples in the training set, and the F- Measure for the minority and majority classes for each multi-view learner. We also provide the predictive performance of the final model in Table 7 so that these values can be compared.

9 Table 5. Summary of the Thrombosis data set. Shown are the number of tuples and attributes for each relation. Relations # tuples # attributes Antibody_exam (target) Patient_info Diagnosis Ana_pattern Thrombosis From the results provided in Tables 6 and 7, one can conclude that the MVC algorithm is able to efficiently construct multi-view learners and then successfully combine them to form a strong final model. Table 6 shows that the multi-view learners yield various accuracy values when learning the target concept. For example, Learner 5 obtains perfect performance for the minority class, as indicated by an F-Measure of 1, but obtains a score of zero on the majority class. In contrast, Learner 1 is good at predicting the majority class (F-measure of 0.949), but yields poor performance for the minority class (F-Measure of zero). However, by combining these learners with the meta-learner, the final model is able to achieve perfect predictive performance. As shown in Table 7, the final model achieves 100% accuracy and perfect F-Measures scores for all classes of the target concept. Table 6. Summary of performance for the multi-view learners. Shown are the names of the training data used along with its number of tuples. Next, the F-Measure scores for the minority and majority classes as well as the overall accuracy on the training data are given. Multi-view Learners Training data # tuples F.min F.maj Acc. Learner1 Antibody_exam Learner2 Patient_info Learner3 Diagnosis Learner4 Ana_pattern Learner5 Thrombosis Table 7. Performance summary for the final model which combines the five multi-view learners. Shown are the number of testing tuples, the F-Measure scores of the minority and majority classes and the overall accuracy. # Tuples F.min F.maj Acc. FINAL MODEL % 6. CONCLUSIONS AND FUTURE WORK This paper introduced a novel approach for learning directly from multi-relational databases using traditional single-table data mining algorithms. The approach is based on the multi-view learning framework which has been successfully applied in many earlier applications of semi-supervised learning. In these applications, it was used to boost the accuracy of a classifier trained on few labeled examples. In this approach, concepts, along with essential information present in the target relation, are propagated to background relations, where aggregation functions are applied, to yield training data sets for a set of multi-view learners. Target concepts are then learned from each of these views independently, using a conventional single-table learning algorithm. Finally, the results from all views satisfying a validation procedure are combined by a meta-learner to form the final classification model. The MVC algorithm was evaluated on six real-world learning problems. The results obtained indicate that the MVC approach performs well with complex database schemas. It achieved comparable accuracy and high efficiency in term of running time, when compared with well-known multi-relational learning algorithms. The important finding here is that the MVC algorithm is able to work efficiently when applied directly on a multi-relational database. This paper therefore demonstrates that the multi-view learning framework does benefit from the accuracy and efficiency of the MVC algorithm when mining on relational databases. Another important finding is that this framework allows the MVC method to make good use of any traditional single-table data mining approach. In conclusion, these results indicate that it is possible to directly mine multi-relational databases using a multi-view learning framework. The results from this paper indicate four advantages to applying the developed method. The first is that the relational database is able to keep its compact representation and normalized structure. The second is that it uses a framework that is able to directly incorporate any traditional single-table data mining algorithm. The third is that the multi-view learning framework is highly efficient for mining relational databases in term of running time. The last advantage is that with this approach it is very easy to integrate the advanced techniques developed by the ensemble and heterogeneous learning communities. Our future work will study some issues that need to be addressed in order to extend the MVC approach. These issues include: further investigation of the method to construct the training data sets from the multi-relational databases; consideration of how to use all join information among relations; deep research of dependency between multiple views; further evaluation of the approach s efficiency and scalability, also in terms of a quantitative evaluation; and experimentation with complex aggregation functions and different view combination techniques, such as majority voting and weighted voting. The idea of using heterogeneous view learners will be also need to be investigated. Future work will also include applying this approach to more real world applications.

10 7. REFERENCES [1] Blockeel, H., DeRaedt, L. and Ramon, J. Top-down induction of logical decision trees. In Proceedings of 1998 International Conference of Machine Learning (ICML 98), Madison, WI, [2] Blum A. and Mitchell, T.M. Combining labeled and unlabeled data with co-training. In Proceedings of COLT-98, 1998, [3] Breiman, L. Bagging Predictors, Machine Learning, 24 (2), 1996, [4] Burges, C.J.C. A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2, 1998, [5] Berka, P. Guid to the financial data set. In Siebes, A. and Berka, P., editors, The ECML/PKDD 2000 Discovery Challenge, [6] Coursac, I., Duteil, N., and Lucas, N. PKDD 2001 Discovery Challenge Medical Domain. The PKDD Discovery Challenge 2001, 3 (2), [7] DeRaedt, L. and Muggleton, S.H. Inductive logic programming: Theory and methods. The Journal of Logic Programming, 19&20, 1994, [8] Dzeroski, S. Multi-relational Data Mining: An Introduction, ACM SIGKDD Explorations, 5(1), 2003, [9] Freund, Y. and Schapire, R. Experiments with a New Boosting Algorithm, the Proceedings of the Thirteenth International Conference on Machine Learning, Bari, Italy, 1996, [10] Ghani, R. Combining labeled and unlabeled data for multiclass text classification. In proceedings of the 19 th International Conference on machine Learning (ICML- 2002), 2002, [11] Guo, H. and Viktor, H.L, Learning from imbalanced data sets with Boosting and Data Generation: the DataBoost-IM approach, ACM SIGKDD Explorations, 6(1), 2004, [12] Getoor, L., Friedman, N, Koller, D. and Taskar, B. Learnig probabilistic models of relational structure. Proceedings of the Eighteenth International Conference on Machine Learning (ICML), Williams College, [13] Krogel, M. A. and Wrobel, S. Transformation-Based Learning using Multirelational Aggregation. In Rouveirol, C. and Sebag, M. editors, Proceedings of the Eleventh International Conference on Inductive Logic Programming (ILP), LNAI 2157, Springer, 2001, [14] Knobbe, A.J., Haas, M. de. And Siebes, A. Propositionalisation and Aggregates. In Deraedt, L. and Siebes, A. editors, Proceedings of the Fifth European Conference on Principles of Data Mining and Knowledge Discovery (PKDD), LNAI 2168, Springer, 2001, [15] Lavvrac, N. and Dzeroski, S. Inductive Logic Programming: Techniques and Applications. Ellis Horwood, [16] Mitchell, T.M. Machine Learning, McGraw Hill, [17] Muslea, I. Active Learning with Multiple Views, Ph.D. Dissertation, Department of Computer Science, University of Southern California, [18] Muggleton, S. and Feng, C. Efficient induction of logic programs. In Proceedings of the 1st Conference on Algorithmic Learning Theory, Ohmsma, Tokyo, Japan, 1990, [19] Muggleton, S. Inverse entailment and progol. In New Generation Computing, Special issue on Inductive Logic Programming, [20] Perlich, C. and Provost, F. Aggregation-based feature invention and relational concept classes, Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Washington, D.C. 2003, [21] Pierce, D. and Cardie, C. Limitations of co-training for natural language learning from large datasets. In Proceedings of Empirical methods in Natural Language Processing (EMNLP-2001), 2001, [22] Quinlan, J.R. C4.5: Programs for machine learning. Morgan Kaufmann, [23] Quinlan, J. R. and Cameron-Jones, R. M. FOIL: A midterm report. In Proceedings of 1993 European Conference of Machine Learning, Vienna, Austria, [24] Rijsbergen, C. J. Information Retrieval. Butterworths, London, [25] Srinivasan, A., King, R. and Bristol, D. An assessment of ILP-assisted models for toxicology and the PTE-3 experiment. In Proceedings of the Ninth International Workshop on Inductive Logic programming, LNAI 1634, Springer-Verlag, 1999, [26] Sa, V.de. and Ballard, D. Category learning from multimodality. Neural Computation, 10(5), 1998, [27] Vens, C., Assche, A.V., Blockeel, H., and Dzeroski, S. First order random forests with complex aggregates. Camacho, R., King, R., and Srinivasan, A. (Eds.), Proceedings of the Fourteenth International Conference on Inductive Logic Programming (ILP), LNAI 3194, Springer-Verlag, 2004, [28] Viktor, H.L. The CILT multi-agent learning system, South African Computer Journal (SACJ), 24, 1999, [29] Witten, I.H. and Frank, E. Data mining practical machine learning tools and techniques with Java implementations, Morgan Kaufmann, [30] Wolpert, D. Stacked Generalization, Neural Network, 5, 1992, [31] Yin, X., Han, J., Yang, J., and Yu, P.S., CrossMine: Efficient Classification across Multiple Database Relations, in Proceedings of the 2004 International conference on Data Engineering (ICDE'04), Boston, MA, 2004.

Efficient Multi-relational Classification by Tuple ID Propagation

Efficient Multi-relational Classification by Tuple ID Propagation Xiaoxin Yin, Jiawei Han and Jiong Yang Department of Computer Science University of Illinois at Urbana-Champaign {xyin1, hanj, jioyang}@uiuc.edu