Mining Relational Databases with Multi-view Learning

Size: px
Start display at page:

Download "Mining Relational Databases with Multi-view Learning"

Transcription

1 Mining Relational Databases with Multi-view Learning Hongyu Guo School of Information Technology and Engineering, University of Ottawa 800 King Edward Road, Ottawa, Ontario, Canada, K1N 6N5 Herna L. Viktor School of Information Technology and Engineering, University of Ottawa 800 King Edward Road, Ottawa, Ontario, Canada, K1N 6N5 ABSTRACT Most of today s structured data resides in relational databases where multiple relations are formed by foreign key joins. In recent years, the field of data mining has played a key role in helping humans analyze and explore large databases. Unfortunately, most methods only utilize flat data representations. Thus, to apply these single-table data mining techniques, we are forced to incur a computational penalty by first converting the data into this flat form. As a result of this transformation, the data not only loses its compact representation but the semantic information present in the relations are reduced or eliminated. In this paper, we describe a classification approach, which addresses this issue by operating directly on relational databases. The approach, called MVC (Multi-View Classification), is based on a multi-view learning framework. In this framework, the target concept is represented in different views and then independently learned using single-table data mining techniques. After constructing multiple classifiers for the target concept in each view, the learners are validated and combined by a meta-learning algorithm. Two methods are employed in the MVC approach, namely (1) target concept propagation and (2) multi-view learning. The propagation method constructs training sets directly from relational databases for use by the multi-view learners. The learning method employs traditional single-table mining techniques to mine data straight from a multi-relational database. Our experiments on benchmark real-world databases show that the MVC method achieves promising results in terms of overall accuracy obtained and run time, when compared with the FOIL and CrossMine learning methods. Categories and Subject Descriptors I.2.6 [Artificial Intelligence]: [Learning] Keywords Data Mining, Multi-view Classification, Relational Database, Multi-relational Data Mining, Aggregation. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. MRDM 05, August 21, 2005, Chicago, Illinois, USA Copyright 2005 ACM /05/ $ INTRODUCTION Most real-world data storage takes advantage of the structure of relational databases and thus information is held in relations specified by foreign key joins. We are therefore faced with the challenge of mining data represented in this type of structure. Recent advances in data mining have played a key role in helping humans analyze and explore large databases. However, most data mining methods work only with flat data presentations. These methods include classification algorithms, such as C4.5 [22], neural networks [16], and support vector machines [4]. It usually takes considerable time and effort to convert the relations present into the required flat format. Many methods, such as LINUS [15], RELAGGS [13], and Polka [14], have been developed to perform this translation. The purpose of these so-called propositionalization methods is to bridge the gap between relational databases and traditional data mining approaches. A drawback of these approaches is that multiple relations lose their compact representation and normalized structure when they are converted into flat form. The process also leads to a loss of information and creates a large amount of redundant data. A promising new approach, multi-view learning, has also received significant attention in current literature [2, 10, 17, 21, 26]. The framework has been applied successfully to many real-world applications such as information extraction [2], text classification [10], and face recognition [26]. We argue that this approach offers a technique which can be used to learn concepts in relational databases. In the multi-view setting, each view can be expressed in terms of a set of disjoint features of the training data [17]. The target concept is learned on each of these separate views using the features present. The results from the views are then combined, each contributing to the learning task. A simple example of multiview learning is as follows. Consider an algorithm that will learn to classify s. Each of the different learners uses either the subject or the content of the s to construct its model. The concepts learned for each can then be combined to perform the final classification. We argue that a multi-view learning problem with n views can be seen as n inter-dependent relations and are thus applicable to multi-relational learning. This is the basic scenario in multirelational learning problem. Multi-relational data mining involves learning patterns from multiple relations [8]. Each relation usually has various attributes (information) contributing to the target concepts to be learned. As another example, consider the loan problem in the PKDD 99 discovery challenge [5], where the

2 banking database consists of eight relations. Each relation describes different characteristics of a client. For example, the Client relation contains a customers age, the Account relation identifies a customers banking account information, and the Card relation refers to customers credit card details. In other words, each relation from this database provides different types of information or views of the concept to be learned, i.e. whether the customer is good or not. This problem is therefore a perfect candidate for multi-view learning. The objective of this paper is to shed light on the role of multiview learning in relational databases. In this paper, we propose the MVC (Multi-View Classification) algorithm. The MVC approach employs the multi-view learning framework to operate directly on multi-relational databases with conventional data mining methods. The approach works as follows. Firstly, the tuple IDs (identifiers for each tuple in the relation) and the target concepts from the target relation (a relation in which each tuple is associated with a class label) are propagated to other relations (background relations), based on the foreign links (key references) between them. Secondly, aggregation functions are applied to each background relation (to which the tuple ID and target concepts were propagated) in order to handle the one-tomany relationships. Each background relation is then used as training data for a particular multi-view learner. Thirdly, conventional single-table data mining methods are used in order to learn the target concept from each view of the data separately. Lastly, the multi-view learners trained are validated and incorporated into a meta-learner to construct the final model. Since the MVC algorithm is based on the multi-view learning framework, it is able to use any conventional method to mine data from relational databases. The paper is organized as follows. Section 2 introduces related work. Section 3 presents the concept of multi-relational data mining and the multi-view learning framework. Next, a detailed discussion of the MVC algorithm is provided in Section 4. In Section 5, we describe and discuss the comparative evaluation performed on the MVC algorithm. Finally, Section 6 concludes this paper and outlines our future work. 2. RELATED WORK Over the past few years, three categories of approaches in relational data mining have been actively investigated. Besides propositionalization methods, another popular approach to relational data mining is to use Inductive Logic Programming (ILP) based methods [7], such as FOIL, TILDE [1], Golem [18], and Progol [19]. Traditional ILP methods learn concepts from relational data within a logic programming framework. The third category of approaches, SRL (statistical relational learning) models also attract significant attentions in current literature. One of the best-known SRL systems is the PRMs (probabilistic relational models) method [12], which extends Bayesian networks to deal with relational data. Learning from multiple views is a new problem setting. This nonstandard new approach draws significant attention in current literature by several interesting applications [2, 10, 26]. From a general point of view, multi-view learning is closely related to ensemble learning approaches. However, from the authors points of view, popular ensemble methods such as boosting, bagging and stacking, focus on constructing different hypotheses from subsets of learning instances. In contrast, an important characteristic of the multi-view learning is that this approach is more interested in learning independent models from disjoint features of the training data. That is, multi-view learning learns separate views from various disjoint-feature-based aspects of the data, leading to independent views on the target concept. The next section introduces multi-relational data mining within the framework of multi-view learning. 3. Multi-Relational Data Mining In this section, we describe the concept of relational classification. Also, this section explains how the multi-view learning framework can be used to directly mine concepts from multirelational databases. 3.1 Relational Databases A schema for a relational database describes a set of tables DB = {T 1, T 2,... T n }, and a set of relationships between pairs of tables. Each row in the table represents a tuple. Each table has at least one primary key attribute, denoted by T. key, which has unique value for each tuple in the table. The other attributes are either descriptive attributes or foreign key attributes. Foreign key attributes link to attributes of other tables (foreign links or key reference). A foreign key attribute of table T 2 referencing table T 1 is denoted as T 2.T 1 _ key. Key attributes in a database include foreign key attributes and primary key attributes. Figure 1 gives a simple example of a relational database schema with eight tables. All relationships among tables are represented with arrows. The arrows show the foreign links, which relate the key attributes of one table to the corresponding key attributes of another table. For example, the Account table has a foreign key attribute account-id which is linked to the key attribute account-id in Loan table. Figure 1. A simple database from PKDD 99 [5] 3.2 Relational Classification The objective of this paper is to undertake relational classification tasks. Suppose we have a database DB and a target relation T target which includes a target variable Y. The relational classification task is to find a function f (x) which maps each tuple of the target

3 table T target to the category Y. The database DB can present various complexities, in terms of the number of tables, the number of relationships between tables, and the number of attributes in the tables. In the above scenario, the target table is considered the target relation while all others are background relations. In the example database (Figure 1), the Loan table is the target table and the target variable is the status attribute. This target variable indicates whether a tuple in the target table is good or bad. All other tables, i.e. tables Account, Order, Transaction, Card, Disposition, Client, and District are background relations. All background relations are associated with the target relation by the foreign key chain. We define a directed foreign key chain as the relationship where a foreign link goes from a background relation directly to the target relation. Otherwise, a relationship is defined as an undirected foreign key chain. For example, in Figure 1, the relations Order, Transaction, Account and Disposition have directed foreign key chains because they are directly linked to the target relation Loan by foreign keys. The relations Card, District, and Client are held in undirected foreign key chains since they do not refer to the target relation directly. 3.3 Multiple View Learning in Relational Databases In a multi-view learning setting, each learner is assigned a set of training data, from which each should be able to learn the target concept. To apply this setting to relational classification tasks, consider a target concept, represented by variable Y that resides in the target table T target. The multi-view classification algorithm has to first correctly propagate the target variable Y to all other background relations, i.e. the other tables in databases. The MVC approach uses foreign key chains to implement this propagation process. The tuple IDs from the target table, which will be used by the aggregation functions, must also be propagated to the background tables. This procedure is formally defined as follows. Definition 1 (directed foreign key chain propagation): Given a database DB = { T target,t 1, T 2,... T n }, where T target is the target table with primary key T target.key and target concept Y. Consider a background table T i, where i {1, 2, n}, with an attribute set A(T i ), and a foreign key T i.t target_key that links to the target table T target. If a tuple in table T i has a key value referencing a tuple in the target table through the foreign key T i.t target_key, we say that this tuple is joinable. All joinable tuples in table T i are selected to form a new relation T i_new. The attributes of table T i _ new are described by the expression A(T i _ new ) =T target.key + Y + A(T i ) A key (T i ), where A key (T i ) are the key attributes of table T i. = T target.key + Y + A(T k ) A key (T k ), where A key (T k ) are the key attributes of table T k. Note that we here only describe an undirected foreign key chain with two background relations involved. This definition can, however, easily be extended to foreign key chains with more than two background relations. Figure 2. A directed foreign key chain Let us use the sample database in Figure 1 to demonstrate these two definitions. The target table Loan has attributes loan-id, account-id, date, amount, duration, payment, and status. Here, the tuple ID is the attribute loan-id and the target concept is from the attribute status. For the background table order, training data from this view will be created. The background relation order has a reference key linking to the target relation Loan, as shown in Figure 2. By Definition 1, we know that this relation follows the directed foreign key chain propagation scenario. As such, the training data for the view order consists of all tuples which have an account-id linking to the account-id in the target table. Each of these tuples has a loan-id, i.e. a tuple ID and status from the target table, along with the attributes to-bank, to-account, account, and type from the table order. The above construction procedure can be converted into the following SQL query for a MySQL database 1 : CREATE TABLE order_new SELECT L.loan-id, L.status, A.to-bank, A.to-account, A.amount, A.type FROM Loan L, Order A WHERE L.account-id = A.account-id; Definition 2 (undirected foreign key chain propagation): Given a database DB = { T target,t 1, T 2,... T n }, where T target is the target table with primary key T target.key and target concept Y. Consider a background table T i, where i {1, 2, n}, with a foreign key T i.t target _ key referencing the target table T target. Next consider background table T k, where k i and k {1, 2, n} with attribute set A(T k ) and foreign key T k.t i _ key that links to table T i. If a tuple in table T k has a key value linking to a tuple in table T i, and this tuple has a link to a tuple in the target table T target, we say that the tuple from T k is joinable to the target table T target. All joinable tuples in table T k are selected to form a new relation T k_new. We define the set of attributes for table T k_new as A(T k_new ) Figure 3. An undirected foreign key chain 1

4 Consider now the situation presented in Figure 3 which is taken from Figure 1. For the background table Client, the training data from this view is created as follows. First we notice that the relation Client has no directed foreign key referencing to the target relation Loan. However, it does have a reference key linked to relation Disposition, which in turn has a foreign key referencing the target relation Loan. By Definition 2, this relation follows the undirected foreign key chain propagation scenario. Following this definition, the tuples from table Client which have client-id linked to the client-id attribute in table Disposition and the tuples from table Disposition with account-id linked to the target table are selected to construct the training data. Since the table Client has attributes client-id, birthday, gender, and districtid, the final training data consists of four attributes, namely birthday and gender from relation Client and loan-id and status from target relation Loan. The above construction procedure can also be converted into the following SQL query for a MySQL database: CREATE TABLE client_new SELECT L.loan-id, L.status, C.birthdate, C.gender FROM Loan L, Disposition D, Client C WHERE L.account-id = D.account-id and D.client-id = C.client-id Aggregation functions are then applied to the newly generated tables T i _ new, to construct the training data sets for the multi-view learners. This aggregation procedure is described next in Section Aggregation-based Feature Generation Relational data mining approaches have to handle the difficulties inherent with one-to-many and many-to-many associations between relations [8, 27]. The relations T i _ new obtained in Section 3.3 consist of one-to-many relationships with respect to the primary key value T target.key (i.e. tuple ID) as propagated from the target relation T target. This kind of indeterminate relationship has a significant impact on the predictive performance and has been studied by several researchers [13, 14, 20]. To address this problem, the MVC method applies aggregation functions taken from the field of relational database. Aggregation has been widely used to summarize the properties of a set of related objects. In the MVC algorithm, the aggregation function squeezes related tuples into a single tuple using the primary key from the target relation. Each new table T i _ new is aggregated to form a multi-view training set T i _ view. Each T i _ view is then used to train separate multi-view learners. For nominal and binary attributes, the MVC method applies the aggregation function COUNT. For numeric attributes, the functions SUM, AVG, MIN, MAX, STDDEV, and COUNT are employed. The aggregation-based feature generation procedure is formally defined as follows. Consider a relation T i with tuple ID attribute T t.key. For each attribute A i, in T i, except the attribute T t.key, new attributes are generated to replace the attribute A i based on the following constraints. 1. Nominal Attribute If A i is a nominal attribute and has n possible different values (n>2), the aggregation-based feature generation procedure produces a new attribute Ψ (A i ), in which the total number of occurrences of values of attribute A i is stored. The value is calculated using the COUNT() function in the database. 2. Binary Attribute If A i is a binary attribute in which only 2 different values are possible, the aggregation procedure creates two new attributes Ψ 1 (A i ) and Ψ 2 (A i ). The attribute Ψ 1 (A i ) stores the number of occurrences of one of the values of Ai, and Ψ 2 (A i ) stores the number of occurrences of the other. Each value is calculated by the COUNT() function in the database. 3. Numeric Attribute If A i is a numeric attribute with a set of values m, six new attributes Ψ 1 6 (A i ) are generated. In each of these attributes, the values for SUM (m), AVG(m), MIN(m), MAX(m), STDDEV(m), and COUNT(m) are stored, respectively. In order to achieve high efficiency, the steps described in Section 3.3 and Section 3.4 are performed simultaneously. Let us go back to the two examples we discussed earlier, namely the order_new and client_new examples, and consider the combination of the propagation and aggregation methods. The table order_new has nominal attributes to-bank, to-account and type and numeric attribute amount. Three aggregation attributes are generated to replace the nominal attributes to-bank, to-account and type, respectively. They are COUNT(to-bank), COUNT(to-account) and COUNT(type). Six new aggregation attributes are created to replace the numeric attribute amount. These are SUM (amount), AVG(amount), MIN(amount), MAX(amount), STDDEV(amount), and COUNT(amount). Combining this procedure with the propagation procedure of Section 3.3 results in the following SQL query: CREATE TABLE order_view SELECT L.loan-id, L.status, COUNT(A.to-bank), COUNT(A.to-account), SUM(A.amount), AVG(A.amount), MIN(A.amount), MAX(A.amount), STDDEV(A.amount), COUNT(A.amount), COUNT(A.type) FROM Loan L, Ordert A WHERE L.account-id = A.account-id; GROUP BY L.loan-id In the case of the client_new table, we have a nominal attribute birthday and a binary attribute gender (male or female). So one aggregation attribute COUNT(birthday) is produced to replace the birthday attribute and two new attributes gender_male and gender_female are created to replace the gender attribute. This procedure can be also converted into the following SQL query: CREATE TABLE client_view SELECT L.loan-id, L.status, COUNT(C.birthdate,) COUNT( IF (C.gender = male, 1, 0 )), COUNT( IF (C.gender = female, 1, 0 )) FROM Loan L, Disposition D, Client C

5 WHERE L.account-id = D.account-id and D.clientid = C.client-id GROUP BY L.loan-id Note that, the construction and aggregation procedures are performed in the relational database management system (RDMS). With the optimal techniques implemented in the RDMS, these procedures are highly efficient. Another benefit is that these procedures can easily be adapted for other database management systems such as Oracle 2 and DB MVC ALGORITHM In this section, we present the MVC algorithm. This approach, which is based on the multi-view learning framework, is able to operate directly on multi-relational databases. It also allows for any traditional single-table learning algorithms developed for flat data representations to be used in the multi-view learning setting. The MVC algorithm, as shown in Figure 4, consists of the following three stages. In the first stage, multi-view training data sets are constructed straight from the relational database. In this step, tuple IDs and class labels from the target table are propagated to all background relations through key references. Also as part of this stage, aggregation functions are applied to each background relation to handle the one-to-many associations. In the second stage, multi-view learners are initiated to learn the target concept in each of the constructed training data sets. In this process conventional single-table learning algorithms are applied. For the third stage of the algorithm, the trained multi-view learners from the second stage are validated and then used in a meta-learning algorithm to construct the final classification model. The pseudo-code of the MVC algorithm is shown below and discussed, in detail, in the following four sections. MVC ALGORITHM Input A relational database DB={T target, T 1, T 2, T n }, consisting of a target relation T target (with target variable Y and primary key ID) and a set of background relations T i. The value n denotes the number of background tables while w denotes the number of possible values of Y. A conventional single-table learning algorithm L. Output A final model for predicting class labels of target tuples. Procedure i. Divide tuples of T target into sets S train(k) and S test(k), using k- fold cross validation. ii. Do for each fold k of the k-fold cross validation iii. Do for i = 1, 2,, n 1. Propagate Y and ID of each tuple in S train(k) to every background relation T i, using the shortest foreign key chain which references the target table S train(k) in order to form the data set D i. 2. Remove key attributes from D i. 3. Apply aggregation functions on D i, grouping by ID 4. Call algorithm L, providing it with D i, obtaining observation V i for the target concept. 5. Calculate the error for V i, denoted as e i. If e i > 0.5, discard V i. End do iv. Call algorithm L, providing it with S train(k), obtaining observation V 0 v. Do for x = 1, 2,, m, where m is the number of tuples in S test(k) Do for s = 0, 1,..., p, where p is the number of observations learned and accepted in the validation of step 5 of iii. If (s ==0) Call V 0, providing it with tuples x. Return predictions {P Yq V0(x)}, where q {1, 2, w}. else Retrieve corresponding tuples from background table T i, denoted as R, following the foreign key chains. If ( R==0) Return {P Yq Vs(x) = 1.0 / w}, where q {1, 2, w}. else Apply aggregation functions to R, grouping by ID, yielding R A Call V s, providing it with R A. Return the predictions of R A {P Yq Vs(x)}, where q {1, 2, w}. End if End if Return X = < P Y0 V0(x), P Y1 V0(x),..., P Yw V0(x), P Y0 V1(x), P Y1 V1(x),..., P yw V1(x),, P Y0 Vi(x), P Y1 Vi(x),...,P Yw Vi(x),Y > as the x th instance of the meta data set D M(k). End do End do End do vi. Call algorithm L, providing it with, forming U D ( ) M j j= 1.. k meta-learner H M vii. Retrain all multi-view learners V i, providing it with T target rather than S train(k), Obtaining hypothesis H i of each view. viii. Return H M and H i, where i = 0, 1,, p, as the final model. End Procedure Figure 4: Pseudo-code for the MVC algorithm

6 4.1 Construct Multi-view Training Data Sets The first step of the MVC algorithm is to, directly from databases, construct multi-view training data sets for use by the multi-view learners. The algorithm proceeds to construct one multi-view training set from each background relation, and one from the target table. In multi-view learning framework, it is important to provide each multi-view learner with sufficient knowledge in order for it to learn the target concept. To achieve this goal, we propagate the class labels and the tuple IDs in the target relation to all other background relations. The tuple IDs identify each tuple in the target relation. These steps are described in step iii of Figure 4. After grouping by tuple ID, the aggregation functions explained in Section 3.4 are applied to deal with the one-to-many associations between background and target tables. For example, the database in Figure 1 has seven background relations and one target table. Thus, the MVC approach constructs eight multi-view training data sets. One data set is from the target relation Loan and the seven others are from the background relations. A summary of the characteristics for each of the eight multi-view training data sets is shown in Table 1. The original characteristics of the attributes are provided in Figure 1. In this construction process, all joinable tuples from the original tables, as described in definition one and two, are selected and used to form the training data sets. In other words, only tuples having foreign key values (either through directed foreign key chain or undirected foreign key chain) which reference key values in the target table are selected. Note that it is possible that the resulting view data set may have fewer tuples than the original. For example, the training data set from the background relation Card has only 36 tuples, as shown in Table 1. This is due to the fact that most of the foreign key values for the tuples from the original table do not link to tuples in target table Loan. Table 1. Summary of the characteristics of the multi-view training data set for the example database (after aggregation). Shown are the number of tuples and attributes as well as the class distribution for each relation. Training data # tuples # attributes Class Distribution Loan :76 Account :76 Order :76 Disposition :76 Card :0 Client :56 District :76 Transaction : Yield Multiple-view Hypothesis After constructing the multi-view training data sets, the MVC algorithm call upon the multi-view learners to learn the target concept from each of these sets. Each learner constructs a different hypothesis based on the data it is given. Any traditional single-table learning algorithms, such C4.5, SVM or Neural Networks, can be applied. This procedure is explained in step iii of Figure 4. In this way, all multi-view learners make different observations on the target concept based on their perspective. The results from the learners will be validated and combined to construct the final classification model. This is discussed in the next section, namely Section Validate Multiple Views The multi-view learners trained in section 4.2 need to be validated before being used by the meta-learner. This step is needed to ensure that they are sufficiently able to learn the target concept on their respective training sets. The view validation has to be able to differentiate the strong learners from the weak ones. In other words, it must identify those learners which are only able to learn concepts that are strictly more general or specific than the target one [17]. In the MVC method, we use the error rate in order to evaluate the quality of a multi-view learner. Learners with training error greater than 50% are discarded. In other words, only learners with predictive performance better than random guessing are used to construct the final model. The method for the construction of this model is discussed in the next section. 4.4 Combine Multi-view Learners The last step of the MVC approach is to construct the final classification model, using the trained multi-view learners. Approaches for combining models have been investigated by many researchers [3, 9, 11, 28, 30]. The most popular are those such as bagging [3], boosting [9] and stacking [30]. In metalearning schemes, a learning algorithm is used to construct the function that combines the predictions of the individual learners. The MVC algorithm applies this scheme in order to combine the classifiers from the multi-view learners. After training, the multi-view learners are used to construct training instances (meta data) for the meta-learning algorithm (meta-learner). Each instance of the meta data consists of the prediction made by the learner on a specific training example, along with the original class label for that example. This process is described in steps v and vi of Figure 4. The details of this procedure proceeds are as follows. Firstly, for each training tuple x from the target table, a multi-view learner V i will retrieve the tuple from background knowledge table T i corresponding to the key reference (a directed or undirected foreign key chain). If a corresponding tuple is found, the learner is called to produce predictions {P Yq Vi(x)} (q {1, 2, w}) for it. Here, P Yq Vi(x) denotes the probability that instance x belongs to class Y q (Y has w different values), as defined by multi-view learner V i. Otherwise, namely if there is no corresponding tuple in table T i, an equal probability is assigned for each class label, i.e. P Yq Vi(x) values equal to 1/w are returned. Following these steps, a feature vector, X, consisting of {P Yq Vi(x)}, along with the original class labels Y is built and used as meta data. The feature vector used in the training of the meta-learner can be formulated as follows: X = < P Y0 V0(x), P Y1 V0(x),..., P Yw V0(x), P Y0 V1(x), P Y1 V1(x),..., P Yw V1(x),, P Y0 Vi(x), P Y1 Vi(x),...,P Yw Vi(x), Y >

7 In the context of the example database, 682 training instance are generated to train the meta-learner. Each instance consists of 17 attributes. For these attributes, one is the original class label (actual class) and the others are the eight pairs of predicted probabilities obtained for the example. The probability values describe the likelihood that the example belongs to the good and bad classes. The following entries show two instances of meta data for the example database , , , , , , , ,1,0, , , , , , ,GOOD , , , , , , , ,1,0, , , , , , ,BAD After building the meta data, the MVC algorithm calls on a metalearner to produce a function to control how the classifiers are used to classify a particular example. The objective of this metalearner is to achieve maximum classification accuracy from this combination of classifiers. This function, along with the hypotheses constructed by each of the multi-view learners constitutes the final classification model. 5. EXPERIMENTAL RESULTS This section presents the results obtained for the MVC algorithm on different standard databases. These results are presented in comparison with two other well known systems for multirelational data mining, namely the FOIL [23] algorithm and another recent method, CrossMine [31]. The FOIL algorithm uses a top-down approach to build rules. This algorithm is commonly used in benchmarking the performance of relational data mining methods. The CrossMine method is a recent addition. This algorithm extends the FOIL approach, with virtual joins, in order to avoid the high cost of producing physical joins in a database [31]. Using tuple ID propagation and a selective sampling method, this approach demonstrates high scalability and accuracy when mining relational databases [31]. 5.1 Methodology Six learning tasks derived from three standard real-world databases were used to compare the predictive performances of the following methods: MVC, FOIL and CrossMine. We implemented the MVC algorithm using Weka [29], a Javabased knowledge learning and analysis environment developed at the University of Waikato in New Zealand. In our experiments, C4.5 [22] decision trees were applied in each of the multi-view learners and for the meta-learner. Each of these experiments produces results using ten-fold cross validation. We report the average running time of each fold for the three methods. The MVC algorithm and the CrossMine method were run on a 3 GHz Pentium 4 PC with 1 GByte of RAM running Windows XP and MySQL. The FOIL approach was run on a Sun4u machine with 4 CPUs. 5.2 Financial Database In our first experiment, we used the Financial database, which was used in the PKDD 1999 discovery challenge [5]. The database was offered by a Czech bank and contains typical business data. The schema of the Financial database is shown in Figure 1. The original database is composed of eight tables. The target table, i.e. the Loan table consists of 682 records, including a class attribute status which indicates the status of the loan, i.e. A (finished and good), B (finished but bad), C (good but not finished), or D (bad and not finished). The background information for each loan is stored in the relations Account, Client, Order, Transaction, Card, Disposition and District. All background relations relate to the target table through directed or undirected foreign key chains, as shown in Figure 1. This database provides us with three different learning problems. Our first learning task is to learn if a loan is good or bad from the 234 finished tuples. The second learning problem attempts to classify if the loan is good or bad from the 682 instances regardless of whether the loan is finished or not. Our third experimental task uses the Financial database as prepared in [31], which has only 400 examples in the target table. The authors sampled the Transaction relation since it contains an extremely large number of instances and discarded some positive examples from the target relation to make the learning problem more balanced [31]. A summary of the three learning data sets is presented in Table 2, where F234AC, F682AC and F400AC denote the first, second and third learning tasks respectively. For this database, eight views, namely, the views loan, account, client, order, transaction, card, disposition, and district were constructed for each task. Table 2. Summary of the data sets used. Shown is the number of tuples in the target relation, the number of background relations, and the target class distribution. Data sets # tuples in the target # background relations target class distribution F400AC :76 F234AC :31 F682AC :76 M :63 M :76 Throm : Mutagenesis Database Our second experiment was conducted on the Mutagenesis data set [25]. This benchmark data set is composed of the structural descriptions of 188 Regression Friendly and 42 Regression Unfriendly molecules that are to be classified as mutagenic or not. Two learning problems can be constructed from this data. The first problem is to classify the 188 Regression Friendly examples. Of the 188 instances 125 tuples are positive and 63 are negative. The second task is the classification of all the Regression molecules, regardless of whether they are Regression Friendly or Regression Unfriendly. This task has 230 learning examples, of which 154 tuples are positive and 76 are negative. The background relations consist of descriptions about the atoms and bonds that make up the molecules. A summary of the characteristics for the two learning data sets are also given in

8 Table 2, where the M188 and M230 describe the first and second tasks respectively. For this database, 3 views, namely mole, bond, and atom are constructed for each learning task. 5.4 Thrombosis Database Our third experiment uses the Thrombosis database from the PKDD 2001 Discovery Challenge [6]. This database is organized using seven relations. We set the Antibody_exam relation as the target table, and Thrombosis as the target concept. This concept describes the different degrees for Thrombosis, i.e. None, Most Severe, Severe, and Mild. The target table has 770 records describing the results of special laboratory examinations performed on patients. Our task here is to determine whether or not a patient is thrombosis free. For this task, we include four relations for our background knowledge, namely Patient_info, Diaagnosis, Ana_pattern, and Thrombosis. All four background relations are linked to the target table by foreign keys. Therefore, for this data set, five different views are constructed by the MVC algorithm. A summary of this data set is shown as part of Table 2 and identified by the index Throm. Table 3. Accuracy obtained for the six learning tasks using FOIL, CrossMine and MVC. Data sets FOIL CM MVC F682AC 83.9 % 90.3 % 92.5 % F400AC 72.8 % 85.8 % 87.8 % F234AC 74.4 % 88.0 % 88.0 % M % 83.9 % 83.0 % M % 85.7 % 86.7 % Throm 100 % 90.0 % 100 % Table 4. Execution time needed for the six learning tasks using FOIL, CrossMine and MVC. Data sets FOIL CM MVC F682AC sec 11.6 sec 5.6 sec F400AC sec 8.1 sec 3.0 sec F234AC sec 5.0 sec 2.0 sec M sec 1.6 sec 3.8 sec M sec 1.0 sec 3.0 sec Throm 0.7 sec 0.9 sec 1.5 sec 5.5 Results We present the predictive accuracy obtained for each of the six learning tasks in Table 3 (CM denotes the CrossMine method). The results obtained using our method, the MVC algorithm, demonstrate its ability to yield better or comparable accuracy when compared to the other popular methods of FOIL and CrossMine. For example, for the F682AC, F400AC, F234AC, M188, and Throm learning tasks, the MVC approach achieved the highest accuracy with 92.5%, 87.8%, 88.0%, 86.7% and 100%, being reported respectively. For the M230 learning problem, the MVC algorithm produced slightly lower (0.9%) accuracy values than FOIL and CrossMine. To evaluate the performance of the MVC approach in terms of run time, we also provide the execution time needed for each of the six learning task in Table 4. These results show that the execution time for the MVC algorithm is comparable to the other two approaches. Promisingly, these outputs indicate that the MVC algorithm is very efficient for complex database schemas. For example, in the F682AC, F400AC and F234AC learning tasks which has the most complex structures in this experimental setting, the MVC algorithm was at least twice as fast as the other two methods. In summary, the results shown in Table 3 and Table 4 indicate that the performances achieved by the MVC approach are comparable to that of the other techniques, when evaluated in terms of overall accuracy. In particular the experimental findings for running time are encouraging. The MVC algorithm was able to successfully learn the target concept in seconds for each of the learning tasks, even for the more complex structure presented in F682AC, F400AC and F234AC. 5.6 Multiple Views Contribution Analysis To better understand how each multi-view learner contributes to the final model, we present the individual performance of each in this section. In this experiment we chose to use the Thrombosis database since it has moderate number of tuples and background relations. This data, as described in Section 5.4, is composed of five tables. A summary of this data set is provided in Table 5. In order to get a good idea of the contribution of each learner, we present its individual predictive performance, i.e. overall accuracy and F-Measures [24] on various classification tasks. This data set consists of two classes which we will refer to as the majority and minority class respectively. The F-measure incorporates the recall and precision into a single number to measure the predictive performance for an algorithm for different threshold values. The values for Precision and Recall are calculated by the expressions TP / (TP + FP) and TP / (TP + FN), respectively. F- Measure is formally defined in the following equation [11]: 2 2 ((1+ β ) Re call Pr ecision ) ( β Re call + Pr ecision ),where β corresponds to the relative importance of precision to recall and is assigned the value of 1. We display in Table 6, the training accuracy, the number of tuples in the training set, and the F- Measure for the minority and majority classes for each multi-view learner. We also provide the predictive performance of the final model in Table 7 so that these values can be compared.

9 Table 5. Summary of the Thrombosis data set. Shown are the number of tuples and attributes for each relation. Relations # tuples # attributes Antibody_exam (target) Patient_info Diagnosis Ana_pattern Thrombosis From the results provided in Tables 6 and 7, one can conclude that the MVC algorithm is able to efficiently construct multi-view learners and then successfully combine them to form a strong final model. Table 6 shows that the multi-view learners yield various accuracy values when learning the target concept. For example, Learner 5 obtains perfect performance for the minority class, as indicated by an F-Measure of 1, but obtains a score of zero on the majority class. In contrast, Learner 1 is good at predicting the majority class (F-measure of 0.949), but yields poor performance for the minority class (F-Measure of zero). However, by combining these learners with the meta-learner, the final model is able to achieve perfect predictive performance. As shown in Table 7, the final model achieves 100% accuracy and perfect F-Measures scores for all classes of the target concept. Table 6. Summary of performance for the multi-view learners. Shown are the names of the training data used along with its number of tuples. Next, the F-Measure scores for the minority and majority classes as well as the overall accuracy on the training data are given. Multi-view Learners Training data # tuples F.min F.maj Acc. Learner1 Antibody_exam Learner2 Patient_info Learner3 Diagnosis Learner4 Ana_pattern Learner5 Thrombosis Table 7. Performance summary for the final model which combines the five multi-view learners. Shown are the number of testing tuples, the F-Measure scores of the minority and majority classes and the overall accuracy. # Tuples F.min F.maj Acc. FINAL MODEL % 6. CONCLUSIONS AND FUTURE WORK This paper introduced a novel approach for learning directly from multi-relational databases using traditional single-table data mining algorithms. The approach is based on the multi-view learning framework which has been successfully applied in many earlier applications of semi-supervised learning. In these applications, it was used to boost the accuracy of a classifier trained on few labeled examples. In this approach, concepts, along with essential information present in the target relation, are propagated to background relations, where aggregation functions are applied, to yield training data sets for a set of multi-view learners. Target concepts are then learned from each of these views independently, using a conventional single-table learning algorithm. Finally, the results from all views satisfying a validation procedure are combined by a meta-learner to form the final classification model. The MVC algorithm was evaluated on six real-world learning problems. The results obtained indicate that the MVC approach performs well with complex database schemas. It achieved comparable accuracy and high efficiency in term of running time, when compared with well-known multi-relational learning algorithms. The important finding here is that the MVC algorithm is able to work efficiently when applied directly on a multi-relational database. This paper therefore demonstrates that the multi-view learning framework does benefit from the accuracy and efficiency of the MVC algorithm when mining on relational databases. Another important finding is that this framework allows the MVC method to make good use of any traditional single-table data mining approach. In conclusion, these results indicate that it is possible to directly mine multi-relational databases using a multi-view learning framework. The results from this paper indicate four advantages to applying the developed method. The first is that the relational database is able to keep its compact representation and normalized structure. The second is that it uses a framework that is able to directly incorporate any traditional single-table data mining algorithm. The third is that the multi-view learning framework is highly efficient for mining relational databases in term of running time. The last advantage is that with this approach it is very easy to integrate the advanced techniques developed by the ensemble and heterogeneous learning communities. Our future work will study some issues that need to be addressed in order to extend the MVC approach. These issues include: further investigation of the method to construct the training data sets from the multi-relational databases; consideration of how to use all join information among relations; deep research of dependency between multiple views; further evaluation of the approach s efficiency and scalability, also in terms of a quantitative evaluation; and experimentation with complex aggregation functions and different view combination techniques, such as majority voting and weighted voting. The idea of using heterogeneous view learners will be also need to be investigated. Future work will also include applying this approach to more real world applications.

10 7. REFERENCES [1] Blockeel, H., DeRaedt, L. and Ramon, J. Top-down induction of logical decision trees. In Proceedings of 1998 International Conference of Machine Learning (ICML 98), Madison, WI, [2] Blum A. and Mitchell, T.M. Combining labeled and unlabeled data with co-training. In Proceedings of COLT-98, 1998, [3] Breiman, L. Bagging Predictors, Machine Learning, 24 (2), 1996, [4] Burges, C.J.C. A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2, 1998, [5] Berka, P. Guid to the financial data set. In Siebes, A. and Berka, P., editors, The ECML/PKDD 2000 Discovery Challenge, [6] Coursac, I., Duteil, N., and Lucas, N. PKDD 2001 Discovery Challenge Medical Domain. The PKDD Discovery Challenge 2001, 3 (2), [7] DeRaedt, L. and Muggleton, S.H. Inductive logic programming: Theory and methods. The Journal of Logic Programming, 19&20, 1994, [8] Dzeroski, S. Multi-relational Data Mining: An Introduction, ACM SIGKDD Explorations, 5(1), 2003, [9] Freund, Y. and Schapire, R. Experiments with a New Boosting Algorithm, the Proceedings of the Thirteenth International Conference on Machine Learning, Bari, Italy, 1996, [10] Ghani, R. Combining labeled and unlabeled data for multiclass text classification. In proceedings of the 19 th International Conference on machine Learning (ICML- 2002), 2002, [11] Guo, H. and Viktor, H.L, Learning from imbalanced data sets with Boosting and Data Generation: the DataBoost-IM approach, ACM SIGKDD Explorations, 6(1), 2004, [12] Getoor, L., Friedman, N, Koller, D. and Taskar, B. Learnig probabilistic models of relational structure. Proceedings of the Eighteenth International Conference on Machine Learning (ICML), Williams College, [13] Krogel, M. A. and Wrobel, S. Transformation-Based Learning using Multirelational Aggregation. In Rouveirol, C. and Sebag, M. editors, Proceedings of the Eleventh International Conference on Inductive Logic Programming (ILP), LNAI 2157, Springer, 2001, [14] Knobbe, A.J., Haas, M. de. And Siebes, A. Propositionalisation and Aggregates. In Deraedt, L. and Siebes, A. editors, Proceedings of the Fifth European Conference on Principles of Data Mining and Knowledge Discovery (PKDD), LNAI 2168, Springer, 2001, [15] Lavvrac, N. and Dzeroski, S. Inductive Logic Programming: Techniques and Applications. Ellis Horwood, [16] Mitchell, T.M. Machine Learning, McGraw Hill, [17] Muslea, I. Active Learning with Multiple Views, Ph.D. Dissertation, Department of Computer Science, University of Southern California, [18] Muggleton, S. and Feng, C. Efficient induction of logic programs. In Proceedings of the 1st Conference on Algorithmic Learning Theory, Ohmsma, Tokyo, Japan, 1990, [19] Muggleton, S. Inverse entailment and progol. In New Generation Computing, Special issue on Inductive Logic Programming, [20] Perlich, C. and Provost, F. Aggregation-based feature invention and relational concept classes, Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Washington, D.C. 2003, [21] Pierce, D. and Cardie, C. Limitations of co-training for natural language learning from large datasets. In Proceedings of Empirical methods in Natural Language Processing (EMNLP-2001), 2001, [22] Quinlan, J.R. C4.5: Programs for machine learning. Morgan Kaufmann, [23] Quinlan, J. R. and Cameron-Jones, R. M. FOIL: A midterm report. In Proceedings of 1993 European Conference of Machine Learning, Vienna, Austria, [24] Rijsbergen, C. J. Information Retrieval. Butterworths, London, [25] Srinivasan, A., King, R. and Bristol, D. An assessment of ILP-assisted models for toxicology and the PTE-3 experiment. In Proceedings of the Ninth International Workshop on Inductive Logic programming, LNAI 1634, Springer-Verlag, 1999, [26] Sa, V.de. and Ballard, D. Category learning from multimodality. Neural Computation, 10(5), 1998, [27] Vens, C., Assche, A.V., Blockeel, H., and Dzeroski, S. First order random forests with complex aggregates. Camacho, R., King, R., and Srinivasan, A. (Eds.), Proceedings of the Fourteenth International Conference on Inductive Logic Programming (ILP), LNAI 3194, Springer-Verlag, 2004, [28] Viktor, H.L. The CILT multi-agent learning system, South African Computer Journal (SACJ), 24, 1999, [29] Witten, I.H. and Frank, E. Data mining practical machine learning tools and techniques with Java implementations, Morgan Kaufmann, [30] Wolpert, D. Stacked Generalization, Neural Network, 5, 1992, [31] Yin, X., Han, J., Yang, J., and Yu, P.S., CrossMine: Efficient Classification across Multiple Database Relations, in Proceedings of the 2004 International conference on Data Engineering (ICDE'04), Boston, MA, 2004.

Efficient Multi-relational Classification by Tuple ID Propagation

Efficient Multi-relational Classification by Tuple ID Propagation Efficient Multi-relational Classification by Tuple ID Propagation Xiaoxin Yin, Jiawei Han and Jiong Yang Department of Computer Science University of Illinois at Urbana-Champaign {xyin1, hanj, jioyang}@uiuc.edu

More information

Stochastic propositionalization of relational data using aggregates

Stochastic propositionalization of relational data using aggregates Stochastic propositionalization of relational data using aggregates Valentin Gjorgjioski and Sašo Dzeroski Jožef Stefan Institute Abstract. The fact that data is already stored in relational databases

More information

Multi-relational Decision Tree Induction

Multi-relational Decision Tree Induction Multi-relational Decision Tree Induction Arno J. Knobbe 1,2, Arno Siebes 2, Daniël van der Wallen 1 1 Syllogic B.V., Hoefseweg 1, 3821 AE, Amersfoort, The Netherlands, {a.knobbe, d.van.der.wallen}@syllogic.com

More information

A Clustering Approach to Generalized Pattern Identification Based on Multi-instanced Objects with DARA

A Clustering Approach to Generalized Pattern Identification Based on Multi-instanced Objects with DARA A Clustering Approach to Generalized Pattern Identification Based on Multi-instanced Objects with DARA Rayner Alfred 1,2, Dimitar Kazakov 1 1 University of York, Computer Science Department, Heslington,

More information

CrossMine: Efficient Classification Across Multiple Database Relations

CrossMine: Efficient Classification Across Multiple Database Relations CrossMine: Efficient Classification Across Multiple Database Relations Xiaoxin Yin UIUC xyin1@uiuc.edu Jiawei Han UIUC hanj@cs.uiuc.edu Jiong Yang UIUC jioyang@cs.uiuc.edu Philip S. Yu IBM T. J. Watson

More information

Experiments with MRDTL -- A Multi-relational Decision Tree Learning Algorithm

Experiments with MRDTL -- A Multi-relational Decision Tree Learning Algorithm Experiments with MRDTL -- A Multi-relational Decision Tree Learning Algorithm Hector Leiva, Anna Atramentov and Vasant Honavar 1 Artificial Intelligence Laboratory Department of Computer Science and Graduate

More information

Aggregation and Selection in Relational Data Mining

Aggregation and Selection in Relational Data Mining in Relational Data Mining Celine Vens Anneleen Van Assche Hendrik Blockeel Sašo Džeroski Department of Computer Science - K.U.Leuven Department of Knowledge Technologies - Jozef Stefan Institute, Slovenia

More information

Classifying Relational Data with Neural Networks

Classifying Relational Data with Neural Networks Classifying Relational Data with Neural Networks Werner Uwents and Hendrik Blockeel Katholieke Universiteit Leuven, Department of Computer Science, Celestijnenlaan 200A, B-3001 Leuven {werner.uwents, hendrik.blockeel}@cs.kuleuven.be

More information

International Journal of Software and Web Sciences (IJSWS)

International Journal of Software and Web Sciences (IJSWS) International Association of Scientific Innovation and Research (IASIR) (An Association Unifying the Sciences, Engineering, and Applied Research) ISSN (Print): 2279-0063 ISSN (Online): 2279-0071 International

More information

A MODULAR APPROACH TO RELATIONAL DATA MINING

A MODULAR APPROACH TO RELATIONAL DATA MINING Association for Information Systems AIS Electronic Library (AISeL) AMCIS 2002 Proceedings Americas Conference on Information Systems (AMCIS) December 2002 A MODULAR APPROACH TO RELATIONAL DATA MINING Claudia

More information

Reducing the size of databases for multirelational classification: a subgraph-based approach

Reducing the size of databases for multirelational classification: a subgraph-based approach J Intell Inf Syst (2013) 40:349 374 DOI 10.1007/s10844-012-0229-0 Reducing the size of databases for multirelational classification: a subgraph-based approach Hongyu Guo Herna L. Viktor Eric Paquet Received:

More information

An Efficient Approximation to Lookahead in Relational Learners

An Efficient Approximation to Lookahead in Relational Learners An Efficient Approximation to Lookahead in Relational Learners Jan Struyf 1, Jesse Davis 2, and David Page 2 1 Katholieke Universiteit Leuven, Dept. of Computer Science Celestijnenlaan 200A, 3001 Leuven,

More information

Rank Measures for Ordering

Rank Measures for Ordering Rank Measures for Ordering Jin Huang and Charles X. Ling Department of Computer Science The University of Western Ontario London, Ontario, Canada N6A 5B7 email: fjhuang33, clingg@csd.uwo.ca Abstract. Many

More information

A Bagging Method using Decision Trees in the Role of Base Classifiers

A Bagging Method using Decision Trees in the Role of Base Classifiers A Bagging Method using Decision Trees in the Role of Base Classifiers Kristína Machová 1, František Barčák 2, Peter Bednár 3 1 Department of Cybernetics and Artificial Intelligence, Technical University,

More information

Trends in Multi-Relational Data Mining Methods

Trends in Multi-Relational Data Mining Methods IJCSNS International Journal of Computer Science and Network Security, VOL.15 No.8, August 2015 101 Trends in Multi-Relational Data Mining Methods CH.Madhusudhan, K. Mrithyunjaya Rao Research Scholar,

More information

Performance Analysis of Data Mining Classification Techniques

Performance Analysis of Data Mining Classification Techniques Performance Analysis of Data Mining Classification Techniques Tejas Mehta 1, Dr. Dhaval Kathiriya 2 Ph.D. Student, School of Computer Science, Dr. Babasaheb Ambedkar Open University, Gujarat, India 1 Principal

More information

Genetic Programming for Data Classification: Partitioning the Search Space

Genetic Programming for Data Classification: Partitioning the Search Space Genetic Programming for Data Classification: Partitioning the Search Space Jeroen Eggermont jeggermo@liacs.nl Joost N. Kok joost@liacs.nl Walter A. Kosters kosters@liacs.nl ABSTRACT When Genetic Programming

More information

An Empirical Study of Lazy Multilabel Classification Algorithms

An Empirical Study of Lazy Multilabel Classification Algorithms An Empirical Study of Lazy Multilabel Classification Algorithms E. Spyromitros and G. Tsoumakas and I. Vlahavas Department of Informatics, Aristotle University of Thessaloniki, 54124 Thessaloniki, Greece

More information

An Empirical Study of Hoeffding Racing for Model Selection in k-nearest Neighbor Classification

An Empirical Study of Hoeffding Racing for Model Selection in k-nearest Neighbor Classification An Empirical Study of Hoeffding Racing for Model Selection in k-nearest Neighbor Classification Flora Yu-Hui Yeh and Marcus Gallagher School of Information Technology and Electrical Engineering University

More information

Naïve Bayes for text classification

Naïve Bayes for text classification Road Map Basic concepts Decision tree induction Evaluation of classifiers Rule induction Classification using association rules Naïve Bayesian classification Naïve Bayes for text classification Support

More information

A Comparison of Text-Categorization Methods applied to N-Gram Frequency Statistics

A Comparison of Text-Categorization Methods applied to N-Gram Frequency Statistics A Comparison of Text-Categorization Methods applied to N-Gram Frequency Statistics Helmut Berger and Dieter Merkl 2 Faculty of Information Technology, University of Technology, Sydney, NSW, Australia hberger@it.uts.edu.au

More information

KBSVM: KMeans-based SVM for Business Intelligence

KBSVM: KMeans-based SVM for Business Intelligence Association for Information Systems AIS Electronic Library (AISeL) AMCIS 2004 Proceedings Americas Conference on Information Systems (AMCIS) December 2004 KBSVM: KMeans-based SVM for Business Intelligence

More information

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1395

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1395 Data Mining Introduction Hamid Beigy Sharif University of Technology Fall 1395 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1395 1 / 21 Table of contents 1 Introduction 2 Data mining

More information

Comparative Evaluation of Approaches to Propositionalization

Comparative Evaluation of Approaches to Propositionalization Comparative Evaluation of Approaches to Propositionalization M.-A. Krogel, S. Rawles, F. Železný, P. A. Flach, N. Lavrač, and S. Wrobel Otto-von-Guericke-Universität, Magdeburg, Germany, mark.krogel@iws.cs.uni-magdeburg.de

More information

OPTIMIZATION OF BAGGING CLASSIFIERS BASED ON SBCB ALGORITHM

OPTIMIZATION OF BAGGING CLASSIFIERS BASED ON SBCB ALGORITHM OPTIMIZATION OF BAGGING CLASSIFIERS BASED ON SBCB ALGORITHM XIAO-DONG ZENG, SAM CHAO, FAI WONG Faculty of Science and Technology, University of Macau, Macau, China E-MAIL: ma96506@umac.mo, lidiasc@umac.mo,

More information

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1394

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1394 Data Mining Introduction Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1394 1 / 20 Table of contents 1 Introduction 2 Data mining

More information

Improving the Random Forest Algorithm by Randomly Varying the Size of the Bootstrap Samples for Low Dimensional Data Sets

Improving the Random Forest Algorithm by Randomly Varying the Size of the Bootstrap Samples for Low Dimensional Data Sets Improving the Random Forest Algorithm by Randomly Varying the Size of the Bootstrap Samples for Low Dimensional Data Sets Md Nasim Adnan and Md Zahidul Islam Centre for Research in Complex Systems (CRiCS)

More information

Relational Learning. Jan Struyf and Hendrik Blockeel

Relational Learning. Jan Struyf and Hendrik Blockeel Relational Learning Jan Struyf and Hendrik Blockeel Dept. of Computer Science, Katholieke Universiteit Leuven Celestijnenlaan 200A, 3001 Leuven, Belgium 1 Problem definition Relational learning refers

More information

Graph Propositionalization for Random Forests

Graph Propositionalization for Random Forests Graph Propositionalization for Random Forests Thashmee Karunaratne Dept. of Computer and Systems Sciences, Stockholm University Forum 100, SE-164 40 Kista, Sweden si-thk@dsv.su.se Henrik Boström Dept.

More information

Mr-SBC: A Multi-relational Naïve Bayes Classifier

Mr-SBC: A Multi-relational Naïve Bayes Classifier Mr-SBC: A Multi-relational Naïve Bayes Classifier Michelangelo Ceci, Annalisa Appice, and Donato Malerba Dipartimento di Informatica, Università degli Studi via Orabona, 4-70126 Bari - Italy _GIGMETTMGIQEPIVFEa$HMYRMFEMX

More information

Discretization Numerical Data for Relational Data with One-to-Many Relations

Discretization Numerical Data for Relational Data with One-to-Many Relations Journal of Computer Science 5 (7): 519-528, 2009 ISSN 1549-3636 2009 Science Publications Discretization Numerical Data for Relational Data with One-to-Many Relations Rayner Alfred School of Engineering

More information

A Two-level Learning Method for Generalized Multi-instance Problems

A Two-level Learning Method for Generalized Multi-instance Problems A wo-level Learning Method for Generalized Multi-instance Problems Nils Weidmann 1,2, Eibe Frank 2, and Bernhard Pfahringer 2 1 Department of Computer Science University of Freiburg Freiburg, Germany weidmann@informatik.uni-freiburg.de

More information

Comparison of various classification models for making financial decisions

Comparison of various classification models for making financial decisions Comparison of various classification models for making financial decisions Vaibhav Mohan Computer Science Department Johns Hopkins University Baltimore, MD 21218, USA vmohan3@jhu.edu Abstract Banks are

More information

Fig 1. Overview of IE-based text mining framework

Fig 1. Overview of IE-based text mining framework DiscoTEX: A framework of Combining IE and KDD for Text Mining Ritesh Kumar Research Scholar, Singhania University, Pacheri Beri, Rajsthan riteshchandel@gmail.com Abstract: Text mining based on the integration

More information

Hybrid Feature Selection for Modeling Intrusion Detection Systems

Hybrid Feature Selection for Modeling Intrusion Detection Systems Hybrid Feature Selection for Modeling Intrusion Detection Systems Srilatha Chebrolu, Ajith Abraham and Johnson P Thomas Department of Computer Science, Oklahoma State University, USA ajith.abraham@ieee.org,

More information

Learning Statistical Models From Relational Data

Learning Statistical Models From Relational Data Slides taken from the presentation (subset only) Learning Statistical Models From Relational Data Lise Getoor University of Maryland, College Park Includes work done by: Nir Friedman, Hebrew U. Daphne

More information

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS CHAPTER 4 CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS 4.1 Introduction Optical character recognition is one of

More information

Bias-free Hypothesis Evaluation in Multirelational Domains

Bias-free Hypothesis Evaluation in Multirelational Domains Bias-free Hypothesis Evaluation in Multirelational Domains Christine Körner Fraunhofer Institut AIS, Germany christine.koerner@ais.fraunhofer.de Stefan Wrobel Fraunhofer Institut AIS and Dept. of Computer

More information

International Journal of Scientific Research & Engineering Trends Volume 4, Issue 6, Nov-Dec-2018, ISSN (Online): X

International Journal of Scientific Research & Engineering Trends Volume 4, Issue 6, Nov-Dec-2018, ISSN (Online): X Analysis about Classification Techniques on Categorical Data in Data Mining Assistant Professor P. Meena Department of Computer Science Adhiyaman Arts and Science College for Women Uthangarai, Krishnagiri,

More information

Information Management course

Information Management course Università degli Studi di Milano Master Degree in Computer Science Information Management course Teacher: Alberto Ceselli Lecture 20: 10/12/2015 Data Mining: Concepts and Techniques (3 rd ed.) Chapter

More information

Efficient Case Based Feature Construction

Efficient Case Based Feature Construction Efficient Case Based Feature Construction Ingo Mierswa and Michael Wurst Artificial Intelligence Unit,Department of Computer Science, University of Dortmund, Germany {mierswa, wurst}@ls8.cs.uni-dortmund.de

More information

Link Prediction for Social Network

Link Prediction for Social Network Link Prediction for Social Network Ning Lin Computer Science and Engineering University of California, San Diego Email: nil016@eng.ucsd.edu Abstract Friendship recommendation has become an important issue

More information

An Intelligent Clustering Algorithm for High Dimensional and Highly Overlapped Photo-Thermal Infrared Imaging Data

An Intelligent Clustering Algorithm for High Dimensional and Highly Overlapped Photo-Thermal Infrared Imaging Data An Intelligent Clustering Algorithm for High Dimensional and Highly Overlapped Photo-Thermal Infrared Imaging Data Nian Zhang and Lara Thompson Department of Electrical and Computer Engineering, University

More information

SNS College of Technology, Coimbatore, India

SNS College of Technology, Coimbatore, India Support Vector Machine: An efficient classifier for Method Level Bug Prediction using Information Gain 1 M.Vaijayanthi and 2 M. Nithya, 1,2 Assistant Professor, Department of Computer Science and Engineering,

More information

Redundant Feature Elimination for Multi-Class Problems

Redundant Feature Elimination for Multi-Class Problems Redundant Feature Elimination for Multi-Class Problems Annalisa Appice Michelangelo Ceci Dipartimento di Informatica, Università degli Studi di Bari, Via Orabona 4, 70126 Bari, Italy APPICE@DI.UNIBA.IT

More information

A General Greedy Approximation Algorithm with Applications

A General Greedy Approximation Algorithm with Applications A General Greedy Approximation Algorithm with Applications Tong Zhang IBM T.J. Watson Research Center Yorktown Heights, NY 10598 tzhang@watson.ibm.com Abstract Greedy approximation algorithms have been

More information

Improving Quality of Products in Hard Drive Manufacturing by Decision Tree Technique

Improving Quality of Products in Hard Drive Manufacturing by Decision Tree Technique Improving Quality of Products in Hard Drive Manufacturing by Decision Tree Technique Anotai Siltepavet 1, Sukree Sinthupinyo 2 and Prabhas Chongstitvatana 3 1 Computer Engineering, Chulalongkorn University,

More information

Classification Using Unstructured Rules and Ant Colony Optimization

Classification Using Unstructured Rules and Ant Colony Optimization Classification Using Unstructured Rules and Ant Colony Optimization Negar Zakeri Nejad, Amir H. Bakhtiary, and Morteza Analoui Abstract In this paper a new method based on the algorithm is proposed to

More information

Random Forest A. Fornaser

Random Forest A. Fornaser Random Forest A. Fornaser alberto.fornaser@unitn.it Sources Lecture 15: decision trees, information theory and random forests, Dr. Richard E. Turner Trees and Random Forests, Adele Cutler, Utah State University

More information

CS145: INTRODUCTION TO DATA MINING

CS145: INTRODUCTION TO DATA MINING CS145: INTRODUCTION TO DATA MINING 08: Classification Evaluation and Practical Issues Instructor: Yizhou Sun yzsun@cs.ucla.edu October 24, 2017 Learnt Prediction and Classification Methods Vector Data

More information

Uplift Modeling with ROC: An SRL Case Study

Uplift Modeling with ROC: An SRL Case Study Appears in the Proc. of International Conference on Inductive Logic Programming (ILP 13), Rio de Janeiro, Brazil, 2013. Uplift Modeling with ROC: An SRL Case Study Houssam Nassif, Finn Kuusisto, Elizabeth

More information

Web page recommendation using a stochastic process model

Web page recommendation using a stochastic process model Data Mining VII: Data, Text and Web Mining and their Business Applications 233 Web page recommendation using a stochastic process model B. J. Park 1, W. Choi 1 & S. H. Noh 2 1 Computer Science Department,

More information

Leveraging Transitive Relations for Crowdsourced Joins*

Leveraging Transitive Relations for Crowdsourced Joins* Leveraging Transitive Relations for Crowdsourced Joins* Jiannan Wang #, Guoliang Li #, Tim Kraska, Michael J. Franklin, Jianhua Feng # # Department of Computer Science, Tsinghua University, Brown University,

More information

A Survey on Postive and Unlabelled Learning

A Survey on Postive and Unlabelled Learning A Survey on Postive and Unlabelled Learning Gang Li Computer & Information Sciences University of Delaware ligang@udel.edu Abstract In this paper we survey the main algorithms used in positive and unlabeled

More information

Comparative Study of Subspace Clustering Algorithms

Comparative Study of Subspace Clustering Algorithms Comparative Study of Subspace Clustering Algorithms S.Chitra Nayagam, Asst Prof., Dept of Computer Applications, Don Bosco College, Panjim, Goa. Abstract-A cluster is a collection of data objects that

More information

CS249: ADVANCED DATA MINING

CS249: ADVANCED DATA MINING CS249: ADVANCED DATA MINING Classification Evaluation and Practical Issues Instructor: Yizhou Sun yzsun@cs.ucla.edu April 24, 2017 Homework 2 out Announcements Due May 3 rd (11:59pm) Course project proposal

More information

DRILA: A Distributed Relational Inductive Learning Algorithm

DRILA: A Distributed Relational Inductive Learning Algorithm DRILA: A Distributed Relational Inductive Learning Algorithm SALEH M. ABU-SOUD Computer Science Department New York Institute of Technology Amman Campus P.O. Box (1202), Amman, 11941 JORDAN sabusoud@nyit.edu

More information

Bagging Is A Small-Data-Set Phenomenon

Bagging Is A Small-Data-Set Phenomenon Bagging Is A Small-Data-Set Phenomenon Nitesh Chawla, Thomas E. Moore, Jr., Kevin W. Bowyer, Lawrence O. Hall, Clayton Springer, and Philip Kegelmeyer Department of Computer Science and Engineering University

More information

Data Mining: Concepts and Techniques Classification and Prediction Chapter 6.7

Data Mining: Concepts and Techniques Classification and Prediction Chapter 6.7 Data Mining: Concepts and Techniques Classification and Prediction Chapter 6.7 March 1, 2007 CSE-4412: Data Mining 1 Chapter 6 Classification and Prediction 1. What is classification? What is prediction?

More information

Mining High Order Decision Rules

Mining High Order Decision Rules Mining High Order Decision Rules Y.Y. Yao Department of Computer Science, University of Regina Regina, Saskatchewan, Canada S4S 0A2 e-mail: yyao@cs.uregina.ca Abstract. We introduce the notion of high

More information

HALF&HALF BAGGING AND HARD BOUNDARY POINTS. Leo Breiman Statistics Department University of California Berkeley, CA

HALF&HALF BAGGING AND HARD BOUNDARY POINTS. Leo Breiman Statistics Department University of California Berkeley, CA 1 HALF&HALF BAGGING AND HARD BOUNDARY POINTS Leo Breiman Statistics Department University of California Berkeley, CA 94720 leo@stat.berkeley.edu Technical Report 534 Statistics Department September 1998

More information

A study of classification algorithms using Rapidminer

A study of classification algorithms using Rapidminer Volume 119 No. 12 2018, 15977-15988 ISSN: 1314-3395 (on-line version) url: http://www.ijpam.eu ijpam.eu A study of classification algorithms using Rapidminer Dr.J.Arunadevi 1, S.Ramya 2, M.Ramesh Raja

More information

A Multi-Relational Decision Tree Learning Algorithm - Implementation and Experiments

A Multi-Relational Decision Tree Learning Algorithm - Implementation and Experiments A Multi-Relational Decision Tree Learning Algorithm - Implementation and Experiments Anna Atramentov, Hector Leiva and Vasant Honavar Artificial Intelligence Research Laboratory, Computer Science Department

More information

Individualized Error Estimation for Classification and Regression Models

Individualized Error Estimation for Classification and Regression Models Individualized Error Estimation for Classification and Regression Models Krisztian Buza, Alexandros Nanopoulos, Lars Schmidt-Thieme Abstract Estimating the error of classification and regression models

More information

Chapter 3: Supervised Learning

Chapter 3: Supervised Learning Chapter 3: Supervised Learning Road Map Basic concepts Evaluation of classifiers Classification using association rules Naïve Bayesian classification Naïve Bayes for text classification Summary 2 An example

More information

Weka ( )

Weka (  ) Weka ( http://www.cs.waikato.ac.nz/ml/weka/ ) The phases in which classifier s design can be divided are reflected in WEKA s Explorer structure: Data pre-processing (filtering) and representation Supervised

More information

Multi-Relational Data Mining

Multi-Relational Data Mining Multi-Relational Data Mining Outline [Dzeroski 2003] [Dzeroski & De Raedt 2003] Introduction Inductive Logic Programmming (ILP) Relational Association Rules Relational Decision Trees Relational Distance-Based

More information

Noise-based Feature Perturbation as a Selection Method for Microarray Data

Noise-based Feature Perturbation as a Selection Method for Microarray Data Noise-based Feature Perturbation as a Selection Method for Microarray Data Li Chen 1, Dmitry B. Goldgof 1, Lawrence O. Hall 1, and Steven A. Eschrich 2 1 Department of Computer Science and Engineering

More information

UAPRIORI: AN ALGORITHM FOR FINDING SEQUENTIAL PATTERNS IN PROBABILISTIC DATA

UAPRIORI: AN ALGORITHM FOR FINDING SEQUENTIAL PATTERNS IN PROBABILISTIC DATA UAPRIORI: AN ALGORITHM FOR FINDING SEQUENTIAL PATTERNS IN PROBABILISTIC DATA METANAT HOOSHSADAT, SAMANEH BAYAT, PARISA NAEIMI, MAHDIEH S. MIRIAN, OSMAR R. ZAÏANE Computing Science Department, University

More information

XML Document Classification with Co-training

XML Document Classification with Co-training XML Document Classification with Co-training Xiaobing Jemma Wu CSIRO ICT Centre, Australia Abstract. This paper presents an algorithm of using Co-training with the precision/recall-driven decision-tree

More information

Fast or furious? - User analysis of SF Express Inc

Fast or furious? - User analysis of SF Express Inc CS 229 PROJECT, DEC. 2017 1 Fast or furious? - User analysis of SF Express Inc Gege Wen@gegewen, Yiyuan Zhang@yiyuan12, Kezhen Zhao@zkz I. MOTIVATION The motivation of this project is to predict the likelihood

More information

NON-CENTRALIZED DISTINCT L-DIVERSITY

NON-CENTRALIZED DISTINCT L-DIVERSITY NON-CENTRALIZED DISTINCT L-DIVERSITY Chi Hong Cheong 1, Dan Wu 2, and Man Hon Wong 3 1,3 Department of Computer Science and Engineering, The Chinese University of Hong Kong, Hong Kong {chcheong, mhwong}@cse.cuhk.edu.hk

More information

Fault Class Prioritization in Boolean Expressions

Fault Class Prioritization in Boolean Expressions Fault Class Prioritization in Boolean Expressions Ziyuan Wang 1,2 Zhenyu Chen 1 Tsong-Yueh Chen 3 Baowen Xu 1,2 1 State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing 210093,

More information

Network Traffic Measurements and Analysis

Network Traffic Measurements and Analysis DEIB - Politecnico di Milano Fall, 2017 Sources Hastie, Tibshirani, Friedman: The Elements of Statistical Learning James, Witten, Hastie, Tibshirani: An Introduction to Statistical Learning Andrew Ng:

More information

Chapter 5: Summary and Conclusion CHAPTER 5 SUMMARY AND CONCLUSION. Chapter 1: Introduction

Chapter 5: Summary and Conclusion CHAPTER 5 SUMMARY AND CONCLUSION. Chapter 1: Introduction CHAPTER 5 SUMMARY AND CONCLUSION Chapter 1: Introduction Data mining is used to extract the hidden, potential, useful and valuable information from very large amount of data. Data mining tools can handle

More information

MODELLING DOCUMENT CATEGORIES BY EVOLUTIONARY LEARNING OF TEXT CENTROIDS

MODELLING DOCUMENT CATEGORIES BY EVOLUTIONARY LEARNING OF TEXT CENTROIDS MODELLING DOCUMENT CATEGORIES BY EVOLUTIONARY LEARNING OF TEXT CENTROIDS J.I. Serrano M.D. Del Castillo Instituto de Automática Industrial CSIC. Ctra. Campo Real km.0 200. La Poveda. Arganda del Rey. 28500

More information

Patent Classification Using Ontology-Based Patent Network Analysis

Patent Classification Using Ontology-Based Patent Network Analysis Association for Information Systems AIS Electronic Library (AISeL) PACIS 2010 Proceedings Pacific Asia Conference on Information Systems (PACIS) 2010 Patent Classification Using Ontology-Based Patent Network

More information

Learning to Match. Jun Xu, Zhengdong Lu, Tianqi Chen, Hang Li

Learning to Match. Jun Xu, Zhengdong Lu, Tianqi Chen, Hang Li Learning to Match Jun Xu, Zhengdong Lu, Tianqi Chen, Hang Li 1. Introduction The main tasks in many applications can be formalized as matching between heterogeneous objects, including search, recommendation,

More information

Classification using Weka (Brain, Computation, and Neural Learning)

Classification using Weka (Brain, Computation, and Neural Learning) LOGO Classification using Weka (Brain, Computation, and Neural Learning) Jung-Woo Ha Agenda Classification General Concept Terminology Introduction to Weka Classification practice with Weka Problems: Pima

More information

Multi-label classification using rule-based classifier systems

Multi-label classification using rule-based classifier systems Multi-label classification using rule-based classifier systems Shabnam Nazmi (PhD candidate) Department of electrical and computer engineering North Carolina A&T state university Advisor: Dr. A. Homaifar

More information

Using Text Learning to help Web browsing

Using Text Learning to help Web browsing Using Text Learning to help Web browsing Dunja Mladenić J.Stefan Institute, Ljubljana, Slovenia Carnegie Mellon University, Pittsburgh, PA, USA Dunja.Mladenic@{ijs.si, cs.cmu.edu} Abstract Web browsing

More information

Probabilistic Abstraction Lattices: A Computationally Efficient Model for Conditional Probability Estimation

Probabilistic Abstraction Lattices: A Computationally Efficient Model for Conditional Probability Estimation Probabilistic Abstraction Lattices: A Computationally Efficient Model for Conditional Probability Estimation Daniel Lowd January 14, 2004 1 Introduction Probabilistic models have shown increasing popularity

More information

3-D MRI Brain Scan Classification Using A Point Series Based Representation

3-D MRI Brain Scan Classification Using A Point Series Based Representation 3-D MRI Brain Scan Classification Using A Point Series Based Representation Akadej Udomchaiporn 1, Frans Coenen 1, Marta García-Fiñana 2, and Vanessa Sluming 3 1 Department of Computer Science, University

More information

A new Approach for Handling Numeric Ranges for Graph-Based. Knowledge Discovery.

A new Approach for Handling Numeric Ranges for Graph-Based. Knowledge Discovery. A new Approach for Handling Numeric Ranges for Graph-Based Knowledge Discovery Oscar E. Romero A. 1, Lawrence B. Holder 2, and Jesus A. Gonzalez B. 1 1 Computer Science Department, INAOE, San Andres Cholula,

More information

CSI5387: Data Mining Project

CSI5387: Data Mining Project CSI5387: Data Mining Project Terri Oda April 14, 2008 1 Introduction Web pages have become more like applications that documents. Not only do they provide dynamic content, they also allow users to play

More information

Temporal Weighted Association Rule Mining for Classification

Temporal Weighted Association Rule Mining for Classification Temporal Weighted Association Rule Mining for Classification Purushottam Sharma and Kanak Saxena Abstract There are so many important techniques towards finding the association rules. But, when we consider

More information

Inductive Logic Programming in Clementine

Inductive Logic Programming in Clementine Inductive Logic Programming in Clementine Sam Brewer 1 and Tom Khabaza 2 Advanced Data Mining Group, SPSS (UK) Ltd 1st Floor, St. Andrew s House, West Street Woking, Surrey GU21 1EB, UK 1 sbrewer@spss.com,

More information

A Framework for Web Host Quality Detection

A Framework for Web Host Quality Detection A Framework for Web Host Quality Detection A. Aycan Atak, and Şule Gündüz Ögüdücü Abstract With the rapid growth of World Wide Web, finding useful and desired information in a short amount of time becomes

More information

Published by: PIONEER RESEARCH & DEVELOPMENT GROUP ( 1

Published by: PIONEER RESEARCH & DEVELOPMENT GROUP (  1 Cluster Based Speed and Effective Feature Extraction for Efficient Search Engine Manjuparkavi A 1, Arokiamuthu M 2 1 PG Scholar, Computer Science, Dr. Pauls Engineering College, Villupuram, India 2 Assistant

More information

A Comparative Study of Selected Classification Algorithms of Data Mining

A Comparative Study of Selected Classification Algorithms of Data Mining Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 4, Issue. 6, June 2015, pg.220

More information

Evaluation Measures. Sebastian Pölsterl. April 28, Computer Aided Medical Procedures Technische Universität München

Evaluation Measures. Sebastian Pölsterl. April 28, Computer Aided Medical Procedures Technische Universität München Evaluation Measures Sebastian Pölsterl Computer Aided Medical Procedures Technische Universität München April 28, 2015 Outline 1 Classification 1. Confusion Matrix 2. Receiver operating characteristics

More information

Credit card Fraud Detection using Predictive Modeling: a Review

Credit card Fraud Detection using Predictive Modeling: a Review February 207 IJIRT Volume 3 Issue 9 ISSN: 2396002 Credit card Fraud Detection using Predictive Modeling: a Review Varre.Perantalu, K. BhargavKiran 2 PG Scholar, CSE, Vishnu Institute of Technology, Bhimavaram,

More information

Best First and Greedy Search Based CFS and Naïve Bayes Algorithms for Hepatitis Diagnosis

Best First and Greedy Search Based CFS and Naïve Bayes Algorithms for Hepatitis Diagnosis Best First and Greedy Search Based CFS and Naïve Bayes Algorithms for Hepatitis Diagnosis CHAPTER 3 BEST FIRST AND GREEDY SEARCH BASED CFS AND NAÏVE BAYES ALGORITHMS FOR HEPATITIS DIAGNOSIS 3.1 Introduction

More information

ReMauve: A Relational Model Tree Learner

ReMauve: A Relational Model Tree Learner ReMauve: A Relational Model Tree Learner Celine Vens, Jan Ramon, and Hendrik Blockeel Katholieke Universiteit Leuven - Department of Computer Science, Celestijnenlaan 200 A, 3001 Leuven, Belgium {celine.vens,

More information

Using Google s PageRank Algorithm to Identify Important Attributes of Genes

Using Google s PageRank Algorithm to Identify Important Attributes of Genes Using Google s PageRank Algorithm to Identify Important Attributes of Genes Golam Morshed Osmani Ph.D. Student in Software Engineering Dept. of Computer Science North Dakota State Univesity Fargo, ND 58105

More information

First order random forests: Learning relational classifiers with complex aggregates

First order random forests: Learning relational classifiers with complex aggregates Mach Learn (2006) 64:149 182 DOI 10.1007/s10994-006-8713-9 First order random forests: Learning relational classifiers with complex aggregates Anneleen Van Assche Celine Vens Hendrik Blockeel Sašo Džeroski

More information

Generating the Reduced Set by Systematic Sampling

Generating the Reduced Set by Systematic Sampling Generating the Reduced Set by Systematic Sampling Chien-Chung Chang and Yuh-Jye Lee Email: {D9115009, yuh-jye}@mail.ntust.edu.tw Department of Computer Science and Information Engineering National Taiwan

More information

Dynamic Ensemble Construction via Heuristic Optimization

Dynamic Ensemble Construction via Heuristic Optimization Dynamic Ensemble Construction via Heuristic Optimization Şenay Yaşar Sağlam and W. Nick Street Department of Management Sciences The University of Iowa Abstract Classifier ensembles, in which multiple

More information

Simple Decision Forests for Multi-Relational Classification

Simple Decision Forests for Multi-Relational Classification Simple Decision Forests for Multi-Relational Classification Bahareh Bina, Oliver Schulte, Branden Crawford, Zhensong Qian, Yi Xiong School of Computing Science, Simon Fraser University, Burnaby, B.C.,

More information

Feature-weighted k-nearest Neighbor Classifier

Feature-weighted k-nearest Neighbor Classifier Proceedings of the 27 IEEE Symposium on Foundations of Computational Intelligence (FOCI 27) Feature-weighted k-nearest Neighbor Classifier Diego P. Vivencio vivencio@comp.uf scar.br Estevam R. Hruschka

More information