QAR-CIP-NSGA-II: A New Multi-Objective Evolutionary Algorithm to Mine Quantitative Association Rules

Size: px

Start display at page:

Download "QAR-CIP-NSGA-II: A New Multi-Objective Evolutionary Algorithm to Mine Quantitative Association Rules"

Juniper Wilson
5 years ago
Views:

1 QAR-CIP-NSGA-II: A New Multi-Objective Evolutionary Algorithm to Mine Quantitative Association Rules D. Martín a, A. Rosete a, J. Alcalá-Fdez b,, F. Herrera b a Dept. Artificial Intelligence and Infrastructure of Informatic Systems, Higher Polytechnic Institute José Antonio Echeverría, Cujae, La Habana, Cuba b Department of Computer Science and Artificial Intelligence, University of Granada, CITIC-UGR, Granada, Spain Abstract Some researchers have framed the extraction of association rules as a multi-objective problem, jointly optimizing several measures to obtain a set with more interesting and accurate rules. In this paper, we propose a new multi-objective evolutionary model which maximizes the comprehensibility, interestingness and performance of the objectives in order to mine a set of quantitative association rules with a good trade-off between interpretability and accuracy. To accomplish this, the model extends the well-known Multi-objective Evolutionary Algorithm Non-dominated Sorting Genetic Algorithm II to perform an evolutionary learning of the intervals of the attributes and a condition selection for each rule. Moreover, this proposal introduces an external population and a restarting process to the evolutionary model in order to store all the nondominated rules found and improve the diversity of the rule set obtained. The results obtained over real-world datasets demonstrate the effectiveness of the proposed approach. Keywords: Data Mining, Quantitative Association Rules, Multi-Objective Evolutionary Algorithms, NSGA-II 1. Introduction Association discovery is one of the most common Data Mining (DM) techniques used to extract interesting knowledge from large datasets [34]. Association rules identify dependencies between items in a dataset [65] and are defined as an expression of the type X Y, where X and Y are sets of items and X Y = [1, 2]. Many previous studies for mining association rules focused on datasets with binary or discrete values, however the data in real-world applications usually consists of quantitative values. Thus, designing DM algorithms able to deal with various types of data is a challenge in this field [6, 13, 36, 56, 61]. A commonly used method to handle continuous domains in the extraction of association rules is to partition the domains of the attributes in to intervals. For instance, an association rule could be Income [1200, 2000] MortgageExpenses [360, 600]. These kinds of rules are known as quantitative association rules (QARs) [56]. In recent years, Evolutionary Algorithms (EAs), particularly Genetic Algorithms (GAs) [23], have been used by many researchers to mine QARs from datasets with quantitative values [4, 8]. The main motivation for applying GAs to knowledge extraction tasks is that they are robust and adaptive search algorithms that perform a global search in place of candidate solutions (for instance, rules or other forms of knowledge representation). Recently, some researchers have presented the extraction of association rules as a multi-objective problem (instead of single objective), removing some of the limitations of the current approaches. Several objectives are considered in the process of extracting association rules, obtaining a set with more interesting and accurate rules [5, 33]. In this way, we can jointly optimize measures such as support, confidence, and so on, which can present different degrees of trade-off depending on the dataset used and the type of information that can be extracted from it. Since this approach Corresponding author. Tel Ext addresses: dmartin@ceis.cujae.edu.cu (D. Martín), rosete@ceis.cujae.edu.cu (A. Rosete), jalcala@decsai.ugr.es (J. Alcalá-Fdez ), herrera@decsai.ugr.es (F. Herrera) Preprint submitted to Information Sciences September 5, 2013

2 has a multi-objective nature, the use of Multi-Objective Evolutionary Algorithms (MOEAs) [14, 18] to obtain a set of solutions with different degrees of trade-off between the different measures could represent an interesting way of working (by considering these measures as objectives). In this paper, we propose a new multi-objective evolutionary model to mine a set of QARs with a good tradeoff between interpretability and accuracy which maximizes three objectives: comprehensibility, interestingness and performance, understanding by performance the product of Certainty Factor (CF) [54] and support. To accomplish this, the model (called QAR-CIP) extends the well-known MOEA Non-dominated Sorting Genetic Algorithm II (NSGA-II) [19] to perform an evolutionary learning of the intervals of the attributes and a condition selection for each rule, therefore the algorithm is called QAR-CIP-NSGA-II. Moreover, this proposal introduces an external population (EP) and a restarting process to the evolutionary model in order to store all the nondominated rules found and promote diversity in the population. Notice that this proposal follows a dataset-independent approach which does not rely on the minimum support (minsup) and the minimum confidence (minconf) thresholds, which are hard to determine for each dataset. In order to assess the performance of the proposed approach, we have presented an experimental study using 9 real-world datasets. We have developed the following studies. First, we have compared our approach with the original evolutionary model of NSGA-II in order to analyze the performance of the new components introduced. Second, we have compared the performance of our approach with four mono-objective approaches and three MOEAs to mine QARs. Third, we have shown the results obtained from the comparison with two other classical approaches for mining association rules. Furthermore, in these studies, we have made use of some nonparametric statistical tests for the pairwise and multiple comparison [21, 30, 29, 31] of the performance of these approaches over 22 real-world datasets. Finally, we have analyzed the scalability of the proposed approach. This paper is arranged as follows. Section 2 introduces a brief study of the existing MOEAs for general purposes [67], some basic definitions of QARs and some quality measures. Section 3 details the evolutionary learning components proposed to mine a set of high quality QARs. Section 4 shows and discusses the results that are obtained over 9 real-world datasets. Section 5 presents some concluding remarks. Finally, Appendix A shows the results obtained by the analyzed algorithms in the 22 real-world datasets considered for the statistical analysis. 2. Preliminaries In this section, we first introduce the basic definitions of QARs and some quality measures. Then, we present a brief study of MOEAs Quantitative association rules Association rules are used to represent and identify dependencies between items in a dataset [34, 65]. As we mentioned above, they are expressions of the type X Y, where X and Y are sets of items, and X Y =. There are many previous studies of mining association rules that are focused on datasets with binary or discrete values; however, data in real-world applications usually consist of quantitative values. When the domain is continuous, the association rules are known as QARs, in which each item is a pair attribute-interval. For instance, a QAR could be Age [30, 52] and S alary [3000, 3500] NumCars [3, 4] Support and Confidence are the most common measures to assess association rules. These measures for a rule X Y are defined as: S upport(x Y) = S UP(XY) D (1) Con f idence(x Y) = S UP(XY) S UP(X) where S UP(XY) is the number of patterns of the dataset covered by the antecedent and consequent of the rule, S UP(X) is the number of patterns of the dataset covered by the antecedent of the rule and D is the number of patterns in the dataset. The classic techniques for mining association rules attempt to discover rules whose support and confidence are greater than the user-defined thresholds minsup and minconf. However, several authors have pointed out some 2 (2)

3 drawbacks of this framework that lead it to find many more rules than it should [10, 12, 55]. For instance, confidence is unable to detect statistical independence or negative dependence between items because it does not take into account the support of the consequent. Moreover, itemsets with very high support are a source of misleading rules because they appear in most of the transactions, and hence any itemset (despite its meaning) seems to be a good predictor of the presence of the high-support itemset. In recent years, several researchers have proposed other measures to select and rank patterns according to their potential interest to the user [3, 12, 32, 48, 50, 54]. We briefly describe some of them below. The conviction [12] measure analyzes the dependency between X and Y, where Y means the absence of Y. Its domain is [0, ), where values less than 1 represent negative dependence, 1 represents independence and values higher than 1 represent positive dependence. However, it is not easy to compare the conviction of rules because its domain is not bounded, making it difficult to define a conviction threshold. Conviction for a rule X Y is defined as: Conviction(X Y) = S UP(X)S UP( Y) S UP(X Y) The lift [50] measure represents the ratio between the confidence of the rule and the expected confidence of the rule. Its domain is [0, ), where values less than 1 imply negative dependence, 1 implies independence and values higher than 1 imply positive dependence. As with conviction, its range is not bounded which makes it difficult to define a lift threshold. Lift for a rule X Y is defined as: Li f t(x Y) = Con f idence(x Y) S UP(Y) = S UP(XY) S UP(X)S UP(Y) CF [54] is interpreted as a measure of variation of the probability that Y is in a transaction when we consider only those transactions where X is present. The uses of this measure prevent the discovery of misleading rules that are not detected by confidence. Its domain is [-1,1], where values less than 0 represent negative dependence, 0 represents independence and values higher than 0 represent positive dependence. CF for a rule X Y is defined in three ways depending on whether Con f idence(x Y) is less, greater than or equal to S UP(Y). if Con f idence(x Y) > S UP(Y) (3) (4) CF(X Y) = Con f idence(x Y) S UP(Y) 1 S UP(Y) (5) if Confidence(X Y) < SUP(Y) CF(X Y) = Con f idence(x Y) S UP(Y) S UP(Y) (6) Otherwise CF(X Y) = 0 (7) The netconf [3] measure evaluates the interest of the association rules, which is based on the support of the rule and its antecedent and consequent support. As with the CF measure, the netconf can detect misleading rules that are not detected by confidence. This measure obtains values in [-1,1], where positive values represent positive dependence, negative values represent negative dependence, and 0 represents independence. Netconf for a rule X Y is defined as: Netcon f = S UP(XY) S UP(X)S UP(Y) S UP(X)(1 S UP(X)) (8) 2.2. Multi-objective evolutionary algorithms for general purposes EAs simultaneously deal with a set of possible solutions (the so-called population) which enables them to find several members of the Pareto optimal set in a single run of the algorithm. Additionally, they are not too susceptible to the shape or continuity of the Pareto front (e.g., they can easily deal with discontinuous and concave Pareto fronts) [58, 67]. 3

4 Reference MOEA 1 st Gen. 2 nd Gen. [26] MOGA [37] NPGA [57] NSGA [22] Hybrid MOEAs [68] Indicator based MOEAs [41] Memetic MOEAs [15] Micro-GA [66, 43] MOEA/D & MOEA/D-DE [20] MOEAs based on coevolution [26, 59] MOEAs based on reference [23] NPGA 2 [19] NSGA-II [40] PAES [16, 17] PESA & PESA-II [69, 70] SPEA & SPEA2 Table 1: Classification of MOEAs The first work to hint at the possibility of using EAs to solve a multi-objective problem was a Ph.D. thesis of 1967 [51] in which, however, no actual MOEA was developed (the multi-objective problem was restated as a singleobjective problem and solved with a GA). David Schaffer is normally considered to be the first to have designed an MOEA in the mid-1980s [52]. Schaffer s approach, called Vector Evaluated Genetic Algorithm (VEGA) consists of a simple GA with a modified selection mechanism. However, VEGA had a number of problems, of which the main one was its inability to retain solutions with acceptable performances; although perhaps above average, they were not outstanding for any of the objective functions. After VEGA, the researchers designed a first generation of MOEAs characterized by their simplicity, whereby the main lesson learned was that successful MOEAs had to combine a good mechanism to select non-dominated individuals (perhaps, but not necessarily, based on the concept of Pareto optimality) with a good mechanism to maintain diversity (fitness sharing was one possibility, but not the only one). The most representative MOEAs of this generation are the following: Nondominated Sorting Genetic Algorithm (NSGA) [57], Niched-Pareto Genetic Algorithm (NPGA) [37] and Multi- Objective Genetic Algorithm (MOGA) [26]. A second generation of MOEAs began when elitism became a standard mechanism. In fact, the use of elitism is a theoretical requirement in order to guarantee the convergence of an MOEA. Many MOEAs have been proposed during the second generation. However, most researchers would agree that few of these approaches have been adopted as a reference or have been used by others. In this way, the Strength Pareto Evolutionary Algorithm 2 (SPEA2) [69] and the NSGA-II [19] can be considered to be the most representative MOEAs of the second generation, with others, such as the Pareto Archived Evolution Strategy (PAES) [40], the Multi-objective Evolutionary Algorithm Based on Decomposition (MOEA/D and MOEA/D-DE) [66, 43], MOEAs based on reference [26, 59], Indicator based MOEAs [68], Hybrid MOEAs [22], Memetic MOEAs [41] and MOEAs based on coevolution [20] also of interest. Table 1 shows a resume of the most representative MOEAs of both generations. Finally, it should be noted that today NSGA-II is a paradigm within the MOEA research community, as the powerful crowding operator that this algorithm uses usually enables the widest Pareto sets to obtain a wide variety of problems, which is a much appreciated property within in this framework. 3. A new multiobjective based algorithm to mine quantitative association rules: QAR-CIP-NSGA-II In this section, we will describe our proposal to obtain a set of QARs with a good trade-off between interpretability and accuracy considering three objectives: comprehensibility, interestingness and performance. This model considers the use of the NSGA-II algorithm [19] in the performance of an evolutionary learning of the rules and introduces two new components to its evolutionary model: an EP and a restarting process. In the following, we will explain in detail all of their characteristics (see subsections ) and present a flowchart of the algorithm (see subsection 3.7). 4

5 3.1. Evolutionary multiobjective model In this paper, this proposal extends the well-known NSGA-II algorithm and introduces an EP and a restarting process to the evolutionary model in order to store all the nondominated rules found, to provoke diversity in the population and to improve the coverage of the datasets. The EP will keep all the nondominated rules found and be updated at the end of each generation with the nondominated rules of the current population. The redundant nondominated rules will be removed from the EP in order to avoid the overlapping rules. A rule is considered redundant if the intervals of all its variables are contained within the intervals of the variables of another rule. Moreover, the size of the EP is not limited in order to obtain a larger number of rules of the Pareto front and to reduce the size of the population (independently of the size of the problem), which helps to better control the method s convergence. In order to provoke diversity in the population the restarting process will be applied when the number of new individuals of the population in one generation is less than α% of the size of the current population (with α determined by the user, usually 5%). In this case, the examples covered by the rules in the EP are marked and the process of initialization of the population is again applied from examples uncovered (see subsection 3.6), allowing us to perform a good exploration of the search space. Finally, the EP is updated with the new population. Notice that both components are complementary. The restarting process uses the examples uncovered by the rules from the EP to generate the new population. In addition, the EP keeps all the nondominated rules found until the last moment, preventing the rules from being removed when the restarting process restarts the whole population. With these modifications, the evolutionary model would be as follows. First our proposal generates an initial population and initializes the EP with nondominated rules from the initial population. Then an offspring population is generated from the current population by selection, crossover and mutation. The next population is constructed from the current and offspring populations, the EP is updated with the current population and, if necessary, the restarting process is applied. When the number of new individuals in the next population is less than α% the restarting process is applied. This process is iterated until a stopping condition is satisfied. The NSGA-II algorithm has two features, which make it a high-performance MOEA. One is the fitness evaluation of each solution based on the Pareto ranking and a crowding measure, and the other is an elitist generation update procedure. Each solution in the current population is evaluated in the following manner. First, Rank 1 is assigned to all nondominated solutions in the current population. All solutions with Rank 1 are tentatively removed from the current population. Next, Rank 2 is assigned to all non-dominated solutions in the reduced current population. All solutions with Rank 2 are tentatively removed from the reduced current population. This procedure is iterated until all solutions are tentatively removed from the current population (i.e., until ranks are assigned to all solutions). As a result, a different rank is assigned to each solution. Solutions with smaller ranks are viewed as being better than those with larger ranks. Among solutions with the same rank, an additional criterion called a crowding measure is taken into account. The crowding measure for a solution calculates the distance between adjacent solutions with the same rank in the objective space. Less crowded solutions with larger values of the crowding measure are viewed as being better than more crowded solutions with smaller values of the crowding measure. A pair of parent solutions are selected from the current population by binary tournament selection based on the Pareto ranking and the crowding measure. When the next population is to be constructed, the current and offspring populations are combined into a merged population. Each solution in the merged population is evaluated in the same manner as in the selection phase of parent solutions using the Pareto ranking and the crowding measure. The next population is constructed by choosing a specified number (i.e., population size) of the best solutions from the merged population Coding scheme and initial gene pool Each chromosome is a vector of genes that represent the attributes and intervals of the rule. We have used a positional encoding, where the i-th attribute is encoded in the i-th gene used. To combine the condition selection with the learning of the intervals, each gene consists of three parts: The first part (ac) represents when a gene is involved or not in the rule. When this part is -1, this attribute is not involved in the rule, and when this part is 0 or 1 this attribute is part of the antecedent or consequent of the rule, respectively. All genes that have 0 in their first parts will form the antecedent of the rule while genes that have 1 will form the consequent of the rule. 5

6 Chromosome Gene1 Gene2 Genem ac1 lb1 ub1 ac2 lb2 ub2. acm lbm ubm Figure 1: Scheme of a chromosome The second part represents the lower bound (lb) of the interval of the attribute. The third part represents the upper bound (ub) of the interval of the attribute. Notice that lb and ub will be equal in the intervals of nominal attributes. Finally, a chromosome C T is coded in the following way, where m is the number of attributes in the dataset. Figure 1 shows the scheme of a chromosome. Gene i = (ac i, lb i, ub i ), i = 1,..., m, C T = Gene 1 Gene 2... Gene m In order to avoid the intervals increasing until they span the total domain, we define amplitude as the maximum size the interval of a determined attribute can attain. Thus, the amplitude of an attribute i is defined as: Amplitude i = (Max i Min i )/δ (9) where Max i and Min i are the maximum and minimum values of the domain of attribute i respectively, and δ is a value given by the system expert that determines the tradeoff between the generalization and specificity of the rules. The initial population will consist of a rule set with a good coverage of the dataset and with only one attribute in the consequent, since in this paper we will only consider these kinds of rules (however, this coding scheme allows us to deal with more than one attribute in the consequent). To generate the initial population, first we select at random the attributes that will be part of the antecedent and consequent of the rules (at least one attribute will be selected for the antecedent and for the consequent). Then we select at random an unmarked pattern from the dataset and generate the interval of each attribute with a size equal to 50% of the Amplitude i of each attribute and with the values of the pattern selected in the center of each of them. If some bound of the intervals exceeds the domain of the attribute this will be replaced by the bound of the domain. Finally, the patterns covered for this rule are marked in the dataset. This process is repeated until the initial population is completed. Notice that, if all the patterns are marked and the initial population is not completed, then all the patterns will be unmarked again and the process will be repeated until the initial population is completed. For instance, let us consider a simple dataset with three attributes X 1, X 2 and X 3, six training patterns and δ = 2. Table 2 shows the six training patterns, the lower and upper bound of the domain of attribute and 50% of the Amplitude i of each attribute. Let us suppose that we select at random the pattern ID3 and the attributes X 1 and X 2 for ID X 1 X 2 X 3 ID ID ID ID ID ID Lower bound of the domain Upper bound of the domain % of the Amplitude i Table 2: Six patterns in this example. 6

7 the antecedent and consequent of the rule respectively. In this iteration, the rule X 1 [0.77, 1.0] X 2 [10, 17] is generated, calculating its intervals as follows: { lb 1 = max } { 2, 0.0 = 0.77 ub 1 = min } 2, 1.0 = 1.0 lb 2 = max { }, 10 = 10 ub 2 = min { }, 50 = 17 { lb 3 = max } { 2, 3.2 = 3.2 ub 3 = min } 2, 70.8 = The 50% of Amplitude i is divided by 2 in order to set the value of the pattern ID3 in the center of the generated intervals. Notice that, ub 1 is 1.0 because when the bound of the intervals exceeds the domain of the attribute this is replaced by the upper/lower bound of the domain. Fig. 2 shows the whole chromosome generated in this example. X 1 X 2 X 3 Gene 1 Gene 2 Gene 3 ac lb ub ac lb ub ac lb ub Figure 2: Chromosome obtained for the example This rule covers the training patterns ID1 and ID3 in Table 2. In this situation, these patterns are marked in the dataset. The EP is initialized with nondominated rules of the initial population Objectives Three objectives are maximized for this problem: Interestingness, Comprehensibility and Performance. Performance is the result of the product of CF and support (see subsection 2.1) which allows us to obtain accurate rules and a good trade-off between local and general rules. Notice that we are interested only in very strong rules [10], which represent a positive dependence between items and solve the support drawback. Thus, a rule X Y must verify: CF(X Y) > 0 S upport(x Y) > minsup (S upport(x Y) > (1 - minsup)) This measure can obtain values in the interval [0, 1]. A rule with a performance value near to 1 may be more useful to the user. Interestingness measures how interesting the rule is, which allows us to extract only those rules that may be of interest to the users. In this case, we have used the interestingness measure lift [50] (see subsection 2.1). Finally, comprehensibility tries to quantify the ease with which the rule can be understood [24]. The generated rules may have a large number of attributes involved, thereby making them difficult to understand. If the generated rules are not understandable to the user, the user will never use them. Here, the comprehensibility of a rule X Y is measured by the number of attributes involved in the rule and is defined as: Comprehensibility(X Y) = 1/Attr X Y (10) where Attr X Y is the number of attributes involved in the antecedent of the rule, as in this paper we will consider only rules with one attribute in the consequent. 7

8 3.4. Genetic operators The crossover operator generates two offspring by randomly interchanging the genes of the parents (exploration). Figure 3 shows a simple example of the performance of this operator. The mutation operator consists of randomly modifying the interval (lb and ub) and ac of a gene selected at random. This operator selects at random one of the bounds of the interval and randomly increases or decreases its value. We have to be particularly careful not to surpass the fixed value of amplitude. In that sense, the way that we modify the interval is similar to that calculated in the initialization process. The value for ac is randomly selected from within the set {-1,0,1}. Figure 3: A simple example of the crossover operator 3.5. Repairing operator After the mutation operator is applied, if any rule does not have an antecedent or consequent or has more than one attribute in the consequent, a repairing operator is used to modify these rules. If there is more than one attribute in the consequent, one attribute is randomly selected as the consequent from amongst them and the remaining attributes are passed to the antecedent in order to maintain rules with only one attribute in the consequent. If there is no attribute in the antecedent and/or consequent these are randomly selected from among the attributes not involved. Finally, the sizes of the intervals are decreased until the number of patterns covered is smaller than the number of patterns covered by the original intervals in order to obtain more simple rules Restarting process To get away from local optima, this algorithm uses a restarting process. This approach marks all the patterns covered by the rules from the EP and applies the process of initialization to the population again (see subsection 3.2) in order to restart the population from examples uncovered by the rules in the EP. Moreover, the EP is updated with the new population following the non-dominance criteria. This restarting process will be applied when the number of new individuals in the population in one generation is less than α% of the size of the current population Flowchart of the algorithm According to the above description, the proposed algorithm for mining QARs can be summarized in the following flowchart. Input: population size N, number of evaluations ntrials, probability of mutation P mut, factor of amplitude for each attribute of the dataset δ, difference threshold α. Output: EP Step 1: Initialize (a) Generate the initial population (P t ) with N chromosomes. (b) Evaluate the initial population (c) Generate all non-dominated fronts F = (F 1, F 2,...) of initial population and calculate crowding-distance in F i. (d) Initialize the EP Step 2: Generate the offspring population (Q t ) as follows. (a) Select a pair of parent solutions by binary tournament selection based on the Pareto ranking and the crowding measure. 8

9 (b) Each pair is crossed, generating two offspring, this operator interchanges the genes of the parents. Next, the mutation and repairing operators are applied for the two offspring. (c) Evaluate the new individuals. Step 3: Generate the next population (P t+1 ) as follows. (a) Create merged population with P t and Q t. (b) Generate all non-dominated fronts F = (F 1, F 2,...) of the merged population and calculate crowdingdistance in F i. (c) Create P t+1 with the best chromosome from the merged population using the non-dominated fronts and crowding distance: Include the i-th non-dominated front in P t+1. Check the next front for inclusion. Sort in descending order using the crowded-comparison operator. Choose the first (N P t+1 ) elements of F i. Step 4: Update of the EP, following the non-dominance criteria. Step 5: If the difference between the current population and previous population is less than α%, restart the population. Step 6: If the maximum number of evaluations is not reached, go to Step 2. Step 7: Remove redundance in the EP, delete the chromosomes that are subchromosomes of others. A subchromosome is one in which the intervals of all its genes are contained within the intervals of the genes of another chromosome. Step 8: The EP is returned. 4. Experimental analysis Several experiments have been carried out in this paper to analyze the performance of our proposal. In order to present them, this section is organized as follows: 1. In subsection 4.1, we describe the real-world datasets that are used in these experiments. 2. In subsection 4.2, we introduce a brief description of the algorithms considered for comparison and we show the configuration of the algorithms (determining all the parameters used). 3. In subsection 4.3, we show an analysis of the new components introduced to the NSGA-II [19] evolutionary model. 4. In subsection 4.4, we compare the performance of our approach with four mono-objective evolutionary approaches (EARMGA [63], GAR [46], GENAR [45] and Alatasetal [4]) and three MOEAs (ARMMGA [49], MODENAR [5] and MOEA Ghosh [33]) for mining QARs. 5. In subsection 4.5, we compare our approach with two classical association rules extraction algorithms: Apriori [11, 56] and Eclat [64]. 6. In subsection 4.6, we analyze the scalability of the proposed approach. 9

10 Names Attributes(R/I/N) Patterns Balance Scale (bal) 4(4/0/0) 625 Basketball (bask) 5(3/2/0) 96 Bolts (bol) 8(2/6/0) 40 House 16H (hh) 17(10/7/0) Solar Flare (fla) 11(0/0/11) 1066 Pollution (pol) 16(16/0/0) 60 Quake (qua) 4(3/1/0) 2178 Stock Price (sto) 10(10/0/0) 950 Stulong (stu) 5(5/0/0) 1419 Table 3: Datasets considered for the experimental study 4.1. Datasets In order to analyze the performance of the proposed approach, we have considered 9 real-world datasets. Table 3 summarizes the main characteristics of the 9 datasets, which are available in the repository KEEL-dataset [7] from which they can be downloaded (Available at where Attributes(R/ I/N) is the number of (Real/Integer/ Nominal) attributes in the data and Patterns is the number of patterns. To develop the different experiments, we consider the average results of 5 runs for each dataset Association rules extraction algorithms considered for the comparison and parameters of the algorithms In these experiments, we compare the proposed approach with nine other algorithms, which are available from the KEEL software tool [9]. A brief description of these algorithms follows. 1. Apriori [11, 56]: The main aim of Apriori is to exploit the search space by means of the downward closure property. The latter states that any subset of a frequent itemset must also be frequent. As a consequence, it generates candidates for the current iteration by means of itemsets considered frequent at the previous iteration. Then it enumerates all the subsets for each transaction and increments the support of candidates matching them. Then, those having the user-specified minsup are marked as frequent for the next iteration. This process is repeated until all the frequent itemsets have been found. Thus, Apriori follows a breadth-first strategy to generate candidates. Finally, Apriori uses the frequent itemsets to generate rules with a confidence greater than minconf. 2. Eclat [64]: Eclat employs a depth-first strategy. It generates candidates by extending the prefixes of an itemset until an infrequent one is found. In such case, it simply backtracks to the previous prefix and then recursively applies the above procedure. Unlike Apriori, the support counting is achieved by adopting a vertical layout. That is, for all the items in a dataset, it first constructs a list of all the transaction identifiers (tid-list) containing that item. Then it counts the support by simply intersecting two or more tid-lists to check whether they have items in common. If this is the, the support is equal to the size of the resulting set. The process for generating the rules is the same as Apriori. 3. Evolutionary Association Rules Mining with Genetic Algorithm (EARMGA) [63]: This algorithm uses a GA to identify QARs, which does not require a user-specified threshold for minsup. Each chromosome encodes a generalized k-rule, where k indicates the desired length. The most interesting rules are returned according to the interestingness measure defined by the fitness function, which is based on the support of the rule and its antecedent and consequent support. 4. GENetic Association Rules(GENAR) [45]: This algorithm mines association rules in numeric datasets by using a GA. Each chromosome encodes an association rule, containing the maximum and minimum intervals of each numeric attribute. The length of the rules is always fixed to the number of attributes, only the last attribute forms the consequent. The objective function considers the number of records covered by the rule and punishes those which have already covered the same records in the dataset. 10

11 5. Genetic Association Rules (GAR) [46]: This algorithm is an extension of GENAR [45], which searches frequent itemsets in numeric datasets without needing to discretize the attributes. Each chromosome is a k-itemset, where each gene represents the maximum and minimum values of the attributes that belong to the k-itemset. This algorithm finds frequent itemsets and it is therefore necessary to run another procedure afterwards in order to generate association rules. 6. Genetic algorithm for automated mining of both positive and negative QARs (Alatasetal) [4]: This algorithm designs a GA to simultaneously search for intervals of quantitative attributes and to discover positive and negative QARs that these intervals conform to in a single run. The chromosomes represent rules, in which each gene has four parts. The first part represents the antecedent or consequent of the rule, the second represents the positive or negative ARs, the third and fourth represent the lower and upper bound of the item interval respectively. The proposed GA performs a dataset-independent approach that does not rely upon the minsup and minconf thresholds. 7. Multi-objective association rules with genetic algorithms (ARMMGA) [49]: This algorithm is an MOEA based on the EARMGA algorithm to mine QARs without taking the minsup and minconf into account. According to the comments of the authors, the most important aspect of this algorithm is that its fitness function only specifies the order of chromosomes in the population and does not have any other effect on the GA operator, using this order as a selection criterion. This algorithm returns the rules found with a better correlation between support and confidence. 8. Multi-objective differential evolution algorithm for mining numeric association rules (MODENAR) [5]: This algorithm uses a multi-objective differential evolution algorithm based on Alatasetal [4] to mine accurate and comprehensible QARs without specifying minsup and minconf. This algorithm uses the same coding scheme for the chromosomes as Alatasetal but without the second part. MODENAR weights four objectives to improve the quality objectives of the rules: support, confidence, comprehensibility and amplitude of the domain of the attributes. 9. Multi-objective rule mining using genetic algorithms (MOEA Ghosh) [33]: This algorithm uses a Pareto based GA to extract some useful and interesting rules from any dataset. Each chromosome represents an association rule, in which bits are associated to each attribute, which indicate the antecedent or consequent of the rule, the absence or presence of the attribute and the relational operators involved with the attribute. It uses three measures: comprehensibility, interestingness and predictive accuracy to solve the multi-objective rule mining problem. A separate population is used, which contains those chromosomes which are non-dominated from amongst the current population as well as from amongst the non-dominated solutions of the previous generation. The parameters of the analyzed algorithms are presented in Table 4. With these values for our proposal, we have tried to facilitate comparisons, selecting standard common parameters that work well in most cases instead of searching for very specific values. We have selected 2.0 for δ since this is a value that works well in most cases, instead of searching for a highly specific value for each dataset. The parameters of the remaining algorithms were selected according to the recommendations of the corresponding authors of each proposal, which are the default parameter settings included in the KEEL software tool [9]. Notice that the length of the rules for EARMGA and ARMMGA is higher in the dataset House 16H, because the number of attributes and transactions is higher in this problem. Finally, for all the experiments conducted in this study, the results shown in the tables for the multi-objective algorithms always refer to nondominated association rules Analysis of the new components introduced to the evolutionary multiobjective algorithm In this section we study the performance of our proposal against the classical approach of NSGA-II (QAR-CIP- NSGA-II Classic) in order to analyze the performance of the new components introduced to the NSGA-II evolutionary model: the EP and restarting process. The results obtained by the analyzed algorithms are shown in Table 5, where #R is the number of association rules generated, Av S up and Av Con f are, respectively, the average support and the average confidence of the rules, Av Li f t is the average value for the measure lift of the rules, Av Conv is the average value for the measure conviction of the rules, Av CF is the average value for the measure CF of the rules, Av NetCon f 11

12 Algorithms Parameters Apriori minsup = 0.1, minconf = 0.8 Eclat minsup = 0.1, minconf = 0.8 Alatasetal N eval =50000, ninitialrandomchromo=12, r = 3, TournamentSize = 10, P sel = 0.25, P cro = 0.7, P mut min = 0.05, P mut max = 0.9, W sup = 5, W con f = 20, W amplrule = 0.05, W amplinterv = 0.02, W covered = 0.01 EARMGA PopSize = 100, N eval = 50000, k = 2 (3 with HH), P sel = 0.75, P cro = 0.7, P mut = 0.1, α = 0.01 GAR PopSize = 100, nitemset = 100, N eval = 50000, P sel = 0.25, P cro = 0.7, P mut = 0.1, ω = 0.4, Ψ= 0.7, µ= 0.5, minsup = 0.1, minconf = 0.8 GENAR PopSize = 100, N eval = 50000, P sel = 0.25, P cro = 0.7, P mut = 0.1, nrules = 30, FP = 0.7, AF = 0.2 ARMMGA PopSize=100, N eval =50000, k=2 (3 with HH), P sel =0.95, P cro =0.85, P mut = 0.01, db=0.01 MODENAR PopSize = 100, N eval =50000, Threshold= 60, CR = 0.3, W sup = 0.8, W con f = 0.2, W comp = 0.1, W amplinterv = 0.4 MOEA Ghosh PopSize = 100, N eval =50000, PointCrossover=2, P cro =0.8, P mut = 0.02 QAR-CIP-NSGA-II PopSize = 100, N eval =50000, P mut = 0.1, δ=2, α = 5 Table 4: Parameters considered for the comparison is the average value for the measure netconf of the rules, Av Amp is the average length of the rules in terms of the attributes involved, Av HyperVol is the average value for the measure hypervolume [71], and %Tran is the percentage of transactions covered by the rules on the total patterns in the dataset. The values for the hypervolume measure have been calculated using the R package emoa 1 [47], which implements Fonseca et al s algorithm proposed in [27]. The values shown in the table represent the maximum value for these measures. We can present the following conclusions based on an analysis of the results presented in Table 5: The EP allows us to obtain a greater number of nondominated rules of the Pareto front due to the fact that the number of rules is not limited by the size of the current population, with each rule providing us with interesting knowledge pertaining to the dataset. The restarting process along with the EP allow us to perform a good exploration of the search space, improving the coverage of the datasets. QAR-CIP-NSGA-II presents higher values of hypervolume than the classic approach, obtaining a greater nondominated area. Notice that these values are very high because the range of the measure lift is not bounded. The rules obtained by our proposal present improvements in almost all the interestingness measures and a similar or higher coverage in all the datasets, showing a positive synergy between the new components. In order to assess whether significant differences exist among the results, we adopt statistical analysis [30, 29, 31] and, in particular, nonparametric tests, according to the recommendations made in [21]. To do this, we have considered 22 real-world datasets. The main characteristics of these datasets and the results obtained by the analyzed algorithms are presented in Appendix A. We decided to apply the statistical tests to the average results obtained for the interestingness measures lift, CF and netconf. Notice that we have not used the conviction measure because the algorithms obtain infinity in most of the datasets. In order to compare the two algorithms we use Wilcoxon s Signed- Ranks test [53, 62]. Wilcoxon s test is based on computing the differences between two sample means (typically, mean test errors obtained by a pair of different algorithms in different datasets). In the classification framework these differences are well defined since these errors are in the same domain. In our case, to have well defined differences in the interestingness measures used, we propose the adoption of its mean values to a MeanS, which are defined for each measure as: 1 package emoa is available from Comprehensive R Archive Network (CRAN) at 12

13 Algorithm #R Av S up Av Con f Av Li f t Av Conv Av CF Av NetCon f Av Amp Av HyperVol %Tran Balance QAR-CIP-NSGA-II Classic QAR-CIP-NSGA-II Basketball QAR-CIP-NSGA-II Classic QAR-CIP-NSGA-II Bolts QAR-CIP-NSGA-II Classic QAR-CIP-NSGA-II Flare QAR-CIP-NSGA-II Classic QAR-CIP-NSGA-II House16H QAR-CIP-NSGA-II Classic QAR-CIP-NSGA-II Pollution QAR-CIP-NSGA-II Classic QAR-CIP-NSGA-II Quake QAR-CIP-NSGA-II Classic QAR-CIP-NSGA-II Stock QAR-CIP-NSGA-II Classic QAR-CIP-NSGA-II Stulong QAR-CIP-NSGA-II Classic QAR-CIP-NSGA-II Table 5: Results for all datasets in the comparison with classical NSGA-II For the measures CF and netconf: MeanS = meanvalue 2 i f meanvalue 0 meanvalue otherwise For the measure lift: MeanS = meanvalue i f meanvalue > meanvalue 2 i f 0 meanvalue 1 where meanvalue represents the mean value obtained for each measure in a dataset. MeanS obtains values in [0,1], where the worst value represents the independence value for each measure (see subsection 2.1), since it does not provide new knowledge to the user. Table 6 shows the results of Wilcoxon s test for the three measures. The hypothesis of equality for Wilcoxon s test has been rejected in all cases with a very small p-value and our proposal has achieved the highest rankings. Interestingness Measure Comparison R + R Hypothesis p-value CF QAR-CIP-NSGA-II vs. QAR-CIP-NSGA-II Classic Rejected < Netconf QAR-CIP-NSGA-II vs. QAR-CIP-NSGA-II Classic Rejected < Lift QAR-CIP-NSGA-II vs. QAR-CIP-NSGA-II Classic Rejected < Table 6: Results of Wilcoxon s Test (α = 0.05) in the comparison with classical NSGA-II 13

14 Figure 4: Pareto fronts obtained at different times of the evolutionary process in two datasets Fig. 4 shows the Pareto fronts obtained by our proposal when the restarting process is applied at different times of the evolutionary process in two datasets (the solutions of a single trial). In this figure, we plot the solutions from QAR-CIP-NSGA-II in a 3-D way and we plot the projections of these solutions on all the possible objective planes. We have modified the objectives in order for them to be plotted as minimization objectives. In order to retain all the information, the dominated solutions that are obtained from the projections have not been removed. We can see how the Pareto front improves along with different numbers of the restarting processes. Moreover, it can easily be seen from these figures how the EP and restarting process allow us to increase the number of non-dominated solutions for each restarting process Comparison with other mono-objective and multi objective evolutionary approaches This section analyzes the performance of our algorithm in comparison with four mono-objective algorithms (EAR- MGA [63], GAR [46], GENAR [45], Alatasetal [4]) and three MOEAs for mining QARs (ARMMGA [49], MOD- ENAR [5], MOEA Ghosh [33]). The results obtained by the analyzed algorithms are shown in Tables 7 and 8 (this kind of table was described in subsection 4.3). By the analysis of the results presented in these tables, we can highlight the following facts: The values obtained by our proposal for the measures lift, CF and netconf are better than the values obtained by the analyzed algorithms in all the datasets, with values close to the best possible value that these measures can achieve, allowing us to obtain an interesting set of association rules. The proposed algorithm presents a good trade-off between all the measures that have been analyzed. Moreover, the rule sets obtained have a low number of attributes, giving the advantage of easier understanding from a user s perspective and a high coverage of the dataset. 14

15 Algorithm #R Av S up Av Con f Av Li f t Av Conv Av CF Av NetCon f Av Amp %Tran Balance EARMGA GAR GENAR Alatasetal ARMMGA MODENAR MOEA Ghosh QAR-CIP-NSGA-II Basketball EARMGA GAR GENAR Alatasetal ARMMGA MODENAR MOEA Ghosh QAR-CIP-NSGA-II Bolts EARMGA GAR GENAR Alatasetal ARMMGA MODENAR MOEA Ghosh QAR-CIP-NSGA-II Flare EARMGA GAR GENAR Alatasetal ARMMGA MODENAR MOEA Ghosh QAR-CIP-NSGA-II House16H EARMGA GAR GENAR Alatasetal ARMMGA MODENAR MOEA Ghosh QAR-CIP-NSGA-II Table 7: Results for the datasets Balance, Basketball, Bolts, Flare and House16H 15

16 Algorithm #R Av S up Av Con f Av Li f t Av Conv Av CF Av NetCon f Av Amp %Tran Pollution EARMGA GAR GENAR Alatasetal ARMMGA MODENAR MOEA Ghosh QAR-CIP-NSGA-II Quake EARMGA GAR GENAR Alatasetal ARMMGA MODENAR MOEA Ghosh QAR-CIP-NSGA-II Stock EARMGA GAR GENAR Alatasetal ARMMGA MODENAR MOEA Ghosh QAR-CIP-NSGA-II Stulong EARMGA GAR GENAR Alatasetal ARMMGA MODENAR MOEA Ghosh QAR-CIP-NSGA-II Table 8: Results for the datasets Pollution, Quake, Stock, and Stulong 16

A Multi-Objective Evolutionary Algorithm for Mining Quantitative Association Rules

A Multi-Objective Evolutionary Algorithm for Mining Quantitative Association Rules Diana Martín and Alejandro Rosete Dept. Artificial Intelligence and Infrastructure of Informatic Systems Higher Polytechnic