Efficient Mining of Generalized Negative Association Rules

2010 IEEE International Conference on Granular Computing Efficient Mining of Generalized egative Association Rules Li-Min Tsai, Shu-Jing Lin, and Don-Lin Yang Dept. of Information Engineering and Computer Science Feng Chia University, Taichung, 407 Taiwan debbi100@ms54.hinet.net, lin.shu.jing@gmail.com, dlyang@fcu.edu.tw Abstract Most association rule mining research focuses on finding positive relationships between items. However, many studies in intelligent data analysis indicate that negative association rules are as important as positive ones. Therefore, we propose a method improved upon the traditional negative association rule mining. Our method mainly decreases the huge computing cost of mining negative association rules and reduces most non-interesting negative rules. By using a taxonomy tree that was obtained previously, we can diminish computing costs; through negative interestingness measures, we can quickly extract negative association data from the database. Keywords-data mining; negative association rule; concept hierarchy; taxonomy; negative interestingness I. ITRODUCTIO In the research of data mining, association rule mining is one of the most important research topics. Most association rule algorithms focus on finding positive association rules. Many literatures in intelligent data analysis show that negative association rules are as important as positive rules. Especially, negative association rule mining can be applied to a domain that has too many types of factors. egative association rules can help users quickly decide which ones are important instead of checking too many rules. For example, in Bioinformatics, we may find out a negative association rule such as {if protein A appears, then protein B and protein C will not appear}. This kind of negative association rule is useful for biologists when they research on disease or drug development. In traditional approach, finding negative association rules encounters a large search space and generates too many non-interesting rules. Therefore, an efficient and useful algorithm for finding negative association rules is very valuable. Our research focuses on reducing computing time and trying to find interesting negative association rules. The proposed algorithm could speed up computing time efficiently and through the domain taxonomy tree, we could find interesting negative association rules more easily. II. RELATED WORK The Apriori algorithm [1] is a basic algorithm in mining association rules. Based on Apriori, many improved algorithms are proposed. For example, partition algorithm [2] was developed for reducing the number of database scan; FP-growth algorithm [3] was used to speed up computation time; generalized association rule [4] was an extension of association rule where no item in Y is an ancestor of any item in X for the rule X Y. evertheless, until now, most improved association rules algorithms focus on positive association rules or some rare association rules mining [5]. Mining negative association rules is another issue that has raised some researchers attention [6]. For example, when we determine strategies of product placement and purchase analysis, there are many factors that we must weight the pros and cons. To minimize negative impacts and increase possible benefits [7], managers must consider which side-effect is unlikely to occur when the expected advantage factor is selected. In such a situation, a negative association rule like X Y would be useful. Because this rule tells us that C (e.g., a disadvantage factor) does not occur or rarely occurs when A (e.g., an advantage factor) shows up. Algorithms for discovering negative association rules are not widely discussed [8,9]. The discovery procedure of these algorithms can be decomposed into three stages: (1) find a set of positive rules; (2) generate negative rules based on existing positive rules and domain knowledge; (3) prune the redundant rules. Ashok, et al. [8] generates negative association rules based on a complex measure of rule parts. A negative association rule defined in [8] is as an implication of the form X Y, where X Y =, X is called the antecedent, and Y is the consequence of the rule. Every negative association rule has a rule interest measure RI, which is defined as: [support( X Y ) support( X Y )] RI = ε (1) support( X ) where ε[support(x)] is the expected support of an itemset X. The rule interest RI is negatively related to the actual support of the itemset X Y. It is the highest if the actual support is zero, and zero if the actual support is the 978-0-7695-4161-7/10 $26.00 2010 IEEE DOI 10.1109/GrC.2010.148 471

same as the expected support. Xiaohui, et al [9] generates negative association rules by using a similar concept of [8]. Three measures used in the algorithm are minimum support, minimum confidence and SM (salience measure). A negative association rule defined in [9] is as an implication of the form X Y (or X Y), where X I, Y I, and X Y =. The SM (salience measure) is used to provide clues to potentially useful negative rules and defined as follows: SM = conf ( r') E( conf ( r')) (2) where conf (r') is the actual confidence of rule r. A large value for SM is an evidence for accepting the hypothesis that X ' Y is false. That is, X ' Y may be true. In brief, to qualify as a negative rule, it must satisfy two conditions: first, there must exist a large deviation between the estimated and actual confidence values and, second, the support and confidence are greater than the minimum required. In this paper, we focus on efficiently mining negative association rules. The reasons that motivate us are: (1) egative association rules are as important as positive rules. (2) Traditional approaches lead to a very large number of rules and expensive computing costs. On account of these motivations, we developed a method to solve these problems. Our method can speed up computing time and find the interesting negative rules according to user s requirements. The rest of the paper is organized as follows. Section III gives the detailed process of the proposed algorithm. The experiment results and their discussion are presented in Section IV. Finally in Section V, we conclude this paper. III. PROPOSED METHOD egative association rule discovery encounters a large search space such that it may spend more computing time than traditional positive rule discovery using intuitive mining algorithms like Apriori. Therefore, we propose an improved approach called Generalized egative Association Rule (GAR) algorithm. For efficiency, we scan the database once and transform transactions into a space-reduced structure called vertical TID table stored in main memory. We assume that the information of taxonomy tree is available in advance. The taxonomy tree is used to assist creating vertical TID table. That is, through the taxonomy tree, we can filter transactions that do not belong to this domain and make no contribution to the end result. In addition to eliminate a large number of useless transactions, the information in the taxonomy tree can be used to mine negative association rules. In the mining steps of GAR, we use negative interestingness and negative confidence to increase accuracy of mined results. Pruning techniques are used to remove non-interesting negative association rules. A. The Concepts of GAR The concepts used in GAR can be divided into two parts: concept hierarchy and negative interestingness. Since the search space of mining negative association rules is extremely large, a concise representation of negative association rule must be developed. Concept hierarchy or taxonomy is used for this purpose. The second is negative interestingness. As mentioned above, we can think of negative association rules as a complement of positive association rules. The nature of negative association rule is totally different from positive one. Therefore, traditional measures such as support and confidence used for positive association rules are not proper for negative association rule anymore. Suitable measures for mining negative association rules are needed. More detailed descriptions about these two concepts are as follows: Concept hierarchy: A concept hierarchy allows a series of mappings from a set of low-level concepts to higher-level, more general concepts. It is a useful form of background knowledge in that they allow raw data to be represented at generalized levels of abstraction. Generalization of the data, or rolling up, is achieved by replacing primitive-level data by higher-level concept. By using concept hierarchy, we can condense the negative association rules to a more succinct form. Figure 1. A concept hierarchy for the dimension snacks Fig. 1 shows a concept hierarchy for the dimension snacks. In this paper, concept hierarchy and taxonomy (tree) will be used interchangeably. We take Fig. 1 as an example, if a generated rule of the form R: Pepsi Brand B cracker, then the rule of the form R 1 : Soft Drink Cracker would also be generated and hold a larger support than R. This kind of concept is suitable for mining negative association rules. Because the number of generated negative association rules would 472

be greater than generated positive ones. Therefore, a negative association rule would be more easily understood if we present it with a concept hierarchy. It also allows users to view the data at more meaningful and explicit abstractions. Fig. 1 shows three nodes for different use. Only items of leaf node are presented in the database. The other two nodes (Root and Internal node) are used for concept hierarchy presentation. Three types of generalized negative association rule would be mined in our method: [ ]Coke [ ]BrandA Leaf Leaf [ ]Soft Drink [ ]BrandB Internalnode Leaf [ ]Soft Drink [ ]Cracker Internalnode Internalnode In our proposed method, we assume that this kind of taxonomy tree can be provided in advance. Through the taxonomy tree, we can first eliminate transactions that do not belong to the domain or contain user-specified items. After counting support of each item, the taxonomy tree would be further pruned to become a smaller one. The taxonomy information is reserved for the following negative association rule mining process. egative interestingness: Before we start to introduce negative interestingness, we shall discuss relationships between items first. Here, we only discuss binary relationship. A state diagram is shown in Table 1. X and Y are different items in a database. Each item in the database has two conditions, that is, presence or absence. Therefore, such a four-state table is created for item X and Y. From Table 1, we can easily deduce support and confidence for mining traditional association rules. For instance, support of rule X Y can be expressed by a and confidence of rule X Y + + + a b c d a can be expressed by. In order to extract a + b interesting negative association rules from large databases, we must define a proper measure for negative association rule mining. From Table 1, we find that attribute a is the condition that X and Y occur at the same time. The others have at least one negative (Absence) factor. Therefore, instead of using traditional measures such as a for support and a for a + b + c + d a + b confidence, we define a measure for mining interesting negative association rules as follows: w1 b + w2 c + w3 d egative interestingness = w4 ( ) w (3) a + b + c + d 5 This measure is a general case that contains most of dissimilarity measures to the best of our knowledge. For example, dissimilarity measures binary pattern difference, average squared and binary Euclidean are subsets of negative interestingness. Users are allowed to modify these flexible parameters ( w 1 to w 5 ) according to their applications and specific demands. Moreover, we also can easily define confidence for negative association rules from the four-state diagram of Table 1. Three types of negative association rules are shown as follows: X X X B. The Process of GAR Y Y Y b a + c c + TABLE 1. Binary relations : : : d b d c + In this section, we give a detailed description of the proposed GAR algorithm in the following three steps: (1) First, we scan the database into a vertical TID table in main memory. The vertical TID table is a memory space-reduced structure. It transforms transactions into a bit-map string mode according to data distribution in the original database. If the original database is dense (most of items occur more than half of total transactions), the vertical TID table can then change to record TID of each item, which is not occurred in the database. If the original database is sparse, then the vertical TID table only records TID of each item occurred in the database. Because our GAR algorithm is a memory-based algorithm, the use of memory space must be considered carefully. The vertical TID table can be applied in both dense and sparse databases. d 473

graph based on L1, L2 and frequent taxonomy items T. The association graph is used to join frequent taxonomy items with original large items in the database and to keep taxonomy information for the following mining process. Based on the association graph, we can produce k-generalized negative association rules. In our GAR algorithm, we only consider generalized negative association rules in the form of [ ]{ ItemsetA } [ ]{ ItemsetB} that items in braces ( ItemsetA or ItemsetB ) are positively associated respectively. Fig. 2 shows our GAR algorithm. egative confidence is used to extract three types of rules in each rule-generation step. IV. EXPERIMET RESULT AD DISCUSSIO We use Visual C++ programming language to implement the GAR algorithm. We perform our experiments on a personal computer of Intel Pentium 4 processor with a clock rate of 2.4AGHz and 512MB DDR266MHz main memory. The test data of our experiments were produced from IBM dataset generator [11]. Figure 2. The GAR algorithm (2) Second, we assume the information of taxonomy tree is always available. According to this taxonomy tree, we can eliminate items and transactions that do not belong to this domain. Then, with a minimum support, we can find L1 from the vertical TID table and calculate support of each internal node in the taxonomy tree. In this step, the support of each internal node and root node can be calculated by using the OR operation. In the GAR process, we use negative interestingness mentioned before to replace support measure except when forming L1. (3) After calculating all support of internal nodes in the taxonomy tree, we can generate frequent taxonomy itemsets T. Then we generate C2 from L1. When counting the support for L2, we use negative interestingness as its threshold and apply a pruning technique. That is, items in C2 that belong to the same parent node according to the taxonomy tree will be pruned. From L2, we can generate R2 with another pruning technique being applied here. That is, assume a rule in the form of [ ] I 1 [ ] I2, I 1 I 2 =, no item in I 2 is an ancestor of any items in I 1. After that, we construct an association A. Experimental Parameters Table 2 shows the parameter settings used in generating the three testing databases. T is the average number of items per transaction. I is the average length of maximal frequent patterns. D is the total number of transactions. In addition to compare with traditional negative association rule algorithm, we also performed experiments on different parameter settings (test1~test5 in Table 3) of our algorithm. TABLE 2. Test databases TABLE 3. Test parameters 474

B. Experiment results and discussion We ran both GAR and traditional algorithms on the three datasets with parameter setting of test1~test5 in Table 3. The average level of taxonomy data is set to 6 and 11 categories were given here. We use dataset T10I6D10K for the first experiment. The weight parameters of GAR are w1 = w2 = w3 = w4 = w5 = 1, and negative interestingness = negative confidence = 0.6. Figure 4. T15I12D100K Figure 3. Execution time experiments on T10I6D10K Fig. 3 shows the result of experiment for database T10I6D10K. X-axis represents various values of initial support of GAR and support of traditional negative association rule algorithm. These values range from 0.5 to 1.0. Y-axis represents execution time of the two algorithms according to different support values. From Fig. 3, we can find that GAR spends less time than traditional algorithm in most cases. When support is close to 0.5, GAR performs much better than traditional negative association rule algorithm. On the other hand, when support is close to 1, the performance of these two algorithms is similar. We use dataset T15I12D100K in the second experiment to analyze the performance of algorithms when the average length of maximal frequent patterns is long. In this experiment, we set the parameters of GAR as w1=1, w2=1, w3=1, w4=1, w5=1, negative interestingness = 0.6 with different Ini_Sup values, and different supports of traditional algorithm. We found traditional algorithm is inefficient, especially when the average size of transactions of maximal potentially large itemsets is doubled from 6 to 12. Fig. 4 shows that traditional algorithm spends more time to generate negative association rules than GAR when support is close to 0.5. When support is close to 1, the performance of GAR and traditional algorithms is almost the same. In the third experiment, we use different parameter settings (test1~test5) from Table 3 to analyze the generated negative association rules for dataset T12I8D50K. Taxonomy data used here are set as the average number of levels = 6 and the average number of categories = 11. Fig. 5 shows the result of this experiment. We found that test4 generated the most amounts of negative association rules. This is because the denominator of negative interestingness decreases such that many rules can be extracted by using negative interestingness. On the other hand, both generated rules of test3 and test5 are much less than other tests. The reason is that the feature of the tested database has less negative relationship. Figure 5. Generated negative association rules using different testing settings of test1~test5 In the last experiment, we use different taxonomy data for comparing the effect of our GAR algorithm with different taxonomic structures. In this experiment, we set the parameters of GAR as w1=1, w2=1, w3=1, 475

w4=1, w5=1, negative interestingness = 0.6, Ini_Sup = 0.6 and negative confidence = 0.6. Two taxonomic structures are used here for comparison. Taxonomy1 is set to the average number of levels = 3 and the average number of categories = 11. Taxonomy2 is set to the average level size = 9 and the average number of categories = 11. The tested dataset is T12I8D50. Fig. 6 shows the execution time of these two taxonomies. From Fig. 6, we found that our method is more efficient when the level size of taxonomy is larger. The reason is that the fan-out of taxonomy deeply effects the performance of GAR. Since Taxonomy2 is set to have a larger level size than Taxonomy1, Taxonomy2 has smaller fan-out than Taxonomy1. Therefore, the GAR algorithm is more efficient for mining negative association rules with a larger level size. ACKOWLEDGMET This research was supported by the ational Science Council, Taiwan, under grants SC 98-2221-E- 035-059-MY2 and SC 98-2218-E-007-005. REFERECES [1] R. Agrawal and R. Srikant, Fast Algorithms for Mining Association Rules in Large Databases, Proc. of the 20th International Conference on Very Large Databases, pp. 487-499, Santiago, Chile, 1994. [2] A. Savasere, E. Omiecinski, and S. B. avathe, An Efficient Algorithm for Mining Association Rules in Large Databases, Proceedings of the 21st International Conference on Very Large Databases, pp. 432-444, Zurich, Switzerland, 1995. [3] J. Han, J. Pei, and Y. Yin, Mining Frequent Patterns without Candidate Generation, Proceedings of 2000 ACM SIGMOD International Conference on Management of Data, pp. 486-493, May 2000. [4] R. Srikant and R. Agrawal, Mining Generalized Association Rules, VLDB 95, pp. 407-419, Zurich Switzerland, 1995. [5] Y. S. Koh and. Rountree, Rare Association Rule Mining and Knowledge Discovery: Technologies for Infrequent and Critical Event Detection, Information Science Reference publisher, August 2009. [6] X. Wu, C. Zhang, S. Zhang, Efficient Mining of Both Positive and egative Association Rules, ACM Transactions on Information Systems, Vol. 22, o. 3, July 2004, pp. 381 405. Figure 6. Comparison of different taxonomies V. COCLUSIO A considerable body of work has been carried out on the problem of positive association rule mining, but negative association rule mining has received very little attention. egative association rule mining can be applied to a domain that has various types of factors and it can help user quickly decide which one is an important factor instead of checking too many rules. In this paper, we proposed an efficient method of mining generalized negative association rules. Instead of mining negative association rules with an intuitive method, we use negative interestingness to characterize the property of negative association rules and justify the effectiveness. With taxonomy tree information, we reduce the search space of the mining process and a useful representation of generalized negative association rule is proposed. In the future, mining sequential patterns with negative conclusions and developing scalable parallel algorithms are two major directions of our future research. [7] Chengqi Zhang, Schichao Zhang, Shichao Zhang and Berno Eugene Heymer, Association Rule Mining: Models and Algorithms (Lecture otes in Artificial Intelligence), Springer Verlag, July 2002. [8] A. Savasere, E. Omiecinski, and S. avathe, Mining for strong negative associations in a large database of customer transactions, Proceedings of International Conference on Data Engineering, pp. 494-502, February 1998. [9] X. Yuan, B. Buckles, Z. Yuan, and J. Zhang, Mining egative Association Rules, Proceedings of the Seventh IEEE International Symposium on Computers and Communications (ISCC 02), pp. 623-628, 2002. [10] C. Cornelis, P. Yan, X. Zhang, and G. Chenand, Mining Positive and egative Association Rules from Large Databases, IEEE Conference on Cybernetics and Intelligent Systems, pp. 613-618, 2006. [11] IBM Almaden Research Center, 2006. Synthetic data generation code, http://www.almaden.ibm.com/ software/quest/ 476