A Survey of Frequent and Infrequent Association Rule Mining Models

Size: px

Start display at page:

Download "A Survey of Frequent and Infrequent Association Rule Mining Models"

Garey Norman
5 years ago
Views:

1 A Survey of Frequent and Infrequent Association Rule Mining Models Sujatha Kamepalli 1, Dr.Raja Sekhara Rao Kurra 2, Dr.Sundara Krishna.Y.K. 3 1 Research Scholar,CSE Department,Krishna University,Machilipatnam, Andhra Pradesh,INDIA. sujatha101012@gmail.com 2 Professor in C.S.E and Director,Usha Rama College of Engineering and Technology,Telaprolu, Andhra Pradesh,INDIA. krr_it@yahoo.co.in 3 Professor,CSE Department, Krishna University,Machilipatnam, Andhra Pradesh,INDIA. yksk2010@gmail.com Abstract Mining frequent and infrequent patterns is a major task of association rule mining research. Traditional association rule mining models are responsible for identifying correlated patterns, frequent patterns from the large candidate sets of transaction databases. However, mining infrequent patterns has also attracted the attention of many researchers in different domain fields. Traditional frequent and infrequent patterns model use two measures; support and confidence as threshold to filter the item sets as positive and negative items. Item sets exceeding minimum thresholds are selected as interesting item sets and rest of them are considered as infrequent patterns. The main issues of the traditional association rule mining models are high computational time, repeated I/O scans and require large storage space for computation. In this paper, we have studied and analyzed various frequent and infrequent association rule mining models and their comparisons. We have also analyzed pros and cons of each model and presented a comparative analysis. Keywords Infrequent patterns, Support. I. INTRODUCTION Data mining is a specialized discipline, aims at the discovery of hidden patterns from large databases. Mining association patterns from different types of databases has received much attention recently.the purpose of the association rule mining algorithms is to find certain relationships between a set of items and itemsets in a dataset. Various association rule mining models are presented since decades in order to identify and extract hidden patterns in the transactional databases. Some significant and mostly used applications are:- telecommunication, market basket, risk management, and so on. The working phenomena of association rule mining can be performed in two steps. In the initial phase, itemsets are processed to filter the candidate sets. In the second step, frequent and infrequent patterns are filtered or analyzed using the statistical measures. In the following phase, these frequent itemsets are used to generate basic association rules. Assume I be a frequent itemsets I = {I1, I2,, Ik}. Association rule is represented as {I1, I2,, Ik-1} {Ik}. The confidence signifies interestingness of that item or rule. Similarly, each and every rule is generated and their interestingness is evaluated by their confidence values. The itemsets whose support value is greater than minimum threshold are defined as frequent itemsets. The efficiency and performance increases significantly, if the numbers of association rules are reduced. Some of the strategies which reduces the number of association rules are:- generation of rules based on interestingness, non-redundancy, coverage, leverage, and so on. A frequent itemset X is a maximal frequent itemset, if there exist not other frequent itemset Y such that X is the subset of Y. Since the total number of maximal frequent patterns are less than the frequent patterns, interesting patterns are identified using several traditional models such as [1-5]. Infrequent pattern discovery is applicable to data coming from different domain fields such as risk assessment in census data, Anomaly detection, cloud usage analysis, network intrusion detection, stock market, disease prediction etc. However, traditional infrequent mining models still suffer from the local item interestingness during the pattern discovery process. The basic conventional algorithm has some defects which can be removed and the algorithm can be enhanced by reducing computation cost, numbers of passes, sampling, additional constraints and parallelization. Some major issues of association rule mining are mentioned below:- 1. In the large databases, large numbers of association rules are produced (both relevant and non-relevant). Support and confidence are two important measures to evaluate interestingness. Chi-square, Laplace and Correlation coefficient 50

2 are some other approaches for the evaluation of interestingness. By implementing subjective metrics like lift and conditional dependency, numbers of association rules can be reduced. 2. Quality of rules is also essential factor along with number of association rules. This also leads to a major affect of association rule algorithms. 3. Association mining algorithms are executed first and then their parameter configuration takes place. Hence, the parameters are needed to be defined before execution. All the association rules exceeding basic statistical measures are considered to be valid. This selection is based upon users prior knowledge and experience on that specified domain field. This issue can be solved by algorithms having least or no parameters at all. Frequent Itemset Mining Models: i) Uniform Distribution Model: This is a frequently used data mining approach. All the rules are based on frequent itemsets. Here, the items exceeding minimum support and minimum confidence are selected. These items produce different association rules through association algorithms. It is considered that, the items in a dataset have uniform distribution with respect to support. This is the major issue of this model. ii) Significance of item Model: In this model the concept of weight was introduced. Weights are added to each and every transaction which assigns priority to the items. The major problem of this model is that, these weights are assigned only in the item set processing and pattern generation process. The mining process does not have any weight. iii) Weighted Association Rule Mining: This model assigns weights to both rules generation and mining process. The approach resolves the issue of downward closure feature. iv) Data trimming Model: It is a probabilistic approach for mining frequent datasets. A modified version of Apriori algorithm i.e,. U-Apriori algorithm was implemented in this model. The computational issue of U- Apriori is handled by mining algorithm. Another technique, known as LGS-trimming is applied on the framework in order to achieve high performance. Infrequent Itemset Mining: i) Positive and Negative Association rule Model: This model is responsible for mining both positive and negative association rules in databases. It helps to decrease the search space and degree of conditional probability is enhanced significantly. This plays vital role in evaluation of confidence both for positive and negative association rules. ii) Minimal infrequent itemset mining Model: This model evaluates minimal τ -infrequent or minimal τ -concurrent itemsets in a database. Initially a item ranking algorithm is implemented, which organizes the items in the ascending order of their ranks. Minimal τ infrequent itemsets are evaluated through items in ranking sequence. iii) Rare Association Rules generation Model: This model generates rare association rules out of infrequent itemsets. It picks the hidden rare association rules from the frequent itemsets. These rare association rules are termed as mri (minimal Rare Itemsets) rules. Apriori algorithm is implemented to calculate the support and mris. The rare itemsets can be regenerated from mris. iv) Pattern-Growth Paradigm and Residual Trees Model: It is responsible to find minimal infrequent itemsets. IFP min algorithm is used in this model to generate infrequent itemsets. The algorithm evaluates items using the residual tree concept (i.e., inverse FP-tree). All these algorithms are mostly mining frequent and infrequent association rules from the frequent item sets. But the negative association rules from infrequent item sets are not considered for pattern discovery process. II. RELATED WORKS P. Kishor and S. Porika implemented a new mining approach to generate positive and negative association rules in case of huge transactional databases [1]. This 51

3 scheme is responsible for generation of strong positive and negative association rules. This model computes the Yales correlation coefficients on the input items and generates frequent patterns. This model predicts negative constraints by considering the positive constraints. The researchers evaluated their approach on real datasets and performed a comparative analysis with other pre-existing techniques. They demonstrated that, their method is better than that of other conventional approaches in terms of frequent patterns and statistical measures. Further in future, this model can be enhanced by implementing Self Organizing Maps (SOM) for frequent itemsets. Y. S. Koh and R. Pears implemented a novel negative association model based on predefined thresholds [2]. As compared to positive association rules, negative associatoin rules are a bit difficult to predict on high dimensional databases. In the traditioanlnegative assocication rule mining models, some random threshold value is considered and it is assumed to be constant through out the whole mining process. The authors presented a new algorithm and termed it as MINR algorithm. MINR algorithm is responsible for generating negative association rules without any threshold value. They evaluate statistical Pearson s(φ) correlation measure as interesting measure. They demonstrated that, their algorithm generates more consize set of constraints as compared to Pearson s φ correlation. This approach also eliminates the issue of Pearson s φ correlation i.e., datasets having variational support itemsets. This method not only shows better performance, but also provides significant precision and recall. Here, the performance does not depend on numbers of negative itemsets and size of itemsets. As the above algorithm produces rules along with huge numbers of terms, it can be implemented in complex scenarios. J. Agrawal, et.al, identified the issues of positive association rule mining and developed a new method for negative ARM [3]. They implemented this approach for market basket analysis. They developed a model known as SARIC, a special type of swarm optimization technique for generation of association rules through itemset range and correlation coefficient. Both positive and negative attributes are considered here. They performed experiments and evaluated the technique with different statistical measures. The result is compared with some other approaches (Apriori, Eclat, HMINE and Genetic Algorithm). Further research can be carried out to decrease the space and computational time. Efforts are needed to implement this method in spatial and temporal databases. K. Amphawan et.al. presented a new approach for mining top-k frequent-regular closed patterns [4]. This approach is known as TFRC-mining. The approach is applicable to real-world applications. Here, users have the responsibilities to assign parametric values (such as k, regularity threshold and minimum length). This technique does not generate large numbers of patterns. Also, it eliminates redundant patterns and select s the frequent patterns. This method decreases numbers of patterns and improves its quality as the size of the database increases. They used a new compressed bit-vector approach for candidate pruning. Evaluation of support and regularity becomes faster as compared to other methods. In future this model can be extended to implement incremental databases, data streams and distributed TFRC-mining. J. K. Jain, et.al. designed and implemented a novel positive and negative association rules through the genetic approach [5]. They used genetic algorithm as an optimized one for pattern discovery. IMLMS approach is integrated with genetic algorithm in order to form the association patterns. On changing distance weight, large number of association rules are generated. Since the weight depends on support and confidence, some of the interesting rules are discarded. They established a theoritical relationship among locally large and globally large patterns. These patterns helps in local pruning in order to reduce the numbers of candidates.a large number of frequent rules are generated in MLMS, than that of IMLMS. But, the proposed algorithm generates a very few numbers of association rules as compared to MLMS and IMLMS. D. Mart ın, et.al, implemented a new approach based on Multi-Objective Evolutionary Algorithm (MOEA) for mining positive and negative association rules [6]. This model is known as MOPNAR, an extended version of MOEA. The mining algorithm produces reduced sets of positive and negative association rules. There are three main objectives of association rules, those are:- comprehensibility, interestingness and performance. The algorithm uses an EP in order to implement evolutionary model and store the non-dominated constraints. Hence, the diversity of constraints is enhanced. These constraints represent a dependency among items. Reduced sets of attributes make the algorithm simple and easy for the users. On increasing problem size, the proposed algorithm achieves reduced computational cost and decreased problem size. 52

4 C. Pasquier, et.al, developed a new algorithm and a framework for frequent patterns using pattern tree [7]. Mainly, three different attribute types are considered here, those are :atrees, closed trees and c-closed atrees. They developed three algorithms for mining these trees. C- closed atrees are a slightly modified version of atrees. They achieve better performance and succinctness. They analyzed issues in atrees and closed trees, those are:- 1) The mining process of atrees returns large numbers of patterns. 2) closed trees mining requires more time because of sub-trees isomorphism evaluation process. They evaluated the performance of this above algorithm and compared it with other algorithms. In future, this approach can be extended for mining frequent closed patterns. Efforts are needed also for mining complex and attributed graphs. A. Y. Rodríguez-González, et.al, emphasizes on similarities for mining frequent patterns and associations [8]. This model helps in pruning search space of various patterns with similarities. They used a special data structure to store all details about instances and their degree of similarities. They implemented GenRules algorithm in order to produce association rules out of frequent patterns. Sometimes situation arises when some meaningful rules are discarded and other false rules are formed. This type of problem arises when without using similarity index, equality index is used for rules generation process. After this, a classification algorithm is applied to analyze the quality of frequent similar patterns. They proved that, their approach results better frequent patterns. The above said algorithm satisfies the downward closure property. Further work may be done to reduce frequent patterns and association rules without any loss of data. Y. Shen, J. Liu and J. Shen tried to extend features, functions and mining patterns using Weka [9]. They identified the problems in association rules of Weka ARM and proposed a new algorithm for mining of positive and negative association rules. They also extended the association rules algorithm for the purpose of extended development. Then the modified algorithm is compared and analyzed with the original algorithm. It helps to retrieve explicit association rules and to mine these rules. They evaluated their technique and experimented on public database and noticed improved results. They also got better adaptability and scalability. Sl. No. K. Subramanian et.al, developed a high utility pattern mining technique for negative association rules mining [10]. They studied and surveyed all the current methods for high utiliity pattern mining. Besides this, pruning methods and utility measures are analyzed. The developed algorithm is named as UP-GNIV and implemented on transactional databases along with negative utilities. The authors applied RNU and PNI in order to discard these negative utilities. The said approach implemented UPtrees to produce PHUIs. The researchers conducted experiments and evaluated the effectiveness of their technique. The above algorithm limits the search space and provides better scaling than that of HUINIV-mining. In future, objective measures and semantic measures can be integrated thoroughly. The present work can be extended by parallel and distributed mining of high utility datasets. The algorithm can also be optimized in transactional databases, as there is no such algorithm. The Intention-Purchase Relationship has attracted a number of empirical studies highlighting significant inconsistencies between purchase intention and purchase behaviour. In the modern world, consumer s purchase intentions are increasing from time to time. The purchase intentions of different products form a queue in the minds of the consumer. Purchasing is the service given to the members of the queue. All purchase intentions need not result in actual purchase. This paper describes the movement of the desired products to both the situations by a three-stage compartmental model. The first compartment includes all products that the consumer desires to buy and the second the real purchase items. The third compartment includes those items which are not decided to be purchased or are postponed to purchase on another occasion. Author(s) Pros Cons 53

5 1 P. Kishor, S. Porika More valid rules are generated for different support and confidence. No additional measure and extra database scans are required. 2 Y. S. Koh and R. Pears Provides better performance, precision and recall than that of Pearson s φ correlation. Eliminates the issue of Pearson s φ correlation i.e., datasets having variational support itemsets. The performance is independent of numbers of negative itemsets and size of itemsets. 3 J. Agrawal,S. Agrawal, Better efficiency and performance as A. Singhai and S. compared to Apriori, Eclat, HMINE and Sharma Genetic Algorithm. 4 K. Amphawan and P. Lenca 5 J. K. Jain, N. Tiwari and M. Ramaiya 6 D. Mart ın, A. Rosete, J. Alcal a-fdez and F. Herrera 7 C. Pasquier, J. Sanhes, F. Flouvat and N. Selmaoui-Folcher 8 A. Y. Rodríguez- González, J. F. Martínez-Trinidad, J. A. Carrasco-Ochoa and J. Ruiz-Shulcloper 9 Y. Shen, J. Liu and J. Shen 10 K. Subramanian and P. Kandhasamy 54 Decreases numbers of patterns and improves quality. Faster computation of support and regularity. Achieves local pruning. Few numbers of association rules will be generated. Improved diversity of association rules. Decreased computational cost and problem size. Achieves better performance and succinctness. References Produces enhanced frequent patterns. Satisfies downward closure property. Extracts explicit association rules. Better adaptability and scalability. Limits the search space. Provides better scaling. [1] P. Kishor and S. Porika, An Efficient Approach for Mining Positive and Negative Association Rules from Large Transactional Databases, nventive No use of SOM for the generation of frequent patterns. It produces rules with huge numbers of terms. Performance can still be improved. Space and computation time is too high. Less gain for dense databases. There exists dependency among items. Mining process of atrees returns large numbers of patterns. Closed trees mining requires more time because of sub-trees isomorphism evaluation process. Reduction of association rules and frequent patterns leads to loss of data. No connection between objective and semantic measures. Results are not optimized. Computation Technologies (ICICT), International Conference on. Vol. 1. IEEE, pp. 1-5, [2] Y. S. Koh and R. Pears, Efficient negative association rule mining based on chance thresholds, Intelligent Data Analysis 18 (2014), pp , [3] J. Agrawal, S. Agrawal, Ankita Singhai and S. Sharma, SET-PSO-based approach for mining

6 positive and negative association rules, Knowledge and Information Systems 45.2 (2015):, pp , [4] K. Amphawan and P. Lenca, Mining top-k frequent-regular closed patterns, Expert Systems with Applications 42, pp , [5] J. K. Jain, N. Tiwari and M. Ramaiya, Mining Positive and Negative Association Rules from Frequent and Infrequent Pattern Using Improved Genetic Algorithm, 5th International Conference on Computational Intelligence and Communication Networks, pp , [6] D. Mart ın, A. Rosete, J. Alcal a-fdez and F. Herrera, A New Multiobjective Evolutionary [7] Algorithm for Mining a Reduced Set of Interesting Positive and Negative Quantitative Association Rules, IEEE Transactions On Evolutionary Computation, vol. 18, no. 1, February 2014, pp.54-69, [8] C. Pasquier, J. Sanhes, F. Flouvat and N. Selmaoui- Folcher, Frequent pattern mining in attributed trees: algorithms and applications, Pacific-Asia Conference on Knowledge Discovery and Data Mining. Springer Berlin Heidelberg, pp.26-37, [9] A. Y. Rodríguez-González, J. F. Martínez-Trinidad, J. A. Carrasco-Ochoa and J. Ruiz-Shulcloper, Mining frequent patterns and association rules using similarities, Expert Systems with Applications 40 (2013), pp , [10] Y. Shen, J. Liu and J. Shen, The Further Development of Weka Base on Positive and Negative Association Rules, 2010 International Conference on Intelligent Computation Technology and Automation, pp , [11] K. Subramanian and P. Kandhasamy, UP-GNIV: an expeditious high utility pattern mining algorithm for itemsets with negative utility values, International Journal of Information Technology and Management, Vol. 14, No. 1, 2015, pp.26-42,

A Novel Approach to generate Bit-Vectors for mining Positive and Negative Association Rules

A Novel Approach to generate Bit-Vectors for mining Positive and Negative Association Rules G. Mutyalamma 1, K. V Ramani 2, K.Amarendra 3 1 M.Tech Student, Department of Computer Science and Engineering,