A new approach of Association rule mining algorithm with error estimation techniques for validate the rules

Size: px

Start display at page:

Download "A new approach of Association rule mining algorithm with error estimation techniques for validate the rules"

Hugo Joseph
5 years ago
Views:

1 A new approach of Association rule mining algorithm with error estimation techniques for validate the rules 1 E.Ramaraj 2 N.Venkatesan 3 K.Rameshkumar 1 Director, Computer Centre, Alagappa University, Karaikudi, Tamilnadu. dr_ramaraj@yahoo.co.in 2 Asst.Prof and Head Dept of IT Bharathiyar College of Engg and Technology Karaikal. Pondichery envenki@sify.com 3 FullTime Ph.D Scholar, Dept of CSE, Alagappa University, Karaikudi, Tamilnadu. rameshkumar_phd@yahoo.co.in Abstract Data mining is one of the emerging fields in recent research literature. Data mining extract knowledge from large amount of data. Data mining provides hidden information from the huge of amount database. Association analysis is one of the major tasks in the data mining process. It has two steps process. One is to find the frequent item set from the database. It produce huge quantity of item set, these item set are mostly irrelevant to the transaction. Second step is to construct the association rule from the frequent item set. This paper describes the approach for find the strong validate association rule only. Association rule validation is made for avoid of irrelevant rule from the newly constructed rules. Here we use the error estimation technique like Train-Test approach in order to avoid the irrelevant rule. It performed low execution time and intelligent information. Keywords: Association rule, Train-Test approaches, Rule validation techniques, Data mining. 1. Introduction Data mining is an Artificial Intelligence (AI) powered tool that can discover useful information with in a database that can then be used to improve the action [1]. Also data mining is an essential step in the process of knowledge discovery in databases in which intelligent methods are applied in order to extract patterns. Other steps in the knowledge discovery process include pre mining task as data cleaning and data integration, as well as post mining task such as pattern evaluation and knowledge presentation. [1] Many type of interesting patterns have been identified in the various research literatures and association rule constitute one such type. Data mining tasks to find these various pattern include characterization, discrimination, association analysis, classification and regression, cluster analysis, outlier analysis and evolution analysis. Association analysis is the discovery of association rule showing attribute value conditions that occur frequently together in a given set of data. Also Association rule mining finds interesting association or correlation relationship among a large set of data items with massive amounts of data continuously being collected and stored, many industries are becoming interested in mining association rules from their databases. These rules have many applications in areas ranging from e- commerce to sports to census analysis to medical diagnosis. These association rules are classified based on the various kinds of criteria like values handled, dimension of values, level of abstraction

2 and various extensions. Most of the previous research works [5][6][7][8] are concentrated that the problem of discovering association rules is decomposed into two sub problems. First is finding all frequent itemset in the database, second construct the association rule using frequent item set. Here those kinds of process are produced large number of frequent itemset and association rule. The most of rules are irrelevant to that specified database. Avoid the irrelevant rule is most important when applying the association rule in the database. It is our main motivation of this research work. Our new proposed approach goes to the problem of discover the association rule is decomposed into three processes. First step indicates the find frequent itemset in the database; second step posses the constructs the association rule and third step addresses to validate the rule. Here we introduce the estimation of error rates technique from the machine learning for rule validation process. Normally estimation of error rates techniques is used to find the accuracy of the rule. Many of the estimation of error rates technique are available in the machine learning like train-test, Train/Validate/Test, K fold cross validation, Stratified cross validation, Leave-one-out Cross validation, Bagging (Bootstrap aggregation) and boosting. Our proposed views go to find the best validation method for association rule with more efficiency compare to other previous research. We present an algorithm that incorporates to find the valid association rules, and validate them with well-known train- test approach [2] [3] [4] to get rules with high accuracy. Support, Confidence and lift are the metrics used to evaluate the high reliability of association rules. Our proposed algorithm produces a set of rules that remain valid in several training/ test cycles. This paper is organized as follows. In Section 2, Association rule mining. Section 3 explains problem definition. Section 4 contributes basic concept of machine learning. Section 5 presents new algorithmic approach for strong association rules. Section 6 discusses implementation and performance analysis. The study is concluded in the section 7, along with the future work. 2. Association Rule Mining Association rule mining finds interesting association or correlation relationship among a large set of data items with massive amounts of data continuously being collected and stored, many industries are becoming interested in mining association rules from their databases. Let D be a set of n transactions such that D={T 1, T 2, T 3,..,T n }, Where T i =I and I is a set of items, I = (i 1, i 2, i 3,..,i m }. A subset of I containing k items is called a k-itemset. Let X and Y be two itemsets such that X I, Y I, and X Y= φ. An association rule is an implication denoted by X=>Y where X is called antecedent and Y is called the consequent. We proceed to define association rule metrics. Given an itemset X, support s(x) is defined as the fraction of transactions T i D such that X T i. Consider P(X) the probability of appearance of X in D, and P(Y X) the conditional probability of appearance of Y given X. P(X) can estimated as P(X)=s(X). The support of a rule X=>Y is defined as s(x=>y) = s(xuy). An association rule X=>Y has a measure of reliability called the confidence, defined as c(x=>y) = s(x=>y)/s(x). Confidence can be used to estimated P(Y X): P(Y X) = P(XUY)/P(X) = c(x=>y). We use third metric called lift,

3 defined as l(x=>y) = P(XUY)/(P(X) P(Y)) = c(x=>y) /s (Y). Lift quantifies the relationship between X and Y. In general, a lift value greater than 1 provides strong evidence that X and Y depend on each other. A lift value below 1 state X depends on the absence of Y or vice versa. A lift value close to 1 indicates X and Y are independent. The problem of mining association rules is defined as finding the set of all rules {X=>Y} such that s(x=>y)>= ψ and c(x=>y)>=α, given a support threshold and a confidence threshold. A itemset X such that s(x)>= ψ is called frequent. 3. Problem Definition Association mining is important aspects in Data mining. Normally association rule mining process is decomposed two main processes. That is 1. Find the frequent itemset in the database using Support. 2. Construct the association rule from the frequent itemset with specified confidence and lift metrics. These kind methods are provided large quantity of rules. Restriction rule reduction is most important. Also Association rule mining produces huge number of irrelevant rules. That are vast process and vast of time. Our new approach is going to overcome this problem. Here we decomposed process in three steps. That is 1. Find the frequent itemset in the database using Support. 2. Construct the association rule from the frequent itemset with specified confidence and lift metrics. 3. Validate the constructed rules. 4. Train Test Approaches The essential idea is this: a sample of data (the training data) is given to enable a classification rule to be set up. What we would like to know is the proportion of errors made by this rule when it is up-and-running, and classifying new observations without the benefit of knowing the true classifications. To do this, we test the rule on a second independent sample of new observations (the test data) whose true classifications are known but are not told to the classifier. The predicted and true classifications on the test data give an unbiased estimate of the error rate of the classifier. To enable this procedure to be carried out from a given set of data, a proportion of the data is selected at random (usually about 20-30%) and used as the test data. The classifier is trained on the remaining data, and then tested on the test data. There is a slight loss of efficiency here as we do not use the full sample to train the decision rule, but with very large datasets this is not a major problem. We adopted this procedure when the number of examples was much larger than 1000 (and allowed the use of a test sample of size 1000 or so). We often refer to this method as one-shot train-and-test. In machine learning, it is customary to collect disjoint (independent) samples from a base data set to build and tune predictive (supervised) models [18]. The most common approach is called train and test. The basic idea is to build a predictive model with a training sample and then validate the model using an independent test sample.

4 Step 1: The given data are randomly partitioned into two independent sets, a training set and a test set. Here 2/3 of data allocated to the training set. 1/3 of data allocated to the test set. Step 2: The training set is used to derive the classifier, whose accuracy is estimated with the test set. The estimate is pessimistic. Since only a portion of the initial data is used to derive the classifier. Random sampling is a variation of the holdout method in which the holdout method is repeated k times. 5. Algorithm To Find The Strong Association Rule This is a summary of the input and output of the algorithm. The main input parameters are k (maximum rule size), t (number of times to train/test) as well as the support, confidence, Lift threshold. In general, the training sample fraction =50%. The output is a set of rules R. That is valid on all t test sets. Building the training and test samples is repeated several times. The association rule algorithm produces different sets of rules with different training/test samples; where each set of rules have slightly lower or higher metrics. We want to find the rules that are valid on both D Train and D Test in general. This motivates repeating the training/test process t times to achieve basic cross validation and compute average for rule metrics. Output: Valid R rules Step 1: For I = 1 to t Do Step 2: Partition D into D Train and D Test based on τ Step 3: Generate the 1- itemset. Search for frequent K- itemsets on Dtrain for k {1,2,..,k} using ψ Step 4: Generate the rules using minimum confidence and minimum lift λ. Let the rule set be R Train Step 5: Validates rule R Train on D Test. Set R Test =R Train For each rule X=>Y R Test Compute test support s(x=>y) on D Test Compute test confidence c(x=>y) on D Test Compute Test lift l(x=>y) on D Test Eliminates rules from R Test Such that c(x=>y)< ψ or c( X=>Y)< α or l(x=>y)< λ Step 6: Get intersection of t rule and produce the rule metrics with (1), (2), (3). Algorithm Input: Support, Confidence, Lift Parameters: Number of times Train / test (t), training sample fraction(τ )

5 It generates all associations up to size k. Eliminates unreliable and particular rules by computing metrics on D Test, producing a subset of rules R I that remains valid on the test dataset I. The process to create train/test samples and to discover/ validate associations on rules is repeated t times, with t being a user specified parameter. This process generates t independent training sets and t independent test sets. These t sets will produce different sets of rules that will have rules in common, but also different rules. At the end, the algorithm computes a rule set that is the intersection of the t rule sets, further eliminating rules that may be particular to one run, or rules that are not valid in general. The metrics of each rule are computed as averages of the test metrics on the t test sets. Let D I be the I th test sets. Let X=>Y be a valid rule appearing on all t sets. Then i=t S(X=>Y) = (1/ t) s(x=>y, D I ) --- (1) i=1 i=t C(X=>Y) = (1/ t) c(x=>y, D I ) --- (2) i=1 i=t L(X=>Y) = (1/ t) l(x=>y, D I ) i=1 6. Performance Analysis This algorithm applied into the two kinds of dataset like medical and census analysis. The complete summary of results as follows. The training was set at τ = 50 %. Every time the algorithm is run, new samples are created. The first parameters is the maximum association size we used K {2,3,4}.In the second set of experiments, to get simple rules, K=4. A lower K produces fewer and simples rules. A higher K significantly increases the number of rules, and they become mole complex. The training sample fraction was τ = 50 %. Association rule mining had the following thresholds for metrics. The minimum support was fixed at ψ= 1 % 3. The validated rule produced using high confidence. Based on the experiments and the domain expert opinion, the minimum confidence was set X=70%. The lift threshold was λ=1. We concentrate on studying the impact on the training set D train. Since it is the one used to build the predictive association rules, and it is the most time demanding. D train is built as a random sample from D every time. Each run generates a different set of rules. The association rules are tested for generality and validity by partitioning the input dataset into a training set as a test set. A valid rule must have minimum metrics on both sets. We would like to understand the extent to which the number of rules in reduced by varying K or ψ. The experiments show the important of filtering rules on the test by varying K. Table I summaries the results. The reduction in the number of association is small, with a reduction of about 10% - 15%. The reduction becomes much more important for the number of rules. For K=2, The impact is small in most cases, which indicates most rules can be generalized. For K=3, the reduction is more than 50%, providing evidence that many rules one particular to the training sample. At K=4, the number of rules in the test sets is about 30% of the total with the reduction of about 70%, providing evidence that most rules may be particular to the training set. The trend indicates there will a combination explosion of rules that are valid only on

6 the training set. Time grows fast as rule size K. The difference in the relative number of patterns for associations and rules can be explained by the fast that association and filtered D Train and D Test based only one support, but rules require support, confidence and lift to be greater than or equal to the respective thresholds in D Train and D Test. Therefore validation is stricter for rules than for association. Table I contains the summary of results. At high support levels, the reduction in the number of rules is about 40% for low support levels, the number of rules goes down to less than 35%. This indicates that as because they do not meet the minimum metrics in the test sets. The last column in table II contains total elapsed times in seconds. Time growth is not as fast compared to varying K, because test set validation significantly reduce the number of patterns. Table I Number of associations and rule in D Train and D Test K ψ No. of No of rules Tim Associations e Train Test Train Test Table II Number of rules in D Train and D Test set varying minimum support ψ K ψ Train Test Time Conclusion This paper focused on two main research issues. The first issue is the large number of rules that are obtained by the standard association rule algorithm. The second issue is the validation of rules on an independent set, which is required to eliminate unreliable rules, or rules that cannot be generalized. In order to validate rules, we used the train and test approach that uses two disjoint samples from a data set to search and validate rules. The algorithm performs several train and test cycle achieve basic cross validation and reduce the number of rules with poor generalization potential. Experiments on a real data set studied the impact of constraints and elimination of unreliable rules with validation on the test set. The reduction in output size provided by validation is significant. In future we can apply the more rigorous technique like train/validate/test, ten-fold-cross validation, stratified cross validation, leave-out-cut cross validation, Bagging and bootstrap aggregation. These methods may provide more efficiency compare to other process. Reference 1.Data mining concepts and Techniques Jiawei Han, Micheline Kamber 2005 reprint Morgan Kaufmann publishers. 2.T.M.Mitchell, Machine Learning. New York: McGraw-Hill T.Hastie, R.Tibshirani and J.H.Friedman the elements of statistical learning 1 st ed New York ;Sprinkerverlag, 2001.

7 Machine Learning, Neural and Statistical Classification Editors: D. Michie, D.J. Spiegelhalter, C.C. Taylor ; February 17, Carlos ordonez Association rule discovery with train and test approach for heart disease prediction in IEEE transactions on information technology and biomedicine vol.10 No.2 April R.agarwal, T.Lmielinski, and A.Swami, Association rule between sets of items in large databases, in proc. ACM SIGMOD conf.,1993 pp R. Agrawal, T. Imielinski, and A. N. Swami. Mining association rules between sets of items in large databases. In Peter Buneman and Sushil Jajodia, editors, Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data, pages , Washington, D.C., USA, ACM Press. 8.C. Borgelt. Efficient Implementations of Apriori and Eclat. Proc. 1st IEEE ICDM Workshop on Frequent Item Set Mining Implementations (FIMI 2003, Melbourne, FL). CEUR Workshop Proceedings 90,Aachen, Germany Authors Dr. E.Ramaraj is presently working as a Director, computer centre at Alagappa university, Karaikudi. He has 20 years teaching experience and 5 years research experience. He has presented research papers in more than 20 national and internation conferences and published more than 30 papers in national and international journals. His research areas include Data mining and Network security. N. Venkatesan is working as Assistant Professor & Head in Information Technology Department in Bharathiyar college of Engineering and Technology, Karaikal, Pondichery. He holds Bachelor s degree in mathematics, Master s degree in computer Appllications from Bharathidasan University, MTech from Allagabad Agricultural University, and MPhil in Comp. Sci. Currently doing Ph.D. He has been member of ISTE. He published 9 papers in National and International conferences. He published a book data mining and warehousing. He is guiding Engineering and MCA students projects. He is guiding M.Phil. Scholars K.Rameshkumar is presently doing Ph.D Fulltime at Dept of CSE, Alagappa university, Karaikudi. He has more than 3 years of industrial experience. He has participated and presented the research papers in various national and International level conferences. His research interests are Data mining, E- security and E- learning.

Mining of Web Server Logs using Extended Apriori Algorithm

Mining of Web Server Logs using Extended Apriori Algorithm International Association of Scientific Innovation and Research (IASIR) (An Association Unifying the Sciences, Engineering, and Applied Research) International Journal of Emerging Technologies in Computational