Using Association Rules for Better Treatment of Missing Values

Similar documents
Introducing Partial Matching Approach in Association Rules for Better Treatment of Missing Values

Structure of Association Rule Classifiers: a Review

APPLYING BIT-VECTOR PROJECTION APPROACH FOR EFFICIENT MINING OF N-MOST INTERESTING FREQUENT ITEMSETS

Mining Quantitative Association Rules on Overlapped Intervals

Improved Frequent Pattern Mining Algorithm with Indexing

Infrequent Weighted Itemset Mining Using SVM Classifier in Transaction Dataset

An Improved Apriori Algorithm for Association Rules

A Conflict-Based Confidence Measure for Associative Classification

Chapter 1, Introduction

Challenges and Interesting Research Directions in Associative Classification

A Novel Algorithm for Associative Classification

PTclose: A novel algorithm for generation of closed frequent itemsets from dense and sparse datasets

A neural-networks associative classification method for association rule mining

Combinatorial Approach of Associative Classification

ETP-Mine: An Efficient Method for Mining Transitional Patterns

FastLMFI: An Efficient Approach for Local Maximal Patterns Propagation and Maximal Patterns Superset Checking

An Evolutionary Algorithm for Mining Association Rules Using Boolean Approach

AN IMPROVISED FREQUENT PATTERN TREE BASED ASSOCIATION RULE MINING TECHNIQUE WITH MINING FREQUENT ITEM SETS ALGORITHM AND A MODIFIED HEADER TABLE

Discovering interesting rules from financial data

Association Rule Mining. Entscheidungsunterstützungssysteme

Discovery of Multi-level Association Rules from Primitive Level Frequent Patterns Tree

International Journal of Computer Science Trends and Technology (IJCST) Volume 5 Issue 4, Jul Aug 2017

Performance Based Study of Association Rule Algorithms On Voter DB

Associating Terms with Text Categories

Mining Imperfectly Sporadic Rules with Two Thresholds

Outlier Detection Using Unsupervised and Semi-Supervised Technique on High Dimensional Data

Transforming Quantitative Transactional Databases into Binary Tables for Association Rule Mining Using the Apriori Algorithm

Discovery of Multi Dimensional Quantitative Closed Association Rules by Attributes Range Method

An Efficient Hash-based Association Rule Mining Approach for Document Clustering

An Improved Algorithm for Mining Association Rules Using Multiple Support Values

Optimization using Ant Colony Algorithm

Temporal Weighted Association Rule Mining for Classification

Sensitive Rule Hiding and InFrequent Filtration through Binary Search Method

[Shingarwade, 6(2): February 2019] ISSN DOI /zenodo Impact Factor

Dynamic Optimization of Generalized SQL Queries with Horizontal Aggregations Using K-Means Clustering

Improving Imputation Accuracy in Ordinal Data Using Classification

DMSA TECHNIQUE FOR FINDING SIGNIFICANT PATTERNS IN LARGE DATABASE

Comparing the Performance of Frequent Itemsets Mining Algorithms

Improving Efficiency of Apriori Algorithm using Cache Database

Implementing Synchronous Counter using Data Mining Techniques

Mining Frequent Itemsets Along with Rare Itemsets Based on Categorical Multiple Minimum Support

Keshavamurthy B.N., Mitesh Sharma and Durga Toshniwal

A Monotonic Sequence and Subsequence Approach in Missing Data Statistical Analysis

On Multiple Query Optimization in Data Mining

A Technical Analysis of Market Basket by using Association Rule Mining and Apriori Algorithm

Mining of Web Server Logs using Extended Apriori Algorithm

Improving Classifier Performance by Imputing Missing Values using Discretization Method

Eager, Lazy and Hybrid Algorithms for Multi-Criteria Associative Classification

Mining High Average-Utility Itemsets

An Efficient Algorithm for finding high utility itemsets from online sell

International Journal of Scientific Research & Engineering Trends Volume 4, Issue 6, Nov-Dec-2018, ISSN (Online): X

Anju Singh Information Technology,Deptt. BUIT, Bhopal, India. Keywords- Data mining, Apriori algorithm, minimum support threshold, multiple scan.

Graph Based Approach for Finding Frequent Itemsets to Discover Association Rules

Concurrent Processing of Frequent Itemset Queries Using FP-Growth Algorithm

Purna Prasad Mutyala et al, / (IJCSIT) International Journal of Computer Science and Information Technologies, Vol. 2 (5), 2011,

Appropriate Item Partition for Improving the Mining Performance

Association Rule Mining from XML Data

Association Rule Mining. Introduction 46. Study core 46

Knowledge Discovery from Web Usage Data: Research and Development of Web Access Pattern Tree Based Sequential Pattern Mining Techniques: A Survey

UAPRIORI: AN ALGORITHM FOR FINDING SEQUENTIAL PATTERNS IN PROBABILISTIC DATA

A Survey on Moving Towards Frequent Pattern Growth for Infrequent Weighted Itemset Mining

Integration of Candidate Hash Trees in Concurrent Processing of Frequent Itemset Queries Using Apriori

Generation of Potential High Utility Itemsets from Transactional Databases

A Further Study in the Data Partitioning Approach for Frequent Itemsets Mining

Enhanced Associative classification based on incremental mining Algorithm (E-ACIM)

An Efficient Algorithm for Mining Association Rules using Confident Frequent Itemsets

Data Access Paths for Frequent Itemsets Discovery

Frequent Item Set using Apriori and Map Reduce algorithm: An Application in Inventory Management

To Enhance Projection Scalability of Item Transactions by Parallel and Partition Projection using Dynamic Data Set

An Approximate Approach for Mining Recently Frequent Itemsets from Data Streams *

ASSOCIATIVE CLASSIFICATION WITH KNN

Performance Analysis of Data Mining Classification Techniques

An Efficient Reduced Pattern Count Tree Method for Discovering Most Accurate Set of Frequent itemsets

2. Department of Electronic Engineering and Computer Science, Case Western Reserve University

Pattern Mining. Knowledge Discovery and Data Mining 1. Roman Kern KTI, TU Graz. Roman Kern (KTI, TU Graz) Pattern Mining / 42

Mining Temporal Indirect Associations

DESIGN AND CONSTRUCTION OF A FREQUENT-PATTERN TREE

Web Page Classification using FP Growth Algorithm Akansha Garg,Computer Science Department Swami Vivekanad Subharti University,Meerut, India

A Novel Approach to generate Bit-Vectors for mining Positive and Negative Association Rules

Keyword: Frequent Itemsets, Highly Sensitive Rule, Sensitivity, Association rule, Sanitization, Performance Parameters.

Cluster based boosting for high dimensional data

WIP: mining Weighted Interesting Patterns with a strong weight and/or support affinity

CSCI6405 Project - Association rules mining

Mining Frequent Patterns with Counting Inference at Multiple Levels

Salah Alghyaline, Jun-Wei Hsieh, and Jim Z. C. Lai

Mining Frequent Itemsets for data streams over Weighted Sliding Windows

Estimating Missing Attribute Values Using Dynamically-Ordered Attribute Trees

Lecture Topic Projects 1 Intro, schedule, and logistics 2 Data Science components and tasks 3 Data types Project #1 out 4 Introduction to R,

AC-Close: Efficiently Mining Approximate Closed Itemsets by Core Pattern Recovery

A Novel Texture Classification Procedure by using Association Rules

C-NBC: Neighborhood-Based Clustering with Constraints

Efficient Mining of Generalized Negative Association Rules

I. INTRODUCTION. Keywords : Spatial Data Mining, Association Mining, FP-Growth Algorithm, Frequent Data Sets

Study on Classifiers using Genetic Algorithm and Class based Rules Generation

Raunak Rathi 1, Prof. A.V.Deorankar 2 1,2 Department of Computer Science and Engineering, Government College of Engineering Amravati

WEB DATA EXTRACTION METHOD BASED ON FEATURED TERNARY TREE

DISCOVERING ACTIVE AND PROFITABLE PATTERNS WITH RFM (RECENCY, FREQUENCY AND MONETARY) SEQUENTIAL PATTERN MINING A CONSTRAINT BASED APPROACH

An Enhanced Density Clustering Algorithm for Datasets with Complex Structures

CMPUT 391 Database Management Systems. Data Mining. Textbook: Chapter (without 17.10)

Understanding Rule Behavior through Apriori Algorithm over Social Network Data

Transcription:

Using Association Rules for Better Treatment of Missing Values SHARIQ BASHIR, SAAD RAZZAQ, UMER MAQBOOL, SONYA TAHIR, A. RAUF BAIG Department of Computer Science (Machine Intelligence Group) National University of Computer and Emerging Sciences A.K.Brohi Road, H-11/4, Islamabad. Pakistan shariq.bashir@nu.edu.pk, rauf.baig@nu.edu.pk Abstract: - The quality of training data for knowledge discovery in databases (KDD) and data mining depends upon many factors, but handling missing values is considered to be a crucial factor in overall data quality. Today real world datasets contains missing values due to human, operational error, hardware malfunctioning and many other factors. The quality of knowledge extracted, learning and decision problems depend directly upon the quality of training data. By considering the importance of handling missing values in KDD and data mining tasks, in this paper we propose a novel Hybrid Missing values Imputation Technique (HMiT) using association rules mining and hybrid combination of k-nearest neighbor approach. To check the effectiveness of our HMiT missing values imputation technique, we also perform detail experimental results on real world datasets. Our results suggest that the HMiT technique is not only better in term of accuracy but it also take less processing time as compared to current best missing values imputation technique based on k-nearest neighbor approach, which shows the effectiveness of our missing values imputation technique. Key-Words: - Quality of training data, missing values imputation, association rules mining, k-nearest neighbor, data mining 1 Introduction The goal of knowledge discovery in databases (KDD) and data mining algorithms is to form generalizations, from a set of training observations, and to construct learning models such that the classification accuracy on previously unobserved observations are maximized. For all kinds of learning algorithms, the maximum accuracy is usually determined by two important factors: (a) the quality of the training data, and (b) the inductive bias of the learning algorithm. The quality of training data depends upon many factors [1], but handling missing values is considered to be a crucial factor in overall data quality. For many real-world applications of KDD and data mining, even when there are huge amounts of data, the subset of cases with complete data may be relatively small. Training as well as testing samples have missing values. Missing data may be due to different reasons such as refusal to responds, faulty data collection instruments, data entry problems and data transmission problems. Missing data is a problem that continues to plague data analysis methods. Even as analysis methods gain sophistication, we continue to encounter missing values in fields, especially in databases with a large number of fields. The absence of information is rarely beneficial. All things being equal, more data is almost always better. Therefore, we should think carefully about how we handle the thorny issue of missing data. Due to the frequent occurrence of missing values in training observations, imputation or prediction of the missing data has always remained at the center of attention of KDD and data mining research community [13]. Imputation is a term that denotes the procedure to replace the missing values by considering the relationships present in the observations. One main advantage of imputing missing values during preprocessing step is that, the missing data treatment is independent of the learning algorithm. In [6, 7] imputing missing values using prediction model is proposed. To impute missing values of attribute X, first of all a prediction model is constructed by considering the attribute X as a class label and other attributes as input to prediction model. Once a prediction model is constructed, then it is utilized for predicting missing values of attribute X. The main advantages of imputing missing values using this approach are that, this method is very useful when strong attribute relationship exists in the training data, and secondly it is a very fast and efficient method as compared to

k-nearest neighbor approach. The imputation processing time depends only on the construction of prediction model, once a prediction model is constructed, then the missing values are imputed in a constant time. On the other hand, the main drawbacks of this approach are that, if there is no relationship exists among one or more attributes in the dataset and the attributes with missing data, then the prediction model will not be suitable to estimate the missing value. The second drawback of this approach is that, the predicted values are usually more accurate than the true values. In [6] and [7] another important imputation technique based on k-nearest neighbor is used to impute missing values for both discrete and continuous value attributes. It uses majority voting for categorical attributes and mean value for continuous value attributes. The main advantage of this technique it that, it does not require any predictive model for missing values imputation of an attribute. The major drawbacks of this approach are the choice of using exact distance function, considering all attributes when attempting to retrieve the similar type of examples, and searching through all the dataset for finding the same type of instances, require a large processing time for missing values imputation in preprocessing step. To overcome the limitations of missing values imputation using prediction model and decrease the processing time of missing values treatment in preprocessing step, we propose a new missing values imputation technique (HMiT) using association rules mining [2, 3, 9] and hybrid combination of k-nearest neighbor approach [6, 7]. Association rule mining is one of the major techniques of data mining and it is perhaps the most common form of local-pattern discovery in unsupervised learning systems. Before this, different forms of association rules have been successfully applied in the classification [9, 12], sequential patterns [4] and fault-tolerant patterns [14]. In [9, 12] different extensive experimental results on real world datasets show that classification using association rule mining is more accurate than neural network, Bayesian classification and decision tress. In HMiT, the missing values are imputed using association rules by comparing the known attribute values of missing observations and the antecedent part of association rules. In the case, when there is no rule present or fired (i.e. no attributes relationship exists in the training data) against the missing value of an observation, then the missing value is imputed using k-nearest neighbor approach. Our experimental results suggest that our imputation technique not only increases the accuracy of missing values imputation, but it also sufficiently decrease the processing time of preprocessing step. 2 Hybrid Missing Values Imputation Technique (HMiT) Let I = {i 1 i n } be the set of n distinct items. Let TDB represents the training dataset, where each record t has a unique identifier in the TDB, and contains set of items such that t items I. An association rule is an expression A -> B, where A and B are items A I, B I, and A B =φ. Here, A is called antecedent of rule and B is called consequent of rule. The main technique of HMiT is divided into two phases- (a) firstly the missing values are impute on the basis of association rules by comparing the antecedent part of rules with the known attributes values of missing value observation, (b) For the case, when there is no association rule exist or fired against any missing value (i.e. no relations exist between the attributes of other observations with the attributes of missing value), then the missing values are imputed using the imputation technique based on k-nearest neighbor approach [6, 7]. The main reason why we are using k-nearest approach as a hybrid combination is that, it is considered to be more robust against noise or in the case when the relationship between observations of dataset is very small [7]. Figure 1 shows the framework of missing values imputation using our hybrid imputation framework. At the start of imputation process a set of strong (with good support and confidence) association rules are created on the basis of given training dataset with support and confidence threshold. Once association rules are created, the HMiT utilizes them for missing values imputation. For each observation X with missing values, association rules are fired by comparing the known attributes values of X with the antecedent part of association rules one by one. If the know attributes values of X are the subset of any association rule R, then R is added in the fired set F. Once all association rules are checked against the missing value of X, then the set F is considered for imputation. If the set F is non empty, then median and mod of the consequent part of the rules in set F is used for missing value imputation in case of numeric and discrete attributes. For the case, when

Scan the dataset Create Association Rules on given support and confidence thresholds For each observation of Training data with Missing value Impute missing values using Association Rules If in the case when no relationship exists, impute values using k-nearest neighbor approach Fig. 1. The framework of our HMiT missing values imputation technique. HybridImputation (Support min_sup, Confidence min_conf) 1. Generate frequent itemset (FI) on given min_sup. 2. Generate association rules (AR) using FI on given min_conf. 3. for each training observation X, which contains at least on attribute value missing. 4. for each rule R in association rules set (AR). 5. compare the antecedent part of R with the known attribute values of X. 6. if antecedent part of R X {known attribute values}. 7. add R to fired set F 8. if F φ 9. if the missing value of X is discrete then use Mod in set F to impute missing value. 1. if the missing value of X is continuous then take Median in set F to impute missing value. 11. else 12. impute missing value using k-nearest neighbor approach. Fig. 2. Pseudo code of HMiT missing values imputation technique

1 8 1 8 6 4 6 4 2 2 1 2 3 4 5 6 7 Missing Values % Support = 2% Confidence = 6% 1 2 3 4 5 6 7 Missing values % Support = 2% Confidence = 6% Fig. 3. Effect of random missing values on missing values imputation accuracy the set F is empty, then the missing value of X is imputed using k-nearest neighbor approach. The pseudo code of our HMiT is described in Figure 2. Lines from 1 to 2 create the association rules on the basis of given support and confidence threshold. Lines from 4 to 7 first compare the antecedent part of all association rules with the known attributes values of missing value observation X. If the known attributes values of X are subset of any rule R, then R is added in the set F. If the set F is non empty, then the missing values are impute on the basis of consequent part of fired association rules in Line 9 and 1, otherwise the missing values are impute using k-nearest neighbor approach in Line 12. 3 Experimental Results To evaluate the performance of HMiT we perform our experimental results on benchmark datasets available at [15]. The brief introduction of each dataset is described in Table 1. To validate the effectiveness of HMiT, we add random missing values in each of our experimental dataset. For performance reasons, we use Ramp [5] algorithm for frequent itemset mining and association rules generation. All the code of HMiT is written in Visual C++ 6., and the experiments are performed on 1.8 GHz machine with main memory of size 256 MB, running Windows XP 25. Table 1. Details of our experimental datasets Datasets Instances Total Classes Attributes Car 1728 6 4 Evaluation CRX 69 16 2 3.1 Effect of Percentage of Missing Values on Missing Values Imputation The effect of percentage of missing values on imputation accuracy is shown in Figure 3 with HMiT and k-nearest neighbor approach. The results in Figure 3 are showing that as the level of percentage of missing values increases the accuracy of predicting missing values decreases. We obtain the Figure 3 results by fixing the support and confidence thresholds as 4 and 6% respectively and nearest neighbor size as 1. The reason behind why use nearest neighbor size = 1, is descried in [1]. From the results it is clear, that our HMiT generates good results as compared to k-nearest neighbor approach on all levels. 3.2 Effect of Confidence Threshold on Missing Values on Missing Values Imputation The effect of confidence threshold on missing values imputation accuracy is shown in the Figure 4. We obtain the Figure 4 results by fixing the support threshold as 4 and insert random missing values with 2%, while the confidence threshold was varied from 2% to 1%. For clear understanding, we exclude the accuracy effect of k-nearest neighbor from Figure 4 results. By looking the results, it is clear that as the confidence threshold increases the accuracy of predicting correctly missing also increases. For higher level of confidence only strong rules are generated and they more accurately predict the missing values, but in case of less confidence more weak or exceptional rules are generated which results in very less accuracy. From our experiments

1 1 8 6 4 2 8 6 4 2 2 3 4 5 6 7 8 9 1 Confidence% 2 3 4 5 6 7 8 9 1 Confidence% Missing Values = 2% Support = 2% Missing Values = 2% Support = 2% Fig. 4. Effect of confidence threshold on missing values imputation accuracy 1 8 6 4 2 1 8 6 4 2 4 45 5 55 6 65 7 75 8 Support Missing Values = 2% Confidence = 6% Fig. 5. Effect of support threshold on missing values imputation accuracy 4 45 5 55 6 65 7 75 8 85 9 Support Missing Values = 2% Confidence = 6% 2 15 Time (secs) 15 1 5 Time (secs) 1 5 5 1 2 3 4 5 6 7 5 1 2 3 4 5 6 7 Missing Values % Missing Values % Support = 2% Confidence = 6% Support = 2% Confidence = 6% Fig. 6. Performance analysis of HMiT and k-nearest neighbor with different missing values level we suggest that confidence threshold between 6% to 7% generates good results. 3.3 Effect of Support Threshold on Missing Values on Missing Values Imputation The effect of support threshold on missing values imputation accuracy is shown in the Figure 5. We obtain the Figure 5 results by fixing the confidence threshold as 6% and insert random missing values with 2%. Again for clear understanding, we exclude the accuracy effect of k-nearest neighbor from Figure 5 results. From the results it is clear that as the support threshold decreases more missing values are predicted with association rules, this is because more association rules are generated. But on the other hand as the support threshold increases less rules are generated and less number of missing values are predicted with association rules.

3.4 Performance Analysis of HMiT and k- nearest neighbor approach The results of Figure 6 are showing that HMiT performs better in term of processing time as compared to k-nearest neighbor approach. 4 Conclusion Missing value imputation is a complex problem in KDD and data mining tasks. In this paper we present a novel approach HMiT for missing values imputation based on association rule mining and hybrid combination of k-nearest neighbor approach. To analyze the effectiveness of HMiT we perform detail experiments results on benchmark datasets. Our results suggest that missing values imputation using our technique has good potential in term of accuracy and is also a good technique in term of processing time. References: [1] E. Acuna and C. Rodriguez, The treatment of missing values and its effect in the classifier accuracy, Clustering and Data Mining Application, Springer-Verlag Berlin- Heidelberg, pp.639-648 24. [2] R. Agrawal, T. Imielinski, and A. Swami, Mining Association Rules between Sets of Items in Large Databases, In Proc. of ACMSIGMOD Int l Conf. Management of Data, pp. 27-216, May 1993. [3] R. Agrawal and R. Srikant, Fast Algorithms for Mining Association Rules, In Proc. of Int l Conf. Very Large Data Bases, pp. 487-499, Sept. 1994. [4] R. Agrawal and R. Srikant, Mining Sequential Patterns, In Proc. of Int l Conf. Data Eng., pp. 3-14, Mar. 1995. [5] S. Bashir and A. R. Baig, Ramp: High Performance Frequent Item set Mining with Efficient Bit-vector Projection Technique, In Proc. of 1 th Pacific Asia Conference on Knowledge and Data Discovery (PAKDD26), Singapore, 26. [6] G. Batista and M. Monard, Experimental comparison of K-Nearest Neighbor and mean or mode Imputation methods with the internal strategies used by C4.5 and CN2 to treat missing data, Tech. Report, ICMC- USP, Feb 23. [7] G. Batista and M. Monard, K-Nearest Neighbour as Imputation Method: Experimental Results, Tech. Report 186 ICMC-USP, 22. [8] I.Duntsch and G. Gediga, Maximum consistency of incomplete data via noninvasive imputation, Artificial intelligence Review, vol. 19, pp.93-17.23. [9] H. Hu and J. Li, Using Association Rules to Make Rule-based Classifiers Robust, In Proc. of 16 th Australasian Database Conference (ADC), 25. [1] J. Han and M. Kamber, Data Mining Concepts and Techniques, 22. [11] W.Z Liu, A.P white, S.G Thompson and M.A Bramer, Techniques for dealing with missing values in classification, Lecture Notes in Computer Science, pp.128, 1997. [12] B. Liu, W. Hsu and Y. Ma, Integrating classification and association rule mining, In Proc. KDD 98, New York, NY, Aug. 1998. [13] M. Magnani, Techniques for dealing with missing data in Knowledge discovery tasks, 4127 Bologna ITALY. [14] J. Pei, A.K.H. Tung and J. Han, Fault- Tolerant Frequent Patter Mining, Problems and Challenges, In Proc. of ACM-SIGMOD Int. Workshop on Research Issues and Data Mining and Knowledge Discovery (DMKD 1), 21. [15] UCI Repository of machine learning, www.ics.uci.edu.