Improved Fuzzy-Optimally Weighted Nearest Neighbor Strategy to Classify Imbalanced Data

Size: px

Start display at page:

Download "Improved Fuzzy-Optimally Weighted Nearest Neighbor Strategy to Classify Imbalanced Data"

Kevin Palmer
5 years ago
Views:

1 Received: December 29, Improved Fuzzy-Optimally Weighted Nearest Neighbor Strategy to Classify Imbalanced Data Harshita Patel 1 * Ghanshyam Singh Thakur 1 1 Department of Mathematics and Computer Applications, Maulana Azad National Institute of Technology, Bhopal , India * Corresponding author s hpatel.sati@gmail.com Abstract: Learning from imbalanced data is one of the burning issues of the era. Traditional classification methods exhibit degradation in their performances hile dealing ith imbalanced data sets due to skeed distribution of data into classes. Among various suggested solutions, instance based eighted approaches secured the space in such cases. In this paper, e are proposing a ne fuzzy eighted nearest neighbor method that optimally handle the imbalance issue of data. Use of optimal eights improve the performance of fuzzy nearest neighbor algorithm for default balanced distribution of data, for the classification of imbalanced data concept of adaptive K is merged ith it that apply large K, number of nearest neighbors for large class and small K for small class. We deploy this combination to classify imbalanced data ith better accuracy for different evaluation measures. Experimental results affirm that our proposed method perform ell than the traditional fuzzy nearest neighbor classification for these type of data sets. Keyords: Fuzzy classification, Imbalanced datasets, K-nearest neighbors, Optimal solution. 1. Introduction Data mining is a very popular and continuously groing research field ith its various application areas. Ever increasing data of modern developed scientific era make the data mining more significant to find ne milestones. Core data mining techniques and their applications provided the efficient solutions for decision making to the scientific society and continuously facilitating ith their analytical poer [1, 2]. Ne developments in technology uncover ne challenges for the data mining researchers. Learning from imbalanced data is one of the most important real orld classification issues refer to distribution of data. Unequal distribution of data into classes severely affects the performances of traditional classifiers because these classifiers are designed for balanced distribution of data by default, resulting in misclassification and classification accuracy is biased toards the larger class having more data even hen the smaller class is of more interest [3]. In concerned ith imbalanced data, class ith large quantity of data examples is called majority class and other ith fe examples knon as minority class. The issue of imbalance is discussed in detail by He et al. [3], they provided good literature and various aspects and possibilities for learning from imbalanced data. The issue of classification of imbalanced data is comes under the top ten challenging issues of data mining [4] and also considered as ne and frequently exploring trend of data mining [5]. Real orld data applications are usually suffered ith imbalance due to natural distribution of their data and for classification, obviously data is not in the required form that traditional classification algorithms look for. Medical Diagnosis [6, 7], Oilspill Detection [8], Credit Card Fraud Detection [9], Culture Modeling [10], Netork Intrusion, Text Categorization, Helicopter Gearbox Fault Monitoring [3] etc. are some examples of imbalanced data from real orld need to be treated

2 Received: December 29, in special manner to find accuracy of classification in correct ay to protect results to be hazardous. Worldide research is going on to treat the sensitive issue of data imbalance that suggest various solutions like data balancing technique by re-sampling methods, alteration in traditional classification algorithms, analysis based on cost associated ith data and ensemble techniques [3]. Alteration in traditional classification algorithms is one good solution hile dealing ith such data here classic algorithms are altered to the level to find accuracy of minority classes too ith majority classes and overall accuracy improved ith both class accuracies. This is good to classify data for its natural distribution ithout information replication or loss as happened sometimes in re-sampling techniques. Also algorithm modification is simply applicable here associated costs are not knon. Almost all traditional classification algorithms have been re-proposed ith their modified versions for imbalanced datasets and successfully running from yester years in both crisp and fuzzy manners and improvements are going on. Instance based learning is one of them. In this paper e ill discuss our proposed nearest neighbor approach ith fuzzy eights. In instance based learning such as nearest neighbor approaches no classifier is prepared in advance. Nearest neighbor algorithm is properly discussed by [11, 12]. Fuzzy logic makes the algorithm more efficient and eights put a stronger hold. For imbalanced data optimal eights ith fuzzy concept provide the opportunity to find better classification for both classes. Default nearest neighbor considers equal eights for all data instances so may lead to misclassification, eighted concept deal ith it. The performance of nearest neighbor classification is dependent on the opted eighing strategy that ho e choose eights. In optimally eighted fuzzy nearest neighbor this becomes more reliable and least biased because here eighing is based on kriging that is a best linear estimator [29]. Hoever skeed distribution of data of imbalanced datasets does not provide expected results ith optimally eighted fuzzy nearest neighbor as other traditional classification approaches. Adaptive concept of K, i.e. large K for large classes and small K for small classes proposed by [28] deals ith imbalanced text data very ell. Our proposed method combines these to beneficial approaches to refine the classification of imbalanced data. In the experiments section it ill be shon ith the comparison of other methods that our proposed methodology is good enough for different evaluation measures. The paper is organized as follos. In section 2, related orks is given. Section 3 provides fundamental basics. Section 4 is dedicated to our proposed algorithm that is folloed by experimental and result discussion of section 5. Conclusion and result discussion is given in section 6 and references are arranged at last. 2. Related Work Traditional classification algorithms are designed for the default balanced distribution of data into classes. So hen these algorithms face skeed or unevenly distributed data i.e. large quantity of data in one class hile others have just fe data elements than accuracy bias toards majority class. From solutions of this issue re-sampling is used in ide manner though information loss occurs in under-sampling and data redundancy increases in oversampling applications. Similarly cost sensitive approaches are applicable obviously hen costs are given. Modification of algorithm is popular in classification of imbalanced data because these methods do not alter the original distribution of data. These approaches perform ell in various ays. Nearest neighbors algorithms are the simplest to find out the class of an unknon instance and so for imbalanced data too ith appropriate modifications. Prati et al. [13] evaluated the performances of classifiers for different degrees of imbalance on their proposed evaluation model. They proposed confidence interval method to inspect classifier performance statistics and concluded that high degree of imbalance results in high misclassification and vice versa. This section contains literature revie on the ork done for issue of imbalance ith imbalanced approaches. Various modified nearest neighbor algorithms have been proposed in crisp and fuzzy manner ith and ithout eights, some are discussed here. CCNND, a single class algorithm to minimize the classification cost is proposed by Kriminger et al. [14], they applied local geometric structure in data for this purpose. This algorithm is applicable on multiclass data too. Tomasev et al. [15] argued about hubness effect related to nearest neighbor that minority class instances are responsible for misclassification in high dimensional data unlike the fact that majority classes are mostly the reason of misclassification in lo and medium dimensional data. Ryu et al. [16] proposed HISNN, an instance based hybrid selection using nearest neighbor for cross project defect prediction. In this class imbalance is existed in source and target projects. This method is orked in the ay that

3 Received: December 29, local learning is done by nearest neighbor and naïve bayes is used for global learning. Resultantly it is very efficient in finding high performance in softare defect prediction. Weighted approaches give good results for the purpose. Class based eighted nearest neighbor approach is proposed by Dubey et al. [17]. Distribution of nearest neighbor of test instances becomes the base for the eight calculation. Patel et al. [18] proposed hybrid neighbor eighted approach by merging adaptive concept ith neighbor eighted strategy i.e. large eights for small classes and small eights for large classes ith different K. Convex optimization technique as proposed by Ando [19] to find eights ith a strong mathematical base improve non-linear performance measure for training data. Liu et al. [20] proposed class confidence eights for imbalanced learning. Weight prototypes ere based on posterior probabilities gained from attribute probabilities, for this purpose Mixture Modelling and Bayesian Netork ere used. In fuzzy scenario a fe algorithms ere proposed for imbalanced data ith nearest neighbor concept. Ramentol et. al. [21] proposed a fuzzy rough ordered eighted average nearest neighbor method for binary classification ith six eight vectors blended ith some indiscernibility relations. Fernandez et al. [22] analyzed the fuzzy rule based classification systems for imbalanced data sets. For better classification results adaptive parametric conjunction operators ere applied for different imbalanced ratios. Han et al. [23] proposed a nearest neighbor approach that as based on fuzzy-rough properties to minimize the biasness occur due to majority class. Liu et al. [24] proposed coupled fuzzy K-nearest neighbor approach for unevenly distributed categorical data instances, here strong bonds exist among class, attributes and other examples. Patel et al. [25] proposed an improved eighted algorithm. Large eight assignment to small classes and small eight assignment to larger classes got efficient hen it merged ith fuzzy logic. 3. Basic Milestones Before moving to the proposed thought, it is required to be acquainted ith fundamental terminologies and their usage. The proposed algorithm is a systematic enhancement of fuzzy K nearest neighbor concept ith optimal eights in adaptive K manner. So e need to kno about K nearest neighbor, fuzzy K nearest neighbor, adaptive K strategy and optimal eights. Nearest neighbor approach is a classification method comes under lazy learning [1] in hich no model of classifier is prepared initially like other eager learning algorithms. Whole training data is kept for classification of test instances and it is done by distance measuring of a test instance to the all training objects to find certain number of nearest neighbors say K. We mean by nearest neighbors are the training instances hich have minimum distances ith the test instance to be classified. After finding the K nearest neighbors of a test instance, class ith maximum number of nearest neighbors is assigned to this instance. Distance measure and number of K may vary as per the researcher s requirement. More discussion is done by [12, 26]. Fuzzy K nearest neighbor algorithm is a fuzzy complement of it crisp version and deal ith soft boundaries of data distribution. Instead of identifying the particular class of a test instance, this method finds the memberships of instances into classes that ho much a instances belong to a class that helps in improving accuracy. Fuzzy K nearest neighbor algorithm as proposed by Keller et al. [27]. The perception of adaptive K is given by Baoli et al. [28] that K ill be larger for larger class and smaller for smaller class for better categorization of imbalanced text data as common K for all classes is not suitable here one class having large quantity of data and have just fe instances. Optimally eighted fuzzy k nearest neighbor as proposed by Pham [29] based on kriging. Optimal eights are calculated for the traditionally found nearest neighbor of test instance. It as an excellent eighted improvement in fuzzy K nearest neighbor algorithm. Combination of optimally eighted fuzzy K nearest neighbor ith adaptive K strategy yields better results on imbalanced data. 4. Proposed Approach We proposed a eighted strategy to classify imbalanced data ith fuzzy optimal eights for adaptive K nearest neighbor. This is a good combination of adaptive approach for imbalanced data that says K should be different ith different classes i.e. large K for large classes and small K for small classes. Optimally eighted fuzzy nearest neighbor as an improved nearest neighbor approach become specific hile combine ith adaptive strategy to deal ith imbalanced data no.

4 Received: December 29, Proposed Algorithm Step 1. Find K C a KC for each class using i ( ) min K I Ca, K, I( Ca) max{ I( Ca) i 1,2} Step 2. Find memberships of training data into each class using Let a training instance y Ca, Then 0.51 ( mcn/ KCa) 0.49 If m a Cn( y) ( mcn/ KCa ) othrerise While taking Ca( y) 1 Step 3. For test instance t, find a set of nearest neighbors X for any K from training dataset Where X= (x1, x2.xp), for K= p (some integer) Step 4. Get covariance matrix Ct beteen nearest neighbors of t Step 5. Get covariance matrix Ctx beteen t and its nearest neighbors Step 6. Calculate eight matrix using 1 W C t Ctx Step7. Normalize negative eights to positive ne b, b b K b1 Where min b Step 8. Find membership of test instance u using ca() t Kc a b1 Kc b a b1 c b ab Step 9. Assign class label to test instance u by Ca if c( t) 0.51 Ca() t Random Assignment Otherise Description of terms used in proposed algorithm All terms could be understood in the given manner: K = An integer input represents number of nearest neighbors to be found. K Ca= Different K for different classes calculated using equation given in step 1. Our purpose is to find large number of nearest neighbor for large classes and vice versa. I(C a) = Number of instances in each Class Ca, Where a= 1 and 2 (for binary classification). λ = Constant integer value to prevent very small results. C n = Nearest Neighbors of training instance y from class C. µ Cn(y) = Membership of y into class C. X= Number of Nearest Neighbors of test instance. C t = Covariance matrix of nearest neighbors of t. C tx =Covariance matrix of nearest neighbors beteen t and its nearest neighbors. W= Weight Matrix. ne = Modified eights to avoid negative eight values. µ Ca(t) = Membership of test instance in class a here a= 1 or 2. C a(t) = Class assigned to test instance. 5. Experiments and Results Proposed algorithm is experimented on five datasets. These datasets are taken from standard UCI [30] and KEEL [31] repositories. These are the benchmarked datasets already explored ell for the same purpose of research in imbalanced datasets by many researcher of the field. Data sets used for experimental purpose are briefly explained in folloing Table 1:

5 Received: December 29, Table 1. Dataset Description Data- sets Source # Instance Class (1/0) #Attributes IR Bad/Good Radar Ionosphere UCI 351 Returns Glass0 KEEL 214 Positive/ Negative Vertebral UCI 310 AB/No Ecoli1 KEEL 336 Positive/ Negative Spectfheart UCI 267 Abnormal/ Normal As e kno calculation of accuracy is not enough for imbalanced data classification because it is biased toards majority classes. F-Measure and G-Mean are popular accuracy measures to evaluate such cases of imbalanced data and so taken here. The evaluation measures F-Measure and G-Mean could be understood on the basis of confusion matrix given in Table 2: Table 2. Confusion Metric for Binary Classification Calculated / Calculated Calculated Actual Positive Negative Outcomes Actual Positive True Positive False Positive Actual Negative False Negative True Negative True positive (TP) are actual positive instances hich are correctly classified as positive. False positive (FP) are actual positive instances, incorrectly classified as negative. True negatives (TN) are actual negative instances also predicted correctly as negative and False negatives (FN) are actual negative instances that are incorrectly classified as positive. On the basis of these predictions, evaluation measure F-Measure and G-Mean ork as follos: 2Pr ecisionre call F Measure Pr ecision Re call (1) TP TN G Mean TP FN TN FP (2) Where and TP Recall TP FN (4) Experiments are done for these to evaluation measures are performed for all above five datasets. Also comparison of proposed algorithm is done ith the fuzzy neighbor eighted nearest neighbor algorithm [25], a good fuzzy-eighted approach for imbalanced classification in nearest neighbor scenario that implies the better performance of our proposed approach. Five K values ere taken for these evaluations (i.e. 5, 10, 15, 20, and 25) and table 3 and 4 are shoing average results for all K. Fuzzy neighbor eighted nearest neighbor algorithm (Fuzzy-NWKNN) and our proposed eighted fuzzy adaptive K nearest neighbor () are compared in Table 3 and Table 4 for experimental results of F-Measure and G-Mean and performance of proposed Weighted FAKNN found better than the Fuzzy-NWKNN. Graphical representations of performances of both algorithms are shon in Figure 1 and Figure 2 in form of bar charts that can easily illustrate the better performance of. Table 3. Results for F-Measure of Fuzzy-NWKNN and Fuzzy- Weighted- NWKNN FAKNN Ionosphere Glass Vertebral Ecoli Spectfheart TP Pr esicion TP FP (3)

6 Received: December 29, Table 4. Results for G-Mean of Fuzzy NWKNN and Fuzzy- Weighted- NWKNN FAKNN Ionosphere Glass Vertebral Ecoli Spectfheart F-Measure G-Mean Ionosphere Glass0 Vertebral Ecoli1 Fuzzy-NWKNN Weighted- FAKNN Figure 1. F-Measure for Fuzzy-NWKNN and Ionosphere Glass0 Vertebral Ecoli1 Spectfheart Fuzzy-NWKNN Weighted-FAKNN Figure 2. G-Mean for Fuzzy-NWKNN and Spectfheart 6. Conclusion and Future Scope In this paper, e proposed a fuzzy eighted adaptive nearest neighbor approach for better classification of imbalanced data. By using adaptive strategy, e choose different K for different classes according to their sizes. This results in comfortable classification of imbalanced data of different sized classes ith very large or small quantity of data into them. Uniform K is mostly not good for such cases. This concept is merged ith optimally eighted fuzzy K nearest neighbor and yielding better outcomes for imbalanced data. This combination is dealing better no ith imbalanced data. Experiments are done on the data sets of different imbalance ratios for ell knon evaluation measures F-measure and G-mean. For this paper e consider all features as necessary and binary classification as our main intension. In future the proposed algorithm can be extended for multiclass classification ith possible feature selection. References [1] J. Han and M. Kamber, Data Mining, Concepts and Techniques, 3 rd ed., Morgan, Kaufmann, [2] H. Patel and D.S. Rajput, Data Mining Applications in Present Scenario: A revie, International Journal of Soft Computing, Vol. 6, No. 4, pp , [3] H. He and E. A. Garcia, Learning from Imbalanced Data, IEEE Transactions on Knoledge and Data Engineering, Vol. 21, No. 9, pp , [4] Q. Yang and X. Wu, 10 challenging problems in data mining research, International Journal of Information Technology and Decision Making, Vol. 5, No. 4, pp , [5] Editorial, Special issue on Ne trends in data mining, NTDM. Knoledge Based Systems, Elsevier, pp. 1-2, [6] R. Pavón, R. Laza, M. Reboiro-Jato and F. Fdez- Riverola, Assessing the impact of class-imbalanced data for classifying relevant/irrelevant medline documents, Advances in Intelligent and Soft Computing, Vol. 93, pp , [7] R. B. Rao, S. Krishanan and R.S. Niculscu, Data Mining for Improved Cardiac Care, ACM SIGKDD Exploration Nesletter, Vol. 8, No. 1, pp. 3 10, [8] M. Kubat, R. C. Holte and S. Matin, Machine Learning for the Detection of Oil Spills in Satellite Images, Machine Learning, Vol. 30, No. 2, pp , [9] P. Chan and S. Stolfo, Toard Scalable Learning ith Non-Uniform Class and Cost Distributions: A Case Study in Credit Card Fraud Detection, In Proceedings of Knoledge Discovery and Data Mining, pp , 1998.

7 Received: December 29, [10] X. C. Li, W. J. Mao, D. Zeng, P. Su and F. Y. Wang. Performance Evaluation of Machine Learning Methods in Cultural Modeling, Journal of Computer Science and Technology, Vol. 24, No. 6, pp , [11] G. Loizou and S. J. Maybank, The Nearest Neighbor and the Bayes Error Rates, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 9, No. 2, pp , [12] T. M. Cover and P. E. Hart, Nearest neighbor pattern classification, IEEE Transactions on Information Theory, Vol. 13, No. 1, pp , [13] R. C. Prati, G. E. A. P. A. Batista and D. F. Silva, Class imbalance revisited: a ne experimental setup to assess the performance of treatment methods, Knoledge and Information Systems, Vol. 45, No. 1, pp , [14] E. Kriminger and C. Lakshminarayan. Nearest Neighbor Distributions for Imbalanced Classification, In: Proc. of WCCI 2012 IEEE World Congress on Computational Intelligence, Brisbane, 2012, pp [15] N. Tomašev and D. Mladenic. Class Imbalance and the Curse of Minority Hubs, Knoledge-Based Systems, Vol. 53, pp , [16] D. Ryu, J. Jang and J. Baik, A hybrid instance selection using nearest-neighbor for cross-project defect prediction, Journal of Computer Science and Technology, Vol. 30, No. 5, pp , [17] H. Dubey and V. Pudi. Class based eighted k nearest neighbor over imbalanced dataset, PAKDD 2013, Part II, LANI, 7819, pp , [18] H. Patel and G.S. Thakur, A Hybrid Weighted Nearest Neighbor Approach to Mine Imbalanced Data, In: Proc. of the 12 th International Conference on Data Mining (DMIN 16), pp , [19] S. Ando, Classifying imbalanced data in distancebased feature space, Knoledge and Information Systems, vol. 46, No. 3, pp , [20] W. Liu and S. Chala. Class Confidence Weighted knn Algorithms for Imbalanced Data Sets, PAKDD 2011, Part II, LANI 6635, pp , [21] E. Ramentol, S. Vluymans, N. Verbiest, Y. Caballero, R. Bello, C. Cornelis, and F. Herrera, IFROWANN: Imbalanced Fuzzy-Rough Ordered Weighted Average Nearest Neighbor Classification, IEEE Transactions on Fuzzy Systems [22] A. Fernandez, M. J. Jesus and F. Herrera, On the Influence of an Adaptive Inference System in Fuzzy Rule Based Classification Systems for Imbalanced Data-Sets, Expert Systems ith Applications, Vol. 36, No. 6, pp , [23] H. Han and B. Mao, Fuzzy-Rough k-nearest Neighbor Algorithm for Imbalanced Data Sets Learning, In: Proc. of FSKD 2010-Seventh International Conference on Fuzzy Systems and Knoledge Discovery, IEEE circuits and systems society, China, pp , [24] C. Liu, L. Cao and P.S. Yu, Coupled Fuzzy K- Nearest Neighbors Classification of Imbalanced Non-IID Categorical Data, In: Proc. of IJCNN - International Joint Conference on Neural Netorks, IEEE, Beijing, 2014, pp [25] H. Patel and G. S. Thakur, Classification of Imbalanced Data using a Modified Fuzzy-Neighbor Weighted Approach, International Journal of Intelligent Engineering and Systems, Vol. 10, No. 1, pp , [26] E. Fix and J. L. Hodges, Discriminatory Analysis- Nonparametric Discrimination: Consistency Properties, International Statistical Revie, Vol. 57, pp , [27] J. M. Keller, M. R. Grey and J. A. Givens Jr., A Fuzzy k- Nearest Neighbor Algorithm, IEEE Transactions on System, Man and Cybernetics, Vol. 4, pp , [28] L. Baoli, L. Qin and Y. Shien An adaptive k- nearest neighbor text categorization strategy ACM Transactions on Asian Language Information Processing, Vol. 3, No. 4, pp , [29] T. D. Pham, An optimally eighted fuzzy k-nn algorithm, In: proc of International Conference on Pattern Recognition and Image Analysis, U.K. 2005, pp [30] A. Asuncion and D. J. Neman, UCI machine learning repository. University of California, School of Information and Computer Science, Irvine, [31] KEEL: Knoledge Extraction based on Evolutionary Learning.

A Hybrid Weighted Nearest Neighbor Approach to Mine Imbalanced Data

106 Int'l Conf. Data Mining DMIN'16 A Hybrid Weighted Nearest Neighbor Approach to Mine Imbalanced Data Harshita Patel 1, G.S. Thakur 2 1,2 Department of Computer Applications, Maulana Azad National Institute