International Journal of Computer Engineering and Applications, Volume XI, Issue IX, September 17, ISSN

Size: px

Start display at page:

Download "International Journal of Computer Engineering and Applications, Volume XI, Issue IX, September 17, ISSN"

Brandon French
6 years ago
Views:

International Journal of Computer Engineering and Applications, Volume XI, Issue IX, September 17, www.ijcea.

com Department of Computer Application ABSTRACT: This paper proposes a classification via clustering approach to predict the consumer claim of bank loans on the basis of bureau data.

1 International Journal of Computer Engineering and Applications, Volume XI, Issue IX, September 17, ISSN A COMPARATIVE STUDY OF CLASSIFICATION VIA CLUSTERING WITH K-MEANS AND J48 ALGORITHMS Dhyan Chandra Yadav dc @gmail.com Department of Computer Application ABSTRACT: This paper proposes a classification via clustering approach to predict the consumer claim of bank loans on the basis of bureau data. The proposed classification via clustering approach can obtain similar accuracy to traditional classification algorithms. Experiments were carried out using real data from bureau report of disputed YES or NO in case. In this paper mainly we analyzed and compare between meta classifier algorithms classification via clustering with J48 classifier and K-Means. Keywords: Classification via clustering, prediction, report, Weka. J48 algorithms, K-Means, Bureau [1] INTRODUCTION One of the biggest problems with credit cards is that it s easy to forget to make a payment. This can be especially true if you re going through a major change in life. Your credit won t be damaged severely if you realize that you did miss that payment before your next due date. It s possible you could even get your credit card company to waive that late fee, if you haven t been habitually late. Also, you can prevent any damage to your credit score by making up that payment before it gets 30 days past due. If you have a calendar on your computer or smart phone you should put your due date on it as a recurring event, to help make sure you don t miss any payments in the future,but when you write a letter, it ensures your rights will be Dhyan Chandra Yadav 1

2 A NOVEL TERM WEIGHING SCHEME TOWARDS EFFICIENT CRAWL OF TEXTUAL DATABASES (PAPER TITLE) protected. Credit reporting companies must investigate your dispute, forward all documents to the furnisher, and report the results back to you unless they determine your claim is frivolous. If the consumer reporting company or furnisher determines that your dispute is frivolous, it can choose not to investigate the dispute so long as it sends you a notice within five days saying that it has made such a determination. If the furnisher corrects your information after your dispute, it must notify all of the credit reporting companies it sent the inaccurate information to, so they can update their reports with the correct information [1]. J. Han and M. Kamber introduced about Classification techniques in data mining. It can be used to predict categorical class labels and classifies data based on training set and class labels and it can be used for classifying newly available data.the term could cover any context in which some decision or forecast is made on the basis of presently available information. Classification procedureis recognized method for repeatedly making such decisions in new situations. Here if we assume that problem is a concern with the construction of a procedure that will be applied to a continuing sequence of cases in which each new case must be assigned to one of a set of pre defined classes on the basis of observed features of data. Creation of a classification procedure from a set of data for which the exact classes are known in advance is termed as pattern recognition or supervised learning. Contexts in which a classification task is fundamental include, for example, assigning individuals to credit status on the basis of financial and other personal information, and the initial diagnosis of a patient s disease in order to select immediate treatment while awaiting perfect test results. Some of the most critical problems arising in science, industry and commerce can be called as classification or decision problems. J. Han and M. Kamber introduced about Clustering.It can be considered the most important unsupervised learning problem; so, as every other problem of this kind, it deals with finding a structure in a collection of unlabeled data. A loose definition of clustering could be the process of organizing objects into groups whose members are similar in some way. A cluster is therefore a collection of objects which are similar between them and are dissimilar to the objects belonging to other clusters [2]. We can show this with a simple graphical example as: Fig.1. Visualized Instances by Classification via Clustering algorithm Dhyan Chandra Yadav 2

International Journal of Computer Engineering and Applications, Volume XI, Issue IX, September 17, www.ijcea.com ISSN 2321-3469 J. R Quinlan and H.

3 International Journal of Computer Engineering and Applications, Volume XI, Issue IX, September 17, ISSN J. R Quinlan and H. Ian introduced thatj48 may refer to: J48, an open source Java implementation of the C4.5 decision tree algorithm C4.5 is an algorithm used to generate a decision tree developed by Ross Quinlan. C4.5 is an extension of Quinlan's earlier ID3 algorithm. The decision trees generated by C4.5 can be used for classification, and for this reason, C4.5 is often referred to as a statistical classifier. Authors of the Weka machine learning software described the C4.5 algorithm as "a landmark decision tree program that is probably the machine learning workhorse most widely used in practice to date" [3]. Fig.2. Visualized Instances by J48 Classifier algorithm J Han and M Kamber introduced about K-means. It is a widely used partitional clustering method in the industries. The K-means algorithm is the most commonly used partitional clustering algorithm because it can be easily implemented and is the most efficient one in terms of the execution time [4]. Dhyan Chandra Yadav 3

4 A NOVEL TERM WEIGHING SCHEME TOWARDS EFFICIENT CRAWL OF TEXTUAL DATABASES (PAPER TITLE) Fig.3. Visualized Instances by K-Means Clustering algorithm [2] RELATED WORKS: H, K analyzed that Credit cards fraudulence arises at very high level scale so we cannot easily detect and predict the related attributes but by the help of data mining classifier tool to prevent the activity of fraudsters in the misuse of credit cards uses the algorithms of neural networks. This system predicts the probability of fraud on an account by comparing the current transactions and the previous activities of each holder [5]. D C Yadav and S Pal discussed that classifier algorithms provide very accurate result in software error detection by J48, ID3 and Naïve Bayes data mining algorithms correctly classified instances will be partition in to numeric and percentage value, kappa statics, mean absolute error and root mean square error will be at numeric value only ID3 andj48 time taken to build model: 0.2 seconds and test mode :10 fold cross validation. Here Weka compare all required parameters on given instances with the classifiers respective accuracy and prediction rate based on highest accuracy of J48 is 100% without error also Naïve Bayes 100% correctly classified but with some error and ID3 95% correctly classified, so it is clear that J48 is the best in three respective algorithms so it is more accurate [6]. D C Yadav and R Kumar discussed that association algorithms provide very accurate result in the frequent and relationship between data object and find the percentage of confidence, support, of data object by the help of apriori, predictive apriori and filtered associate algorithms. Therefore these algorithms can be used in other domains to bring out interestingness among data present in the origin [7]. D C Yadav and R Kumar discussed that three major clustering algorithms: K-Means, Hierarchical clustering and Density based clustering algorithm and compare the performance of these three major clustering algorithms. Author compared using a clustering tool and find result: K-Means algorithm is better than Hierarchical Clustering and Make density based algorithm because all the algorithms have some ambiguity in some (noisy) data when clustered [8]. Dhyan Chandra Yadav 4

5 International Journal of Computer Engineering and Applications, Volume XI, Issue IX, September 17, ISSN R Sukanya and K Prabha discussed that back propagation Neural Network, Support Vector Machine is used for rainfall prediction. ANN improves the efficiency of Rainfall prediction by analyzing the historical and current facts to make accurate predictions about future [9]. R S, S M, N E, S P and V Kirand discussed that the huge volume of warranty data for segregating the fraudulent warranty claims using pattern recognition and clustering.survey of automotive industry shows up to 10% of warranty costs are related to warranty claims fraud, costing manufacturers several billions of dollars. The existing methods to detect warranty fraud are very complex and expensive as they are dealing with inaccurate and vague data, causing manufacturers to bear the excessive costs [10]. D C Yadav analyzed that in statical analysis of binary classification, the F1 score is a measure of a test's accuracy. It considers both the precision and the recall of the test to compute the score. In this analysis author computed the best score for F1 by the help of data mining classifier algorithms and choose the ID3 Tree is the best data mining classifier algorithms to be applied over selected datasets. Because ID3 Tree has highest F1 score and take less time to build a mode [11]. D C Yadav analyzed that the Matthews correlation coefficient is used in machine learning as a measure of the quality of binary (two-class) classifiers. It takes into account true and false positives and negatives and is generally regarded as a balanced measure which can be used even if the classes are of very different sizes. Author computed the MCC is in essence a correlation coefficient between the observed and predicted binary classifications by the help of data mining classifier algorithms and ID3 Tree is the best data mining classifier algorithms to be applied over selected datasets. Because ID3 Tree has highest MCC value and minimum number of time in second 0.00 to build a model [12]. D C Yadav analyzed that the informedness of a prediction method as captured by a contingency matrix is defined as the probability that the prediction method will make a correct decision as opposed to guessing and is calculated using the bookmaker algorithm. Their correlation is the generated by LAD Tree, ID3, and J48 data mining algorithms and find ID3 is the best data mining classifier algorithms to be applied over selected datasets [13]. D C Yadav analyzed that the FDR-controlling procedures provide less stringent control of Type I errors compared to class wise errors. In this analysis we choose the ID3 Tree is the best data mining classifier algorithms to be applied over selected datasets. Because ID3 Tree has minimum time to build a model [14]. D C Yadav analyzed that all analysis on the basis of dependable variables for overall performance and Predicts categorical class level classifiers based on training set and the values in the class level attribute use the model in classifying new data. Author analyzed between AD Tree, LAD Tree, J48 and Naïve Bayes for correctly classify and incorrectly classify with kappa static model and choose the LAD Tree is the best data mining classifier algorithms to be applied over selected datasets. Because LAD Tree has highest correctly value % and minimum number of unclassified instances is Also Lad tree have highest value of metric for accuracy [15]. T, R and Liu discussed that a framework was presented on the base of security systems and Case based reasoning for fraud detection. First, a set of normal and fraud cases are made Dhyan Chandra Yadav 5

6 A NOVEL TERM WEIGHING SCHEME TOWARDS EFFICIENT CRAWL OF TEXTUAL DATABASES (PAPER TITLE) from labeled data. Then, the primary detectors are made with random or genetic algorithms. Then, negative selection and clonal selection operations are applied on primary detectors in order to obtain a set of detectors with different algorithms that can detect a variety of frauds [16]. M,S,B and Saira discussed that many fraud detection systems that have been presented so far, have used data mining and neural network approaches. While no fraud detection system with the combination of anomaly detection, misuse detection and decision making system have been used so far for fraud detection in credit cards. Then, a system was proposed that used Hidden Markov Model to detect the fraudulent transactions [17]. A John analyzed that hybrid feature selection and anomaly detection algorithm in order to detect fraud in credit cards. The authors have noted that fraud detection on the internet must be done online and immediately. Since the use of credit card by card holders follows a fixed pattern, this fixed pattern can be extracted from a usual legal activity of card holders in 1 or 2 years.thus, this pattern is compared to the use of process of card holder and in case of non-similarity in the pattern, the activity is considered illegal. It should be noted that the neural networks were used to teach the patterns detection in the model in this study [18]. A P, M K and A N discuss that data mining as one of the most efficient tools of data analysis has attracted the attention of many people. The use of different techniques and algorithms of this tool in various fields like customer relationship management, fraud management and detection, medical sciences, sport and etc. Due to the large number of data in banks, data mining has had lots of functions in financial and monetary affairs so far. Credit risk management, fraud detection, money laundering, customer relationship management and banking services quality management are some examples of data mining function in banks [19]. In this work, we propose to use a meta-classifier that uses a cluster for classification approach based on the assumption that each cluster corresponds to a class. Primary, the usage and interaction consumer claim data have to be collected and preprocessed. Then, an optional attribute selection process can be applied or not, in order to select only a group of attributes/variables or to use all available. Secondary in this paper mainly we analyzed and compare between meta classifier algorithms classification via clustering with J48 classifier and K-Means. [3] METHODOLOGY: Our research approach is to use Classification via Clustering, J48 and K-Means on consumer claim data set. The research methodology is divided into 5 steps to achieve the desired results: Step 1: In this step, prepare the data and specify the source of data. Step 2:In this step select the specific data and transform it into different format by weka. Dhyan Chandra Yadav 6

7 International Journal of Computer Engineering and Applications, Volume XI, Issue IX, September 17, ISSN Step 3:In this step, implement data mining algorithms and checking of all the relevant dispute is perform. Step 4: The decision is taken on the presence of dispute in source code. If dispute is present then proceed further, otherwise it will stop. We classify the relevant dispute using Classification via Clustering, J48 and K-Means. Step 5: At the end, the results are display and evaluated. Table.1. Representation of Computational Variables of Consumer Claim Property Source Complaint Type Sample Size Description Customer Financial Protection Bureau, a U.S. Government Agency. Consumer Complaint Database Attributes for Financial Problem( Bank, Lender &Company etc.) 500Total: 100 Consumer dispute and 400 non dispute Dependable Variables Dispute(YES) Consumer Complaint dispute Dispute(NO) Consumer Complaint not dispute Field name Description Date received The date the CFPB received the complaint. Tags Data that supports easier searching and sorting of complaints submitted by or on behalf of consumers. Date sent to The date the CFPB sent the complaint to the company. company Company response to This is how the company responded. For example, Closed with explanation. consumer Timely response? Whether the company gave a timely response. For example, Yes or No. Now we have study about consumer complaints of bank different type s loans and relate dispute. 1. Data Preparation: One of the biggest problems with credit cards is that it s easy to forget to make a payment. This can be especially true if you re going through a major change in life. Your credit won t be damaged severely if you realize that you did miss that payment before your next due date. It s possible you could even get your credit card company to waive that late fee, if you haven t Dhyan Chandra Yadav 7

8 A NOVEL TERM WEIGHING SCHEME TOWARDS EFFICIENT CRAWL OF TEXTUAL DATABASES (PAPER TITLE) been habitually late. Also, you can prevent any damage to your credit score by making up that payment before it gets 30 days past due. If you have a calendar on your computer or smart phone you should put your due date on it as a recurring event, to help make sure you don t miss any payments in the future,but when you write a letter, it ensures your rights will be protected. The Consumer Complaint Database is a collection of complaints on a range of consumer financial products and services, sent to companies for response. We don t verify all the facts alleged in these complaints, but we take steps to confirm a commercial relationship between the consumer and the company. 2. Data Selection and transform: The database generally updates daily, and contains certain information for each complaint, including the source of the complaint, the date of submission, and the company the complaint was sent to for response. The database also includes information about the actions taken by the company in response to the complaint, such as, whether the company s response was timely and how the company responded. Companies also have the option to select a public response. Company level information should be considered in context of company size. Data from those complaints helps us understand the financial marketplace and protect consumers. Fig.4. Representation of Instances by Classification via Clustering algorithm Dhyan Chandra Yadav 8

9 International Journal of Computer Engineering and Applications, Volume XI, Issue IX, September 17, ISSN Fig.5. Representation of Instances by J48 algorithm Fig.6. Representation of Instances by K-Means algorithm After data selection we transform data by data mining weka tool. These tree algorithms have his specific details as accuracy by class,time to build a model and stratified cross validation with summary. Confusion matrix describe correctly classified and incorrectly classified instances with respect to class. Dhyan Chandra Yadav 9

10 A NOVEL TERM WEIGHING SCHEME TOWARDS EFFICIENT CRAWL OF TEXTUAL DATABASES (PAPER TITLE) 3. Implementation: In this step, we choose classification algorithm for the purpose of accuracy in dataset, which are the J48. To investigate further the classifier performance in accuracy. Weka is the data mining tools. It is the simplest tool for classify the data various types. It is the first model for provide the graphical user interface of the user. For perform the clustering we used the promise data repository. It is providing the past project data for analysis. With the help of figures we are showing the working of various algorithms used in weka. weka is more suitable tool for data mining applications. Clustering is a main task of explorative data mining, and a common technique for statical data analysis used in many fields, including machine learning. I am using Weka data mining tools for this purpose. It provides a better interface to the user than compare the other data mining tools. 4. Result and Discussion: All our experiments were performed using Weka and the previously described consumer claim dataset. In order to test the accuracy of obtained classification models we used the 10-fold cross- validation method. All classifiers in Weka work in the same way under cross-validation. The model is built using just the instances in the training fold. The classification via clustering approach is based on the "clusters to classes" evaluation routine in the cluster evaluation code, which finds a minimum-error mapping of clusters to classes. Table.2. Representation Compute Instances for Classifier Algorithms Algorithms Accuracy Kappa RMSE RAE RRSE TIME NO. of Iterations Sum of Square Error Classification via Clustering J K-Means In the first experiment, we executed the following clustering algorithms provided by Weka for classification via clustering using 500 instances of consumer claim of available attributes. Secondary in this paper mainly we analyzed and compare between meta classifier algorithms classification via clustering with J48 classifier and K-Means. From the above table-2 it is clear that: In case of classification J48 perform more accurate result 79.2 and less error compare to classification via clustering. Dhyan Chandra Yadav 10

11 International Journal of Computer Engineering and Applications, Volume XI, Issue IX, September 17, ISSN In the case of clustering K-Means perform less error compare to classification via clustering. Finally we find that in case of classification and clustering J48 and K-Means are better compare to classification via clustering algorithm. 5. CONCLUSION Yes, dispute participation in the claim of bank loan bureau report was a good predictor for the dispute. Another advantage of classification models based on mapping clusters to classes is that they are very simple and interpretable to instructors. In the case dispute YES here, instructors only have to analyze the cluster centroids to know that consumer active in the bank claim.we find that in case of classification and clustering J48 and K-Means are better compare to classification via clustering algorithm. REFERENCES [1] [2] J. Han and M. Kamber, Data Mining Concepts and Techniques, Elevier, [3] Quinlan, J. R. C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers, [4] J Han and M Kamber, Data Mining: Concepts and Techniques, Morgan Kaufmann Publishers, second Edition, (2006). [5] H, K. (Ed.) Detecting Payment Card Fraud with Neural Networks. Singapore: World Scientific (2000). [6] D C Yadav and S Pal An Integration of Clustering and Classification Technique in Software Error Detection African Journal of Computing & ICT, Vol. 8,No.2,June 2015, ISSN [7] D C Yadav and R Kumar Generate Identical Rules by the Different Algorithms for Detection of Risk In Software Project Development Shodh Prerak A Multidisciplinary Quarterly International Refereed Journal, Volume V, Issue 2, April 2015.ISSN X. [8] D C Yadav and R Kumar A Comparative Study of Software Bug by Clustering Algorithms Annals of Multi-Disciplinary Research A Quarterly Inter National Refereed Research Journal, Volume V, Issue 5, Dec.2015.ISSN [9] R Sukanya and K Prabha Comparative Analysis for Prediction of Rainfall using Data Mining Techniques with Artificial Neural Network Volume-5, Issue-6, Page no , Jun 30, [10] R S, S M, N E, S P and S V Kirand Modelling an Optimized Warranty Analysis methodology for fleet industry using data mining clustering methodologies with Fraud Dhyan Chandra Yadav 11

12 A NOVEL TERM WEIGHING SCHEME TOWARDS EFFICIENT CRAWL OF TEXTUAL DATABASES (PAPER TITLE) detection mechanism using pattern recognition on hybrid analytic approach,fourth International Conference on Recent Trends in Computer Science & Engineering. Chennai, Tamil Nadu, India Procedia Computer Science,Volume 87, 2016, Pages [11] D C Yadav Analysis of Bug Accuracy through F1-Score by Data Mining Classifier Algorithms, Annals of Multi-Disciplinary Research A Quarterly International Refereed Journal, Volume VII, Issue 2, June 2017.ISSN [12] D C Yadav Analysis of Bug through Mathews Correlation Coefficient with Confusion Matrix,Sodha Pravaha A Multidisciplinary Refereed Research Journal, Vol. VII, Issue 3, July 2017,ISSN [13] D C Yadav To Create Correct Decision by Informedness of Bug Prediction,Shodh Prerak A Multidisciplinary Quarterly International Refereed Research Journal,Volme VII,Issue 2,April 2017.ISSN X. [14] D C Yadav Controlling the Procedures through False Discovery Rate Defect in Bug Analysis, Sodha Pravaha A Multidisciplinary Refereed Research Journal, Vol.VII, Issue 2, April 2017,ISSN [15] D C Yadav Measurement of Gap between Software Bug Classes by Data Mining Classifier Algorithms,Vaichariki A Multidisciplinary Refereed International Research Journal, Vol.VII, Issue 2, June 2017,ISSN [16] Tue, Ren, Liu Artificial Immune System for Fraud Detection,IEEE, vol. 2, pp , [17] M, S, B and Saira A Defense Mechanism For Credit Card Fraud Detection, International Journal on Cryptography and Information Security (IJCIS), 2012,pp [18] A John Data Mining Application for Cyber Credit-Card Fraud Detection System, Springer- Verlag Berlin Heidelberg, 2013, pp [19] A P, M K and A N Fraud detection in E-banking by using the hybrid feature selection and evolutionary algorithms, IJCSNS International Journal of Computer Science and Network Security, VOL.17 No.8, August Dhyan Chandra Yadav 12

International Journal of Computer Engineering and Applications, Volume XI, Issue IX, August 17, ISSN

International Journal of Computer Engineering and Applications, Volume XI, Issue IX, August 17, www.ijcea.com ISSN 2321-3469 MEASURE THE GROUTH OF INSTANCES BY APRIORI AND FILTERED ASSOCIATOR ALGORITHMS