International Journal of Computer Engineering and Applications, Volume XII, Issue II, Feb. 18, ISSN

Size: px

Start display at page:

Download "International Journal of Computer Engineering and Applications, Volume XII, Issue II, Feb. 18, ISSN"

Philip Wells
5 years ago
Views:

International Journal of Computer Engineering and Applications, Volume XII, Issue II, Feb. 18, www.ijcea.

Engineering and Technology, Hyderabad, India srikanthbethu@gmail.com ABSTRACT: Classification is a major technique in Data mining (machine learning) and widely used in various fields.

1 International Journal of Computer Engineering and Applications, Volume XII, Issue II, Feb. 18, ISSN PERFORMANCE ANALYSIS OF CLASSIFICATION ALGORITHMS IN DATA MINING Srikanth Bethu Assistant Professor, Department of Computer Science and Engineering Gokaraju Rangaraju Institute of Engineering and Technology, Hyderabad, India ABSTRACT: Classification is a major technique in Data mining (machine learning) and widely used in various fields. Classification is a data mining technique used to predict group membership for data instances. Here we present the basic classification techniques which perform several major kinds of classification methods including Decision tree induction, Bayesian networks, k-nearest neighbor classifier and the goal of this paper are to study to provide a comprehensive review of different classification techniques in data mining. Keywords: Bayesian networks, decision tree induction; k-nearest neighbor classifier;k means classification; [1] INTRODUCTION The Data mining is a process of inferring knowledge from huge data and has three major components Clustering or Classification, Association Rules and Sequence Analysis. Classification/clustering is a process that analyze a set of data and generate a set of grouping rules which can be used to classify future data. It is the computational process of identifying patterns in large data sets involving methods at the intersection of artificial intelligence, machine learning, statistics, and database systems to extract previously unknown interesting patterns. Comparison of algorithms is a step toward what is referred to as the "Data mining" in which the student academic performance is analyzed by taking all the 3 algorithms and conducting classification and the preprocessing is done by using some methods of preprocessing and then all Srikanth Bethu 314

2 PERFORMANCE ANALYSIS OF CLASSIFICATION ALGORITHMS IN DATA MINING the algorithms are analyzed and then they are calculated accuracy and based on the accuracy we will select the algorithm Problem Defining and Experimental Design Three base algorithms were chosen for this study from different approaches naive Bayes, cart(decision tree), and knearest neighbor and three algorithms of the same base algorithms. The design is multiple group pretest-posttest: the base algorithms is executed on the data for the pretest, manipulate the algorithms by adding the boosting, then run the boosted algorithms and observe the post test performance data. Data was collected from the kaagle the data set is student academic performance. The data is around 60,000 rows and there is lot of data about the student and we need to find and analyze the students future academic performance by the given previous data of the student and by using all these 3 algorithms we need to calculate the accuracy of all the algorithms and compare these three algorithms and then by that algorithm we will know which algorithm is best suited for the given dataset student academic performance. This study aims to compare the performance of a wide range of classification techniques within a student academic performance. Comparison: Comparison of classification algorithm makes it very simple to know which algorithm is the best one for the given dataset; it makes very efficient way of processing and selecting the suitable algorithm for the given dataset Domain Introduction This paper focuses on a survey of various classification techniques that are most commonly used in data mining. The comparative study between different algorithms (K-NN classifier, Bayesian network and Decision tree) is used to show the strength and accuracy of each classification algorithm in term of performance efficiency and time complexity. A comparative study would definitely bring out the advantages and disadvantages of one method over the other Advantages of Comparison of Algorithms Comparison of algorithms can do: 1. Increases your independence and give you greater 2. control of algorithms 3. Make it easier to select the best algorithm 4. Save you time and effort. 5. Improve your personal safety. 6. Reduce the time to select the algorithms 7. Increase efficiency. 8. Reduces confusion of selection of algorithms [2] LITERATURE SURVEY a) Naive Bayesian algorithm A Naive Bayes classifier considers that the presence (or absence) of a particular feature (attribute) of a class is unrelated to the presence (or absence) of any other feature when the class variable is given. The Naive Bayes Classifier technique is based on Bayesian Theorem and it is used when the dimensionality of the inputs is high.bayesian classification is based on Bayes Theorem and Bayes Theorem is stated as below: Let X is a data sample whose class label is not known and let H be some hypothesis, such that the data sample X may belong to a specified Srikanth Bethu 315

3 International Journal of Computer Engineering and Applications, Volume XII, Issue II, Feb. 18, ISSN class C. Bayes theorem is used for calculating the posterior probability P(C X), from P(C), P(X), and P(X C). Where P(C X) is the posterior probability of target class. P(C) is called the prior probability of class. P(X C) is the likelihood which is the probability of predictor of given class. P(X) is the prior probability of predictor of class. Where P(c/x) is posterior probability, P(x/c) is likelihood, P(c) is class prior probability, P(x) is predictor prior probability. The Naive Bayes classifier works as follows: 1) Let D be the training dataset associated with class labels. Each tuple is represented by n- dimensional element vector, X=(x1, x2, x3,...,xn). 2) Consider that there are m classes C1, C2, C3..., Cm. Suppose that we want to classify an unknown tuple X, then the classifier will predict that X belongs to the class with higher posterior probability, conditioned on X. i.e., the Naive Bayesian classifier assigns an unknown tuple X to the class Ci if and only if P(Ci X) > P(Cj X) For 1 j m, and i j, above posterior probabilities are computed using Bayes Theorem. Advantages : i. It requires short computational time for training. ii. It improves the classification performance by removing the irrelevant features. iii. It has good performance. Disadvantages: a. The Naive Bayes classifier requires a very large number of records to obtain good results. b. Less accurate as compared to other classifiers on some datasets. b) CART Algorithm Cart classification technique is performed in two phases: tree building and tree pruning. 1) Tree building is performed in top-down approach. During this phase, the tree is recursively partitioned till all the data items belong to the same class label. It is very computationally intensive as the training dataset is traversed repeatedly. 2) Tree pruning is done in a bottom-up manner. It is used to improve the prediction and classification accuracy of the algorithm by minimizing over fitting problem of tree. Over-fitting problem in decision tree results in misclassification error. Advantages: Srikanth Bethu 316

4 PERFORMANCE ANALYSIS OF CLASSIFICATION ALGORITHMS IN DATA MINING a. Decision Trees are very simple and fast. b. It produces the accurate result. c. Representation is easy to understand i.e. comprehensible. d. It supports incremental learning. e. It takes the less memory. f. It can also deal with noisy data. g. It uses different measures such as Entropy, Gini index, Information gain etc.to find best split attribute. Disadvantages: i. It has long training time. ii. Decision trees can have significantly more complex representation for some concepts due to replication problem. C. K-Nearest Neighbour Euclidian distance or Hamming distance is used according to the data type of data classes used. In this a single value of K is given which is used to find the total number of nearest neighbours that determine the class label for unknown sample. If the value of K=1, then it is called as nearest neighbour classification. The K-NN classifier works as follows: i. Initialize value of K. ii. Calculate distance between input sample and training samples. iii. Sort the distances. iv. Take top K- nearest neighbors. v. Apply simple majority. vi. Predict class label with more neighbors for input sample. Following example shows that there are three classes X, Y and Z as shown in figure 1. Now, it is required to find out the class label for data sample P. Here, value of K=5 and the Euclidean distance is calculated for each sample pair and it is found that four nearest neighbour samples are falling in the class label X, while single tuple belongs to class label Z. Advantages: i. Easy to understand and implement. ii. Training is very fast. iii. It is robust to noisy training data. Srikanth Bethu 317

5 International Journal of Computer Engineering and Applications, Volume XII, Issue II, Feb. 18, ISSN iv. It performs well on applications in which a sample can have many class labels. Disadvantages: a. Lazy learners incur expensive computational costs when the number of potential neighbors which to compare a given unlabeled sample is large. b. It is sensitive to the local structure of the data. c. Memory limitation. d. As it is supervised lazy learner, it runs slowly. [3] DESIGN AND IMPLEMENTATION A. System Analysis In Existing system consist the following steps that states the problem 1. State the problem and collect the data 2. Data processing 3. Apply the algorithm. 4. Evaluate the algorithm. With this evaluation it takes so much of time to know which the better algorithm is. Takes time and more effort to proceed to which algorithm. The proposed system can be designed with the following implementations 1. State the problem and collect the data 2. Data processing 3. Apply the algorithm. 4. Evaluate the algorithm. 5. Find the accuracy. 6. Select the algorithm with highest accuracy Data input Processing Pre-processed data Results Output Classification Fig.3.1. System Architecture The above fig.3.1. Shows the data accessibility and its processing. Srikanth Bethu 318

6 PERFORMANCE ANALYSIS OF CLASSIFICATION ALGORITHMS IN DATA MINING Fig.3.2. Proposed System Analysis Fig Workflow diagram of Data Processing and Classification The above fig.3.2. Shows the workflow of data processing and classification of data. A. Technologies Used R-Language: R and its libraries implement a wide variety of statistical and graphical techniques, including linear and nonlinear modeling, classical statistical tests, time-seriesanalysis, classification, Srikanth Bethu 319

7 International Journal of Computer Engineering and Applications, Volume XII, Issue II, Feb. 18, ISSN clustering, and others. R is easily extensible through functions and extensions, and the R community is noted for its active contributions in terms of packages. R-Shiny Shiny is an R package that makes it easy to build interactive web applications using only R. More information about Shiny can be found here Shiny makes it easy for R users to turn analyses into interactive web applications that anyone can use. Let your users choose input parameters using user friendly controls like sliders, drop-down menus, and text fields. Easily incorporate any number of outputs like plots, tables, and summaries. Shiny has been around for a couple of years. We ve talked about it before but there has been some improvement to the product over the months so I wanted to take another look. I m not a prolific R programmer nor am I an expert web application developer. So this look at Shiny is from someone who understands these things and can do a little but is not an expert. Every Shiny app has the same structure. At a minimum there are two R scripts saved together in a directory. Every Shiny app has ui.r and server.r files. These files implement the user interface and the working part of the application You create a Shiny application by making a new directory and saving the ur.r and server.r files inside it.you can run a Shiny app by giving the name of its directory to the R function runapp(). Shiny apps have two components: A user interface script and a server script. There can be other files like help documentation, CSS files to change the look of the application, etc. But only the interface and server scripts are required. [4] RESULTS AND DISCUSSION Module1: a) The first module consists of the dataset tab. b) We can browse the dataset from browse option c) The dataset which is selected will be viewed on the screen Fig.4.1. Data set choosen for classification Srikanth Bethu 320

8 PERFORMANCE ANALYSIS OF CLASSIFICATION ALGORITHMS IN DATA MINING Module2: Module 2 consists of model building. Then algorithms are selected and the accuracy is calculated. On analysing the accuracy we suggest the best model for the dataset. Fig.4.2. Algorithms choosen for classification Fig.4.3. Classification by Naïve Bayesian Table 4.1: Result set of Cart, K-Nearest neighbor and Navie Srikanth Bethu 321

9 International Journal of Computer Engineering and Applications, Volume XII, Issue II, Feb. 18, ISSN Bayesian CART K-NEAREST NAIVE BAYESIAN NEIGHBOR Accuracy : Accuracy : Accuracy : Upper Accuracy : Upper Accuracy : Upper Accuracy : Kappa : Kappa : Kappa : Lower Accuracy : Lower Accuracy : Lower Accuracy : Sensitivity : Sensitivity : Sensitivity : The result set table 4.1 gives the difference between each algorithm with their values and their accuracy in classfication. Fig.4.4. Classification by K-Nearest neighbor Fig.4.1. explains the selection of dataset from the system for classification. The dataset is a student raw data. Fig.4.2. explains the selection of classification algorithms to classify the taken dataset from the system. Based on their natural properties the accuracy has calculated. Fig.4.3. explains the execution of Naïve Bayesian algorithms on given dataset and gives the accuracy value as Srikanth Bethu 322

10 PERFORMANCE ANALYSIS OF CLASSIFICATION ALGORITHMS IN DATA MINING Fig.4.4. explains the Classification by K-Nearest neighbor on the given dataset and the accuracy calculated value is [5] CONCLUSION AND FUTURE SCOPE Classification algorithms come in many different formats, some are intend as a speedier way to execute the same algorithms, others might offer a more consistent performance or higher overall accuracy for the specific problem you have at hand.here we have taken the student performance and we have compared the performance with these 3 algorithms and find accuracy for them and suggest the best one. For the future work more algorithms from classification can be incorporated and much more datasets should be taken or try to get the real dataset from the industry to have the actual impact of the performance of algorithms taken into consideration. Moreover, in Multilayer Perception algorithm speed of learning with respect to number of attributes and the number of instances can be taken into consideration for the performance. REFERENCES [1] Aha, D.W., Breslow, L.A: Comparing Simplification Procedures for Decision Trees on an Economics Classification, NRL/FR/ , (Technical Report AIC ), May 11, [2] Auer, P. Holte, R.C., Maass, W.: Theory and Applications of Agnostic PAC-Learning with Small Decision Trees, Proc. 12th Int l Machine Learning Conf. San Francisco, Morgan Kaufmann 1995, pp [3] Breslow, L., Aha, D.W.: Comparing Tree-Simplification Procedures, Proc. 6 th Int l Workshop Artificial Intelligence and Statistics, Ft. Lauderdale, 1997, pp [4] Ganti, V., Gehrke, J., Ramakrishnan, R.: Mining Very Large Databases, IEEE Computer, Special issue on Data Mining, August [5] Kohavi, R., Sommerfield, D., Dougherty, J.: Data Mining using MLC++: A Machine Learning Library in C++, Tools with AI, [6] U.S. Cancer Statistics Working Group. United States Cancer Statistics: Incidence and Mortality Web-based Report. Atlanta (GA): Department of Health and Human Services, Centers for Disease Control. [7] Zaïane, O. (2001), Web usage mining for a better web-based learning environment, Proceedings Of Conference on Advanced Technology For Education, [8] Merceron, A., Yacef, K. (2003), A web-based tutoring tool with mining facilities to improve learning and teaching. Proceedings of the 11th International Conference on Artificial Intelligence in Education, [9] M.Ramaswami and R.Bhaskaran(2010), A CHAID Based Performance Prediction Model in Educational Data Mining, International Journal of Computer Science Issues Vol. 7, Issue 1, pp [10] Nguyen Thai-Nghe, Andre Busche, and Lars Schmidt-Thieme(2009), Improving Academic Performance Prediction by Dealing with Class Imbalance, Ninth International Conference on Intelligent Systems Design and Applications, Srikanth Bethu 323

International Journal of Scientific Research & Engineering Trends Volume 4, Issue 6, Nov-Dec-2018, ISSN (Online): X

International Journal of Scientific Research & Engineering Trends Volume 4, Issue 6, Nov-Dec-2018, ISSN (Online): X Analysis about Classification Techniques on Categorical Data in Data Mining Assistant Professor P. Meena Department of Computer Science Adhiyaman Arts and Science College for Women Uthangarai, Krishnagiri,