ANALYSIS COMPUTER SCIENCE Discovery Science, Volume 9, Number 20, April 3, 2014 ISSN 2278 5485 EISSN 2278 5477 discovery Science Comparative Study of Classification Algorithms Using Data Mining Akhila GS, Madhu GD, Madhu MH, Pooja MH Dept. of Computer Science, K.L.E.I.T, Hubli, India Correspondence to: Pooja MH, Dept. of Computer Science, K.L.E.I.T, Hubli, India Publication History Received: 04 February 2014 Accepted: 28 March 2014 Published: 3 April 2014 Citation Akhila GS, Madhu GD, Madhu MH, Pooja MH. Comparative Study of Classification Algorithms Using Data Mining. Discovery Science, 2014, 9(20), 17-21 ABSTRACT Data mining is the practice of automatically searching large stores of data to discover patterns and trends that go beyond simple analysis. Data mining uses sophisticated mathematical algorithms to segment the data and evaluate the probability of future events. Classification is a data mining process that assigns items in a collection to target categories or classes. The goal of classification is to accurately predict the target class for each case in the data. Some classification algorithms used in data mining include Apriori, Decision Trees and Naïve Bayes algorithm. In this paper, we apply these classification algorithms to a log data and carry out a comparative analysis to find out which algorithm is best suited for such analysis. Keywords: Data mining, Classification algorithms, Abbreviations: URL Uniform Resource Locator. 1. INTRODUCTION Web mining is the application of data mining techniques to extract knowledge from Web data (Jaideep Srivastava, Prasanna Desikan, Vipin Kumar, 2012). Web mining enables to find out the relevant results from the web and is used to extract meaningful information from the discovery patterns kept back in the servers. Web usage mining is a type of web mining which mines the information including Web documents, hyperlinks between documents, usage logs of web sites, etc (S.K. Malik, 2011). Classification algorithms are the most commonly used data mining models that are Page17
Figure 1 Screen shot of pre-processed log file Figure 2 Architecture of proposed system Figure 2.1 Screen shot of final table in apriori algorithm. widely used to extract valuable knowledge from huge amounts of data (Neslihan Dogan, Zuhal Tanrikulu, 2013). There are a various classification algorithms and each of them provides different benefits depending on the type of data set on which they are used. In this paper the data set that we considered for classification is the log data. Before applying any classification algorithm, it is necessary to process the log file to convert it to readable form by segregating the data in the log file into their respective fields so that we can extract only those fields which are required for the classification. We processed the log file and extracted the middle word from the URLs and arranged them according to the sessions, where every ten minutes makes a session. Figure 1 shows the input for the Classification Algorithms. The classification algorithms used here are Apriori algorithm (it is a classic algorithm used in data mining for learning association rules), Decision Tree algorithm (this is based on conditional probabilities and it generates rules) and Naive Bayes algorithm (it uses Bayes Theorem, a formula that calculates a probability by counting the frequency of values and combinations of values in the historical data). We have considered eleven classes, namely Business, entertainment, education, gaming, information technology, mailing, news, research, search engine, security and social networking. Page18
Figure 2.3 Screen shot of table for Naive Bayes algorithm. Figure 2.2 Screen shot of table for decision tree algorithm. 2. ARCHITECTURE Figure 2 shows the architecture of proposed system. The phases of the proposed system are as follows: 1. Raw data is the log file obtained from the server log. 2. Pre-processing is the process of cleaning and removing the irrelevant and redundant log entries for mining process. 3. Extracting the middle word from the URL in the given dataset. For example, if the URL is www.facebook.com, we have to extract the middle word, which is facebook. 4. Pattern Analysis is the process of analyzing the trend of the users. 5. Classification is grouping the users based on the predefined classes. The phases of classifications are training and testing. 6. Comparison & Analysis is comparison and analysis of various algorithms to choose the best algorithms among them based on the precision. We have used three algorithms for analysis namely Apriori, Decision trees and naïve Bayes algorithms. 2.1. Apriori Algorithm The first and most influential algorithm for efficient association rule discovery is Apriori (Markus Hegland, 2005). The main purpose of using this algorithm in our project is to find frequently visited sites. Step 1 - The first step in this algorithm is to find frequent item sets from the data. This is done by taking the combination of items. Page19
Step 2 - The second step is finding the count of number of times this combination has appeared in the data set and removing those item sets which have minimum count. Step 3 - In the third step, these two steps are repeated until count reaches the maximum value. We compared each root word of every session with remaining root words in the data set to find the frequent item set. In each iteration, we set a certain value as threshold and all the item sets whose values are below the threshold are eliminated. Also, in each iteration, the combination of URLs and their count keeps changing in each. Figure 2.1 shows a table obtained after classification. Here, every row contains the frequently visited sites of a session. Figure 3.1 Graph showing no of URLs classified into each class by the classification algorithms Figure 3.2 Graph showing precision and recall values for apriori, decision trees and Naive Bayes algorithm respectively 2.2. Decision Trees Algorithm A decision tree is a classifier expressed as a recursive partition of the instance space (Lior Rokach,"Decision Trees Data Mining And Knowledge Discovery Handbook ). The decision tree has a tree like structure. It consists of nodes and a node that has no incoming edges is called root, rest of the nodes will have only one incoming edge. A test node is a node with outgoing edges; it is also called an internal node. Nodes which do not have any outgoing edges are called leaves (also known as terminal or decision nodes). In a decision tree, each internal node splits the instance space into two or more sub-spaces according to a certain discrete function of the input attributes values. In the simplest and most frequent case, each test considers a single attribute, such that the instance space is partitioned according to the attribute s value. Each leaf is assigned to one class representing the most appropriate target value. Instances are classified by navigating them from the root of the tree down to a leaf, according to the outcome of the tests along the path. Decision tree has the following steps: Step 1 - The given input value is tested at parent node for deciding to which class the input belongs. Step 2 According to the outcome of the test, the input is assigned to the leaf nodes, which means a class (like business, education, social class etc) is assigned for the input value. These two steps are recursive. Figure 2.2 the business, education and entertainment tables created after applying the classification algorithm. 2.3. Naive Bayes Algorithm This Classification is named after Thomas Bayes, who proposed the Bayes Theorem. The Bayesian Classification represents a supervised learning method as well as a statistical method for classification. It can solve diagnostic and predictive problems (Medhekar, Mayur P. Bote, Shruti D. Deshmukh, 2013). The algorithm is implemented by the following steps: Step 1 - In every session, we find the probability of each URL by finding the ratio of count of no of times each URL has appeared in the session and the count of no of URLs in the session. Step 2 - The category of each URL is found and the no of times each class appears is counted. Figure 2.3 shows the screen shot of the output after applying the Naive Bayes algorithm. Page20
3. EXPERIMENTAL RESULTS For the final table obtained in each algorithm, we counted the total number of URLs present in every class. We plotted a graph for each algorithm by taking the categories on the horizontal axis and the count of no of URLs in that class on the vertical axis. The classification algorithms are compared by using the graph shown below. 3.1. Graph For each algorithm, the graph shows the number of URLs in every class. As we can see from the graph, the Naive Bayes algorithm has maximum number of URLs in each class (for example 533 URLs in Business class, 304 in education and 328 in entertainment); the apriori algorithm has minimum number of URLs (for example 60 URLs in Business class, 9 in education and 17 in entertainment) in every class. The decision tree algorithm has less number of URLs in every class (for example 374 URLs in Business class, 247 in education and 23 9 in entertainments) compared with Naive Bayes Algorithm. Hence, by looking at this graph, we can conclude that the Naive Bayes algorithm is best suited for our dataset. We have calculated precision and recall values to know the correctness of the result. The following figure shows the graph plotted for precision values of these algorithms (Figure 3.1). 3.2. Precision Graph From the precision values, we see that naive Bayes identifies the highest number of URLs in each class as it has the highest precision. Hence, this is the best suited algorithm for the given data set (Figure 3.2). 4. CONCLUSION This project was an attempt made to compare the various classification algorithms to analyse the best one for a given data set. Here we have considered three algorithms namely Apriori, decision tree and Naive Bayes algorithms. The same can be tried on a different dataset to check if the same results occur. SUMMARY 1.The web is very large and volatile. At a given point of time various users around the world use the web. This information is stored in log files. Managing the huge database becomes a problem and searching becomes slow. The solution is to apply a classification algorithm. 2.Various classification algorithms are used in data mining. Each of them works efficiently for a particular type of data set. Hence, to know which algorithm is suitable for our data set, we need to apply selected algorithms to the data set. Depending upon their efficiency, and correctness we can decide which classification algorithm is best suited. 3.To do this, the raw data needs to be preprocessed. Then the middle word is extracted from it. To this, we apply classification algorithms and then analysis is carried out to determine which algorithm gives best results based on precision 4.After applying the classification algorithms, we plot a graph which shows number of URLs identified for each class by each of the algorithms. FUTURE ISSUES The way to expand this project is to try and implement more algorithms like particle swarm optimization, grid based algorithms etc. and check whether the efficiency remains same or not. DISCLOSURE STATEMENT There is no special financial support for this research work from the funding agency. REFERENCES 1. Raj Kumar, Dr. Rajesh Verma, "Classification Algorithms for Data Mining: A Survey, International Journal of Innovations in Engineering and Technology, 2nd August 2012. 2. Lior Rokach,"Decision Trees Data Mining and Knowledge Discovery Handbook. 3. Harry Zhang and Jiang Su,"Naive Bayesian Classifiers for Ranking", Machine Learning: ECML 2004, September 2004 4. Jaideep Srivastava, Prasanna Desikan, Vipin Kumar,"Web Mining - Accomplishments and Future Directions",19 September 2012. 5. Cristóbal Romero, Sebastián Ventura, Pedro G. Espejo and César Hervás,"Data Mining Algorithms to Classify Students", 2008 6. Bruno Fernandes Chimieski, Rubem Dutra Ribeiro Fagundes, "Association and Classification Data Mining Algorithms Comparison over Medical Datasets",Journal of Health Informatics, 2013. 7. Medhekar1, Mayur P. Bote, Shruti D. Deshmukh, international journal of enhanced research in science technology & engineering, Volume 2 ISSUE 3, March 2013. 8. Neslihan Dogan, Zuhal Tanrikulu, "A comparative analysis of classification algorithms in data mining for accuracy, speed and robustness" Information Technology and Management, Volume 14, Issue 2, June 2013. 9. S.K. Malik,"Information Extraction Using Web Usage Mining, Web Scrapping and Semantic Annotation", Institute of Electrical and Electronics Engineers, 7 Jan 2011. 10. Markus Hegland, "The Apriori Algorithm - a Tutorial", 30 March 2005. Page21