ANALYSIS COMPUTER SCIENCE Discovery Science, Volume 9, Number 20, April 3, Comparative Study of Classification Algorithms Using Data Mining

Similar documents
Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques

A Comparative Study of Selected Classification Algorithms of Data Mining

International Journal of Scientific Research & Engineering Trends Volume 4, Issue 6, Nov-Dec-2018, ISSN (Online): X

CLASSIFICATION OF WEB LOG DATA TO IDENTIFY INTERESTED USERS USING DECISION TREES

ISSN: (Online) Volume 3, Issue 9, September 2015 International Journal of Advance Research in Computer Science and Management Studies

A STUDY OF SOME DATA MINING CLASSIFICATION TECHNIQUES

Performance Analysis of Data Mining Classification Techniques

Fault Identification from Web Log Files by Pattern Discovery

International Journal of Advance Engineering and Research Development. Survey of Web Usage Mining Techniques for Web-based Recommendations

An Improved Document Clustering Approach Using Weighted K-Means Algorithm

Iteration Reduction K Means Clustering Algorithm

Classifying Twitter Data in Multiple Classes Based On Sentiment Class Labels

NORMALIZATION INDEXING BASED ENHANCED GROUPING K-MEAN ALGORITHM

Data Mining with Oracle 10g using Clustering and Classification Algorithms Nhamo Mdzingwa September 25, 2005

Pattern Classification based on Web Usage Mining using Neural Network Technique

Web Data mining-a Research area in Web usage mining

International Journal of Advance Engineering and Research Development. A Survey on Data Mining Methods and its Applications

Study on Classifiers using Genetic Algorithm and Class based Rules Generation

Web Page Classification using FP Growth Algorithm Akansha Garg,Computer Science Department Swami Vivekanad Subharti University,Meerut, India

Research/Review Paper: Web Personalization Using Usage Based Clustering Author: Madhavi M.Mali,Sonal S.Jogdand, Deepali P. Shinde Paper ID: V1-I3-002

SCHEME OF COURSE WORK. Data Warehousing and Data mining

International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.7, No.3, May Dr.Zakea Il-Agure and Mr.Hicham Noureddine Itani

Understanding Rule Behavior through Apriori Algorithm over Social Network Data

Data Preprocessing Method of Web Usage Mining for Data Cleaning and Identifying User navigational Pattern

Inferring User Search for Feedback Sessions

2. (a) Briefly discuss the forms of Data preprocessing with neat diagram. (b) Explain about concept hierarchy generation for categorical data.

Comparative Study of Web Structure Mining Techniques for Links and Image Search

Survey Paper on Web Usage Mining for Web Personalization

Credit card Fraud Detection using Predictive Modeling: a Review

Enhancement in Next Web Page Recommendation with the help of Multi- Attribute Weight Prophecy

An Ensemble Approach to Enhance Performance of Webpage Classification

A Cloud Based Intrusion Detection System Using BPN Classifier

Review on Text Mining

COMPARISON OF DIFFERENT CLASSIFICATION TECHNIQUES

Chapter 5: Summary and Conclusion CHAPTER 5 SUMMARY AND CONCLUSION. Chapter 1: Introduction

International Journal of Computer Engineering and Applications, Volume VIII, Issue III, Part I, December 14

Keywords- Classification algorithm, Hypertensive, K Nearest Neighbor, Naive Bayesian, Data normalization

Educational Data Mining: Performance Evaluation of Decision Tree and Clustering Techniques using WEKA Platform

A Lime Light on the Emerging Trends of Web Mining

Comparison of FP tree and Apriori Algorithm

Data Mining of Web Access Logs Using Classification Techniques

Preprocessing of Stream Data using Attribute Selection based on Survival of the Fittest

Basic Data Mining Technique

DATA MINING AND WAREHOUSING

STUDY PAPER ON CLASSIFICATION TECHIQUE IN DATA MINING

The Transpose Technique to Reduce Number of Transactions of Apriori Algorithm

Web Usage Mining: A Research Area in Web Mining

INTELLIGENT SUPERMARKET USING APRIORI

Mining of Web Server Logs using Extended Apriori Algorithm

Correlation Based Feature Selection with Irrelevant Feature Removal

International Journal of Computer Engineering and Applications, Volume XII, Issue II, Feb. 18, ISSN

Frequent Item Set using Apriori and Map Reduce algorithm: An Application in Inventory Management

WhatsApp Group Data Analysis with R

Comparative analysis of classifier algorithm in data mining Aikjot Kaur Narula#, Dr.Raman Maini*

Web page recommendation using a stochastic process model

Analysis of classifier to improve Medical diagnosis for Breast Cancer Detection using Data Mining Techniques A.subasini 1

Data Mining: Concepts and Techniques Classification and Prediction Chapter 6.1-3

Farthest First Clustering in Links Reorganization

A SURVEY- WEB MINING TOOLS AND TECHNIQUE

CSE4334/5334 DATA MINING

International Journal of Advance Engineering and Research Development. A Facebook Profile Based TV Shows and Movies Recommendation System

Normalization based K means Clustering Algorithm

PREDICTING UPCOMING STUDENTS PERFORMANCE USING MINING TECHNIQUE

Modelling Structures in Data Mining Techniques

Sathyamangalam, 2 ( PG Scholar,Department of Computer Science and Engineering,Bannari Amman Institute of Technology, Sathyamangalam,

International Journal of Advanced Research in Computer Science and Software Engineering

International Journal of Advance Engineering and Research Development. A Review Paper On Various Web Page Ranking Algorithms In Web Mining

A Web Page Recommendation system using GA based biclustering of web usage data

TERM BASED WEIGHT MEASURE FOR INFORMATION FILTERING IN SEARCH ENGINES

A Survey on k-means Clustering Algorithm Using Different Ranking Methods in Data Mining

A Survey on Web Personalization of Web Usage Mining

INTERNATIONAL JOURNAL OF COMPUTER ENGINEERING & TECHNOLOGY (IJCET)

A Supervised Method for Multi-keyword Web Crawling on Web Forums

IMPLEMENTATION OF CLASSIFICATION ALGORITHMS USING WEKA NAÏVE BAYES CLASSIFIER

An Efficient Approach for Color Pattern Matching Using Image Mining

Data mining: concepts and algorithms

An Effective Performance of Feature Selection with Classification of Data Mining Using SVM Algorithm

Web Mining Team 11 Professor Anita Wasilewska CSE 634 : Data Mining Concepts and Techniques

A Review: Content Base Image Mining Technique for Image Retrieval Using Hybrid Clustering

Classification and Optimization using RF and Genetic Algorithm

A Monotonic Sequence and Subsequence Approach in Missing Data Statistical Analysis

Improving the Efficiency of Web Usage Mining Using K-Apriori and FP-Growth Algorithm

International Journal of Computer Science Trends and Technology (IJCST) Volume 5 Issue 4, Jul Aug 2017

International Journal of Software and Web Sciences (IJSWS)

Outlier Detection Using Unsupervised and Semi-Supervised Technique on High Dimensional Data

Web Mining Using Cloud Computing Technology

Classification Algorithms in Data Mining

Supervised Web Forum Crawling

Parametric Comparisons of Classification Techniques in Data Mining Applications

Supervised Learning Classification Algorithms Comparison

Automated Tagging for Online Q&A Forums

Global Journal of Engineering Science and Research Management

Analysis of Data Mining Techniques for Software Effort Estimation

Implementation of Data Mining for Vehicle Theft Detection using Android Application

Cse634 DATA MINING TEST REVIEW. Professor Anita Wasilewska Computer Science Department Stony Brook University

MATRIX BASED SEQUENTIAL INDEXING TECHNIQUE FOR VIDEO DATA MINING

Dynamic Clustering of Data with Modified K-Means Algorithm

DECISION TREE INDUCTION USING ROUGH SET THEORY COMPARATIVE STUDY

COMPARATIVE ANALYSIS OF POWER METHOD AND GAUSS-SEIDEL METHOD IN PAGERANK COMPUTATION

Recommender System for volunteers in connection with NGO

Transcription:

ANALYSIS COMPUTER SCIENCE Discovery Science, Volume 9, Number 20, April 3, 2014 ISSN 2278 5485 EISSN 2278 5477 discovery Science Comparative Study of Classification Algorithms Using Data Mining Akhila GS, Madhu GD, Madhu MH, Pooja MH Dept. of Computer Science, K.L.E.I.T, Hubli, India Correspondence to: Pooja MH, Dept. of Computer Science, K.L.E.I.T, Hubli, India Publication History Received: 04 February 2014 Accepted: 28 March 2014 Published: 3 April 2014 Citation Akhila GS, Madhu GD, Madhu MH, Pooja MH. Comparative Study of Classification Algorithms Using Data Mining. Discovery Science, 2014, 9(20), 17-21 ABSTRACT Data mining is the practice of automatically searching large stores of data to discover patterns and trends that go beyond simple analysis. Data mining uses sophisticated mathematical algorithms to segment the data and evaluate the probability of future events. Classification is a data mining process that assigns items in a collection to target categories or classes. The goal of classification is to accurately predict the target class for each case in the data. Some classification algorithms used in data mining include Apriori, Decision Trees and Naïve Bayes algorithm. In this paper, we apply these classification algorithms to a log data and carry out a comparative analysis to find out which algorithm is best suited for such analysis. Keywords: Data mining, Classification algorithms, Abbreviations: URL Uniform Resource Locator. 1. INTRODUCTION Web mining is the application of data mining techniques to extract knowledge from Web data (Jaideep Srivastava, Prasanna Desikan, Vipin Kumar, 2012). Web mining enables to find out the relevant results from the web and is used to extract meaningful information from the discovery patterns kept back in the servers. Web usage mining is a type of web mining which mines the information including Web documents, hyperlinks between documents, usage logs of web sites, etc (S.K. Malik, 2011). Classification algorithms are the most commonly used data mining models that are Page17

Figure 1 Screen shot of pre-processed log file Figure 2 Architecture of proposed system Figure 2.1 Screen shot of final table in apriori algorithm. widely used to extract valuable knowledge from huge amounts of data (Neslihan Dogan, Zuhal Tanrikulu, 2013). There are a various classification algorithms and each of them provides different benefits depending on the type of data set on which they are used. In this paper the data set that we considered for classification is the log data. Before applying any classification algorithm, it is necessary to process the log file to convert it to readable form by segregating the data in the log file into their respective fields so that we can extract only those fields which are required for the classification. We processed the log file and extracted the middle word from the URLs and arranged them according to the sessions, where every ten minutes makes a session. Figure 1 shows the input for the Classification Algorithms. The classification algorithms used here are Apriori algorithm (it is a classic algorithm used in data mining for learning association rules), Decision Tree algorithm (this is based on conditional probabilities and it generates rules) and Naive Bayes algorithm (it uses Bayes Theorem, a formula that calculates a probability by counting the frequency of values and combinations of values in the historical data). We have considered eleven classes, namely Business, entertainment, education, gaming, information technology, mailing, news, research, search engine, security and social networking. Page18

Figure 2.3 Screen shot of table for Naive Bayes algorithm. Figure 2.2 Screen shot of table for decision tree algorithm. 2. ARCHITECTURE Figure 2 shows the architecture of proposed system. The phases of the proposed system are as follows: 1. Raw data is the log file obtained from the server log. 2. Pre-processing is the process of cleaning and removing the irrelevant and redundant log entries for mining process. 3. Extracting the middle word from the URL in the given dataset. For example, if the URL is www.facebook.com, we have to extract the middle word, which is facebook. 4. Pattern Analysis is the process of analyzing the trend of the users. 5. Classification is grouping the users based on the predefined classes. The phases of classifications are training and testing. 6. Comparison & Analysis is comparison and analysis of various algorithms to choose the best algorithms among them based on the precision. We have used three algorithms for analysis namely Apriori, Decision trees and naïve Bayes algorithms. 2.1. Apriori Algorithm The first and most influential algorithm for efficient association rule discovery is Apriori (Markus Hegland, 2005). The main purpose of using this algorithm in our project is to find frequently visited sites. Step 1 - The first step in this algorithm is to find frequent item sets from the data. This is done by taking the combination of items. Page19

Step 2 - The second step is finding the count of number of times this combination has appeared in the data set and removing those item sets which have minimum count. Step 3 - In the third step, these two steps are repeated until count reaches the maximum value. We compared each root word of every session with remaining root words in the data set to find the frequent item set. In each iteration, we set a certain value as threshold and all the item sets whose values are below the threshold are eliminated. Also, in each iteration, the combination of URLs and their count keeps changing in each. Figure 2.1 shows a table obtained after classification. Here, every row contains the frequently visited sites of a session. Figure 3.1 Graph showing no of URLs classified into each class by the classification algorithms Figure 3.2 Graph showing precision and recall values for apriori, decision trees and Naive Bayes algorithm respectively 2.2. Decision Trees Algorithm A decision tree is a classifier expressed as a recursive partition of the instance space (Lior Rokach,"Decision Trees Data Mining And Knowledge Discovery Handbook ). The decision tree has a tree like structure. It consists of nodes and a node that has no incoming edges is called root, rest of the nodes will have only one incoming edge. A test node is a node with outgoing edges; it is also called an internal node. Nodes which do not have any outgoing edges are called leaves (also known as terminal or decision nodes). In a decision tree, each internal node splits the instance space into two or more sub-spaces according to a certain discrete function of the input attributes values. In the simplest and most frequent case, each test considers a single attribute, such that the instance space is partitioned according to the attribute s value. Each leaf is assigned to one class representing the most appropriate target value. Instances are classified by navigating them from the root of the tree down to a leaf, according to the outcome of the tests along the path. Decision tree has the following steps: Step 1 - The given input value is tested at parent node for deciding to which class the input belongs. Step 2 According to the outcome of the test, the input is assigned to the leaf nodes, which means a class (like business, education, social class etc) is assigned for the input value. These two steps are recursive. Figure 2.2 the business, education and entertainment tables created after applying the classification algorithm. 2.3. Naive Bayes Algorithm This Classification is named after Thomas Bayes, who proposed the Bayes Theorem. The Bayesian Classification represents a supervised learning method as well as a statistical method for classification. It can solve diagnostic and predictive problems (Medhekar, Mayur P. Bote, Shruti D. Deshmukh, 2013). The algorithm is implemented by the following steps: Step 1 - In every session, we find the probability of each URL by finding the ratio of count of no of times each URL has appeared in the session and the count of no of URLs in the session. Step 2 - The category of each URL is found and the no of times each class appears is counted. Figure 2.3 shows the screen shot of the output after applying the Naive Bayes algorithm. Page20

3. EXPERIMENTAL RESULTS For the final table obtained in each algorithm, we counted the total number of URLs present in every class. We plotted a graph for each algorithm by taking the categories on the horizontal axis and the count of no of URLs in that class on the vertical axis. The classification algorithms are compared by using the graph shown below. 3.1. Graph For each algorithm, the graph shows the number of URLs in every class. As we can see from the graph, the Naive Bayes algorithm has maximum number of URLs in each class (for example 533 URLs in Business class, 304 in education and 328 in entertainment); the apriori algorithm has minimum number of URLs (for example 60 URLs in Business class, 9 in education and 17 in entertainment) in every class. The decision tree algorithm has less number of URLs in every class (for example 374 URLs in Business class, 247 in education and 23 9 in entertainments) compared with Naive Bayes Algorithm. Hence, by looking at this graph, we can conclude that the Naive Bayes algorithm is best suited for our dataset. We have calculated precision and recall values to know the correctness of the result. The following figure shows the graph plotted for precision values of these algorithms (Figure 3.1). 3.2. Precision Graph From the precision values, we see that naive Bayes identifies the highest number of URLs in each class as it has the highest precision. Hence, this is the best suited algorithm for the given data set (Figure 3.2). 4. CONCLUSION This project was an attempt made to compare the various classification algorithms to analyse the best one for a given data set. Here we have considered three algorithms namely Apriori, decision tree and Naive Bayes algorithms. The same can be tried on a different dataset to check if the same results occur. SUMMARY 1.The web is very large and volatile. At a given point of time various users around the world use the web. This information is stored in log files. Managing the huge database becomes a problem and searching becomes slow. The solution is to apply a classification algorithm. 2.Various classification algorithms are used in data mining. Each of them works efficiently for a particular type of data set. Hence, to know which algorithm is suitable for our data set, we need to apply selected algorithms to the data set. Depending upon their efficiency, and correctness we can decide which classification algorithm is best suited. 3.To do this, the raw data needs to be preprocessed. Then the middle word is extracted from it. To this, we apply classification algorithms and then analysis is carried out to determine which algorithm gives best results based on precision 4.After applying the classification algorithms, we plot a graph which shows number of URLs identified for each class by each of the algorithms. FUTURE ISSUES The way to expand this project is to try and implement more algorithms like particle swarm optimization, grid based algorithms etc. and check whether the efficiency remains same or not. DISCLOSURE STATEMENT There is no special financial support for this research work from the funding agency. REFERENCES 1. Raj Kumar, Dr. Rajesh Verma, "Classification Algorithms for Data Mining: A Survey, International Journal of Innovations in Engineering and Technology, 2nd August 2012. 2. Lior Rokach,"Decision Trees Data Mining and Knowledge Discovery Handbook. 3. Harry Zhang and Jiang Su,"Naive Bayesian Classifiers for Ranking", Machine Learning: ECML 2004, September 2004 4. Jaideep Srivastava, Prasanna Desikan, Vipin Kumar,"Web Mining - Accomplishments and Future Directions",19 September 2012. 5. Cristóbal Romero, Sebastián Ventura, Pedro G. Espejo and César Hervás,"Data Mining Algorithms to Classify Students", 2008 6. Bruno Fernandes Chimieski, Rubem Dutra Ribeiro Fagundes, "Association and Classification Data Mining Algorithms Comparison over Medical Datasets",Journal of Health Informatics, 2013. 7. Medhekar1, Mayur P. Bote, Shruti D. Deshmukh, international journal of enhanced research in science technology & engineering, Volume 2 ISSUE 3, March 2013. 8. Neslihan Dogan, Zuhal Tanrikulu, "A comparative analysis of classification algorithms in data mining for accuracy, speed and robustness" Information Technology and Management, Volume 14, Issue 2, June 2013. 9. S.K. Malik,"Information Extraction Using Web Usage Mining, Web Scrapping and Semantic Annotation", Institute of Electrical and Electronics Engineers, 7 Jan 2011. 10. Markus Hegland, "The Apriori Algorithm - a Tutorial", 30 March 2005. Page21