International Journal of Computer Applications (975 8887) Pattern Classification based on Web Usage Mining using Neural Network Technique Er. Romil V Patel PIET, VADODARA Dheeraj Kumar Singh, PIET, VADODARA ABSTRACT The traffic on World Wide Web is increasing rapidly and huge amount of data is generated due to users numerous interactions with web sites. Web Usage Mining is the application of data mining techniques to discover the useful and interesting patterns from web usage data. We apply new approach for classify User pattern using Modified naive Bayesian classification with supervise learning technique for real time and more complex data. This can be used in marketing for adverting purpose, E- commerce & government agency for more make page dynamic based on user pattern classification. We purpose classification real time data on time & accuracy base. Keywords Analysis on Web Usage Data, Classification Base on supervise Neural Network Technique with Naive Bayesian Classification Algorithm 1. INTRODUCTION Web mining is the application of data mining techniques to discover patterns from the Web. Web mining can be broadly defined as the discovery and analysis of useful information from the World Wide Web [1].Web usage mining has various application areas such as web prefetching, link prediction, site reorganization and web personalization. Most important phases of web usage mining are the reconstruction of user sessions by using heuristics techniques and Classify useful patterns from these sessions by using pattern Classification techniques [7].Web usage mining, also known as web log mining, aims to evaluate interesting and frequent user browsing behavior from web browsing data that are stored in web server logs, browser logs, proxy server logs [2]. Web Usage mining [3] is the process of applying data mining techniques to the discovery of usage patterns from Web data, targeted towards various practical applications such as personalized web search and surfing, web recommendation systems. Data mining efforts associated with the Web, called Web mining, can be broadly divided into three classes, i.e. web content mining, web structure mining, and web usage mining. The primary goal of web usage mining is to classification base on web log data. Previous many researches proposed on classification like k-means, naive Bayesian, neural network. But we propose new approach classification algorithm with supervise neural network technique for classification on real & complex web log data. Especially we classify some criteria base web log data to improve web site in future for make more dynamic. In web usage mining, pattern discovery isdifficult because only bits of information like IP addresses and site clicks are available. But analysis of this usage data will yield the information needed for organizations to provide an effective presence to their customers. The most effective way to retrieve useful information from a database is applicationdependent.usage mining is also valuable to e-businesses whose business is based solely on the traffic provided through search engines. The use of this type of web mining helps to gather important information from customers visiting the site. This enables an in-depth log to complete analysis of a company s productivity flow. E-businesses depend on this information to direct the company to the most effective Web server for promotion of their product or service. 2. PATTERN CLASSIFICATION ON WEB USAGE DATA After identifying user sessions, the various techniques of web usage pattern discovery are applied in order to detect interesting and useful patterns. There are several kinds of access pattern mining that can be performed depending on the needs of the analyst. Some of pattern discovery techniques are Path Analysis, Clustering, Association Rules and sequential Patterns used for identify user habit identification [9]. Here we used classification analysis, data items are classified according to predefined categories. User habit identification based on web usage mining using neural network technique. In my work web log data divide in particular time session & identify most visited web page for future dynamic change and also identify http error on particular page& day base [1]. We classify our URL on base of our criteria with naive Bayesian classification algorithm with supervise neural network technique for better classification accuracy & reduce time on particular session to identify number of visit on particular criteria [11]. 13
International Journal of Computer Applications (975 8887) 3. PROPOSED METHODOLOGY The proposed algorithm for web log data classification in order to some predefine our criteria. In this model first various preprocessing stage apply after apply our new approach for classification. Fig 1: Model for Web Log data Classification 3.1 Web Log Data We use real web log data for better classification & improve web site to make dynamic. We use www.ijprs.com site data for last 12 days to analysis. Data size is 5.5 MB of raw log data but after data cleaning data size is reduce. 3.2 Data Cleaning Web Log Data Data Cleaning User Session Identification Naive Bayesian Classification with Supervise learning technique Result Testing & Validating Performance System Performance Measure The items which are not related for usage analysis must be removed from the log files. When user requests to particular page from web server, various log entries are recorded. If page contains the images, videos, scripts, flash animations etc. then resource requests for them will also be added in the log file [4]. The objective of Web Usage Mining is to find users behavior. So the entries for these resource requests do not make sense and must be removed from log file. Elimination of irrelevant items can be done by checking the suffix of the URL, which signifies in what format the kinds of files are [5]. For example, the entries from log file with URL suffix jpg, gif, css, js, mov, avi, swfetc can be removed. Web servers can be configured to write different fields into the log file in different formats. The most common fields used by web servers are the followings: IP Address, Login Name, User Name, Timestamp, Request, Status, Bytes, Referrer, and User agent. 1) Declare filename, method, IP address, file extension, hostname, username, timestamp, offset, protocol, bytes, and status code. 2) Open a database connection. 3) Create an object of Prepared Statement to read each record in log table. 4) For each record read from the log table i. Read status code the status as extracted from the database.*/ ii. Read method as extracted from the database. iii. If (status code = or method = GET) { 1. Read IP address, hostname, username, timestamp, offset, protocol, bytes, and path. 2. Extract file extension from path. 3. If file extension!={*.gif, *.jpg, *.css, *.swf, *.avi, *.mov} 4. Insert data entries into summarized log table. 5. Else 6. Remove data entries. 7. Close connection. 8. End Output: Summarized log table 3.3 User Session Identification Code After data cleaning, unique users must be identified. To identify the users, one simple method is to use login information, if users log in before using the web-site or system. Another approach is to use cookies for identifying the visitors of a web-site by storing a unique ID [8]. However, these two methods are not general enough because they depend on the application domain and the quality of the source data. We can use a more general method to identify user. A new IP indicates a new user. The same IP but different user agent means a new user. The user agents are said to be different if it represents different web browsers or operating systems in terms of type and version [9]. The list of log entries is sorted by the combination of IP addresses or host name of the user and the user agent. The result is a list where all entries generated by the same user are clustered together and stored as separate log entry lists. 3.4 Propose Naive Bayesian Classification algorithm 1. Let T be a training set of samples with k attributes as X1,X2,..Xk given by n dimensional vector Q = {y1,y2, yn} 2. Let P denotes the probability 14
Accurcy International Journal of Computer Applications (975 8887) 3. Given a sample Q, the classifier performs the prediction to determine the attributes having the highest posteriori probability such that P (Xi Q) > P (Xj Q) where i,j = 1,2, k 5. EXPERIMENT RESULT The above algorithm implement in java programming language using Eclipse, Net Beans with SQL. 4. Maximum posteriori hypothesis is calculated using P (Xi Q) = P (Q Xi) P (Xi) P (Q) 5. Maximize P (Q Xi) P (Ai) if both P (Q Xi) P (Xi) are known or P (Q Xi) if only 6 4 Count Count P (Q Xi) is known. 6. If the web logs data set contain many attributes it results in maximum of computation time which can be reduced using the following equation P (Q Xi) λ P (Yn Xi) 7. Repeat step 4 to 6 until all criteria is match. 8. Comparison of processed results to find the URL having highest hits for particular slot of time. 9. Create graph of result of session base data. 1. Find Accuracy & Time in session data inputs sets. 3.4.1 Supervise learning Technique inputs: examples, a set of examples, each with input x = x1; x2; : : : ; xn and output y inputs: network, a perceptron with weights Wj ; j = ; : : : ; n and activation function g Repeat for each e in examples do inpnj = Wj xj [e] Err y[e] - g(in) WjWj + Err _ g(in) _ xj [e] End Until all examples correctly predicted or stopping criterion is reached Return network 4. PROBLEM STATEMENT The K-Means Algorithm use predefine sample to Classification. Also Result depends on Particular value of k. It is more time to calculating the data sets. It can handle only numerical format data. The main disadvantage is taken more time to classify web log data. Another Disadvantage is not taking large data sets to classify. Some of sample base classify which is already predefine in data sets. Also not find more accurate result in web log data. Fig 2: No of Count in particular criteria in Session 25 15 1 5 1 2 3 4 5 6 7 8 Fig 3: Time Spent in Particular Session to Count Data 9% 8% 7% 6% 5% 4% 3% % 1% % Time Session Criteria Fig 4: Accuracy to find data classification Here we also classify Error on Particular day with session. Time 15
Time In Second Accuracy Hits International Journal of Computer Applications (975 8887) 7 6 5 4 3 1 44 Not found 43 forbidden Table 1: Time taken to different Algorithm Session No.of Test Case K- means classify Naive Bayesian classify 1 255.39% 48.57% Date 53 Service unavailabl e 2 17 28.47% 4.28% 3 128 33.6% 39.25% 4 85 35.3% 41.91 Fig 5: Error Classification After we find some result in create our session base classification to our site to make dynamic & find how many hits to find on particular session & which web pages with which data for visited by user.we compare previous k-means result with our new approach. 5 4 3 1 1 2 3 4 5 6 7 8 Session Fig 6: Time taken to different session Time In Second Session k-means Naive Bayesian with supervise learning 1.1 18.1 2 22.15 16.23 3 28.3 17.21 4 4.25 15.12 5 42.27 14. 6 32.52 15.23 7 25.12 19.38 K- M Table 2: Accuracy to Classify Data 6.% 5.% 4.% 3.%.% 1.%.% 1 2 3 4 No of session Fig 7: Accuracy to classify data K-Means Naive Bayesian 6. CONCLUSION & FUTURE WORK In This paper we have presented a comprehensive overview of Technique for Naive Bayesian Classification with supervise learning technique. Main Objective is that Classification of user habit in more & more accurate with session base divide data after data cleaning concept for the use of more dynamic web site & web pages in future for business improvement, marketing, government agency put security. Here we classify URL base on predefine criteria. In this study we propose classification result base on Time & Accuracy of data classification.in future as popularity of the web continues to increase, there is a growing need to develop tools and techniques that will help improve its overall usefulness. In future improve web site or make dynamic web pages so use more large data sets to find more accurate classification. 8 28.3 15.29 16
International Journal of Computer Applications (975 8887) 7. REFERENCE [1] JaideepSrivastava, PrasannaDesikan, Vipin Kumar, Web Mining - Concepts, Applications & Research Directions. [2] D. Vasumathi, Dr. A. Govardhan, K.VenkateswaraRao. 5-9. Performance Improvements and Efficent Approach for Mining Periodic Sequential Access Patterns. International Journal of Computer Science and Security, (IJCSS) Volume (3): Issue (5). [3] JaideepSrivastava, Robert Cooley, MukundDeshpande, Pang-Ning Tan.. Web Usage Mining: Discovery and Applications of Usage Patterns from Web Data. SIGKDD Explorations, Vol. 1, No. 2. [4] Pang-Ning Tan, and Vipin Kumar, Discovery of Web robot sessions based on their navigational patterns. Data mining and knowledge discovery, 2, 6(1), pp. 9-35. [5] Lalani, A.S., Data mining of web access logs, School of Computer Science and Information Technology. Royal Melbourne Institute of Technology. Melbourne, Victoria, Australia, 3. [6] R. Cooley, B. Mobasher and J. Srivastava (1997). Web Mining: Information and Pattern Discovery on the World Wide Web. In Proceedings of the 9th [7] IEEE International Conference on Tools with AI (ICTAI, 97), November. [8] J. Srivastava, R. Cooley, M. Deshpande and P-N. Tan (). Web Usage Mining: Discovery and Applications of usage patterns from Web Data, SIGKDD Explorations, Vol 1, Issue 2. [9] SonaliMuddalwar, ShashankKawar Applying Artificial Neural Network In Web Usage Mining International Journal of Computer Science and Management Research Vol 1 Issue 4 November 12 [1] Ms. Vinita Shrivastava, Mr. Neetesh Gupta Performance Improvement Of Web Usage Mining By Using Learning Based K-Mean Clustering International Journal of Computer Science and its Applications [11] S.Taherizadeh and N.Moghadam Integrating web content mining into web usage mining for finding patterns and predicting user s behaviors, An International journal of information science and management, January / June- 9, Vol.7. [12] Prakash S Raghavendra, Shreya Roy Chowdhury, SrilekhaVedulaKameswari Web Usage Mining using Statistical Classifiers and Fuzzy Artificial Neural Networks International Journal Multimedia and Image Processing (IJMIP), Volume 1, Issue 1, March 11 IJCA TM : www.ijcaonline.org 17