Web Log Data Cleaning For Enhancing Mining Process

Size: px

Start display at page:

Download "Web Log Data Cleaning For Enhancing Mining Process"

Maryann Dean
6 years ago
Views:

1 Web Log Data Cleaning For Enhancing Mining Process V.CHITRAA*, Dr.ANTONY SELVADOSS THANAMANI** *(Assistant Professor, CMS College of Science and Commerce **(Reader in Computer Science, NGM College (AUTONOMOUS), India ABSTRACT There is a rap development of World We Web in its volume of traffic and the size and complexity of web sites. Web servers accumulate data about user s interactions in log files whenever requests for resources are received. Because of the tremendous usage, the log files are growing at a faster rate and the size is becoming huge. The complexity of tasks such as web site design, web server design has increased along with this growth. Web usage mining is application of mining techniques in logs. Log data is usually noisy and ambiguous and preprocessing is an important process for efficient mining process. In the preprocessing, the data cleaning process includes removal of records of graphics, veos and the format information, the records with the failed HTTP status code and robots cleaning. This paper enhances cleaning to remove irrelevant records from log file and experiments the effect of cleaning from path completion stage. The experimental results show the performance of the proposed methodology and comparatively it gives the good results. Keywords- Data Cleaning, Preprocessing, Path Completion, Transactions, Log Mining 1. Introduction World We Web develops raply day by day. As per November 2012 survey Web Server survey by Netcraft there are 625,329,303 active sites. So researchers are paying more and more attention on the efficiency of services offered to the users over the internet. Web usage mining is an active, technique used in this field of research. It is also called web log mining in which data mining techniques are applied to web access log. A web access log is a time series record of user s requests each of which is sent to a web server whenever a user sent a request. Due to different server setting parameters, there are many types of web logs, but typically the log files share the same basic information such as client IP address, request time, requested URL, HTTP status code, referrer etc., Web usage mining extracts regularities of user access behavior as patterns, which are defined by combinations, orders or structures of the pages accessed by the internet. Web mining [1] is the application of data mining, artificial intelligence, chart technology and so on to the web data and traces user s visiting behaviors and extracts their interests using patterns. Because of its direct application in e-commerce, Web analytics, e-learning, information retrieval etc., web mining [2] has become one of the important areas in computer and information science. Web Usage Mining [3] uses mining methods in log data to extract the behavior of users which is used in various applications like personalized services, adaptive web sites, customer profiling, prefetching, creating attractive web sites etc., In order to get the suitable Web log data to perform the data mining, we must undertake a series of operations on the original Web log files such as the log consolation and data cleaning, user and transaction entification, data integration and so on.web servers accumulate data about user s interactions in log files whenever requests for resources are received. Each row of Web Log Data represents the URLs that the user visits. Attributes of the data include Visit Time, Host, URL, and other miscellaneous information about users' actions. Visited URLs of Web Log Data are only records of users' webwatching behaviours in different formats such as Common Log format, Extended Common Log format which is issued by Apache and IIS. In order to get user's interest categories, we should know the categories of web pages that the user visits. The three stages of Web Usage Mining Log Data Pretreatment Mining into patterns Analysis of Extracted Results Preprocessing [4,5] is an important step because of the complex nature of the Web architecture which takes 80% in mining process. The raw data is pretreated to get reliable sessions for efficient mining. It includes the domain dependent tasks of data cleaning, user entification, session entification, and path completion and construction of transactions. Data cleaning is the task of removing irrelevant records that are not necessary for mining. User entification is the process of associating page references with same IP address with different users. Session entification is breaking of a user s page references into user sessions. Path completion [6] is used to fill missing page references in a session. Classifications of transactions are used to know the users interest and navigational behavior [7]. The second step in web usage Volume 01 No.11, Issue: 03 P a g e 49

2 mining is knowledge extraction in which data mining algorithms like association rule mining techniques, clustering, classification etc. are applied in preprocessed data. The third step is pattern analysis in which tools are proved to facilitate the transformation of information into knowledge. Knowledge query mechanism such as SQL is the most common method of pattern analysis. This paper focuses on data preprocessing and data cleaning technique to remove irrelevant log entries which is used to increase the efficiency of path completion. In this study a referrerbased method is proposed to efficiently construct the reliable transactions in data preprocessing. 2. Related Work The data to be examined for Web Usage Mining is Log data which differs from other datasets used in data mining, and there are several problems that must be addressed in preparation for data mining. The main problem is to get a reliable dataset for mining. Therefore the data should be pretreated and users accessing behavior is to be constructed as transactions[8]. These transactions are to be reliable. The first stage is preprocessing is data cleaning. Data cleaning includes Removal of records of graphics, veos and the format information Removal of records with the failed HTTP status code Removal of records entered during robots navigation The Common log formats or Extended Log Formats only records the visitors browsing activities and not the details such as the same user or different users. This means that different visitors sharing the same host cannot be distinguished. If proxy servers are used the problem is severed. Users are entified easily by using Cookies or authentication mechanism. But users are not attracted by these types of sites due to privacy concerns [9]. There are two heuristics for the attribution of requests to different visitors. 1. If two records has different IP address they are distinguished as two different users else if both IP address are same then User agent field is checked. 2. If the browser and operating system information user agent field is different in two records then they are entified as different users. After users are entified the next step is entification of sessions. A session is a sequence of activities made by one user during one visit to the site. There are three heuristics available to entify sessions from users. Two are based on time and one based on the navigation of users through the web pages. Time Oriented Heuristics: The simplest methods are time oriented in which one method based on total session time [8] and other based on single page stay time. The set of pages visited by a specific user at a specific time is called page viewing time. It varies from 25.5 minutes [8] to 24 hours [10] while default time is 30 minutes by R.Cooley [9]. The second method depends on page stay time which is calculated with the difference between two timestamps. If it exceeds 10 minutes the second entry is assumed as a new session. Navigation Oriented Heuristics: This method uses web topology in graph format. It consers webpage connectivity, however it is not necessary to have hyperlink between consecutive page requests. Due to proxy servers and cached versions of the pages used by the client using Back, the sessions entified have many missed pages. So path completion [13] step is carried out to entify missing pages. Referrer based methods are used to append the missing pages. After session construction transactions [14] are entified. A transaction is defined as a set of homogenous pages that have been visited in a user session. There are three approaches to entify different types of transactions. Transaction entification by Reference Length: Reference Length approach is based on the fact that depending upon the time taken a user spends on a page correlates to whether the page should be classified as auxiliary or content pages for that user. Transaction entification by Maximal Forward Reference: This approach is based on the forward references in a path of pages accessed by a user. A forward reference is defined to be a page not already in the set of pages for current transaction and a backward reference is defined as a page that is already contained in the set of pages for the current transaction. A new transaction is started when the next forward reference is made. In this the last page in maximal forward reference are consered as content pages and the pages leading to forward reference is treated as auxiliary pages. Transaction Identification by Time Window: The time window approach partitions a user session into time intervals no larger than a specified parameter. If W is the time window then (Date m.time - Date 1. time) W where m is the last page in a session. 3. Methodology The data can be gathered from different sources like server-se, client-se, proxy servers for web usage mining. Volume 01 No.11, Issue: 03 P a g e 50

3 Server se data are the web logs collected when client requests for a web page. Web server logs are plain text that is independent from server platform. Most of the web servers follow common log format and some servers follow Extended log format along with referrer and user agent. Data from client se in which remote agents like Java Applets are used to collect user browsing information. Java applets may generate some additional overhead especially when they are loaded for the first time Cookies are unique ID generated by the web server for indivual client browsers and it automatically tracks the site visitors [15]. However if the user wishes for privacy and security, they can disable the browser option for accepting cookies. Explicit User Input data is collected through registration forms and proves important personal and demographic information and preferences. However, this data is not reliable since there are chances of incorrect data or users neglect those sites. Proxy level collection is the data collected from intermediate server to reduce the loading time of a Web page and network traffic load, Proxy traces may reveal the actual HTTP requests from multiple clients to multiple Web servers. Web log data preprocessing is a complex process and takes 80% of total mining process. Among all the sources log data is consered as reliable and consered for predicting useful patterns. Since log data is noisy data preprocessing cleans log records by removing irrelevant records and finally transform raw data into sessions. There are four steps in preprocessing of log data Data Cleaning The process of data cleaning is removal of outliers or irrelevant data [16]. Analyzing the huge amounts of records in server logs is a cumbersome activity. So initial cleaning is necessary. Data cleaning is usually sitespecific, and involves tasks such as, removing extraneous references to embedded objects that may not be important for the purpose of analysis, including references to style files, graphics, or sound files. The cleaning process also may involve the removal of at least some of the data fields. The status code return by the server is three digit number. There are four class of status code: Success (200 Series), Redirect (300 Series), Failure (400 Series), Server Error (SOD Series). The most common failure codes are 401 (failed authentication), 403 (Forbden request to a restrict subdirectory, and the dreaded 404 (file not found) messages. Such entries are useless for analysis process and therefore they are cleaned form the log files. Data cleaning contains the null value noise and data processing the inconsistent data processing and some others. The inconsistencies of data lead to the reduction of credibility of the data mining results. The data cleaning removes the noise or irrelevant data, and also processes the missing data field in the data. Automated programs like web robots, spers and crawlers are also to be removed from log files. Thus removal process in the experiment includes The records of graphics, veos and the format information The records have filename extension of GIF, JPEG, CSS, and so on, which can be found in the URI field of the every record, can be removed. This extension files are not actually the user interested web page, rather it is just the documents embedded in the web page. So it is not necessary to include in entifying the user interested web pages. This cleaning process helps in discarding unnecessary evaluation and also helps in fast entification of user interested patterns The records with the failed HTTP status code The HTTP status code is then consered in the next process for cleaning. By examining the status field of every record in the web access log, the records with status codes over 299 or under 200 are removed. This cleaning process will further reduce the evaluation time for determining the used interested patterns Robots Cleaning Robots also called as sper or bot [16] is a software tool that periodically scans a web site to extract its content and automatically follow all the hyperlinks from a web page. Search engines, such as Google, periodically use Web Robots to gather all the pages from a web site in order to update their search indexes. The number of requests from one Web Robots may be equal to the number of the web site's URIs. If the web site does not attract many visitors, the number of requests coming from all the WRs that have visited the site might exceed that of human-generated requests. Eliminating WR-generated log entries not only simplifies the mining task that will follow, but it also removes uninteresting sessions from the log file. Usually, a WR has a breadth (or depth) first search strategy and follows all the links from a web page. Therefore, a WR will generate a huge number of requests on a web site. Moreover, the requests of a WR are out of the analysis scope, as the analyst is interested in discovering knowledge about users' behavior. There are few techniques available to find Robots navigation. Using Robots.txt file: Robot Exclusion Standard allows Web administrators to specify the pages to be blocked from Robots visit and the Robots are allowed to examine robots.txt in any website. This file is not interlinked Volume 01 No.11, Issue: 03 P a g e 51

4 with any of the web pages and so users are unaware of this page. The file contains the list of pages disallowed for robots to visit. The IPaddress which refer to this file is assumed as a robot. Using User Agent: Guelines are proved to web designers. One of the guelines is they are not allowed to use the name of the Robot as a User Agent. But many robot designers he their entities by using the same user agent field. Using IP Address: Many web sites prove a list of IP address for known Web Robots. But the updating of database is very difficult since the growth of new robots is tremendous. Using Method attribute: The request method HEAD in a request incurs less overhead since it contains only message header. So guelines request the designer to use HEAD method. Using Browsing time: The next technique is based on the fact that the crawlers retrieve pages in an automatic and exhaustive manner, so they are distinguished by a very high browsing speed. Therefore, for each different IP address, the browsing speed is calculated and all requests with this value less than a threshold are regarded as made by robots and are consequently removed. The value of the threshold is set up by analyzing the browser behavior arising from the consered log files. Out of all the methods this technique is an efficient one to detect robots. The reference length is calculated in a session and the threshold is fixed as 2 seconds. This removal helps in accurate detection of user interested patterns by proving only the relevant web logs. Only the patterns that are much interested by the user will be resulted in the final phase of entification if this cleaning process is performed before start entifying the user interested patterns User Identification The log file after cleaning is consered as Web Usage Log Set WULS = {UIP, Date, Method, URI, Version, Status, Bytes, ReferrerURL, BrowserOS }. The next important and complex step is unique user entification. The complexity to entify users is due to the use of local cache and proxy servers to enhance browsing. To overcome this cookies are used. But users may disable cookies. Another solution is to collect registration data from users, but users neglect to give their information due to privacy concerns. So majority of records does not contain any information in the user and authentication fields. The fields which are useful to find unique users and sessions are IP address User agent Referrer URL Users and sessions are entified by using these fields as follows. If two records has same IP address check for browser information. If user agent value is same for both records then they are entified as from same user Session Identification The goal of session entification is to dive the page accesses of each user into indivual sessions. These sessions are used as data vectors in various classification, prediction, clustering into groups and other tasks. If URL in the referrer URL field in current record is not accessed previously or if referrer url field is empty then it is consered as a new user session. Reconstruction of accurate user sessions from server access logs is a challenge task and time oriented heuristics with a time limit of 30 minutes is followed. From WULS, the set of user sessions are extracted as referrer based method and time oriented heuristics. The User Session Set is given as USS={USID,(URI 1,ReferrerURI 1,Date 1 )..(URI k, ReferrerURI k, Date k) )} where 1 k n and n denotes the amount of records in WULS. Every record in WULS must belong to a session and every record in WULS can belong to one user session only. After grouping the records into sessions the path completion step follows 3.4. Computing the Reference Length Reference Length is the time taken by the user to view a particular page. This plays an important role in the following procedures. Generally it is calculated by the difference between access time of a record and the next record. But this is not correct since the time includes data transfer rate over internet, launching time to play audio or veo files on the web page and so on. The user s real browsing time is very difficult to analyze. The data transfer rate and size of page is also consered and the reference length is calculated as RL time = RLT - bytes_sent / c Where RLT is the difference of access time between a record and the next one and bytes_sent is taken from log entry of a record and c is the data transfer rate Path Completion Volume 01 No.11, Issue: 03 P a g e 52

5 Path completion step is carried out to entify missing pages due to cache and Back. Path Set is the incomplete accessed pages in a user session. It is extracted from every user session set. Path Combination and Completion: Path Set (PS) is access path of every USID entified from USS. It is defined as PS = {USID,(URI 1, Date 1, RLength 1 ), (URI k, Date k, RLength k )} where Rlength is computed for every record in data cleaning stage. After entifying path for each USID path combination is done if two consecutive pages are same. In the user session if any of the URL specified in the Referrer URL is not equal to the URL in the previous record then that URL in the Referrer Url field of current record is inserted into this session and thus path completion is obtained. The next step is to determine the reference length of new appended pages during path completion and modify the reference length of adjacent ones. Since the assumed pages are normally consered as auxiliary pages the length is determined by the average reference length of auxiliary pages. The reference length of adjacent pages is al so adjusted. 4. Experimental Results The experiments are conducted in the proposed technique by using the log obtained from the reputed college web site for about 10 days in January The obtained record consists of 2000 records in the log file. Then the data cleaning process is carries out. Initially, after removing records with graphics and veos format such gif, JPEG, etc., 1520 records are obtained. Then by checking the status code, the total of 450 records is resulted. Finally, 390 records are resulted after applying robot cleaning process. In the proposed method the records accessed by robots, agents are also cleaned by consering the access time limit of 2 seconds. The sample of 3 set of records are consered and experimented. Figure 1 shows the results after cleaning stage in 3 different samples. In the sample 1, the total of 2000 records are obtained initially. Then after removing the gif status, 860 records are resulted. Finally 450 records are obtained after robots cleaning. S S S % 50% 100% Initial Log Data Cleaning without robots removed Robots Clean Fig 1: Different Data Cleaning Techniques applied in samples In sample 2, initial record is 950, 480 records are resulted after gif status removal and finally 320 records are obtained after robots cleaning process. When consering sample 3, the initial record is 600, 370 records are resulted after gif status removal and finally 250 records are obtained after robots cleaning process. As the number of irrelevant records is discarded, this helps in determining the user interested pattern more accurately in less time. For sample 1, the time required for path completion using initial log is 119 seconds, whereas, 77 seconds after cleaning gif requests, irrelevant status code and it takes only 52 seconds after cleaning robot navigation. For sample 2, only 30 seconds is required by including robots cleaning and more time is required when the robots cleaning is not included. For sample 3, 106 seconds and 81 seconds are required by using original log and log after gif, status removed, whereas, only 56 seconds is required by using the log after robots cleaning. After data cleaning, 14 users are entified according to IP addresses, browsers and operating systems. Furthermore, by using the referrer-based and the timeoriented heuristics methods, 60 user sessions are distinguished in this experiment. Then the path completion technique is applied in order to determine the path accessed by the user. The path completed for a user by using original log is given in table 1. Table 1: Path Completed using Original Log IP Address User Session Path Completed Volume 01 No.11, Issue: 03 P a g e 53

6 Table 2: Path Completed for a User by using log after Cleaning but without Robots Cleaning IP Address User Session Path Completed Table 3: Path Completed for a User by using log after Robots Cleaning IP Address User Session Path Completed Table 2 shows the path completed for a user by using log after cleaning but without robots cleaning. It can be observed from table 2 that the irrelevant pages found in table 1 are eliminated. Finally, table 3 proves path completed for a user by using log after robots cleaning. From table 3, it can be observed that only most relevant web pages interested by the user is obtained, whereas, in table 1 and table 2 some of the irrelevant wed pages are consered for predicting the user interested patterns. 5. Conclusion Log files are the best source to know user behavior. The results of mining can be used to improve the website design and increase satisfaction which helps in various applications. The quality of a website can be evaluated by analyzing user accesses of the website by web usage mini. A data preprocessing treatment system for web usage mining has been analyzed and implemented for log data to reduce the time taken for mining process and to get accurate resuls. It has undergone various steps such as data cleaning, user entification, session entification, path completion and transaction entification. Data cleaning phase includes the removal of records of graphics, veos and the format information, the records with the failed HTTP status code and finally robots cleaning. Different from other implementations records are cleaned effectively by removing robot entries. This preprocessing step is used to give a reliable input for data mining tasks. The data cleaning implemented phase in this paper will helps in determining only the relevant logs that the user is interested in and it enhances the mining process in the next stage. References [1]. Jaeep Srivastave, Robert Cooley, Mukund Deshpande, Pang-Ning Tan, Web Usage Mining: Discovery and Applications of Usage Patterns from Web Data, SIGKDD Explorations. ACM SIGKDD,2000. [2]. Bamshad Mobasher, Data Mining for Web Personalization, LCNS, Springer-Verleg Berlin Heelberg, [3]. Pierrakos. D, Web usage mining as a tool for personalization: a survey, User Modeling and User- Adapted Interaction, 13(4), pp [4]. Peter I. Hofgesang, Methodology for Preprocessing and Evaluating the Time Spent on Web Pages, Proceedings of the IEEE/WIC/ACM International Conference on Web Intelligence,2006. [5]. Robert.Cooley,Bamshed Mobasher and Jaeep Srinivastava, Data Preparation for Mining World We Web Browsing Patterns, journal of knowledge and Information Systems,1999 [6]. Chungsheng Zhang and Liyan Zhuang, New Path Filling Method on Data Preprocessing in Web Mining, Computer and Information Science Journal, August [7]. Cyrus Shahabi, Amir M.Zarkessh, Jafar Abi and Vishal Shah, Knowledge discovery from users Web page navigation, Workshop on Research Issues in Data Engineering, Birmingham, England,1997. [8]. Suresh R.M. and Padmajavalli.R., An Overview of Data Preprocessing in Data and Web usage Mining, IEEE, [9]. Robert.Cooley,Bamshed Mobasher, and Jaeep Srinivastava, Web mining:information and Pattern Discovery on the World We Web,,In International conference on Tools with Artificial Intelligence, Newport Beach, IEEE,1997, pages [10]. Istvan K. Nagy and Csaba Gaspar-Papanek User Behaviour Analysis Based on Time Spent on Web Pages, Web Mining Applications in E-commercce and E-Services, Studies in Computational Intelligence, 2009, Volume 172/2009, , DOI: / _7 -Springer [11]. Catlegde. L and Pitkow. J, Characterising Browsing Behaviours in the World We Web, Computer Networks and ISDN systems, 1995 [12]. Spilipoulou M.and Mobasher B, Berendt B., A framework for the Evaluation of Session Reconstruction Heuristics in Web Usage Analysis, INFORMS Journal on Computing Spring,2003 [13]. Yan Li, Boqin FENG and Qinjiao MAO, Research on Path Completion Technique in Web Usage Mining,, International Symposium on Computer Science and Computational Technology, IEEE,2008. [14]. Yan Li and Boqin FENG The Construction of Transactions for Web Usage Mining, International Conference on Computational Intelligence and Natural Computing, IEEE,2009. Volume 01 No.11, Issue: 03 P a g e 54

[15]. Tanasa. D and Trousse. B, Advanced Data Preprocessing for Intersites Web Usage Mining, IEEE Intelligent Systems, 19(2), pp. 59-65,2004. [16]. Tan,P. N. and Kumar, V., 2002.

Chitraa is a doctoral student in Manonmaniam Sundaranar University, Tirunelveli, Tamilnadu. She is working as an Associate Professor in CMS college of Science and Commerce, Coimbatore.

7 [15]. Tanasa. D and Trousse. B, Advanced Data Preprocessing for Intersites Web Usage Mining, IEEE Intelligent Systems, 19(2), pp ,2004. [16]. Tan,P. N. and Kumar, V., Discovery of Web Robot Sessions Based on their Navigational Patterns, Data Mining and Knowledge Discovery, 6(1), pp AUTHORS PROFILE Mrs. V. Chitraa is a doctoral student in Manonmaniam Sundaranar University, Tirunelveli, Tamilnadu. She is working as an Associate Professor in CMS college of Science and Commerce, Coimbatore. Her research interest lies in Database Concepts, Web Mining, Clustering. She has published papesr in reputed international journal and presented papers in conference. She is an IEEE student member. Dr. Antony Selvadoss Davamani is working as a Reader in the department of Computer Science in NGM college with a teaching experience of about 23 years. His research interests includes Knowledge Management, Web Mining, Networks, Mobile Computing, Telecommunication. He has gued more than 41 M.Phil Scholors, guing many Ph.D. Scholors and presented more than 30 papers. He has attended more than 17 workshops, seminars and published many books. Volume 01 No.11, Issue: 03 P a g e 55

Web Data mining-a Research area in Web usage mining

IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661, p- ISSN: 2278-8727Volume 13, Issue 1 (Jul. - Aug. 2013), PP 22-26 Web Data mining-a Research area in Web usage mining 1 V.S.Thiyagarajan,