The influence of caching on web usage mining

Size: px

Start display at page:

Download "The influence of caching on web usage mining"

Egbert Rogers
5 years ago
Views:

1 The influence of caching on web usage mining J. Huysmans 1, B. Baesens 1,2 & J. Vanthienen 1 1 Department of Applied Economic Sciences, K.U.Leuven, Belgium 2 School of Management, University of Southampton, UK Abstract Most web servers collect lots of data during their daily operation. Information, such as which pages are requested and who is responsible for these requests, is stored in log files. The analysis of these log files may yield worthwhile information on how to adapt the site to improve the user experience. However, the data in the log files is usually not stored in a format suited to perform analyses. Many operations are needed to transform the logs in a format that is convenient for the chosen type of analysis. After an overview of these operations, we will discuss how caching of pages can skew the results of studies. We will show how caching can be detected and how one can deal with it. Afterwards, the techniques are applied to the data of a European online wine shop. Keywords: web usage mining, data pre-processing, data cleaning, caching, robot detection. 1 Introduction More and more organizations are dependent on the web for the sale and marketing of their products, for informing customers, for contacting suppliers, In consequence, to measure the effectiveness of an advertising campaign or to make forecasts on various business variables, they can no longer rely only on traditional sources to acquire the required information. Some of the newer data sources that contain worthwhile information are the log files that are generated by web servers. Every request made to the server is stored in these log files. An analysis of these requests can provide information on how to adapt the site. This will improve the user experience and, in consequence, the profitability of the site is likely to increase.

2 78 Data Mining V In the next section we will focus on web usage mining, which is the process of analysing log files. Hereby, we first specify the data that is available in the log files. Afterwards, some of the more important steps of the mining process are covered into greater detail. In the third section, we will apply these operations on the data of a European online wine shop. During the analysis of the logs of this company some strange observations were made. Possible explanations for these observations are given and tested for correctness. It will be shown that caching is responsible for the strange phenomenon. Afterwards, some solutions for the problem will be discussed. In the final section, this paper is concluded by some suggestions for future research. 2 Web usage mining Web Usage Mining is defined as the application of data mining techniques to discover usage patterns from web data (Srivastava et al. [1]). It usually consists of three main phases: preprocessing, pattern discovery and pattern analysis. In this paper the focus is on the first phase: preprocessing. During the preprocessing phase all the necessary operations to transform the data in a form suited for the chosen type of analysis are performed. The different steps of this phase are shown in Figure 1. The input for web usage mining is a raw log file in which every request made to the server is stored. A typical line in the logs looks as follows: [27/Jun/2002:00:01: ] GET shop/detail.html HTTP/ Mozilla/4.0 (compatible; MSIE 6.0;Windows NT 5.1;SKY11a) The first part of this line, , specifies the IP-address of the client who made the request to the server. This IP-address can be used to identify the different visitors of the site. The second component in this line gives us information on when the request was made. More precisely, it indicates the time when the server completed the request. The third part of the line GET /shop/detail.html HTTP/1.1 is called the request line and consists of three parts. The GET-part specifies the method used to request the page. If the surfer requests a normal web page, the method used will always be GET. Other possibilities are POST, to send values of a form to the server, and HEAD. The second part of the request line indicates which file was requested. The final part of the request line specifies the protocol used to request the file. This protocol is normally HTTP/1.0 or HTTP/1.1. The two numbers following the request line, 200 and 38890, are respectively a status code and the size of the returned file. The status code 200 indicates that the request was successfully completed. Other status codes indicate various types of errors, from which error 404: Page not found, is probably the best-known. The next part of the line designates the referrer. This is the page that refers to the requested page. Thus, from this line it can be concluded that on June 27th, 2002 there was a link from to the page Finally, the last component of the line specifies the agent. A browser fills in this

3 Data Mining V 79 field to identify itself. As a visitor normally uses only one browser during a session on the web, we can use this field in combination with the IP-address to identify the different visitors. Figure 1: Different steps of the preprocessing phase (Cooley et al. [2]). 2.1 Data cleaning Raw log files contain a lot of lines that are irrelevant for web usage mining. These lines must be deleted before applying the mining techniques. The principle hereby is that with every action of the visitor, usually a click of the mouse, should correspond one line in the log files. When a visitor requests a page, this request will be logged, but it is not the only request that will appear in the log files. The HTML-code of the page indicates which pictures the browser should show. When the browser analyzes the HTML-code, it will send requests for these pictures. So if there are four pictures on the requested page, there will appear five lines in the log files: one for the HTML-page and one for every picture. These requests for pictures must be deleted from the log files because the user did not explicitly ask for them. Cleaning the log files from pictures and photos is quite easy. It suffices to examine the extensions of the requested files and extensions that correspond with pictures, such as.jpg and.gif, should be deleted. For a similar reason requests for directories and stylesheets are deleted. There are many other operations to perform during the data cleaning phase. We will discuss these in the section on the practical analysis. 2.2 User identification After the cleaning phase, a new problem arises. How can we determine which requests are made by the same visitor? Many methods are available, but most require additional information that is normally not stored in the log files. First, some of the problems that arise are discussed. Then, we focus on a number of solutions. In every line in the logs an IP-address of the client who was responsible for that request to the server is stored. In an ideal world, every surfer on earth would have his own address and we could use this IP-address to identify the different users. But this is not an ideal world and there is not always a one-to-one correspondence between visitor and IP-address. Sometimes multiple people share the same IP-address or one visitor sends requests from different IP-

4 80 Data Mining V addresses. Cooley et al. [2] propose the following heuristic. Based on the assumption that requests with different values in the agent field indicate different users, they assume that all requests with the same IP-address and agent information originate from the same visitor. It is quite unusual that one visitor sends requests with multiple browsers, so this assumption is plausible. This heuristic does not always result in the correct result. For example, if multiple users share the same IP-address and use the same browser, the heuristic will indicate only one user. Also, IP-addresses can change over time. A user who visits the site today, can have a different address on a subsequent visit. So we cannot use this heuristic to track repeated visits. The advantage of this heuristic is that no additional information is needed. It suffices to have the log files at your disposal. Other methods require additional information and sometimes need the active collaboration of the visitor. We will quickly run over some of these methods. For more information one should consult Cooley [3]. A first method to identify users is to ask them explicitly to log themselves in. Web-based -clients, such as Yahoo and Hotmail, use this technique. However, this method can only be used for a certain category of sites. An e-shop clearly cannot use this method, because it would frighten the customers who just came along for information. This is in analogy with the real world. In a bank it is normal that one must show some kind of identification, whereas in a clothing shop this would be a strange experience. Cookies are another frequently used method to identify visitors. A cookie is a small file which is placed by the server on the client machine during the first request. On subsequent requests the server can read and modify the contents of the cookie. In these cookies a unique identifier is stored which can be used to recognize the visitor, even over repeated visits. The disadvantage of this method is that not everybody accepts cookies although tests have proven this to be a small minority (CIM [4]). 2.3 Session identification Most users visit a site several times. In the preceding step we determined which requests were caused by a certain user. The goal of this step, session identification, is to divide these requests of one user into several visits or sessions. Because we usually have access to the log files of only one site, it is not trivial to find out when a visitor has left this site. The most widely used method to divide requests into sessions is based on a time-out. If there is a sufficient large amount of time between two subsequent requests of the same user, a new session is started. For many commercial software packages an inactivity of 30 minutes suffices to start a new session. This 30 minute time-out is based on research from Catledge and Pitkow [5], who found the optimal time-out to be 25.5 minutes, which resulted in the standard of 30 minutes. In Figure 1 two additional steps are shown: path completion and transaction identification. These steps are not discussed in this paper. For a detailed discussion one should consult Huysmans et al. [6] or Cooley [3].

5 Data Mining V 81 3 Practical study For the practical study, we used one year of log data from a European online wine shop. The logs contained requests. From these logs all the requests for images, directories and stylesheets were deleted. This resulted in the removal of 87% of the requests. Afterwards, we used the heuristic proposed by Cooley et al. [2] to identify the different users. All requests that originate from the same IP-address and have the same information in the agent-field are considered to be requests from the same user. Finally, all the requests from the same user were divided into a number of different sessions whenever a time-out of more than thirty minutes occurred between two successive requests. In figure 2, a histogram is shown from which we can see how many times there was a certain time interval between two successive requests from the same user. In Cooley et al. [7], it is mentioned that the shape of this histogram is usually close to an exponential distribution. This shape can also be seen in the histogram of figure 3 but what really draws the attention are the returning peaks. These peaks occur on a regular basis, namely every sixty seconds. In literature, we did not find an explanation for this phenomenon. In the rest of this section some possible explanations for the phenomenon are discussed seconds Figure 2: Histogram of seconds between successive requests. A first possible explanation is the use of refresh meta tags on the examined site. Such a tag indicates that the browser should request a new page after a fixed number of seconds. This tag is frequently used when a page has moved. For a few seconds a message is shown that indicates that the page has moved and that the bookmarks should be updated. Afterwards, the visitor is automatically transferred to the new address. The appearance of this tag on the site could cause one or a few peaks but probably not all of them. On the examined site, we found no single occurrence of the refresh meta tag. Therefore other explanations were investigated.

6 82 Data Mining V Typical characteristics from the browser-software might cause some of the peaks and in particular the larger peak at 1800 seconds (not shown in the histogram). Some browsers, like for instance Opera, have built-in timers to automatically refresh the requested pages. In Opera this timer is standard set to 30 minutes. So, this timer might cause the larger peak at 1800 seconds. It is very difficult (probably even impossible) to detect the requests that are the result of this auto-refreshing. Another reason why these peaks could occur are robots. A robot, also called spider or webcrawler, is a program that automatically traverses the web. It uses the links on already visited pages to determine what page it should visit next. 0,25 0,2 0,15 0,1 0, Figure 3: Probability that a request is generated by robot per time interval. Many robots are used by search engines. They traverse the web and place visited pages in a database. When a query is performed, the search engine consults this database to construct the result. Because robots are computer programs they can request many pages in a few seconds. A robot that traverses a site very fast, called rapid-fire, might cause the server from which the pages are requested to react very slowly to requests from human users. To prevent this from happening, most robot-designers make sure that their robot leaves a certain amount of time between two requests to the same server. These time intervals might be the reason for the peaks in figure 3. To test this hypothesis it is needed to remove the requests from robots and see if the peaks still occur. There are many different ways by which we can recognize if a request is made by a robot or a human user (Tan [8]). A possible way is to see what the first page is that a user requests. For robots this will usually be the page robots.txt that contains information about which pages should not be visited by robots. Robots that follow this convention will always request this page before requesting any other page on the server. The agent-field in the log files will also be different when a request is made by a robot. For human visitors this field contains information on their browser, such as IE and Mozilla for Internet Explorer and Netscape. For robots this field usually contains information about themselves, such as the name of the robot and the site of its creators. If we have a list with names of robots, we can simply compare the agent information with this list to identify the robots. Kohavi and Paraekh [9] mentions some other heuristics that identify robots. One of them is to use a zero-width link that can only be seen by robots.

7 Data Mining V 83 For our analysis, we have used the first method. If a request is made for the page robots.txt the corresponding agent-field and IP-address are stored in a separate file. This file is then used to identify requests for robots. We ve detected over 220 different robots by this method. They were responsible for requests (8,4% of all requests for pages). The removal of robot requests results in a reduction of some of the peaks, but nevertheless most of the peaks remain as large as before. In figure 4, a second histogram is shown, that indicates the probability that a request is caused by a robot for a given time interval. For example, from the graph we can detect that whenever there are 60 seconds between two successive requests, there is a probability of approximately 6% that the requests were generated by a robot. In contrast to our expectations, there are no peaks at the multiples of 60 seconds but valleys. Valleys indicate that a smaller portion of the requests is created by robots. Either not enough robots were identified or something else causes the peaks in figure 3. As most of the peaks were not influenced by the robot removal, we assumed something else caused the pattern. Therefore, a final thing we investigated was caching. Caching is a mechanism by which frequently requested pages are duplicated on a nearby server to improve the speed of surfing. Regularly, usually after a fixed time-interval, the nearby server checks if the duplicates are still similar to the original pages. If changes were made to the original pages, the duplicates are updated. We found that caching was indeed responsible for the pattern seen in figure 3. The wine site we examined had one large commercial partner. On the partner site a number of articles were duplicated directly from the wine site. Approximately every 30 minutes the partner site checked with the wine site if its copy was still the most recent version of the article. Figure 4: Multiple IP-addresses perform requests. If the partner site had used only one IP-address to perform these checks, it would have been quite easy to detect this type of caching as the regular checking would have resulted in one very long session. But unfortunately, the partner site used a whole pool of IP-addresses to perform these checks (Figure 4). Therefore it seems like many different visitors create the pattern. An example is given in Table 1. The time interval between two successive requests of the same page is fixed at 30 minutes, the IP-address that makes a request is chosen

8 84 Data Mining V randomly from the pool of available IP-addresses. If we use the IP-address to identify the different users and we use a time-out of thirty minutes to divide the requests into sessions, the following results are obtained. User 1 (x.y.z.001) creates two sessions: A-B and B-C. The time between two successive requests is respectively 10 and 5. User 2 (x.y.z.002) and user 3 (x.y.z.003) are each responsible for one session, namely C-B-A, with intervals 25 and 20, and A-C with interval 15. Table 1: Example. IP-address Time(minutes) Page requested x.y.z A x.y.z B x.y.z C x.y.z A x.y.z B x.y.z C x.y.z A x.y.z B x.y.z C Figure 5: Histogram of seconds between successive requests after removal of cached page checks. It is clear that the discovered sessions are of no use for the analysis of the behaviour of the visitors. Therefore the corresponding requests should be removed from the log files, but as the example shows it is quite difficult to identify them. The sessions are variable in size, they are created by different IPaddresses, the time interval between successive requests is variable (although mostly close to a multiple of 60 seconds, which is the reason why the pattern appears in figure 3.

9 Data Mining V 85 We used a very simple procedure to detect which IP-addresses were causing the pattern. First, we sorted all sessions by the corresponding IP-address. Afterwards, we repeatedly divided the sessions into two parts and created histograms as in figure 3. Only in one of the created histograms will the peaks still be present. After a few iterations, when the peaks start to show up in both created histograms, it becomes clear which IP-addresses are responsible for the pattern. The above procedure only works when all the IP-addresses from the pool start with the same numbers. If not, the above procedure will fail. In our practical study, we found that 10 different IP-addresses contributed to the generation of the pattern. In figure 5, we can see the corresponding histogram after the removal of the requests from these addresses. ( requests are removed which is 14.5% of the requests after removal of bots) The shape of this histogram is very close to an exponential distribution. One peak still remains at 360 seconds and also after 1800 seconds an increase can be seen. Both peaks are probably caused by the combination of robots and characteristics from browser software. 4 Conclusion In this paper, some of the problems that arise when performing web usage mining were treated and the most common solutions to deal with these problems were discussed consequently. In the third section of this paper, we focused on the influence of caching. It was shown that caching was responsible for numerous requests that would have considerably skewed the results of any study. While difficult to identify the requests caused by caching, we have proposed a method to detect and deal with the problem. In the future, some experiments on other log files should be performed to examine if the spiky pattern appears in other log files and to investigate if the proposed method remains generally applicable. References [1] Srivastava J., Cooley R., Deshpande M. & Tan P-N., Web Usage Mining: Discovery and Applications of Usage Patterns from Web Data. Web Data, SIGKDD Explorations, Volume 1, Issue 2, pp , [2] Cooley R., Mobasher R. & Srivastava J., Data Preparation for Mining World Wide Web Browsing Patterns. Knowledge and Information Systems, Volume 1, [3] Cooley R., Web Usage Mining: Discovery and Application of Interesting Patterns from Web Data. Ph.D. Thesis, University of Minnesota, [4] CIM, [5] Catledge L. & Pitkow J., Characterizing Browsing Strategies in the World-Wide Web. Journal of Computer Networks and ISDN systems, Volume 27, nr. 6, [6] Huysmans J., Baesens B. & Vanthienen J., Web Usage Mining: a practical Study. Submitted to the 12th Conference on Knowledge Acquisition and Management (KAM 2004), 2004.

10 86 Data Mining V [7] Cooley R., Mobasher R. & Srivastava J., Grouping Web Page References into Transactions for Mining World Wide Web Browsing Patterns. In Proceedings of the 1997 IEEE Knowledge and Data Engineering Exchange Workshop (KDEX-97), [8] Tan P.N., Kumar V., Discovery of Web robot sessions based on their navigational patterns, Data Mining and Knowledge Discovery, [9] Kohavi R. & Parekh R., Ten supplementary analyses to improve e- commerce web site, In Proceedings of WEBKDD 2003, 2003.

Data Mining of Web Access Logs Using Classification Techniques

Data Mining of Web Access Logs Using Classification Techniques Data Mining of Web Logs Using Classification Techniques Md. Azam 1, Asst. Prof. Md. Tabrez Nafis 2 1 M.Tech Scholar, Department of Computer Science & Engineering, Al-Falah School of Engineering & Technology,