Enhancing Web Caching Using Web Usage Mining Techniques

Size: px

Start display at page:

Download "Enhancing Web Caching Using Web Usage Mining Techniques"

Adrian Newton
6 years ago
Views:

1 Enhancing Web Caching Using Web Usage Mining Techniques Samia Saidi Yahya Slimani Department of Computer Science Faculty of Sciences of Tunis and University of Sciences of Tunis Abstract. Performance and other service quality attributes are crucial to user satisfaction of web services. Web Mining provides the key to understanding web traffic behavior, which in turn explain the increasing interest in this domain and its high number of its possible applications. In this paper, we apply Web Usage Mining techniques to propose an intelligent caching solution with the goal of improving the quality of service of web sites. We found that empowering caching with a prefetching engine that predicates the components of pages to be used in the near future by users can enhance web sites performances. This is allowed by analyzing the historical of navigation of a web site reported in log files and by determining the set of components to be sollicitated in the future using frequent closed itemsets. keywords :Web caching, Web Usage Mining, Web log files, Web page components, Frequent closed itemsets. 1 Introduction Web Usage Mining (WUM) techniques have been applied in many fields such that web personalization, E-commerce and so on [15]. In this work we are focusing on the application of WUM to enhance web caching performance. The intelligent web caching method that we proposed has mainly as objective the selection or the prefetching of web page components to cache based on user profiles. Previous works in this field focused mainly on analyzing [3, 13, 16] raw log files to select which web page to replace in cache. Our objective in this work is to propose a framework that handles new web exigencies. Mainly the prefetching of components of pages and not whole pages. Remember that, in order to reduce the overhead for generating dynamic data in systematic web sites, it is useful to generate data corresponding to a dynamic object once, then store the object in a cache, and subsequently serve requests to the object from cache instead of invoking the server again. In fact, the caching of a whole page can be of limited utility, especially in the case of personalized

2 pages, where each client would need a different version of the same page. Moreover, different web page components can have different update frequency, so that, caching the entire page needs, when one update is necessary, the recomputation of the whole page even if only some parts of the page have been updated. So, we choose to cache components of pages and not entire pages, since the caching of components of pages can enhance the web performance by specifying precisely candidate web page components for caching. Hence, choosing a suitable technique of fragmentation of pages is crucial. On the other hand, having the set of web page components, selecting which components to cache is extremely important, we deal here with the problem of prefetching or with the problem of selection of components to cache. In fact, a user s browsing sequence will follow the hyperlinks between Web objects. That is, if object A has a hyperlink to object B, the probability that B will be accessed, given A has been accessed already, will increase significantly. Hence, if we prefetch those objects witch are very likely to be referenced the client s subsequent requests, part of the network latency can be hidden within the time between client s consecutive requests [5]. In this paper, we propose a method of selection of web page components based on the preferences of users, obtained by applying techniques of WUM. Given the high traffic in the Internet, caching of documents is an important technique to reduce latency, bandwidth consumption, and server load. Consolidating a Web cache with a suitable prefetching policy can enhance performance. Nevertheless, good performance requires a prefetching policy targeted to characteristics of the Web applications and of the client. Our method of prefetching provides the ability to prefetch objects that could be used in the near future using the historical of navigation of the users. The rest of this paper is organized as follows. Section 2 reviews related works about caching and fragmentation of web pages. Section 3 reviews the problem of selecting web page components. Section 4 describes our proposed approach based on techniques of WUM. Section 5 deals with some experimental results. Finally, section 6 concludes the paper and highlights some future works. 2 Related Works It is more efficient if we concentrate in the caching of components of pages and not whole pages in order to enhance the QoS of web servers. Some authors, like Labrinidis and al. [9 12] deal with a special caching (materialization) of webviews which are database query results (i.e. database views) with HTML formatting commands or XML semantic tags. These views are organized in order to rapidly answer various types of web queries. The authors in [12] deal with materialized, non materialized and virtual webviews. They proposed a solution based on collecting multiple statistics and on estimating two metrics, namely Quality of service (QoS) and Quality of Data (QoD). The proposed system invests in the materialization when the global QoD of the solution exceeds a fixed threshold. In the other hand, the solution invests in dematerialization when the global QoD of the solution is lower than the fixed threshold. This solution is very

3 hard to implement because of the large number of statistics required besides of the number of estimations based on the Recursive Prediction Error Method [8] that must be considered by it. Moreover, the authors doesn t consider any constraint of size for the suggested cache. In fact, the proposed asynchronous cache stores all the given webviews under one policy of materialization. Besides, this solution does nt treat any aspect of replacement which incurs significant space overhead. For these reasons, we found first that selection based on a more rigorous method of prediction can improve the results of materialization. Then, we focus on a method based on the preferences of users and thus on technics of WUM. Furthermore we believe that Labrinidis and al. works store only webviews and doesn t consider other interesting components. Hence, we found that the application of an appropriate technique of fragmentation can be beneficial since it considers all types of components of pages and not only webviews. Datta and al. in [6] proposed a caching of fragment of pages where scripts of pages are composed on multiple code blocks. A code block can be reused if it is tagged within the script. When the script is executed, the tags inform the server to check the cache before executing the code block. If the requested fragment is founded in the cache they bypass the logical code of the block. Else they execute the block and subsequently copied it in the cache. The authors here use one technique of replacement called the Least Likely to be Used (LLU) which is based on a predictive technique: a component of a page to be cached replaces the least likely to be used one. For that, we found that the fragmentation of web pages is necessary to get candidates set of page components to be cached. The problem of fragmentation is mainly treated by Challenger and al. [2,?] that propose a fragment-based method for the design of web sites. Relations between pages and page fragments are predefined in a fixed graph and are generally specified by the user. The authors have deployed systems using two approaches for creating and modifying an Object Dependency Graph (ODG). Fragmentation methods are used in different purposes. For example Bouras and al. [1] invest in fragmentation to eliminate redundant data transfer over the web. The algorithm of fragmentation is viewed here as an html filter. This filter fragmented the web pages using tags. It stripped all images and considers fragment delimiters between the tags of <table>, because they represent the most popular structuring tags. An other method of fragmentation is proposed by Ramaswamy and al. [14], which is mainly used for detecting interesting fragments for caching. This method has as goal the detection of interesting fragments in dynamic web pages, which exhibit potential benefits and thus are interesting cache units. They define candidate fragments as fragments that are shared by over a threshold of fragments and that have different personalization and lifetime characteristics. For that, they first propose a hierarchy of the dynamic web pages and a particular data structure that helps in the detection of fragments. In fact, they convert web pages to their corresponding Augmented Fragment (AF) deduced from Document Object Model (DOM) tree and prune the fragment tree by eliminating the text formatting nodes. The result of the first step

4 is a specialized DOM tree that contains only the content of structured tags ( like <TABLE>, <TR>, <P>). The second step annotates fragments of the fragment tree obtained in the first step. Second, they propose an efficient algorithm to detect maximal fragments that are shared among multiple documents. Third, they develop a practical algorithm that effectively detects fragments based on their lifetime and personalization characteristics. Having that, the selection of the suitable set of page components is critical to enhance caching, so it must be based on a good technique of prefetching. For that, we decide to develop one approach based on the analysis of the historical of navigation of web users allowed by techniques of WUM to predicate the set of page components to select for caching. 3 Problem Definition Selecting web page components to cache is an important problem for web sites managers because caching a good set of page components allows reutilization and thus improves the performance of web sites especially for dynamic ones. As a solution, we propose a new approach that suggests to use the access patterns of users in order to select a suitable set of web page component to be cached. In fact, the access patterns of users are an important and useful knowledge that can lead to find the best page components to be materialized. We use the terminology of Web page component to deal with a part of a web page which can be dynamic, i.e. generated from database queries (defined as webviews by Labrinidis and al. [12]). The selection of web pages components acts complementary to caching. Thus, it is helpful to follow one strategy for selection allowing the selection of the web page components the most referenced together. We define the problem of selecting page components as follows: Using knowledge about users access patterns mined from Web Log files select web page components to be cached so that these latter have high probability to be referenced in the near future. 4 Description of the Proposed Approach 4.1 Architecture Starting from the typical three-tier architecture of modern web servers, we suggest to add a new component to this architecture (see figure 1). This component, that we call Pre-fetching engine, communicates to the web cache module a good set of web page components to be cached. The inputs of this component are web log files and the source code of a web site. Yet, this component has as role to analyze web log files to structure them on paths of navigation. In parallel, it has to split web pages to generate a structured set of pages. These two results are combined to generate a structured table containing paths of navigation of users with web components of web pages. This result will then performed to generate the set of components of web pages to cache based on mining Frequent Closed

5 Fig. 1. Number of Frequent Closed Itemsets and Items Generated. Itemsets. The generated set will depicts a suitable set of components of web pages which are the most used by the users. We mainly focused on the role of the prefetching engine that select and store a set of candidate page components for caching after applying one policy of replacement. 4.2 Preparation Step The WUM techniques provide knowledge about user behavior s on a Web site. This knowledge is expressed through the relation patterns hidden in the log files. Before extracting this knowledge, it is necessary to prepare the rows of a log file into structured valuable data. In our approach, we had three phases namely : (i) the pretreatment of web log files, (ii) the fragmentizing web pages, (iii) the integration of web page components into structured transactions composed by web page components used by one web user. Pretreatment of Web Log Files Remember that a log file is the primary source of data in web usage mining. It is a plain text file, where user queries are ordered in chronological time order to represent the fine-grained navigational behavior of visitors. Each hit against the server, corresponding to an http request, generates a single entry in the server access logs. The log format may vary, but it contains fields identifying the time and the date of the request, the IP address of the user, the resource requested, the status of the request, the http method used, the user agent (browser and operating system type and version), the referring web resource, and, if available, client-side cookies identifying uniquely a repeat visitor. An example of a server access log entry is depicted in Table 1. Some fields in the log entries have been changed to respect privacy. Web log entry 1 shows the date and the time of the user access, having the IP address spending 546 s on the page. The uri of the system is /version fr/phpwebgallery 1.3.4/picture.php, the uri of the query is cat = most v isited image i d = 7&expand = 23, 15, 18, 19 and the port is 80. Information on user agent are M ozilla/5.0 + (compatible; +Googlebot/2.1; + +

6 : 00 : 42SIGET/version fr/phpwebgallery 1.3.4/picture.phpcat = most v isited image i d = 7&expand = 23, 15, 18, Mozilla/5.0 + (compatible; +Googlebot/2.1; + + http : // Table 1. A web server log entry http : // the referrer is not mentioned by this entry. Finally, the three status of query substatus and win32 status are respectively. The aims of the preprocessing step in a WUM process are : (i) convert the raws log file into a set of transactions (one transaction being the list of pages visited by one user) ; (ii) eliminate the non-interesting or noisy requests (e.g. implicit requests or requests made by Web robots). Some steps were already proposed by authors like Cooley [4] and Tanasa [15]. However, we added some new steps and modified some existing ones to propose a complete pretreatment methodology more suitable for dynamic web caching. As a result, steps to follow are data fusion, data cleaning and data structuration. Tanasa in [15] deals with the step of data summarization where he first transfers the structured file containing visits or episodes (if identified) to a relational database. Afterwards, he applies the data generalization at the request level (for URLs) and the aggregated data computation for episodes, visits and user sessions to completely fill in the database. Ziane and al. in [17] proposed,for this step, to load structured data extracted from log files into a data cube structure in order to perform data mining as well as traditional On Line Analytical Processing (OLAP). For the first step of data fusion, we join the set of log files into one log file by applying a specific algorithm that inserts records of the resulting log file based on chronological time order. For privacy reasons, we remove the host names or the IP addresses and replace them with identifiers keeping information about the domain extension. Then, for the following step of data cleaning we remove mainly all unnecessary requests, such as implicit requests for the objects embedded in the web pages and the requests generated by non human clients of the web site like the web robots. For this removal, it is necessary to distinguish between the implicit and the explicit requests for the images since explicit requests represent the real actions of the users. Although Tanasa in [15] stresses on that the decision for supporting or removing images from web log files depends mainly on the purpose of WUM. In fact, for a web cache application for example, it is more important to predict requests for these files than requests for other files like text files because of the size of images. We decide to remove images because we will fragmentize web pages and then we will retrieve images. For the removal of Web Robots (WR) we scan periodically a web site. It follows all the hyperlinks from a Web page. Thus, a WR will generate a huge number of requests on a web site, since the number of requests from one WR may be equal to the number the Web sites URIs. For identifying the

7 requests generated by a WR, we use a simple heuristic based on the list of user agents known as robots. But databases containing these lists are not exhaustive and each day new WR s appear or are renamed. Once all the WRs identified, the requests that they generate can be removed. The third step of preprocessing is the data structuration which groups the unstructured requests of a log file by user, user session, page view, visit, and episode. Thus at the end of this step, the log file will be a set of transactions. A transaction is a user session, a visit or an episode. In our case we don t need to identify episodes since we apply techniques of WUM for caching and we exclude the factor of ontology for the identification of episodes. For the identification of users, the log file provides only the computer address (name or IP) and the user agent. For web sites requiring user registration, the log file contains also the user login (as the third record in a log entry). In this case, we use this information as user identification. When the user login is not available, we consider (if necessary) each IP as a user, although we know that an IP address can be used by several users. For the session identification, the difficulties were well described in [4, 15]. For that, if the user login is available, we combine the user login field with the pair (Host, User Agent) to separate the user sessions. We choose this solution because a registered user might use different computers or browsers when exploring the web site and the inclusion of the user agent allows us to better distinguish between users within a common host. For the page view identification the requests are grouped by page views using the following algorithm: When the request for the page view p i is in the log file, we remove the log entries corresponding to the embedded resources from one page P i, and we keep only the request for P i. When the request for P i is absent (due to the browser or proxy cache), but some entries for its corresponding resources are present and these entries have P i in the referrer field, we replace the entries corresponding to the resources with a request for P i and we set the time of this request to t i = mintime(l i ), where l i is the corresponding log entry for the resource r i. Cooley [4] here deal with an algorithm of path completion. Then, several heuristics can be used to split the user session into visits [4]. We follow the heuristic dictating that a new visit begins each time when a gap exceeding a threshold of time between two page views. Thus, at this level we get a set of n page views P = {P 1, P 2,..., P n }, and a set of m user visits V = {v 1, v 2,..., v m }, where each v i V is a subset of the pages P. Conceptually speaking, we can view each transaction as a sequence of ordered pairs: v =< (p 1v, ω(p 1v )), (p 2v,ω(p 2v )),..., (p nv,ω(p nv )) >, where ω(p iv ) is the weight associated to page p iv in the transaction v. We choose to represent these weights in a binary manner, to note the existence or non-existence of a page in a transaction. This result can be represented by a binary relation depicted as follows : R 1 is defined over the couple (V, P ), where V is the set of visits and P is the set of web pages. (V, P ) R 1 if and only if the visit v V contains the page p P.

8 Fragmentizing Web Pages In parallel to this pretreatment process, we apply a fragmentation algorithm to select interesting candidate page components for caching, since manual markup of web page components in dynamic web pages is both labor-intensive and error-prone. Furthermore, the manual approach for detection of web page components becomes unmanageable and unrealistic for the caching of components that deal with multiple content providers. It is crucial to detect interesting fragments in dynamic web pages. However, the method proposed in [14] represents a good method of fragmentation adapted to caching, since as said before, it focuses on candidates for caching and proposes an alternative solution for the selection of interesting of page components for caching. In our case a simpler method of fragmentation is adopted since our method of selection is based on WUM techniques. For that, we use an algorithm dealing with the parsing of the html code of pages to identify all the components of one page (static or dynamic) to be a candidate for the pre-selection of components to materialize. The result is then depicted in the second association R 2 between web pages and web page components, defined over the couple (P, C), where P is the set of web pages and C is the set of web page components. We said that (p, c) R 2 if and only if the web page p P contains the component c C. Integration of Web Page Components into Visits Having the two binary relations R 1 and R 2, we define a new relation R that is the composition of R 1 and R 2. R 1 : V P, R 2 : P C( (x, y) V C), (xry) z(xr 1 z) (zr 2 y). The generated relation R is composed by visits of the pre-treated log files and from page components detected during the parsing of web pages. The algorithm of selection of the suitable set to be cached is based on the mining of this suitable set using the generation of frequent closed itemsets under a parameterizable support. 4.3 Processing Step For the formulation, visits of users of one web site are the set of objects, and web page components are the items. Our objective is to generate a set of interesting web page components which are accessed frequently by a set of users. We believe that this set will be a significant subset of the initially defined set of web page components. For that, generating this subset using frequent itemsets, can be critical for two reasons : first, the huge number of frequent itemsets to be generated, and how to generate, from these frequent itemsets the set of the best components to cache. To generate items from itemsets we can apply two set operations of intersection or of union. We don t apply the intersection of itemsets, especially because when we apply this operation on some simple visits by web page components, we found the empty set. We then explore the possibility of the union because this operation is less complex and because it generates a subset of items (web page components) which are accessed with a fixed frequency with consideration to all users. To reduce the number of frequent itemsets, we considered three possibilities : the generation of frequent itemsets,

9 the generation of closed frequent itemsets and finally the generation of maximal itemsets. If we refer to the definition proposed by Gouda and al. [7] we found that a frequent itemset is closed if it has no superset with the same frequency. A frequent itemset is called maximal if it is not a subset of any other frequent itemset. For the set of frequent itemsets there will be a generation of all possible subsets of items that are upper than a fixed number of visits. Thus, if we have many long frequent patterns, the number of generated itemsets can be very high mainly because a frequent pattern of length l implies the presence of 2 l 2 additional frequent patterns. For that, exploring the set of generated closed itemsets can reduce the first set generated by the frequent itemsets. But it is crucial to verify that there is no loose of information, especially after the operation of union of the items of closed itemsets. Remembering that the set of closed frequent itemsets is composed of supersets of itemsets under different supports. Now, we can formulate the problem of the generation of closed itemsets as follows: we consider the set of frequent itemsets generated by applying one algorithm of generation of frequent itemsets. The set of closed frequent itemsets will be a subset of this latter especially because we remove all itemsets which are subsets of other itemsets under a same support. Then, we focus only on the set of items generated by the union of itemsets of this set. We found that there is no loose of information generated by the union, since the union of subsets of one superset gives the items of this superset. Furthermore, if we explore the set of maximal itemsets and we formulate this differently there will be elimination of all inclusion between closed itemsets. Thus, only supersets will be kept from the generated closed frequent itemsets, and the union of these supersets gives the same results given by the union of items of closed itemsets. Hence, one can interchangeably choose between the three proposed solutions and may choose directly the generation of maximal itemsets. However, after carrying out some experimental results, we found that existing implementations are time consuming. For this reason, we rely on the generation of closed frequent itemsets. We select one implementation of generation of closed frequent itemsets to generate the set of closed frequent itemsets with a support equal to 1/3. We apply to this set an algorithm for generating the union of the items of itemsets. This latter is the result generated by the pre-fetching engine. 5 Experimental results To illustrate our proposal, we consider an experimental web site with 30 Web pages. Applying the fragmentation algorithm, after parsing web pages, we get 70 page components. After parsing the web log file and applying the associated pretreatment, we generate variable numbers of visits, depending on navigation of users related to the web site. We then apply CHARM algorithm [18] to generate the set of closed frequent itemsets. After that, we obtain the set of items (web page components) generated by the union of frequent closed itemsets. If we analyze results obtained

10 Fig. 2. Number of Frequent Closed Itemsets and Items Generated. Fig. 3. Number of Frequent Closed Itemsets and Items Generated. under different supports, we found that although the number of closed sets varies under different supports, while the number of items generated by the union of frequent itemsets remain constant. It generally varies on the 1/3 of the total number of web components. Results obtained by CHARM and by the implementation of the union are illustrated in Figure 4. 6 Conclusion and future works In this paper, we proposed a new approach for selecting a set of components of pages to be cached. This approach is based on WUM techniques. We first integrate page component into visits of users of one web site in the preprocessing step. Then, we use the generation of frequent closed itemsets to filter the set of the most sollicitated components to be candidate for caching. Implementation efforts are carried out in order to test the proposed prototype for materialization on real web sites and to compare its performance with existing methods.

11 Fig. 4. Number of Frequent Closed Itemsets and Items Generated. Fig. 5. Number of Frequent Closed Itemsets and Items Generated. Moreover, research efforts are undertaken to take into account the aspects of the update propagation of the cached web page components and the algorithm of placement to adapt for the proposed solution. References J. Challenger, P. Dantzig, A. Iyengar, and K. Witting. A fragment-based approach for efficiently creating dynamic web content. ACM Trans. Internet Techn, 5(2): , C-Y. chang and M-S Chen. A new cache replacement algorithm for the integration of web caching and prefectching. In Proceedings of CIKM, Virginia, USA, pages , R. cooley. Web usage mining: Discovery and application of interesting patterns from web data. PhD thesis, University of Minnesota, USA, M. Crovella and P. Barford. The network effects for prefetching. Proc. IEEE INFOCOM 1998, pages , 1998.

12 6. A. Datta, K. Dutta, H. Thomas, and D. VanderMeer. A comparative study of alternative middle tier caching solutions to support dynamic web content acceleration. In Proceedings of the 27th VLDB Conference, Roma, Italy, pages 11 14, K. Gouda and M-J. Zaki. Genmax: An efficient algorithm for mining maximal frequent itemsets. Data Mining and Knowledge Discovery, V. Jacobson. Congestion avoidance and control. In Proceedings of ACM SIG- COMM, Stanford, CA, USA, page , A. Labrinidis and N. Roussopoulos. On the materialization of web views. In Proc. Of the ACM SIGMOD Conference, Philadelphia, Pennsylvania, USA, pages , A. Labrinidis and N. Roussopoulos. Web views materialization. In Proc. Of the ACM SIGMOD Conference, Dallas, Texas, United States, pages 79 84, A. Labrinidis and N. Roussopoulos. Online view selection for the web. In Proc. Of the ACM SIGMOD Conference, Madison, Wisconsin, pages 56 68, A. Labrinidis and N. Roussopoulos. Exploring the trade-off between performance and data freshness in database-driven web servers. The VLDB Journal, 13(3): , September A. Nanopoulos, D. Katsaros, and Y. Manolopoulos. Exploiting web log mining for web cache enhancement. WEBKDD, San Francisco, August, pages 68 87, L. Ramaswamy, A. Iyengar, L. Liu, and F. Douglis. Automatic detection of fragments in dynamically generated web pages. Proceedings of the 13th International Conference on World Wide Web WWW2004, New York, USA, pages , D. tanasa. Web Usage Mining: Contributions to Intersites Logs Preprocessing and Sequential Pattern Extraction with Low Support. PhD thesis, Thesis University of Nice Sophia Antipolis, French, Y-H. Wu and A-L-P. Chen. Prediction of web page accesses by proxy server log. World Wide Web, 5(1):67 88, O-R. Zaiane, M. Xin, and J. Han. Discovering web access patterns and trends by applying olap and data mining technologies on web logs. Proceedings, IEEE International Forum on Research and Technology Advances in Digital Libraries, pages 19 29, M-J. Zaki and C-J. Hisiao. Charm: An efficient algorithm for closed itemset mining. In 2nd SIAM Intl. Conf. on Data Mining, Arlington, VA, USA, pages , 2002.

Web Data mining-a Research area in Web usage mining

IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661, p- ISSN: 2278-8727Volume 13, Issue 1 (Jul. - Aug. 2013), PP 22-26 Web Data mining-a Research area in Web usage mining 1 V.S.Thiyagarajan,