Mining Web Logs with PLSA Based Prediction Model to Improve Web Caching Performance

Similar documents
Cluster Analysis of Electrical Behavior

Parallelism for Nested Loops with Non-uniform and Flow Dependences

Content Based Image Retrieval Using 2-D Discrete Wavelet with Texture Feature with Different Classifiers

A Binarization Algorithm specialized on Document Images and Photos

The Research of Support Vector Machine in Agricultural Data Classification

Video Proxy System for a Large-scale VOD System (DINA)

Subspace clustering. Clustering. Fundamental to all clustering techniques is the choice of distance measure between data points;

Improvement of Spatial Resolution Using BlockMatching Based Motion Estimation and Frame. Integration

Concurrent Apriori Data Mining Algorithms

A Fast Content-Based Multimedia Retrieval Technique Using Compressed Data

Term Weighting Classification System Using the Chi-square Statistic for the Classification Subtask at NTCIR-6 Patent Retrieval Task

A User Selection Method in Advertising System

A Fast Visual Tracking Algorithm Based on Circle Pixels Matching

Query Clustering Using a Hybrid Query Similarity Measure

Related-Mode Attacks on CTR Encryption Mode

Performance Evaluation of Information Retrieval Systems

Support Vector Machines

Text Similarity Computing Based on LDA Topic Model and Word Co-occurrence

Learning the Kernel Parameters in Kernel Minimum Distance Classifier

A Unified Framework for Semantics and Feature Based Relevance Feedback in Image Retrieval Systems

Tsinghua University at TAC 2009: Summarizing Multi-documents by Information Distance

An Entropy-Based Approach to Integrated Information Needs Assessment

CS 534: Computer Vision Model Fitting

Learning to Classify Documents with Only a Small Positive Training Set

Learning-Based Top-N Selection Query Evaluation over Relational Databases

An Optimal Algorithm for Prufer Codes *

FINDING IMPORTANT NODES IN SOCIAL NETWORKS BASED ON MODIFIED PAGERANK

MULTISPECTRAL IMAGES CLASSIFICATION BASED ON KLT AND ATR AUTOMATIC TARGET RECOGNITION

Three supervised learning methods on pen digits character recognition dataset

Outline. Type of Machine Learning. Examples of Application. Unsupervised Learning

The Codesign Challenge

Lecture 5: Multilayer Perceptrons

NUMERICAL SOLVING OPTIMAL CONTROL PROBLEMS BY THE METHOD OF VARIATIONS

Determining the Optimal Bandwidth Based on Multi-criterion Fusion

BioTechnology. An Indian Journal FULL PAPER. Trade Science Inc.

Fast Computation of Shortest Path for Visiting Segments in the Plane

UB at GeoCLEF Department of Geography Abstract

Mathematics 256 a course in differential equations for engineering students

A New Approach For the Ranking of Fuzzy Sets With Different Heights

Unsupervised Learning

Scheduling Remote Access to Scientific Instruments in Cyberinfrastructure for Education and Research

Research on Categorization of Animation Effect Based on Data Mining

Maximum Variance Combined with Adaptive Genetic Algorithm for Infrared Image Segmentation

The Shortest Path of Touring Lines given in the Plane

Fuzzy C-Means Initialized by Fixed Threshold Clustering for Improving Image Retrieval

Edge Detection in Noisy Images Using the Support Vector Machines

Simulation Based Analysis of FAST TCP using OMNET++

Hybridization of Expectation-Maximization and K-Means Algorithms for Better Clustering Performance

FAHP and Modified GRA Based Network Selection in Heterogeneous Wireless Networks

Wishing you all a Total Quality New Year!

Journal of Chemical and Pharmaceutical Research, 2014, 6(6): Research Article

A Clustering Algorithm Solution to the Collaborative Filtering

Solving two-person zero-sum game by Matlab

Virtual Memory. Background. No. 10. Virtual Memory: concept. Logical Memory Space (review) Demand Paging(1) Virtual Memory

SLAM Summer School 2006 Practical 2: SLAM using Monocular Vision

Improving Web Image Search using Meta Re-rankers

Classifying Acoustic Transient Signals Using Artificial Intelligence

Improved Resource Allocation Algorithms for Practical Image Encoding in a Ubiquitous Computing Environment

Keywords - Wep page classification; bag of words model; topic model; hierarchical classification; Support Vector Machines

Semantic Image Retrieval Using Region Based Inverted File

A Similarity Measure Method for Symbolization Time Series

For instance, ; the five basic number-sets are increasingly more n A B & B A A = B (1)

Skew Angle Estimation and Correction of Hand Written, Textual and Large areas of Non-Textual Document Images: A Novel Approach

Compiler Design. Spring Register Allocation. Sample Exercises and Solutions. Prof. Pedro C. Diniz

TECHNIQUE OF FORMATION HOMOGENEOUS SAMPLE SAME OBJECTS. Muradaliyev A.Z.

Recommended Items Rating Prediction based on RBF Neural Network Optimized by PSO Algorithm

A MOVING MESH APPROACH FOR SIMULATION BUDGET ALLOCATION ON CONTINUOUS DOMAINS

Sum of Linear and Fractional Multiobjective Programming Problem under Fuzzy Rules Constraints

X- Chart Using ANOM Approach

Pruning Training Corpus to Speedup Text Classification 1

Deep Classification in Large-scale Text Hierarchies

An Image Fusion Approach Based on Segmentation Region

Detection of an Object by using Principal Component Analysis

Utilizing Content to Enhance a Usage-Based Method for Web Recommendation based on Q-Learning

Efficient Text Classification by Weighted Proximal SVM *

Backpropagation: In Search of Performance Parameters

A New Feature of Uniformity of Image Texture Directions Coinciding with the Human Eyes Perception 1

LinkSelector: A Web Mining Approach to. Hyperlink Selection for Web Portals

Extraction of Human Activities as Action Sequences using plsa and PrefixSpan

Classifier Selection Based on Data Complexity Measures *

BOOSTING CLASSIFICATION ACCURACY WITH SAMPLES CHOSEN FROM A VALIDATION SET

A Model Based on Multi-agent for Dynamic Bandwidth Allocation in Networks Guang LU, Jian-Wen QI

A Post Randomization Framework for Privacy-Preserving Bayesian. Network Parameter Learning

User Authentication Based On Behavioral Mouse Dynamics Biometrics

FRES-CAR: An Adaptive Cache Replacement Policy

Load-Balanced Anycast Routing

Simulation: Solving Dynamic Models ABE 5646 Week 11 Chapter 2, Spring 2010

Adaptive Transfer Learning

High-Boost Mesh Filtering for 3-D Shape Enhancement

Smoothing Spline ANOVA for variable screening

Virtual Machine Migration based on Trust Measurement of Computer Node

A Clustering Algorithm for Chinese Adjectives and Nouns 1

Reliable Negative Extracting Based on knn for Learning from Positive and Unlabeled Examples

Effective Page Recommendation Algorithms Based on. Distributed Learning Automata and Weighted Association. Rules

Configuration Management in Multi-Context Reconfigurable Systems for Simultaneous Performance and Power Optimizations*

User Tweets based Genre Prediction and Movie Recommendation using LSI and SVD

A NEW APPROACH FOR SUBWAY TUNNEL DEFORMATION MONITORING: HIGH-RESOLUTION TERRESTRIAL LASER SCANNING

Problem Set 3 Solutions

TESTING AND IMPROVING LOCAL ADAPTIVE IMPORTANCE SAMPLING IN LJF LOCAL-JT IN MULTIPLY SECTIONED BAYESIAN NETWORKS

Anytime Predictive Navigation of an Autonomous Robot

Transcription:

JOURAL OF COMPUTERS, VOL. 8, O. 5, MAY 2013 1351 Mnng Web Logs wth PLSA Based Predcton Model to Improve Web Cachng Performance Chub Huang Department of Automaton, USTC Key laboratory of network communcaton system and control Hefe, Chna Emal: hcb@mal.ustc.edu.cn Jnln Wang, Haojang Deng, Jun Chen Insttute of Acoustcs, Chnese Academy of Scences atonal etwork ew Meda Engneerng Research Center Bejng, Chna Emal: {wangjl, denghj, chenj}@dsp.ac.cn Abstract Web cachng s a well-known strategy for mprovng the performance of web systems. The key to better web cachng performance s an effcent replacng polcy that keeps n the cache popular documents and replaces rarely used ones. When coupled wth web log mnng, the replacng polcy can more accurately decde whch documents should be cached. In ths paper, we present a PLSA based predcton model to predct the user access patterns and nterest to extend the well-known GRAM-GDSF cachng polcy. Extensve experments are conducted on the publcly avalable web logs datasets. The result shows that our approach gets better web-access performance. Index Terms PLSA, web cachng, web cachng algorthm, predcton model, web log mnng I. ITRODUCTIO As the Internet and user scale are growng at a very rapd rate, the performance requrements of web systems become ncreasngly hgh. Web cachng s one of the most successful solutons for mprovng the performance of web systems. The core dea of web cachng s to mantan the popular web documents that lkely to be revsted n near future n a cache, such that the performance of web system can be sgnfcantly mproved snce most of later user requests can be drectly repled from the cache. Lyng n the heart of web cachng algorthms s the cache replacement polcy. To mprove the performance of web cachng, researchers have proposed a number of cache replacement polces [1, 21], many of whch have been covered n the comprehensve surveys by [16]. These tradtonal algorthms take nto account several factors and assgn a key value or prorty for each web document stored n the cache. However, t s dffcult to have an omnpotent polcy that performs well n all envronments or for all tme because each polcy has dfferent desgn ratonal to optmze dfferent resources. Moreover, combnaton of the factors that can nfluence the replacement process to get wse replacement decson s not an easy task because one factor n a partcular stuaton or envronment s more mportant than other envronments [14, 15]. Due to these constrants, there s a need for an effectve approach to ntellgently manage the web cache whch satsfes the objectves of Web cachng requrement. Ths s motvaton n adoptng ntellgent technques n the Web cachng algorthms. Another motvaton to ntellgent Web cachng algorthms s the avalablty of web access logs fles, whch can be exploted as tranng data. In a few prevous studes, the ntellgent approaches have been appled n web cachng algorthms or other web servce [2, 11, 12, 13, 17, 18, 19, 20]. These studes typcally buld predcton model by tranng the web logs. By makng use of the predcton model, the cachng algorthms become more effcent and adaptve to the web cachng envronment compared to the tradtonal web cachng algorthms. However, these studes ddn t take nto account the user access patterns and nterest when buldng the predcton model. Snce the users are the source of all the web access actons, t s necessary to buld a predcton model whch can well modelng the user access patterns and nterest. In ths paper, we use the web access logs to tran a probablstc latent semantc analyss (PLSA) based predcton model for user access patterns and nterest to extend GRAM-GDSF [6]. Based on the model, we can obtan user s topcs of nterest and mne rules for future access predcton. We then proposed the PLSA-based GRAM-GDSF cachng algorthm, whch ncorporate our PLSA-based predcton model nto GRAM-GDSF to mprove ts web cachng performance. We also mprove the PLSA model to get the more accurately topcs. The experment shows that our PLSAbased predcton model ndeed mprove the system performance over GRAM-GDSF. The organzaton of the paper s as follows. In the next secton, we revew the related work n web cachng. In Secton 3, we gve a bref ntroduce of the PLSA model. do:10.4304/jcp.8.5.1351-1356

1352 JOURAL OF COMPUTERS, VOL. 8, O. 5, MAY 2013 Data process s descrbed n Secton 4. Then n Secton 5, we ntroduce the formal PLSA-based predcton model and show how t ntegrates wth the cachng algorthms. Experment results are presented n Secton 6. Fnally, we conclude our paper n Secton 7. II. RELATED WORK Web cachng plays a key role n mprovng web systems performance. The heart of web cachng s socalled replacement polcy, whch measure the popularty of all the prevously vsted documents, keeps n the cache those popular documents and replaces rarely used ones. The basc dea of the most well-known cachng algorthms s to assgn each document a key value computed by factors such as sze, frequency and cost. Usng ths key value, we could rank these documents accordng to correspondng key value. When a replacement s to be carred, the lower-ranked documents wll be evcted from the cache. Among these key valuebased cachng algorthms, GDFS [1] s the most successful one. It assgns a key value to each document n the cache as K /, where L s an nflaton factor to avod cache polluton, C s the cost to fetch h, F s the past occurrence frequency of h and S s the sze of h. Avalablty of web access logs fles that can be used as tranng data promotes the emergence of ntellgent web cachng algorthms [2, 3, 4, 11, 12]. In [11], the neuro-fuzzy system has been employed to predct web objects that can be re-accessed later. [12] proposes a logstc regresson model to predct the future request. However, these algorthms ddn't make use of the web logs to mne the useful nformaton that can reveal the user behavor pattern. Then a lot of study has been conducted to use web log mnng to mprove the performance of web cachng. Web log mnng extract useful knowledge from large-scale web logs for future research and applcaton. In [5], Ptknow and Proll studed the pattern extracton technques to predct the web use s access path. An n-gram model to predct future requests was proposed n [7]. [8] has proposed sequental data mnng for web transacton data, but they ddn t apply the algorthm n cachng. In [6], Qang Yang dscusses an ntegrated model by combnng assocatonbased predcton and the well-known GDSF cachng algorthm (GRAM-GDSF) n a unfed framework. They frst tran the web logs to buld a set of assocaton rules, and then apply these rules to gve predcton of future vsts for each sesson. However, the predcton ddn t take nto account the nterest of current user. In contrast, n ths paper, we propose a PLSA-based predcton model, by whch we can make predctons based on the current actve user s nterest, to mprove the assocaton predcton algorthm n [6]. III. THE PLSA MODEL The PLSA model [9] was orgnally developed for topc dscovery n a text corpus, where each document s represented by ts word frequency. The model assumes that, under the texts we observed there s another latent level: the topc level. A document has a certan probablty dstrbuton on a varety of topcs, and smlarly, topcs also have dfferent dstrbuton on a set of words. Therefore, PLSA ntroduces a latent topc varable,, between the document,, and the word,,. Then the PLSA model s gven by the followng generatve scheme: (1) Select a document wth probablty p. (2) Pck a latent topc wth probablty p. (3) Generate a word wth probablty p. As a result, we generally get an observaton par,, whle the latent topc varable s hdden. Ths generatve model can be expressed by the followng probablstc model: (, ) ( ) ( j j j ) P w d P d P w d (1) K ( ) ( ) ( j j k k ) P w d P w z P z d (2) k 1 Expectaton maxmzaton (EM) algorthm [10] s appled to learn the unobservable probablty dstrbuton P z d and P w z from the complete dataset. The log-lkelhood of the complete dataset s: M 1 (, ) log (, j j ) L n d w P d w (3) ( j ) ( j k ) ( k ) M K n d, w log P w z P z d 1 k 1 Where n, s the number of occurrences of word n document. In order to maxmze the loglkelhood functon L, we should frst ntal the PLSA probablty model parameters P and P wth random number, then perform teratve calculatons by alternatng mplementaton of the E-step and M-step. When the change of L s less than a threshold value, we stop the teratve calculatons. In E-step, the pror probablty of z s calculated: Pz ( d, w ) k j K P ( w z ) P ( z d j k k ) P ( w z ) P ( z d j l l ) (4). (5) In M-step, the followng formulas are used to reestmate the model parameters: 1 ( z j k ) M P w 1 nd (, w ) P( z d, w ) j k j nd (, w ) P( z d, w ) j k j (6)

JOURAL OF COMPUTERS, VOL. 8, O. 5, MAY 2013 1353 nd (, w ) P( z d, w ) j k j d k M Pz ( ) M nd (, w ) IV. DATA PROCESS In ths secton, we present our approach to establsh a PLSA-based predcton model on a large-scale web log. Our goal s to model user access patterns to get the user s nterest, based on whch we can predct future requests for each current actve user. A. Extractng Embedded Objects HTML documents consst of varety of embedded web objects, such as mages, audo and vdeo fles. Exactly, they contan the lnkage structure of these web objects. References to embedded objects are usually preceded by ther HTML contaner. Therefore, from the web logs we can see, they always appear as a burst of requests from the same user shortly after an HTML access. If an object s observed that ts request always come mmedately after access to certan HTML documents, t can be labeled as ts contaners. Generally web users are only nterested n HTML page, however, they know nothng about the lnkage structure nformaton. Therefore, we deal wth HTML documents and embedded objects dfferently. Whle buldng PLSAbased model to get the nterest of user, we do not take embedded objects nto consderatons. Instead, we just assocate them to ther correspondng HTML contaners. After extracted from web logs, these embedded objects are stored n HTML-OBJECT Hash Table, by whch we can get the embedded objects of a HTML documents quckly. B. Buldng User-html Matrx After flterng the embedded objects, only HTML document reman n a request sequence. When we buld the user-html matrx, the sesson lst has to be generated frst. Here, the generaton of sesson lst conssts of three major steps. Frst, the records of the web logs are sorted accordng to the access tme. Then, web set a reasonable sesson tme threshold. We assume that the duraton of any sesson won t exceed the sesson tme threshold. Fnally, accordng to the threshold, we sequentally buld the sessons for each user. Then we can get the user-html matrx (see Fg.1) by traversng the sesson lst. Each row n the matrx represents a user and n, specfes the number of tmes the html fle accessed by user. (, )... (, 1 1 1 )... (, m 1 M ) n u h n u h n u h............... n ( u, h1 )... n ( u, h )... n ( u, h n n m n M )............... n ( u, h1 )... n ( u, h )... n ( u, h m M ) Fgure 1. The user html matrx. j (7) V. PLSA-BASED PREDICTIO MODEL A. Overvew Although the PLSA model was orgnally developed for topc dscovery n text corpus, t has been appled n many scentfc felds, such as multmeda, data mnng and mage recognton. Here we apply the PLSA model to mprove web cachng algorthm. However, n the applcaton process, we found that there s a problem wth the PLSA model: t doesn t consder the user smlarty when calculate the model parameters. In order to more accurately obtan the latent topcs, we have made some mprovements on the teratve calculatons to address the smlarty between the users. We ntroduce a new smlarty layer n PLSA model such that the topc dstrbutons of the gven user can be updated by that of users smlar to hm. After buldng the mproved PLSA model, we can made predcton for each current actve user based on the users nterest whch s derved from the model. Further, we make use of the predcton model to mprove the cachng algorthm descrbed n [6]. As shows n Fg.2 we depct an overvew of the mproved PLSA model n our applcaton scenaro. Gven the user,,, the HTML fle,, and the latent topc,, we adopt the same generatve scheme as that of PLSA. In addton, we ntroduce a smlarty layer between user and HTML fle. 1. Select a user wth probablty P 2. Pck a latent topc wth probablty P 3. Access a HTML fle wth probablty P 4. P can be updated by that of the smlar users: P 5. s smlarty between and, t also can be consdered as the condtonal probablty The user smlartes are parameterzed by the user smlarty matrx S whch descrbed n the followng secton. As we have ncorporated the user smlartes nto the PLSA model, the smlar users can have smlar topc dstrbutons and we wll get more accurate latent topcs than the orgnal PLSA model. B. Users Smlartes Wth the user-html matrx ntroduced as Fg.1 shows, we compute the user smlarty matrx S by cosne smlarty. For each par of users n the matrx, we frst compute ther cosne smlarty as follows: uur uur u u l Sm h uur uur (8) u u uur where u s the -th user and represented by the -th row n the user-html matrx. Then we can get a smlarty matrx S where. Further, we should normalze the matrx S such that ts row adds up to 1. l

1354 JOURAL OF COMPUTERS, VOL. 8, O. 5, MAY 2013 S l S l S h h 1 Therefore, each element of S can be consdered as the condtonal probablty and the topc dstrbuton of a gven user can be updated by the topc dstrbutons of the users that are smlar to the gven user. ( ) ( ) ( k k l l ) P z u P z u P u u (10) C. Parameter Estmatng Accordng to the maxmum lkelhood prncple, we estmate the parameters P and P by maxmzng the log-lkelhood functon: M K (, ) log ( ) ( j j k k ) L n u h P h z P z u 1 k 1 (11) Then we used EM algorthm to estmate the parameters. In order to ncorporate the user smlarty, we renew the probablty P by equaton (9) at each end run of M-step, thus resultng n a varaton of EM algorthm through the followng E-step and M-step: The E-step s gven by (, ) P z u h k j K P ( h z ) P ( z u j k k ) P ( h z ) P ( z u j l l ) and the M-step s gven by ( ) P h z 1 j k M ( ) P z u 1 k M (, ) (, j k j ) n u h P z u h (, ) (, j k j ) n u h P z u h (, ) (, j k j ) n u h P z u h (, h j ) n u ( ) ( ) ( k k l l ) (9) (12) (13) (14) P z u P z u P u u (15) Iteratvely perform E-step and M-step untl the probablty values are stable. D. Predcton Algorthm The process of buldng the PLSA-based model s called tranng. Once the tranng s fnshed, we can make use of the model to gve predctons of future vsts. Specfcally, for user, we wll assume that he s nterested n the topc, f p s the bggest one among the sets,. Lkewse, for topc, we only keep largest value among sets. Then we assume that the html pages correspondng to the values are lkely to be accessed under the topc. In the prevous secton, we ntroduced the ntellgent web cachng algorthm [6] (GRAM-GDSF) whch ams to mprove the GDSF cachng algorthm. GRAM-GDSF s one of the best ntellgent cachng replacement algorthms. Our PLSA-based predctve cachng algorthm (P-GRAM) s an extenson and enhancement of GRAM-GDSF by ncorporatng a factor of predctve nterest frequency. When we use the mproved PLSA model, the correspondng PLSA-based predctve cachng algorthm s called IP-GRAM. ormally, there smultaneously exst a number of actve users who are accessng a web server. Based on the pre-traned mproved PLSA model, our predcton model can predct each actve user s nterested topc and the HTML fle correspondng to the topc. Dfferent users wll gve dfferent predcton to future HTML fles. Snce our predcton of a HTML fle comes wth a probablty P, we can combne all the current actve users nterest predctons to calculate the future nterest-based occurrence frequency of a HTML fle. Let denote a HTML fle on the server, be an actve user on a web server,, be the probablty predcted by an actve user, who are nterested n topc, for HTML. If, 0, t ndcates that HTML fle s not predcted by. Let be the future nterest based frequency of requests to HTML. If we assume all the users on a web server are ndependent to each other, we can obtan the followng equaton: I P (16) j, j To llustrate (16), we map two users n Fgure 2. Each user yelds a set of predctons to HTML fles accordng to hs nterest. Snce users are assumed ndependent to each other, we use (16) to compute the nterest-based frequency of HTML fle. For example, The HTML fle s predcted by three actve users wth a probablty of 0.7, 0.6 and 0.5, respectvely. From (16), 1.3. Ths means that based on users nterest, the HTML fle wll be accessed 1.3 tmes n the near future. h 1 :0.30 h 2 :0.21 h :0.50 1 h :0.42 2 h :0.60 1 h :0.20 1 I 0.30 + 0.50 + 0.60 1 1.40 I 0.21 + 0.42 + 0.20 2 0.83 Fgure 2. Predcton of nterest based frequency.

JOURAL OF COMPUTERS, VOL. 8, O. 5, MAY 2013 1355 Once the future nterest-based frequency I of a HTML h can be predcted, we extend GRAM-GDSF [6] to ncorporate the I : K h L + w h + F h + I h * C h / S h ( ) ( ) ( ) ( ) ( ) ( ) ( ) (17) We add w, F and I together n (17), whch mples that the key value of a HTML h s determned not only by ts past occurrence and sesson-based future frequency, but also affected by ts nterest. The more lkely the actve users are nterested n t, the greater the key value wll be. The ratonale behnd our extenson s that we take nto account the users nterest obtaned by tranng the web logs and adjust the replacement polcy. VI. EXPERIMETS A. Smulaton Model We have conducted a seres of expermental comparsons wth the web log that we are able to get. In the experments, the EPA data set contans a full day of HTTP requests to the EPA web server whch located at Research Trangle Park, C. Before the experments, we removed the records of uncacheable URLs from the web logs. A URL s consdered uncacheable when t contans dynamcally generated content such as CGI scrpts. We also fltered out the records wth unsuccessful HTTP response code. In our experments, we use two objectve performance parameters to evaluate the performance of our extended cachng algorthm. The ht rate s the percentage of all requests that can be repled drectly by searchng the cache for a copy of the requested document. The byte ht rate represents the percentage of bytes that are transferred drectly from the cache rather than from the orgn server. Let be the total number of request and δ 1 f the request s ht n the cache, whle δ 0 ohterwse. Mathematcally, the two parameters can be calculated as follows: HR BHR 1 δ 1 b δ 1 where s the sze of the -th request. b (18) (19) B. Expermental Results In the experments, the algorthms under comparson are GRAM-GDSF, P-GRAM and IP-GRAM. Wth respect to any one of the three algorthms, from Fg.3 and Fg.4, we can see that both the ht rate and byte ht rate are growng n a log-lke fashon as a functon of the cache sze. Ths suggests that ht rate or byte ht rate does not ncrease as much as the cache sze does, especally when cache sze s large enough. In other words, as the cache sze becomes nfnte, the performance dfference between these three algorthms wll dsappear. So we should compare the performance of cachng algorthms under lmted cache sze. We observed the cache ht rates and cache byte ht rates under dfferent cache szes. As shows n Fg.3 and Fg.4, IP-GRAM and P-GRAM both perform better than GRAM-GDSF. Ths result reveals that the nterestbased frequency, predcted by the PLSA-based predcton model, s workable. Due to the valdty of nterest-based frequency, IP-GRAM and P-GRAM both buld a more accurately predcton model than -GRAM, so that the accordng cache replacement polces get a better performance. Besdes, we found that IP-GRAM s a lttle better than P-GRAM, snce t gets a more accurately latent topc when buldng the predcton model. In addton, we can see that when cache szes are large few replacement decsons are needed and cache polluton s not a factor so the polces have smlar performance. When cache szes are very small, addng a sngle large document can result n the removal of a large number of smaller documents reducng the effects of cache polluton. At ths case, a good cachng algorthm could sgnfcantly mprove the cachng performance. Ht Rate VS Cache Sze 90 Ht Rate(%) Byte Ht Rate(%) 80 70 60 50 IP-GRAM P-GRAM GRAM-GDSF 40 0 0.5 1 Cache sze(%) Fgure 3. Ht rate comparson on EPA data 75 65 55 Byte Ht Rate VS Cache Sze IP-GRAM P-GRAM GRAM-GDSF 45 0 0.5 1 Cache sze(%) Fgure 4. Byte ht rate comparson on EPA data In order to prove the valdty of the mproved PLSA model, we compare the convergence rate between IP- GRAM and P-GRAM. P-GRAM adopts the PLSA model and IP-GRAM uses the mproved PLSA model nstead. As shows n Fg.5, the log-lkelhood of IP- GRAM convergence faster than P-GRAM. Snce IP-

1356 JOURAL OF COMPUTERS, VOL. 8, O. 5, MAY 2013 GRAM ntroduce smlarty layer n PLSA model, topc dstrbutons of the gven user can be updated by that of users smlar to hm durng the teratve process of tranng. By ths way, IP-GRAM could estmate more accurate probablty model of the topc dstrbutons n an teraton, whch wll accelerate the convergence rate of teratons. Log-lkelhood -7.5 x 104-8 -8.5-9 -9.5-10 -10.5-11 0 50 100 150 200 umber of teratons Fgure 5. Log lkelhood convergence rate VII. COCLUSIO In ths paper, we appled the PLSA-based predcton model buld from the web logs to mprove the GRAM- GDSF cachng algorthm. By takng nto account the nterest-based frequency n cachng algorthm, t s possble to dramatcally mprove both the ht rate and byte ht rate. In the future, we would lke to study on how to reduce the buldng tme of PLSA model, so that we can dynamcally update the model parameters on lne. In addton, n future we can consder ncorporatng the clusterng approach to process web logs, so that a more accurate user nterest model could be obtaned by buldng PLSA model. ACKOWLEDGMET Ths work and related experment envronment s supported by the atonal Hgh Technology Research and Development Program of Chna under Grant o. 2011AA01A102, the atonal Key Technology R&D Program under Grant o. 2012BAH18B04 and the Strategc Prorty Research Program of the Chnese Academy of Scences under Grant o. XDA06010302. We are sncerely grateful to ther support. REFERECES IP-GRAM P-GRAM [1] L. Cherkasova, mprovng www proxes performance wth greedy-dual-sze-frequency cachng polcy, In HP Techncal Report, Palo Alto, ovember 1998. [2] Q. Lu, Web Latency Reducton wth prefetchng, PhD thess, Unversty of Western Ontaro, London, 2009. [3] H.T. Chen, Pre-fetchng and Re-fetchng n Web cachng system: Algorthms and Smulaton, Master Thess, TRET UIVERSITY, Peterborough, Ontaro, Canada, 2008. [4] W. Al, S.M. Shamsuddn, Intellgent Clent-sde Web Cachng Scheme Based on Least recently Used Algorthm and euro-fuzzy System, The sxth Internatonal Symposum on eural etworks, Lecture otes n computer Scence, Sprnger-Verlag Berln Hedelberg, pp. 70-79, 2009. [5] Ptkow J, Proll P, Mnng longest repeatng subsequences to predct www surfng, In Proceedngs of the 1999 USEIX Annual Techncal Conference, 1999. [6] Qang Yang, Hanng Henry Zhang, Tany LI, Mnng web logs for predcton models n WWW cachng and prefetchng, In Proceedngs of Internatonal Conference on Knowledge Dscovery and Data Mnng KDD 01, San Francsco, Calforna, USA, 2001. [7] R. Agrawal and R. Srkant, Mnng Sequental Patterns, In Proceedngs of Internatonal Conference on Data Engneerng, Tape, Tawan, 1995. [8] Z. Su, Q. Yang, Y. Lu, H. Zhang, Whatnext: A predcton system for web requests usng n-gram sequence models, In Proceedngs of the Frst Internatonal Conference on Web Informaton System and Engneerng Conference, pp. 200-207, Hong Kong, June 2000. [9] T. Hofmann, Probablstc latent semantc ndexng, In Proceedngs of the 22 nd annual nternatonal ACM SIGIR conference on Research and development n nformaton retreval, pp. 50-57, ACM, 1999. [10] S. Deerwester, Maxmum lkelhhod from n complete data va the EM algorthm, Journal of the Royal Statstcal Socety B 39, pp. 1-38 [11] L. Janhu, X. Tanshu, Y. Chao, Research on WEB Cache Predcton Recommend Mechansm Based on Usage Pattern, Frst Internatonal Workshop on Knowledge Dscovery and Data Mnng(WKDD), pp.473-476, 2008. [12] T.M. Kroeger, D.D.E. Long, J.C. Mogul, Explorng the bounds of web latency reducton from cachng and prefetchng. Proceedngs of the USEDC Symposum on Internet Technology and Systems, pp. 13-22, 1997. [13] Tao. Tan, Hongjun. Chen, A personalzaton recommendaton method based on Deep web data query, Journal of Computers, v 7, n 7, p 1599-1606, 2012. [14] H.T. Chen, Pre-fetchng and Re-fetchng n Web cachng systems: Algorthms and Smulaton, Master Thess, TRET UIVERSITY, Peterborough, Ontaro, Canada, 2008. [15] A.K.Y. Wong, Web Cache Replacement Polces: A Pragmatc Approach, IEEE etwork magazne, 20(1), pp. 28-34, 2006. [16] J Wang, A survey of web cachng schemes for the nternet, ACM SIGCOMM Computer Communcaton Revew, 1999. [17] Jngl. Zhou, Xuejun. e, Lehua. Qn, Janfeng, Zhu, Web clusterng based on tag set smlarty, Journal of Computers, v 6, n 1, p 59-66, 2011. [18] Yanjuan. L, Maozu. Guo, Web page classfcaton usng relatonal learnng algorthm and unlabeled data, Journal of Computers, v 6, n 3, p 474-479, 2011. [19] We. Huang, Ly. Zhang, Jdong. Zhang, Mngzhu Zhu, Semantc focused crawlng for retrevng E-commerce nformaton, Journal of Software, v 4, n 5, p 436-443, July 2009. [20] Xangfeng. Luo, Ka. Yan, Xue. Chen, Automatc dscovery of semantc relatons based on assocaton rule, Journal of Software, v 3, n 8, p 11-18, ovember 2008. [21] P. Cao, S. Iran, Cost-Aware WWW Proxy Cachng Algorthms, USEIX Symposum on Internet Technologes and Systems, Monterey, CA, 1997.