A Webpage Similarity Measure for Web Sessions Clustering Using Sequence Alignment

Size: px

Start display at page:

Download "A Webpage Similarity Measure for Web Sessions Clustering Using Sequence Alignment"

Nicholas Wade
5 years ago
Views:

1 A Webpage Smlarty Measure for Web Sessons Clusterng Usng Sequence Algnment Mozhgan Azmpour-Kv School of Engneerng and Scence Sharf Unversty of Technology, Internatonal Campus Ksh Island, Iran Reza Azm Department of Computer Engneerng Alzahra Unversty Tehran, Iran Abstract Web sessons clusterng s a process of web usage mnng task that ams to group web sessons wth smlar trends and usage patterns nto clusters. Ths process s crucal for effectve webste management, web personalzaton and developng web recommender systems. Accurate clusterng of web sessons s hghly dependent to smlarty measure defned to compare web sessons. In ths paper, we propose a smlarty measure for comparng web sessons. The sequental order of web navgatons n sessons s consdered usng sequence algnment method. Furrmore, we propose to consder usage smlarty of web sessons based on tme a user spends on a webpage, and also frequency of vst of each page wthn sesson. The proposed method s valdated by clusterng a collecton of web sessons usng an agglomeratve clusterng technque and comparng results wth avalable methods. The expermental results show effectveness of proposed method to capture propertes of web sesson data. Keywords-nterestngness of webpage; webpage smalrty measure; sequence algnment; web sessons clusterng I. INTRODUCTION The World Wde Web (WWW s consdered as largest dstrbuted collecton of nformaton. By rapd growth of ths nformaton resource, t has become a dffcult task for users to acqure r desred nformaton even n a partcular webste. Hence, a need for developng technques that can facltate ths ssue has been hghlghted. Knowng users and makng profles of m can be helpful for webstes to present relevant nformaton to partcular vstors. To address ths ssue, web usage mnng has recently attracted many attentons [1]. Web usage mnng s an applcaton of data mnng whch tres to extract useful patterns from data that are obtaned from nteracton of users wth web. Deployng web mnng technques are crucal for any applcaton that ams to ease use of web such as creatng adaptve webstes, web personalzaton, web recommender systems, etc. Web usage mnng from web access log fles has three steps [1]: (1 data pre-processng; (2 pattern dscovery by applyng varous technques such as clusterng, classfcaton, assocaton dscovery, and sequental pattern dscovery to data; and (3 pattern analyss whch ams at elmnatng rrelevant patterns from dscovered patterns n prevous step. In our study, we are nterested n web sessons clusterng whch s problem of groupng web sessons wth smlar usage patterns nto groups. Any clusterng process tends to maxmze ntra-group smlarty and mnmze nter-group smlarty of cluster obects [2]. One of most challengng ssues n clusterng web sesson s how to measure smlarty of web sessons. A more precse smlarty measure can defntely be more helpful for nvestgatng nature of data. The most popular measures that are used for web sessons clusterng are Eucldean dstance, Cosne smlarty measure, and Jaccard coeffcent. We should keep n mnd that a web sesson contans a sequence of URLs accessed by a user. Therefore, a good smlarty measure should be defned so that t does not gnore sequental nature of web navgatons n sessons. On or hand, not all of URLs vsted n a sesson are equally mportant to user. In ths paper, we borrow dea of sequence algnment [3] from bonformatcs n order to fnd best match of sesson sequences. Sequence algnment s one of fundamental operatons n bonformatcs n order to capture relatonshps between DNA sequences. Smlar to each sesson whch conssts of a sequence of web pages, each DNA contans a sequence of amno acds. Consequently, technques used n DNA sequences algnment can be appled to measure smlarty of web sessons. Smlar to DNA sequence algnment, problem of computng smlarty between web sessons can be facltated by usng dynamc programmng technques [3]. In addton to sequental nature of web sessons, we consder tme a user spends on a webpage and also frequency of vstaton from a partcular webpage n a web sesson n order to estmate mportance of that page to user. Usng Slhouette coeffcent [2] of obtaned clusters as evaluaton measure, we compare our method wth avalable methods. The results show that our method s more effectve n nvestgatng smlarty of web sessons for web sessons clusterng task. The remander of ths paper s organzed as follows. In Secton 2, a revew on some avalable methods for clusterng web sessons s presented. In Secton 3, we ntroduce a new method for estmatng smlarty of sessons usng sequence algnment. The expermental results for evaluatng proposed method are presented n Secton 4. Fnally, Secton 5 concludes our study. Also, future works are presented n ths secton /11/$26.00 c 2011 IEEE

2 II. RELATED WORK Dfferent smlarty measures have been proposed for capturng web sessons smlartes. Also, varous clusterng algorthms have been ntroduced n lteraturee for groupngg web users wth common practces. Most of see works represent a sesson usng a vector defnedd over space of web pages wthn a partcularr webste: a vector dmenson corresponds to specfc URL wthn webste. Dependng on values assgned to se dmensons, dfferent user behavor analyss can be performed. The most common method s to assocate bnary values to a dmenson,.e. value 1 for pages whch user has vsted m n sesson and 0 for ors. Nasraou et al. [4] used ths representaton for web sessonss and deployed normalzed cosne of angle between vectors as smlarty of m for web sessons clusterng. Some or methods have been proposed to use feature weghts based on tme a user spends on a partcular webpage (perhaps normalzed by sze of webpage or frequency of occurrence of a URL wthn user sesson nstead of bnary weghts [5], [6]. Yan et al. [5] appled Eucldean dstance n web user clusterng task and n suggested some lnks to web users accordng to r correspondng cluster. None of mentoned methods captures sequental nature of web navgatons wthn sessons. In a newer attempt, Baneree and Ghosh [7] used relatve tme spent on longest common sub-sequence between sessons, found through dynamc programmng, as smlarty between sessons. The authors n bult an abstract smlarty graph for set of sessons and appled graph parttonng methods n order to cut abstract graph nto clusters. Wang and Zaane [8] ntroduced a new method to measure smlartes between web sessons based on sequence algnment n computatonal bology. In ths method, y frst defnedd a smlarty between web pages usng herarchcal structure of URLs n webste. Then, y utlzed dynamc programmng to fnd best match between sesson sequences. In method presented n [9], accurate vewng tme of accessed pages are consdered n addton to URL of pages for defnng smlarty of web pages. Smlar to [8], a dynamc programmng process s n appled to fnd best mach for sessons. In [10], an algorthm for Web Sesson Clusterng Based on Increase of Smlartes (WSCBIS s presented usng method proposed n [9]. Ths algorthm decreasess tme and space complexty of clusterng compared to k-means and Robust Clusterng usng lnks (ROCK [11]. Hay et al. [12] have clustered web users usng dfferent smlarty measures: Sequence Algnment Method (SAM and Assocaton measure (Eucldean dstance- s based measure. In SAM, sequental order of requests taken nto consderaton and not poston of m. The results proved that SAM retreves sequences not only wth smlar pages, but order of pages s also consdered compared to assocatve measure. The method proposed n ths paper has smlar basc dea to methods presented n [9]. Compared to ths work, we consder not only tme a user spends on a webpage, but also frequency of vstaton from a partcular webpage n order to estmate nterestngness of that page to user n a sesson. Then, we defne smlarty of web pages based on conuncton of smlarty of web pages URLs and smlarty of r nterestngness to users. Fnally, we employ sequence algnment n order to fnd best match of sessons and estmate smlarty of sessons. A. Web Page Smlarty Based on URLs Wang and Zaane [8] proposed a method to measure smlarty of dfferent web pages. Ths method does not consder content of web pages but smply paths leadng to a webpage n herarchcal structure of URLs of webste. The detal nformatonn of ths method s dscussed n [8]. Brefly, n order to measure smlarty of web pages, we frst represent each level of r URLs by a token. As a result, token strng of full path of a URL s concatenaton of all representatve tokens for each level. The token for each level s assgned based on herarchcal structure of webste. Markng tree structure of a nomnal webste s llustrated n Fg.1. In order to compute smlarty of web pages, we frst determne length of longest token strng among. Then, we gve a weght to each level of tokens from last to frst. The last level of longest token strng s gven a weght equal to 1, second to last s gven weght 2, and so on. Fnally, smlarty between token strngss (Token Sm s defned as sum of weghts of those matchng tokens dvded by sum of total weghts. The obtaned smlarty for par of web pages ranges from 0, for pages wthout any dentcal token n dentcal place, to 1, for pages that are exactly same. An example of ths process for nomnal URLs (example-webste/a/c.html and example- webste/a/b.html s presented n Fg. 2. For ths example, smlarty between token strngs s (3+2 / (3+2+1 = Fgure 1. Markng URL tree of a nomnal webste. Fgure 2. An example of token strng comparson. 21

3 B. Web Page Smlarty Based on Importance to User As mentoned, we use smlarty of web pages wth conuncton of smlarty between nterests of r users n vstng those pages. In a partcular sesson, normalzed frequency of vst of th webpage (P and tme spent on ths page can be represented usng (1 and (2 respectvely. In (1, Frequency(P s smply number of tmes webpage P has been vsted n sesson. In (2, Tme Spent on(p s dfference between exact tme of request of page P and tme of request for next webpage n sesson from access log fle. Consder that, we cannot compute ths value for last webpage requested n sesson. Here, we defne tme spent on last webpage of a sesson as average tme spent on or web pages of sesson. Consderng fact that length of a webpage (n bytes can have mpact on tme that s needed to vst that page, we have normalzed spent tme on pages by dvdng ths value by length of correspondng webpage as shown n (2. Freq (P = SpentTme( P = Frequency( P Frequncey( P P All pages n sesson Tme Spent on(p Length(P Tme Spent on(p Length(P P All pages n sesson In (2 and (3, frequency of vstaton of a webpage and spent tme on that page are normalzed by r denomnator whch s sum of se values for whole requests n sesson. These measures should be combned to descrbe nterestngness of a webpage to a user. In mamatcs, harmonc mean s one of several knds of average. In our case, we use harmonc mean of Freq and SpentTme for page P as measure of nterestngness of ths page to a user n one sesson. Ths value can be descrbed usng (3. Interest(P = 1 Freq(P SpentTme( P Fnally, we defne a measure to estmate smlarty of nterestngness of vstors from th page (P and th (P page n a sesson. Ths value can be estmated usng (4. Consder that P and P can also belong to dfferent sessons. Interest Sm(P, P = mn { Interest(P max { Interest(P, Interest(P, Interest(P } } Consderng smlarty between par of pages based on (1 (2 (3 (4 r URLs (Token Sm and r nterestngness to user (Interest Sm, we can defne smlarty of web pages as shown n (5. In ths equaton, parameter s a scale factor whch should be assocate wth a value between 0 and 1. Smlarty (P, P = Token Sm(P, P + (1 Interest Sm(P, P C. Smlarty of Web Sessons As mentoned earler, we consder each sesson as a sequence of URLs that are requested by user. For estmatng smlarty of web sessons, we apply sequence algnment method n order to fnd best match between sequences. In algnng sequences, not only characters that match dentcally are consdered, but also spaces or gaps (or conversely, nsertons n or sequence and msmatches, both of whch can correspond to mutatons. In sequence algnment, we want to fnd an optmal algnment that, loosely speakng, maxmzes number of matches and mnmzes number of spaces and msmatches. For applyng sequence algnment method, we need to defne a scorng functon whch helps fnd optmal matchng between sesson sequences. Consder that, smlarty of web pages dscussed n prevous secton plays role of a page matchng goodness functon. The scorng functon deployed n method of ths paper s as follows. For each dentcal matchng,.e. a par of pages wth smlarty 1, score s 20; for each msmatchng,.e. a par of pages wth smlarty 0, or matchng a page wth a gap, score s 10; for a par of pages wth smlarty (0,1, score for r matchng s between -10 and 20. Hence, scorng functon for smlarty between pages P and P can be calculated usng (6. Consder that, parameter s actually smlarty between pages P and P whch was calculated n prevous secton. (5 Score (P, P = , 0 1 (6 As mentoned earler, estmaton of smlarty between web sessons s calculated usng sequence algnment method n order to fnd best match between sequences. The fnal smlarty between sequences s obtaned based on r optmal matchng and length of sequences. An optmal matchng s an algnment wth hghest possble score. As mentoned earler, problem of fndng optmal matchng of sequences can be facltated usng dynamc programmng,.e. smlarty of sessons sequences can be computed by consderng contrbuton of smlarty of pages n head of each sequence and maxmum smlarty n remanng subsequence. Ths process can be descrbed usng a matrx n whch one sequence (sesson s placed along top and or sequence (sesson s placed along left sde of matrx. An example of such matrx s llustrated n Fg. 3. In ths fgure, each webpage n a sesson s shown usng correspondng token to each level of webpage URL n structure of webste. Also, tme user has spent on each webpage s shown n parenses for each webpage. 22

4 Fgure 3. An example of sesson matchng matrx. In order to calculate optmal matchng usng sequence algnment matrx, a gap s added to start of each sequence whch ndcates startng pont of matchng. The goal s to fnd an optmal path from top left corner to bottom rght corner of matrx. In each step, we can only have a rght, down or dagonal move. A rght move corresponds to nsertng a gap to sequence n left and matchng sequence on top wth a gap, whle a down move corresponds to nsertng a gap to sequence on top and matchng sequence on left wth a gap. In each step, score for each three moves s calculated and maxmum of m s added to current score, whch had been obtaned from prevous moves. In or words, drecton whch provdes maxmumm score s chosen n each step. The optmal path s n acheved through back propagatng from bottom rght corner to startng pont. In gven example, optmal path found through back propagatng s shown by arrows n Fg. 3. The score thatt s put n lower rght corner s optmal sequence algnment score. In our scorng system, optmal score cannot be bgger than length of shorter sesson multpled by 20. Also, t cannot be smaller than length of longer sesson multpled by -10. Therefore, fnal smlarty measure can be calculated by normalzng optmal score wth respect to se maxm and mnma. In case of our example, smlarty of web sessons s [69.17 ( 10 6] / [(20 5 ( 10 6] = Usng ths defnton, smlarty value for web sesson wll always be between 0 and 1. III. EXPERIMENTAL EVALUATION Many attempts have been made to evaluate clusterng goodness and to fnd rules to quantfy qualty of a clusterng result. Cluster valdaton for large datasets of categorcal data such as web sesson data s a very hard task. Prevous works that proposed to use sequence algnment method to cluster web sesson data, valdates r expermental results manually rar than quanttatvely [8], [9]. Consder that, due to large number of web pages n a typcal webste, usually a large number of web sessons are needed to fully represent possble usage patterns over that webste. However, as sequence algnment method s a tme consumng process, dealng wth a large dataset ncreases processng tme for constructng clusters. The emprcal evaluaton reported n ths paper concerns queston wher proposed method, whch consdererss both smlarty of web pages based on r URLs and usage smlarty of m n order to defne a scorng functon for sequence algnment method, can properly reflect nature of sesson data. Web usage data that are used for ths experment are collected from Musc Machne webste by Perkowtz and Etzon [13]. The orgnal data, used n our experment, contans requests from log fles of web server for fve random days n After preprocessng task on orgnal data usng methods descrbedd n [14], 2664 web sessons were extracted. We have compared our method to one presented n [9], whch we refer to as extended Sequence Algnment (SA method. As descrbed earler, n extended SA method, only accurate vewng tme of accessed pages are taken nto consderatonn for defnng usage smlarty between web sessons. Ths method s clamed to perform better n revealng nature of data compared to normal SA method presented n [8]. Also, t has been proved that SA method can effectvely reflect sequental nature of web navgatons n web sessons clusterng compared to or smlarty measures such as Jaccard coeffcent [8]. To compare our method wth extended SA, dstance matrces holdng parwse sequence algnment dstance measures between web sessons are obtaned usng our proposed method and extended SA. Consderng fact that smlarty measures obtaned from both methods always range from 0 to 1, dstance between sessons can be calculated by subtractng smlarty of web sessons from 1. Fnally, obtaned dstance matrces can be used for clusterng web sessons. For our experment, we have appled agglomeratve clusterng usng average lnkage method [1] to construct a lnkage tree. The average lnkage method s not very susceptble to nose and outlers n nput data. After constructng lnkage tree, we cut off tree n order to generate desred number of clusters from sesson data. In ths experment parameter used n (5 s set to 0.7. For evaluatng effectveness of smlarty measure n ths paper, average Slhouette Coeffcent (SC [ 2] s calculated for web sessons data n r clusters. The Slhouette coeffcent consders both coheson and separaton of data ponts for evaluatng clusters. The Slhouette coeffcent value for th web sesson (s can be calculated usng (7. b a s = max(a, b In (7, a s average dstance of th sessons from or sessons n ts correspondng cluster. The parameterr b s mnmum of average dstances from th sesson to sessons n or clusters. The value of Slhouette coeffcent can vary from -1 to 1; closer value to 1, better clusterng result. The average Slhouette coeffcent values for dfferent number of clusters (range from 2 to 25 for dstance matrces are shown n Fg. 4. (7 23

5 Average Slhouette Coeffcent Number of Clusters Proposed Method Extended SA Method Fgure 4. The comparson of average SC values for dfferent number of clusters, generated by usng parwse dstance measure matrces of proposed method and extended SA method. As we can see n Fg. 4, average SC values for clusters generated by dstance measure of proposed method has hgher values compared to SC values of clusters generated by dstance measure of extended SA method. Ths result can show our approach can properly reflect nature of sesson data snce, not only we have consdered sequental nature of web sessons, but also we have taken nto account smlarty of usage patterns for comparng web sessons. As we can see n Fg. 4, average SC value reduces by ncreasng number of clusters. The reason s that, by ncreasng number of clusters we cut off lnkage tree n lower values of dstance between clusterng obects. The result would be smaller clusters whch may have hgher nter-cluster smlartes. IV. CONCLUSIONS AND FUTURE WORK As descrbed n ths paper, web sesson clusterng s an mportant task to group web sessons wth smlar trends. Ths s an essental process for effectve webste management, web personalzaton, and web recommender systems. Accurate clusterng of web sessons s hghly dependent to smlarty (or dssmlarty measure defned to compare web sessons. In ths paper, we proposed a new smlarty measure for web sessons clusterng. We consdered tme a user spends on a webpage and also frequency of vstaton from a partcular webpage wthn a sesson n order to estmate nterestngness of that page to user. Then, we defned smlarty of web pages wthn sessons based on conuncton of smlarty of web pages and smlarty of r nterestngness to users. Fnally, we employed sequence algnment method n order to fnd best match of sesson sequences and estmate smlarty of m. We compared our method wth a case n whch only spent tme on a webpage was consdered for defnng usage pattern smlarty. The evaluaton was performed by measurng average Slhouette coeffcent of sessons whch were clustered usng agglomeratve clusterng wth average lnkage method. Expermental results verfy effectveness of our method. However, tme complexty of SA methods s stll hgh. As descrbed, we estmated smlarty of web pages based on smlarty of r herarchcal structure of URLs whle gnorng content of web pages. To have a better estmaton of smlarty of web pages, we can use or methods proposed for web content mnng such as Informaton Retreval or semantc web approaches. Furrmore, for havng a more general evaluaton, we can use a larger collecton of web sessons data and apply dfferent clusterng algorthms on se data. REFERENCES [1] T. Hussan, S. Asghar, and S. Fong, A herarchcal cluster based preprocessng methodology for Web Usage Mnng, n 6th Internatonal Conference on Advanced Informaton Management and Servce (IMS, 2010, pp [2] P. J. Rousseeuw, Slhouettes: A graphcal ad to nterpretaton and valdaton of cluster analyss, Journal of Computatonal and Appled Mamatcs, vol. 20, pp , Nov [3] K. Charter, J. Schaeffer, and D. Szafron, Sequence algnment usng FastLSA, n Internatonal Conference on Mamatcs and Engneerng Technques n Medcne and Bologcal Scences (METMBS, 2000, p [4] O. Nasraou, C. C. Urbe, C. R. Coronel, and F. Gonzalez, TECNO- STREAMS: trackng evolvng clusters n nosy data streams wth a scalable mmune system learnng model, n Thrd IEEE Internatonal Conference on Data Mnng, ICDM, 2003, pp [5] T. W. Yan, M. Jacobsen, H. Garca-Molna, and U. Dayal, From user access patterns to dynamc hypertext lnkng, Computer Nerks and ISDN Systems, vol. 28, no. 7-11, pp , May [6] R. Forsat, M. R. Meybod, and A. Rahbar, An effcent algorthm for web recommendaton systems, n IEEE/ACS Internatonal Conference on Computer Systems and Applcatons, Los Alamtos, CA, USA, 2009, vol. 0, pp [7] A. Baneree and J. Ghosh, Clckstream clusterng usng weghted longest common subsequences, n Proceedngs of Web Mnng Workshop at 1st SIAM Conference on Data Mnng, 2001, p [8] Wenan Wang and O. R. Zaane, Clusterng Web sessons by sequence algnment, n In Proceedngs 13th Internatonal Workshop on Database and Expert Systems Applcatons, 2002, pp [9] Chaofeng L and Yansheng Lu, Smlarty Measurement of Web Sessons by Sequence Algnment, n IFIP Internatonal Conference on Nerk and Parallel Computng Workshops, NPC Workshops, 2007, pp [10] C. L, Research on Web Sesson Clusterng, Journal of Software, vol. 4, no. 5, Jul [11] S. Guha, R. Rastog, and K. Shm, Rock: A robust clusterng algorthm for categorcal attrbutes, Informaton Systems, vol. 25, no. 5, p , [12] B. Hay, G. Wets, and K. Vanhoof, Segmentaton of vstng patterns on web stes usng a sequence algnment method, Journal of Retalng and Consumer Servces, vol. 10, no. 3, pp , May [13] M. Perkowtz and O. Etzon, Adaptve stes: Automatcally learnng from user access patterns, n Proceedngs of 6th Internatonal World Wde Web Conference, Santa Clara, Calforna, [14] V. Sathyamoorth and V. M. Bhaskaran, Data Preparaton Technques for Web Usage Mnng n World Wde Web-An Approach, Internatonal Journal of Recent Trends n Engneerng, vol. 2, no. 4,

Machine Learning: Algorithms and Applications

Machine Learning: Algorithms and Applications 14/05/1 Machne Learnng: Algorthms and Applcatons Florano Zn Free Unversty of Bozen-Bolzano Faculty of Computer Scence Academc Year 011-01 Lecture 10: 14 May 01 Unsupervsed Learnng cont Sldes courtesy of