Mining Web Logs for Personalized Site Maps
|
|
- Lawrence McCarthy
- 6 years ago
- Views:
Transcription
1 Mining Web Logs for Personalized Site Maps Fergus Toolan Nicholas Kushmerick Smart Media Institute, Computer Science Department, University College Dublin {fergus.toolan, Abstract. Navigating through a large Web site can be a frustrating exercise. Many sites employ Site Maps to help visitors understand the overall structure of the site. However, by their very nature, unpersonalized Site Maps show most visitors large amounts of irrelevant content. We propose techniques based on Web usage mining to deliver Personalized Site Maps that are specialized to the interests of each individual visitor. The key challenge is to resolve the tension between simplicity (showing just relevant content), and comprehensibility (showing sufficient context so that the visitors can understand how the content is related to the overall structure of the site). We develop two baseline algorithms (one that relies on shortest paths, and one that mines the server log for popular paths), and compare them to a novel approach that mines the server log for popular path fragments that can be dynamically assembled to reconstruct popular paths. Our experiments with two large Web sites confirm that the mined path fragments provide much better coverage of visitors sessions that the baseline approach of mining entire paths. 1. Introduction Finding relevant information in a large Web site can be tedious and frustrating. Site Maps are commonly used by Web developers to help visitors understand and navigate complex sites. For example, Figure 1(a) shows a portion of Apple.com s Site Map. By their very nature, Site Maps present nearly all of a Web site s content. Of course, most visitors are interested in just a small subset of this content [3]. Figure 1(b) illustrates how this Site Map could be personalized for some particular visitor who is interested in just a few aspects of Apple.com. Our research goal is to develop techniques to enable Web sites to automatically deliver Personalized Site Maps. Achieving this goal involves solving two sub-problems. The first challenge is to determine what content items (i.e., Web pages) each visitor is actually interested in. The second challenge is to display these relevant pages in a way that helps visitors understand how the relevant pages are related. Web designers invest substantial effort in crafting Site Maps in order to help visitors understand the overall structure of the Web site, and Personalized Site Maps must not throw the baby out with the bathwater by ignoring this structure. For example, a visitor interested in 17 inch Studio Displays should not simply be pointed to the most relevant page, but she should be shown how the page is related to the site s Products section. We adopt a simple solution to the first problem we assume that the visitor expresses her interests with an explicit query such as inexpensive studio displays that is processed with standard information retrieval techniques and defer to future work more sophisticated approaches based on collaborative filtering and other forms of user models. In this paper, we focus on the second sub-problem, how to organize a set of relevant Web pages to reflect the site s structure. We note that there is a trade-off between two competing considerations. On the one hand, Personalized Site Maps should be as simple as possible. This suggests a trivial approach in which a Personalized Site Map displays just the shortest paths from each of the relevant pages to the site s home page. However, short paths are not necessarily intuitively meaningful to visitors [4]. For example, Web sites often contain numerous navigational cross-links, so the shortest path between two pages may well involve completely unrelated parts of the site. We adopt the assumption that the most comprehensible path between two pages will be the one that has been most popular with previous site visitors Submitted to First International Workshop on Mining for Enhanced Web Search; draft of 01/08/
2 [7]. We therefore add to our Personalized Site Maps paths that have been frequently traversed by past site visitors, which are often not the shortest. The technical challenge of our work concerns how to compute the most popular path between a given pair of pages. The naïve approach would be to extract the N most popular paths from the server s access log. However, given the inherent diversity of visitors interests, N must be extremely large in order to obtain sufficient coverage over actual site visitors. To address this sparseness, we propose a novel algorithm for mining fragments of paths, rather than entire paths, from the server logs, and then assembling the fragments. For example, suppose that A>B>C>D>E>F>G and A>B>C>D>H>I>J are two paths that occur frequently in past visitors session, where the notation where x>y indicates a traversal by a particular visitor from page x to page y. Using the naïve approach we need to store the two paths in their entirety. However we could store only A>B>C>D, D>E>F>G and D>H>I>J and then recreate the full paths. This path fragment method allows us to compress the previous sessions much more than storing entire paths. We make the following contributions. First, we formalize the problem of constructing Personalized Site Maps (Section 2). Second, we describe our algorithm for solving this problem that mines popular path fragments from server logs (Section 3). Finally, using data from two Web sites, we empirically demonstrate that shortest paths are often quite unpopular (thus providing evidence that shortest paths are not intuitively meaningful), and that our mined path fragments provide better coverage than simply storing entire paths (Section 4). Apple.com Site Map Apple.com Personalized Site Map (a) Figure 1: (a) The Apple.com Site Map, and (b) a fictitious personalized version that displays only the few pages that are relevant to a particular visitor. (b) 2. Problem formalization We formalize the problem of constructing a Personalized Site Map as follows. We take as input a Web site s graph G=(V,E) and its distinguished home page root r V. Each node in V corresponds to a Web page, and directed edge (u,v) E represents a hyperlink between the corresponding documents. We also assume that a set of relevant nodes R = {r 1, r 2,, r n } has been identified during the initial relevance-assessment step. A Personalized Site Map is a subset G =(V,E ) of the original graph, such that G contains the root and the relevant nodes (i.e., V {r, r 1, r 2,, r n } V), as well as sufficient additional nodes and edges from G so that G contains a path from the root r to each relevant node r i. So far, this personalization task is highly under-constrained, as there may be many such subgraphs G =(V,E ). To decide between alternative subgraphs, we exploit the actual visitor usage data from the site s server log. The intuition is that we want to select the alternative G whose edges E are the most popular among previous site visitors. Our task thus reduces to the following: Given a Web site server log, we want to mine sufficient data from the log in order to be able to reliably reconstruct the most popular path from any node u V to any other node v V. Naturally, without access to the entire log some data -2-
3 will necessarily be lost and this reconstruction process cannot be perfect. We will therefore be interested in empirically comparing the coverage of alternative algorithms. 3. Algorithms We begin with a brief discussion of server log pre-processing. We then describe three alternative algorithms for constructing Personalized Site Maps. The first baseline algorithm, SP, ignores the server log and simply assumes that shorter paths are more popular than longer paths. The second algorithm, PP, extracts the N most popular paths from the server log, and tries to reconstruct the most popular path between two pages using these N paths. As mentioned above, PP is ineffective because path traversal logs for large graphs are necessarily very sparse and thus N must be very large to ensure adequate coverage. Our third algorithm, MP, mines path fragments from server logs, and then dynamically assembles them into a path between from a given node u to another node v. Since MP discards strictly more information than PP, it can make mistakes, but our experiments in Section 4 demonstrate its effectiveness in practice Server log pre-processing. Web server logs contain a large amount of noise which must be discarded, and also often do not contain data that must be inferred [1]. Noise corresponds to requests for images, applets, etc, which are logged, yet are irrelevant for our purposes as these are embedded in page views. Data may be missing due to caching by, for example, the browser or Internet service provider. This arises most commonly when the visitor uses the browser s back button. For example, if a user traverses the path u>v, then hits the back button, and then traverses u>w, this will appear in the log as u>v>w. We use a simple path completion algorithm to automatically insert entries that must be missing due to the known structure of the site graph. The final problem with server logs is that requests are stored in the order that the server receives them. Specifically, if multiple people are browsing the site concurrently, their requests are intermingled in the log file. We use simple session extraction heuristics to segment the entire log into a sequence of sessions. First, we partition requests by IP address. Second, we use an inter-access delay threshold D to split a sequence of accesses from a given IP into one or more sessions. From our preliminary experiments described in Section 4.2 we set D = 15 minutes SP ( shortest path ) algorithm. The simplest technique in any route planning system is the shortest path between two points [4]. Essentially, in order to estimate the most popular path from node u to node v, the SP algorithm ignores the past visitors entirely and simply assumes that short paths are more popular than long paths. To evaluate the SP algorithm, we measure its coverage. The coverage of the SP algorithm is the fraction of extracted sessions in which users went from page u to page v via the shortest path. In Section 4 we empirically demonstrate that the coverage of SP is in fact quite low PP ( popular paths ) algorithm. The PP algorithm simply records the N most frequent sessions extracted from the server log during the pre-processing step. It is an example of sequential pattern discovery from web logs as seen in [5] and [2]. The coverage of PP is the fraction of extracted sessions in which the user navigated from page u to page v via the most popular path from u to v. In Section 4 we demonstrate that N must be quite large in order to obtain sufficient coverage over the entire Web graph MP ( mined paths ) algorithm. The MP algorithm expands each server log session into a set of all subpaths of length between K min and K max. The N most popular such fragments are then used to reconstruct a path from a page u to a page v. To do so, MP considers all possible ways to assemble the mined fragments, subject to the constraint that adjacent fragments must overlap on A pages. For instance, if A=2 then the two fragments u>v>w and -3-
4 v>w>x can be assembled to create a path u>v>w>x. This overlap constraint corresponds to an assumption that Web navigation can be modelled as a Markov process of order A [9]. In our experiments we use A=2, K min =4, and K max =15. We leave to future work a systematic exploration of optimal values for these parameters. The coverage of MP is the fraction of extracted sessions that can be recovered from the mined fragments. In Section 4 we demonstrate that, for a given value of N, MP has better coverage than PP. 4. Experiments We now describe an experimental evaluation of the techniques we discussed in the previous section. We begin with a discussion of the two datasets we used for our experiments. We then describe the results of experiments designed to answer the following questions: 1. How sensitive are our results to the inter-access delay threshold D used to segment the raw server log into sessions? (Section 4.2) 2. How frequently is the shortest path between two pages the most popular path? (Section 4.3) 3. How does the coverage of PP compare to that of MP, as a function of the amount N of mined data? (Section 4.4) 4.1. Datasets. We evaluated our techniques on two Web sites, the server for the Computer Science Department of University College Dublin ( and Music Machines (machines.hyperreal.org) 1. Figure 2 summarises these datasets. UCD CS Music Machines Time Period Apr 2000 Dec 2001 Feb 1997 Apr 1999 Total Requests 4,327,397 14,722,468 After Pre-processing 1,258,643 2,996,322 Number of Distinct IPs 55, ,092 Number of Sessions 236, ,801 Mean Session Length Figure 2: Summary of the experimental data. The total number of requests includes images, applets, etc. The number after pre-processing is the number of requests for actual page views. The number of distinct IP s is the number of IP addresses from which the server received requests in the time period. Note that the number of IP address is not equal to the number of actual visitors due to noise introduced by proxy servers, and it is not equal to the number of sessions because each visitor may initiate several sessions in the log file time period Threshold Experiments. The first experiments relate to the inter-access delay threshold D used to segment the raw server log into sessions. Specifically, we want to ensure that the results from our subsequent experiments are not overly sensitive to the setting of this free parameter. 1 The Music Machines server logs were archived by Mike Perkowitz and are available at -4-
5 Number of Sessions Inter-access delay threshold D (minutes) Figure 3: Number of sessions extracted from the UCD log, as a function of the inter-access delay threshold D. Figure 3 shows the number of distinct sessions extracted from the log files of the UCD web site as the session threshold increases from five to 45 minutes. While the number of sessions grows rapidly as D decreases, the variation is much smaller at the intuitively reasonable larger thresholds. Our second experiment compared the overlap between the popular paths mined by the PP algorithm, using a threshold of 10, 15 and 20 minutes. As shown in Figure 4, there is a substantial overlap between the various sets of mined paths Overlap (%) Compared inter-access delay thresholds (minutes) Figure 4: Overlap between paths mined from the UCD log by the PP algorithm, for three pairs of inter-access delay thresholds. Based on this data, we conclude that our technique is relatively stable across values of the inter-access delay threshold D. We set the threshold D=15 minutes for the remainder of the experiments Comparison of shortest and popular paths. The next experiment seeks to confirm that shortest paths are not necessarily the most popular. Figure 5 shows the fraction of popular paths that are in fact the shortest path, as a function of the number N of paths mined by the PP algorithm, for both web sites. For example, of the N=40 most-popular paths mined by PP, 60% these paths are in fact the shortest path. We can see that as the paths become less popular (i.e., for small values of N), shortness is indeed a good proxy for popularity. However, as -5-
6 N increases the overlap between PP and SP decreases substantially. We conclude that, as predicted, popular paths are frequently sub-optimal. Overlap UCD MM Number of mined paths N Figure 5: Overlap between popular and shortest paths Coverage of mined and popular paths. So far, our experiments have been concerned with demonstrating that SP and MP do indeed generate different paths. In this section we investigate the benefits of using mined paths as opposed to just using the popular paths. For various values of N, we measure the fraction of the extracted sessions that can be reconstructed in their entirety using MP. With N=5000 we can reconstruct 27% of sessions from the UCD log file in their entirety and can manage to recreate 14% of the Music Machines Sessions. We can generalize this experiment by measuring the fraction of individual sessions that each algorithm can reconstruct. That is, we know that 27% of the UCD sessions can be fully (100%) reconstructed, by presumably many others sessions can be, say, 75% reconstructed. We therefore measured the average fraction of a given session that can be reconstructed for each of the two algorithms. For both Web sites, our experimental results in Figures 6 and 7 demonstrate that the mined path fragments can be used to reconstruct a greater proportion of individual traversals than relying solely on popular paths. Specifically, we measure the coverage of MP and PP for various values of N up to 1000, and find that for each value of N, MP has higher coverage than PP, and the coverage gap grows rapidly as N increases. 30% 25% Coverage 20% 15% 10% Popular Mined Difference 5% 0% Numbr of mined paths or fragments N Figure 6: Coverage of the MP and PP algorithms for the UCD site. -6-
7 3% 3% Coverage 2% 2% 1% Popular Mined Difference 1% 0% Number of mined paths or fragments N Figure 7: Coverage of the MP and PP algorithms for the Music Machines site. 5. Related Work Previous research in Web usage and server log mining addresses two major issues: the preprocessing of the raw data, and the discovery of patterns or rules in the data. Our work relates to both of these areas. Pre-processing is discussed in detail in [1]. The aim of pre-processing of web server log files is to obtain a set of sessions (visits) recorded in the log files. It can be divided into three distinct phases: data cleansing, user/session identification and path completion. Our system implements all of these components of Web Usage Mining. The second phase of Web usage mining is that of pattern discovery [1,2,5,6]. Pattern discovery involves the extraction of some meaningful information, such as association rules, classification rules, or sequential patterns. The PP and MP algorithms can be seen as the pattern discovery phase for the Personalised Site Map task. The construction of improved site maps is discussed in [3]. Li et al discuss the need for topic-focused site maps that home in on the users interests and try to display that section of the map. They also discuss the granularity of the site map, which is the level of detail the map should show. They use the method of extracting logical domains from the web site where each logical domain is associated with a certain topic. Unlike our system they use semantic knowledge from the pages contents. 6. Future Work Our Personalized Site Map algorithms have been fully implemented. Our current focus involves measuring the effectiveness of our approach. Our experiments have demonstrated that our technique works well, in the sense that we are able to build site maps containing popular (as opposed to merely short) paths. We believe that users will find popular paths more intuitive compared to paths that are merely short, but we have not yet established this empirically. We intend to conduct user trials of the system to get users judgements of the quality of the Personalized Site Maps. For example, one important topic is the generality/specificity of the pages on the paths. Do the pages earlier in the path contain more general information than later pages? We are also exploring other applications for our path-mining algorithm. At its core, we have developed an approach to predicting which pages are likely to be viewed next, given a prefix of a visitor s trajectory. Therefore, a second potential application concerns using this predictive ability for pre-fetching and caching [8]. -7-
8 Our preliminary investigation of pre-fetching shows promising results. For the UCD site, we allow the PP algorithm to recommend a likely next page after each session prefix. The entire dataset contains over 236,000 sessions, leading to 832,307 recommendations from PP. Of these recommendations, 43,349 (5.2%) were correct (ie, that the user did indeed visit the recommended page next). We intend to extend this experiment to caching of multiple pages, and comparing our approach to existing page-prediction algorithms. Another possible direction would be to introduce a collaborative element to the system. We could rate each popular path for a user based on whether it appears in his sessions or not. Standard collaborative filtering techniques can then be used to recommend a particular popular path to recommend with greater confidence than our current PP algorithm. 7. Conclusions We have introduced the problem of automatically constructing Personalized Site Maps. The key challenge is to display to the visitor a subgraph that both contains relevant content items, and also organizes them in a coherent and meaningful manner. Our approach is based on the assumption that the best way to indicate the relationship between a given pair of pages is to show the path between them that has been most popular with past visitors. Based on this observation, we propose a naïve algorithm (PP) for mining popular paths from raw server logs, and a more sophisticated algorithm (MP) for mining path fragments. The key idea of MP is to mine a collection of path fragments that can be dynamically assembled in order to reconstruct many popular paths. Our experiments with two large Web sites confirm that MP can reconstruct a larger fraction of visitors sessions that PP. Acknowledgments. We thank Barry Smyth for helpful discussions. This research was funded by grant N from the US Office of Naval Research, and grant 01/F.1/C015 from Science Foundation Ireland. References [1] Rob Cooley Web Usage Mining PhD Thesis, Department of Computer Science, University of Minnesota, 2001 [2] W. Gaul and L. Schmidt-Thieme. Mining web navigation path fragments. In Proceedings of the Workshop on Web Mining for E-Commerce -- Challenges and Opportunities, Boston, MA, August 2000 [3] Li, W-S., Ayan, N. F., Kolak, O., Vu, Q. "Constructing Multi-Granular and Topic Focused Web Sites in Proceedings of WWW [4] McGinty, L., Smyth, B., Case Based Route Planning. In Proceedings of the 11 th Conference on Artificial Intelligence and Cognitive Science, Galway, Ireland, [5] Srikant, R., Agrawal, R. Mining Sequential Patterns: Generalisations and Performance improvements. In Proceedings of the 5 th International conference Extending Database Systems [6] Jaideep Srivastava, Robert Cooley, Mukund Deshpande, Pang-Ning Tan, Web Usage Mining: Discovery and Applications of Usage Patterns from Web Data SIGKDD Explorations, Vol. 1, Issue 2, [7] Wexelblat, A., Maes, P. Footprints: History-Rich Tools for Information Foraging In Proceeedings of CHI 99 Conference on Human Factors in Computing,
9 [8] Yang, Q., Zhang, H. H., Li, T. "Mining Web logs for Prediction in WWW Caching and Pre-fetching." In Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining KDD'01, San Francisco [9] Ypma, A., Heskes, T. Categorization of Web Pages and User Clustering with mixtures of Hidden Markov Models. In Proceedings of the International Workshop on Web Knowledge Discovery and Data Mining, WEBKDD 02, July , Edmonton, Canada. -9-
Survey Paper on Web Usage Mining for Web Personalization
ISSN 2278 0211 (Online) Survey Paper on Web Usage Mining for Web Personalization Namdev Anwat Department of Computer Engineering Matoshri College of Engineering & Research Center, Eklahare, Nashik University
More informationA Survey on Web Personalization of Web Usage Mining
A Survey on Web Personalization of Web Usage Mining S.Jagan 1, Dr.S.P.Rajagopalan 2 1 Assistant Professor, Department of CSE, T.J. Institute of Technology, Tamilnadu, India 2 Professor, Department of CSE,
More informationTHE STUDY OF WEB MINING - A SURVEY
THE STUDY OF WEB MINING - A SURVEY Ashish Gupta, Anil Khandekar Abstract over the year s web mining is the very fast growing research field. Web mining contains two research areas: Data mining and World
More informationSEQUENTIAL PATTERN MINING FROM WEB LOG DATA
SEQUENTIAL PATTERN MINING FROM WEB LOG DATA Rajashree Shettar 1 1 Associate Professor, Department of Computer Science, R. V College of Engineering, Karnataka, India, rajashreeshettar@rvce.edu.in Abstract
More informationBehaviour Recovery and Complicated Pattern Definition in Web Usage Mining
Behaviour Recovery and Complicated Pattern Definition in Web Usage Mining Long Wang and Christoph Meinel Computer Department, Trier University, 54286 Trier, Germany {wang, meinel@}ti.uni-trier.de Abstract.
More informationWeb page recommendation using a stochastic process model
Data Mining VII: Data, Text and Web Mining and their Business Applications 233 Web page recommendation using a stochastic process model B. J. Park 1, W. Choi 1 & S. H. Noh 2 1 Computer Science Department,
More informationSemantic Clickstream Mining
Semantic Clickstream Mining Mehrdad Jalali 1, and Norwati Mustapha 2 1 Department of Software Engineering, Mashhad Branch, Islamic Azad University, Mashhad, Iran 2 Department of Computer Science, Universiti
More informationPattern Classification based on Web Usage Mining using Neural Network Technique
International Journal of Computer Applications (975 8887) Pattern Classification based on Web Usage Mining using Neural Network Technique Er. Romil V Patel PIET, VADODARA Dheeraj Kumar Singh, PIET, VADODARA
More informationINTRODUCTION. Chapter GENERAL
Chapter 1 INTRODUCTION 1.1 GENERAL The World Wide Web (WWW) [1] is a system of interlinked hypertext documents accessed via the Internet. It is an interactive world of shared information through which
More informationThe influence of caching on web usage mining
The influence of caching on web usage mining J. Huysmans 1, B. Baesens 1,2 & J. Vanthienen 1 1 Department of Applied Economic Sciences, K.U.Leuven, Belgium 2 School of Management, University of Southampton,
More informationFM-WAP Mining: In Search of Frequent Mutating Web Access Patterns from Historical Web Usage Data
FM-WAP Mining: In Search of Frequent Mutating Web Access Patterns from Historical Web Usage Data Qiankun Zhao Nanyang Technological University, Singapore and Sourav S. Bhowmick Nanyang Technological University,
More informationData Mining of Web Access Logs Using Classification Techniques
Data Mining of Web Logs Using Classification Techniques Md. Azam 1, Asst. Prof. Md. Tabrez Nafis 2 1 M.Tech Scholar, Department of Computer Science & Engineering, Al-Falah School of Engineering & Technology,
More informationCLASSIFICATION OF WEB LOG DATA TO IDENTIFY INTERESTED USERS USING DECISION TREES
CLASSIFICATION OF WEB LOG DATA TO IDENTIFY INTERESTED USERS USING DECISION TREES K. R. Suneetha, R. Krishnamoorthi Bharathidasan Institute of Technology, Anna University krs_mangalore@hotmail.com rkrish_26@hotmail.com
More informationCreate a Profile for User Using Web Usage Mining
Journal of Academic and Applied Studies (Special Issue on Applied Sciences) Vol. 3(9) September 2013, pp. 1-12 Available online @ www.academians.org ISSN1925-931X Create a Profile for User Using Web Usage
More informationUsing Petri Nets to Enhance Web Usage Mining 1
Using Petri Nets to Enhance Web Usage Mining 1 Shih-Yang Yang Department of Information Management Kang-Ning Junior College of Medical Care and Management Nei-Hu, 114, Taiwan Shihyang@knjc.edu.tw Po-Zung
More informationSemantic Website Clustering
Semantic Website Clustering I-Hsuan Yang, Yu-tsun Huang, Yen-Ling Huang 1. Abstract We propose a new approach to cluster the web pages. Utilizing an iterative reinforced algorithm, the model extracts semantic
More informationWeb Data mining-a Research area in Web usage mining
IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661, p- ISSN: 2278-8727Volume 13, Issue 1 (Jul. - Aug. 2013), PP 22-26 Web Data mining-a Research area in Web usage mining 1 V.S.Thiyagarajan,
More informationLog Information Mining Using Association Rules Technique: A Case Study Of Utusan Education Portal
Log Information Mining Using Association Rules Technique: A Case Study Of Utusan Education Portal Mohd Helmy Ab Wahab 1, Azizul Azhar Ramli 2, Nureize Arbaiy 3, Zurinah Suradi 4 1 Faculty of Electrical
More informationA Constrained Spreading Activation Approach to Collaborative Filtering
A Constrained Spreading Activation Approach to Collaborative Filtering Josephine Griffith 1, Colm O Riordan 1, and Humphrey Sorensen 2 1 Dept. of Information Technology, National University of Ireland,
More informationWeb Usage Mining: A Research Area in Web Mining
Web Usage Mining: A Research Area in Web Mining Rajni Pamnani, Pramila Chawan Department of computer technology, VJTI University, Mumbai Abstract Web usage mining is a main research area in Web mining
More informationAssociation-Rules-Based Recommender System for Personalization in Adaptive Web-Based Applications
Association-Rules-Based Recommender System for Personalization in Adaptive Web-Based Applications Daniel Mican, Nicolae Tomai Babes-Bolyai University, Dept. of Business Information Systems, Str. Theodor
More informationWeb Usage Mining. Overview Session 1. This material is inspired from the WWW 16 tutorial entitled Analyzing Sequential User Behavior on the Web
Web Usage Mining Overview Session 1 This material is inspired from the WWW 16 tutorial entitled Analyzing Sequential User Behavior on the Web 1 Outline 1. Introduction 2. Preprocessing 3. Analysis 2 Example
More informationOverview of Web Mining Techniques and its Application towards Web
Overview of Web Mining Techniques and its Application towards Web *Prof.Pooja Mehta Abstract The World Wide Web (WWW) acts as an interactive and popular way to transfer information. Due to the enormous
More informationWeb Page Classification using FP Growth Algorithm Akansha Garg,Computer Science Department Swami Vivekanad Subharti University,Meerut, India
Web Page Classification using FP Growth Algorithm Akansha Garg,Computer Science Department Swami Vivekanad Subharti University,Meerut, India Abstract - The primary goal of the web site is to provide the
More informationFault Identification from Web Log Files by Pattern Discovery
ABSTRACT International Journal of Scientific Research in Computer Science, Engineering and Information Technology 2017 IJSRCSEIT Volume 2 Issue 2 ISSN : 2456-3307 Fault Identification from Web Log Files
More informationMining for User Navigation Patterns Based on Page Contents
WSS03 Applications, Products and Services of Web-based Support Systems 27 Mining for User Navigation Patterns Based on Page Contents Yue Xu School of Software Engineering and Data Communications Queensland
More informationFarthest First Clustering in Links Reorganization
Farthest First Clustering in Links Reorganization ABSTRACT Deepshree A. Vadeyar 1,Yogish H.K 2 1Department of Computer Science and Engineering, EWIT Bangalore 2Department of Computer Science and Engineering,
More informationSathyamangalam, 2 ( PG Scholar,Department of Computer Science and Engineering,Bannari Amman Institute of Technology, Sathyamangalam,
IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661, p- ISSN: 2278-8727Volume 8, Issue 5 (Jan. - Feb. 2013), PP 70-74 Performance Analysis Of Web Page Prediction With Markov Model, Association
More informationAN IMPROVISED FREQUENT PATTERN TREE BASED ASSOCIATION RULE MINING TECHNIQUE WITH MINING FREQUENT ITEM SETS ALGORITHM AND A MODIFIED HEADER TABLE
AN IMPROVISED FREQUENT PATTERN TREE BASED ASSOCIATION RULE MINING TECHNIQUE WITH MINING FREQUENT ITEM SETS ALGORITHM AND A MODIFIED HEADER TABLE Vandit Agarwal 1, Mandhani Kushal 2 and Preetham Kumar 3
More informationIJREAT International Journal of Research in Engineering & Advanced Technology, Volume 1, Issue 5, Oct-Nov, ISSN:
IJREAT International Journal of Research in Engineering & Advanced Technology, Volume 1, Issue 5, Oct-Nov, 20131 Improve Search Engine Relevance with Filter session Addlin Shinney R 1, Saravana Kumar T
More informationMetaData for Database Mining
MetaData for Database Mining John Cleary, Geoffrey Holmes, Sally Jo Cunningham, and Ian H. Witten Department of Computer Science University of Waikato Hamilton, New Zealand. Abstract: At present, a machine
More informationWeb Usage Mining: A Research Area in Web Mining
IJSRD - International Journal for Scientific Research & Development Vol. 2, Issue 02, 2014 ISSN (online): 2321-0613 Web Usage Mining: A Research Area in Web Mining Nisha Yadav 1 1 Department of Computer
More informationFuzzy Cognitive Maps application for Webmining
Fuzzy Cognitive Maps application for Webmining Andreas Kakolyris Dept. Computer Science, University of Ioannina Greece, csst9942@otenet.gr George Stylios Dept. of Communications, Informatics and Management,
More informationDATA MINING II - 1DL460. Spring 2014"
DATA MINING II - 1DL460 Spring 2014" A second course in data mining http://www.it.uu.se/edu/course/homepage/infoutv2/vt14 Kjell Orsborn Uppsala Database Laboratory Department of Information Technology,
More informationAn Effective method for Web Log Preprocessing and Page Access Frequency using Web Usage Mining
An Effective method for Web Log Preprocessing and Page Access Frequency using Web Usage Mining Jayanti Mehra 1 Research Scholar, Department of computer Application, Maulana Azad National Institute of Technology
More informationCHAPTER - 3 PREPROCESSING OF WEB USAGE DATA FOR LOG ANALYSIS
CHAPTER - 3 PREPROCESSING OF WEB USAGE DATA FOR LOG ANALYSIS 48 3.1 Introduction The main aim of Web usage data processing is to extract the knowledge kept in the web log files of a Web server. By using
More informationIn the recent past, the World Wide Web has been witnessing an. explosive growth. All the leading web search engines, namely, Google,
1 1.1 Introduction In the recent past, the World Wide Web has been witnessing an explosive growth. All the leading web search engines, namely, Google, Yahoo, Askjeeves, etc. are vying with each other to
More informationInferring User Search for Feedback Sessions
Inferring User Search for Feedback Sessions Sharayu Kakade 1, Prof. Ranjana Barde 2 PG Student, Department of Computer Science, MIT Academy of Engineering, Pune, MH, India 1 Assistant Professor, Department
More informationA New Web Usage Mining Approach for Website Recommendations Using Concept Hierarchy and Website Graph
A New Web Usage Mining Approach for Website Recommendations Using Concept Hierarchy and Website Graph T. Vijaya Kumar, H. S. Guruprasad, Bharath Kumar K. M., Irfan Baig, and Kiran Babu S. Abstract To have
More informationReview Paper Approach to Recover CSGM Method with Higher Accuracy and Less Memory Consumption using Web Log Mining
ISCA Journal of Engineering Sciences ISCA J. Engineering Sci. Review Paper Approach to Recover CSGM Method with Higher Accuracy and Less Memory Consumption using Web Log Mining Abstract Shrivastva Neeraj
More informationCollaborative Filtering using a Spreading Activation Approach
Collaborative Filtering using a Spreading Activation Approach Josephine Griffith *, Colm O Riordan *, Humphrey Sorensen ** * Department of Information Technology, NUI, Galway ** Computer Science Department,
More informationWeb Mining. Data Mining and Text Mining (UIC Politecnico di Milano) Daniele Loiacono
Web Mining Data Mining and Text Mining (UIC 583 @ Politecnico di Milano) References Jiawei Han and Micheline Kamber, "Data Mining: Concepts and Techniques", The Morgan Kaufmann Series in Data Management
More informationAC-Close: Efficiently Mining Approximate Closed Itemsets by Core Pattern Recovery
: Efficiently Mining Approximate Closed Itemsets by Core Pattern Recovery Hong Cheng Philip S. Yu Jiawei Han University of Illinois at Urbana-Champaign IBM T. J. Watson Research Center {hcheng3, hanj}@cs.uiuc.edu,
More informationA Constrained Spreading Activation Approach to Collaborative Filtering
A Constrained Spreading Activation Approach to Collaborative Filtering Josephine Griffith 1, Colm O Riordan 1, and Humphrey Sorensen 2 1 Dept. of Information Technology, National University of Ireland,
More informationCharacterizing Home Pages 1
Characterizing Home Pages 1 Xubin He and Qing Yang Dept. of Electrical and Computer Engineering University of Rhode Island Kingston, RI 881, USA Abstract Home pages are very important for any successful
More informationImproving the Performance of a Proxy Server using Web log mining
San Jose State University SJSU ScholarWorks Master's Projects Master's Theses and Graduate Research Spring 2011 Improving the Performance of a Proxy Server using Web log mining Akshay Shenoy San Jose State
More informationCLASSIFICATION FOR SCALING METHODS IN DATA MINING
CLASSIFICATION FOR SCALING METHODS IN DATA MINING Eric Kyper, College of Business Administration, University of Rhode Island, Kingston, RI 02881 (401) 874-7563, ekyper@mail.uri.edu Lutz Hamel, Department
More informationInternational Journal of Software and Web Sciences (IJSWS)
International Association of Scientific Innovation and Research (IASIR) (An Association Unifying the Sciences, Engineering, and Applied Research) ISSN (Print): 2279-0063 ISSN (Online): 2279-0071 International
More informationMore Efficient Classification of Web Content Using Graph Sampling
More Efficient Classification of Web Content Using Graph Sampling Chris Bennett Department of Computer Science University of Georgia Athens, Georgia, USA 30602 bennett@cs.uga.edu Abstract In mining information
More informationK-Nearest-Neighbours with a Novel Similarity Measure for Intrusion Detection
K-Nearest-Neighbours with a Novel Similarity Measure for Intrusion Detection Zhenghui Ma School of Computer Science The University of Birmingham Edgbaston, B15 2TT Birmingham, UK Ata Kaban School of Computer
More informationInternational Journal of Computer Engineering and Applications, Volume VIII, Issue III, Part I, December 14
International Journal of Computer Engineering and Applications, Volume VIII, Issue III, Part I, December 14 DESIGN OF AN EFFICIENT DATA ANALYSIS CLUSTERING ALGORITHM Dr. Dilbag Singh 1, Ms. Priyanka 2
More informationA Review Paper on Web Usage Mining and Pattern Discovery
A Review Paper on Web Usage Mining and Pattern Discovery 1 RACHIT ADHVARYU 1 Student M.E CSE, B. H. Gardi Vidyapith, Rajkot, Gujarat, India. ABSTRACT: - Web Technology is evolving very fast and Internet
More informationWeb Mining. Data Mining and Text Mining (UIC Politecnico di Milano) Daniele Loiacono
Web Mining Data Mining and Text Mining (UIC 583 @ Politecnico di Milano) References q Jiawei Han and Micheline Kamber, "Data Mining: Concepts and Techniques", The Morgan Kaufmann Series in Data Management
More informationWeb Mining. Data Mining and Text Mining (UIC Politecnico di Milano) Daniele Loiacono
Web Mining Data Mining and Text Mining (UIC 583 @ Politecnico di Milano) References Jiawei Han and Micheline Kamber, "Data Mining: Concepts and Techniques", The Morgan Kaufmann Series in Data Management
More informationA Hybrid Web Personalization Model Based on Site Connectivity
A Hybrid Web Personalization Model Based on Site Connectivity Miki Nakagawa, Bamshad Mobasher {mnakagawa,mobasher}@cs.depaul.edu School of Computer Science, Telecommunication, and Information Systems DePaul
More informationContext-based Navigational Support in Hypermedia
Context-based Navigational Support in Hypermedia Sebastian Stober and Andreas Nürnberger Institut für Wissens- und Sprachverarbeitung, Fakultät für Informatik, Otto-von-Guericke-Universität Magdeburg,
More informationOntology Generation from Session Data for Web Personalization
Int. J. of Advanced Networking and Application 241 Ontology Generation from Session Data for Web Personalization P.Arun Research Associate, Madurai Kamaraj University, Madurai 62 021, Tamil Nadu, India.
More informationUAPRIORI: AN ALGORITHM FOR FINDING SEQUENTIAL PATTERNS IN PROBABILISTIC DATA
UAPRIORI: AN ALGORITHM FOR FINDING SEQUENTIAL PATTERNS IN PROBABILISTIC DATA METANAT HOOSHSADAT, SAMANEH BAYAT, PARISA NAEIMI, MAHDIEH S. MIRIAN, OSMAR R. ZAÏANE Computing Science Department, University
More informationWeb Mining Using Cloud Computing Technology
International Journal of Scientific Research in Computer Science and Engineering Review Paper Volume-3, Issue-2 ISSN: 2320-7639 Web Mining Using Cloud Computing Technology Rajesh Shah 1 * and Suresh Jain
More informationKeywords Data alignment, Data annotation, Web database, Search Result Record
Volume 5, Issue 8, August 2015 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Annotating Web
More informationChapter 2. Related Work
Chapter 2 Related Work There are three areas of research highly related to our exploration in this dissertation, namely sequential pattern mining, multiple alignment, and approximate frequent pattern mining.
More informationChapter 10. Conclusion Discussion
Chapter 10 Conclusion 10.1 Discussion Question 1: Usually a dynamic system has delays and feedback. Can OMEGA handle systems with infinite delays, and with elastic delays? OMEGA handles those systems with
More informationInternational Journal of Scientific & Engineering Research, Volume 4, Issue 11, November ISSN
International Journal of Scientific & Engineering Research, Volume 4, Issue 11, November-2013 398 Web Usage Mining has Pattern Discovery DR.A.Venumadhav : venumadhavaka@yahoo.in/ akavenu17@rediffmail.com
More informationHeading-Based Sectional Hierarchy Identification for HTML Documents
Heading-Based Sectional Hierarchy Identification for HTML Documents 1 Dept. of Computer Engineering, Boğaziçi University, Bebek, İstanbul, 34342, Turkey F. Canan Pembe 1,2 and Tunga Güngör 1 2 Dept. of
More informationIJITKMSpecial Issue (ICFTEM-2014) May 2014 pp (ISSN )
A Review Paper on Web Usage Mining and future request prediction Priyanka Bhart 1, Dr.SonaMalhotra 2 1 M.Tech., CSE Department, U.I.E.T. Kurukshetra University, Kurukshetra, India 2 HOD, CSE Department,
More informationDATA MINING - 1DL105, 1DL111
1 DATA MINING - 1DL105, 1DL111 Fall 2007 An introductory class in data mining http://user.it.uu.se/~udbl/dut-ht2007/ alt. http://www.it.uu.se/edu/course/homepage/infoutv/ht07 Kjell Orsborn Uppsala Database
More informationSemi-Supervised Clustering with Partial Background Information
Semi-Supervised Clustering with Partial Background Information Jing Gao Pang-Ning Tan Haibin Cheng Abstract Incorporating background knowledge into unsupervised clustering algorithms has been the subject
More informationOn Multiple Query Optimization in Data Mining
On Multiple Query Optimization in Data Mining Marek Wojciechowski, Maciej Zakrzewicz Poznan University of Technology Institute of Computing Science ul. Piotrowo 3a, 60-965 Poznan, Poland {marek,mzakrz}@cs.put.poznan.pl
More informationLeveraging Set Relations in Exact Set Similarity Join
Leveraging Set Relations in Exact Set Similarity Join Xubo Wang, Lu Qin, Xuemin Lin, Ying Zhang, and Lijun Chang University of New South Wales, Australia University of Technology Sydney, Australia {xwang,lxue,ljchang}@cse.unsw.edu.au,
More informationCapturing Window Attributes for Extending Web Browsing History Records
Capturing Window Attributes for Extending Web Browsing History Records Motoki Miura 1, Susumu Kunifuji 1, Shogo Sato 2, and Jiro Tanaka 3 1 School of Knowledge Science, Japan Advanced Institute of Science
More informationData warehousing and Phases used in Internet Mining Jitender Ahlawat 1, Joni Birla 2, Mohit Yadav 3
International Journal of Computer Science and Management Studies, Vol. 11, Issue 02, Aug 2011 170 Data warehousing and Phases used in Internet Mining Jitender Ahlawat 1, Joni Birla 2, Mohit Yadav 3 1 M.Tech.
More informationEFFECTIVELY USER PATTERN DISCOVER AND CLASSIFICATION FROM WEB LOG DATABASE
EFFECTIVELY USER PATTERN DISCOVER AND CLASSIFICATION FROM WEB LOG DATABASE K. Abirami 1 and P. Mayilvaganan 2 1 School of Computing Sciences Vels University, Chennai, India 2 Department of MCA, School
More informationSimilarity Matrix Based Session Clustering by Sequence Alignment Using Dynamic Programming
Similarity Matrix Based Session Clustering by Sequence Alignment Using Dynamic Programming Dr.K.Duraiswamy Dean, Academic K.S.Rangasamy College of Technology Tiruchengode, India V. Valli Mayil (Corresponding
More informationTheme Identification in RDF Graphs
Theme Identification in RDF Graphs Hanane Ouksili PRiSM, Univ. Versailles St Quentin, UMR CNRS 8144, Versailles France hanane.ouksili@prism.uvsq.fr Abstract. An increasing number of RDF datasets is published
More informationProbabilistic Abstraction Lattices: A Computationally Efficient Model for Conditional Probability Estimation
Probabilistic Abstraction Lattices: A Computationally Efficient Model for Conditional Probability Estimation Daniel Lowd January 14, 2004 1 Introduction Probabilistic models have shown increasing popularity
More informationExplore Co-clustering on Job Applications. Qingyun Wan SUNet ID:qywan
Explore Co-clustering on Job Applications Qingyun Wan SUNet ID:qywan 1 Introduction In the job marketplace, the supply side represents the job postings posted by job posters and the demand side presents
More informationCOMPREHENSIVE FRAMEWORK FOR PATTERN ANALYSIS THROUGH WEB LOGS USING WEB MINING: A REVIEW
Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 2, Issue. 4, April 2013,
More informationInternational Journal of Advanced Research in Computer Science and Software Engineering
Volume 2, Issue 9, September 2012 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Discovery
More informationProbability Measure of Navigation pattern predition using Poisson Distribution Analysis
Probability Measure of Navigation pattern predition using Poisson Distribution Analysis Dr.V.Valli Mayil Director/MCA Vivekanandha Institute of Information and Management Studies Tiruchengode Ms. R. Rooba,
More informationPattern Mining in Frequent Dynamic Subgraphs
Pattern Mining in Frequent Dynamic Subgraphs Karsten M. Borgwardt, Hans-Peter Kriegel, Peter Wackersreuther Institute of Computer Science Ludwig-Maximilians-Universität Munich, Germany kb kriegel wackersr@dbs.ifi.lmu.de
More informationImplementation Techniques
V Implementation Techniques 34 Efficient Evaluation of the Valid-Time Natural Join 35 Efficient Differential Timeslice Computation 36 R-Tree Based Indexing of Now-Relative Bitemporal Data 37 Light-Weight
More informationResearch/Review Paper: Web Personalization Using Usage Based Clustering Author: Madhavi M.Mali,Sonal S.Jogdand, Deepali P. Shinde Paper ID: V1-I3-002
Journal) Volume1, Issue3, Nov-Dec, 2014.ISSN: 2349-7173(Online) International Journal of Advanced Research in Technology, Engineering and Science (A Bimonthly Open Access Online. Research/Review Paper:
More informationGraph based Approach for Mining Frequent Sequential Access Patterns of Web pages
Graph based Approach for Mining Frequent Sequential Access Patterns of Web pages Dheeraj Kumar Singh Varsha Sharma Sanjeev Sharma ABSTRACT The Internet has impacted almost every aspect of our society.
More informationResearch Article Apriori Association Rule Algorithms using VMware Environment
Research Journal of Applied Sciences, Engineering and Technology 8(2): 16-166, 214 DOI:1.1926/rjaset.8.955 ISSN: 24-7459; e-issn: 24-7467 214 Maxwell Scientific Publication Corp. Submitted: January 2,
More informationThwarting Traceback Attack on Freenet
Thwarting Traceback Attack on Freenet Guanyu Tian, Zhenhai Duan Florida State University {tian, duan}@cs.fsu.edu Todd Baumeister, Yingfei Dong University of Hawaii {baumeist, yingfei}@hawaii.edu Abstract
More informationImage Similarity Measurements Using Hmok- Simrank
Image Similarity Measurements Using Hmok- Simrank A.Vijay Department of computer science and Engineering Selvam College of Technology, Namakkal, Tamilnadu,india. k.jayarajan M.E (Ph.D) Assistant Professor,
More informationAN EFFECTIVE SEARCH ON WEB LOG FROM MOST POPULAR DOWNLOADED CONTENT
AN EFFECTIVE SEARCH ON WEB LOG FROM MOST POPULAR DOWNLOADED CONTENT Brindha.S 1 and Sabarinathan.P 2 1 PG Scholar, Department of Computer Science and Engineering, PABCET, Trichy 2 Assistant Professor,
More informationEnhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques
24 Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques Ruxandra PETRE
More informationThe Comparative Study of Machine Learning Algorithms in Text Data Classification*
The Comparative Study of Machine Learning Algorithms in Text Data Classification* Wang Xin School of Science, Beijing Information Science and Technology University Beijing, China Abstract Classification
More informationMining Temporally Evolving Graphs
Mining Temporally Evolving Graphs Prasanna Desikan and Jaideep Srivastava Department of Computer Science University of Minnesota, Minneapolis, MN 55414, U.S.A {desikan,srivastava}@cs.umn.edu Abstract Web
More informationAn Average Linear Time Algorithm for Web. Usage Mining
An Average Linear Time Algorithm for Web Usage Mining José Borges School of Engineering, University of Porto R. Dr. Roberto Frias, 4200 - Porto, Portugal jlborges@fe.up.pt Mark Levene School of Computer
More informationInternational Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.7, No.3, May Dr.Zakea Il-Agure and Mr.Hicham Noureddine Itani
LINK MINING PROCESS Dr.Zakea Il-Agure and Mr.Hicham Noureddine Itani Higher Colleges of Technology, United Arab Emirates ABSTRACT Many data mining and knowledge discovery methodologies and process models
More informationHierarchical Document Clustering
Hierarchical Document Clustering Benjamin C. M. Fung, Ke Wang, and Martin Ester, Simon Fraser University, Canada INTRODUCTION Document clustering is an automatic grouping of text documents into clusters
More informationARS: Web Page Recommendation System for Anonymous Users Based On Web Usage Mining
ARS: Web Page Recommendation System for Anonymous Users Based On Web Usage Mining Yahya AlMurtadha, MD. Nasir Bin Sulaiman, Norwati Mustapha, Nur Izura Udzir and Zaiton Muda University Putra Malaysia,
More informationAn Algorithm for user Identification for Web Usage Mining
An Algorithm for user Identification for Web Usage Mining Jayanti Mehra 1, R S Thakur 2 1,2 Department of Master of Computer Application, Maulana Azad National Institute of Technology, Bhopal, MP, India
More informationRecommendation Models for User Accesses to Web Pages (Invited Paper)
Recommendation Models for User Accesses to Web Pages (Invited Paper) Ṣule Gündüz 1 and M. Tamer Özsu2 1 Department of Computer Science, Istanbul Technical University Istanbul, Turkey, 34390 gunduz@cs.itu.edu.tr
More informationWeb Service Usage Mining: Mining For Executable Sequences
7th WSEAS International Conference on APPLIED COMPUTER SCIENCE, Venice, Italy, November 21-23, 2007 266 Web Service Usage Mining: Mining For Executable Sequences MOHSEN JAFARI ASBAGH, HASSAN ABOLHASSANI
More informationWEB-LOG CLEANING FOR CONSTRUCTING SEQUENTIAL CLASSIFIERS
Applied Artificial Intelligence, 17:431 441, 2003 Copyright # 2003 Taylor & Francis 0883-9514/03 $12.00 +.00 DOI: 10.1080/08839510390219291 u WEB-LOG CLEANING FOR CONSTRUCTING SEQUENTIAL CLASSIFIERS QIANG
More informationPre-processing of Web Logs for Mining World Wide Web Browsing Patterns
Pre-processing of Web Logs for Mining World Wide Web Browsing Patterns # Yogish H K #1 Dr. G T Raju *2 Department of Computer Science and Engineering Bharathiar University Coimbatore, 641046, Tamilnadu
More informationAn Approach To Web Content Mining
An Approach To Web Content Mining Nita Patil, Chhaya Das, Shreya Patanakar, Kshitija Pol Department of Computer Engg. Datta Meghe College of Engineering, Airoli, Navi Mumbai Abstract-With the research
More information