Enhanced Web Log Based Recommendation by Personalized Retrieval

Size: px

Start display at page:

Download "Enhanced Web Log Based Recommendation by Personalized Retrieval"

Amy Barton
5 years ago
Views:

1 Enhanced Web Log Based Recommendation by Personalized Retrieval Xueping Peng FACULTY OF ENGINEERING AND INFORMATION TECHNOLOGY UNIVERSITY OF TECHNOLOGY, SYDNEY A thesis submitted for the degree of Doctor of Philosophy February 2015

3 CERTIFICATE OF AUTHORSHIP/ORIGINALITY I certify that the work in this thesis has not previously been submitted for a degree nor has it been submitted as part of requirements for a degree except as fully acknowledged within the text. I also certify that the thesis has been written by me. Any help that I have received in my research work and the preparation of the thesis itself has been acknowledged. In addition, I certify that all information sources and literature used are indicated in the thesis. Signature of Student

5 Acknowledgements First I would like to thank my supervisors. Prof Chengqi Zhang, Prof Zhendong Niu and Dr Ling Chen who introduced me to the wonderful world of research. They not only gave me invaluable academic advice, but also helped my transition into a different culture. Prof Chengqi Zhang has been a great mentor and collaborator, being both energetic and full of ideas. Prof Zhendong Niu, who works in the School of Computer Science, Beijing Institute of Technology, guided me in terms of information retrieval and web log mining. Dr Ling Chen helped me by asking insightful questions, and giving me thoughtful comments on the thesis. I enjoyed working with them, and benefited enormously from my interactions with them. I spent nearly two and a half years in the University of Technology, Sydney. I thank the collaborators, faculty staff, fellow students and friends in the QCIS centre, who made my graduate life a very memorable experience. In particular, I thank you if you are reading this thesis. I thank my family. My parents richly endowed me with curiosity about the natural world. Last but not least, my deep gratitude is extended to my dear wife Suling Niu who brings me so much love and happiness. It is no exaggeration to say that she helped to make my thesis writing an enjoyable endeavor.

7 Contents Contents vi List of Figures xii List of Tables xiv 1 Introduction Backgroud Research Questions Research Objectives Significance and Main Contributions Research Methodology Thesis Structure Publications Related to This Thesis Literature Review Web Log Mining Technologies Data Collection vi

8 CONTENTS Data Preprocessing Mining Algorithms Pattern Analysis Research and Applications Challenges Personalized Retrieval Query Expansion Result Processing Challenges Recommender System Collaborative Filtering Content-Based Approach Hybrid Approach Challenges Summary Query Suggestion Model Based on The Query Semantics and Click-through Data Introduction Related Works Query Suggestion Bipartite Graph The Proposed Method Query Semantics for Document-based Method Query-URL Bipartite Graph for Log-based Method vii

9 CONTENTS Construction of Query-URL Matrices Matrix Factorisation and Query Relevance Computation Integrate Multiple Suggestion Models Experiments Data Set Evaluation Metrics Comparison of Query Suggestion Results Evaluation of Suggestion Results Summary Collaborative Filtering Retrieval Model Based on Local and Global Features Introduction Related Works Personalized Web Search Filtering Algorithms Click-through Data Proposed Approach User Profile Sequence Score Web Page Rating Preference Score User-Based Collaborative Filtering Personalized Search Model Experiments Data Sets viii

10 CONTENTS Evaluation Metrics Ranking Methods Compared Users Evaluation Impact of Parameters Personalized Search Performance Summary Web Search Recommendation Based on the Retrieval Sequence and the Browsing Features Introduction Related Works User Information Collection Web Search Recommendation The Proposed Method Definition of Terms User Modeling Recommendation of Resource Resource Retrieval Resource Filtering Experiments and Analysis Data Set The Standard of Evaluation Results of Experiment Summary ix

11 CONTENTS 6 Recommendation Based on User Interests Association Findings Introduction Related Works Association Rule Mining Maximal Frequent Itemsets The Proposed Method Basic Concepts and Assumptions Basic Concepts Basic Assumptions Resources Collection and Description Resources Collection Resources Description User Profile Association Algorithm User Modelling Recommendation of Resources Experiments and Analysis Experiment Data Set Evaluation Metrics Analysis of Experiment Results Summary Conclusions and Future Research Conclusions Future Research x

12 CONTENTS References 93 xi

13 List of Figures 1.1 Relationship between chapters High level Web log mining process An example of click-through bipartite MAP comparisons Sequence similarity between two models Impact of the parameter α Impact of the parameter β Comparison between two models The comparison of precision on four classes Average precision between two models Minimum support and number of the transactions The comparison of precision on four classes Comparison between two models xii

15 List of Tables 3.1 Samples of search engine click-through data Examples of QCQS query suggestion results Accuracy comparisons User profile Information format of click-through data Precision comparisons Web access log Format of user profile xiv

17 Abstract With the rapid development of the Internet and WWW, it is more and more important for people to access quality web information. Thus the problem of enabling users to quickly and accurately find information has become an urgent issue. As one of the basic ways to solve this problem, personalized information services have been focusing on fulfilling the personalized information requirements of different users based on their actual demands, preference characteristics, behaviour patterns, etc. This thesis focuses on enhancing web log based recommendation by personalized retrieval, and its main works and innovations include: For personalized retrieval, the thesis proposes two models to improve user experience and optimize search performance. The first is a query suggestion model based on query semantics and click-through data. This model calculates the subject relevance between queries, and then combines the semantic information and the relevance of the query-click matrix model as this can effectively eliminate the ambiguity and input errors of reminder queries. The second is a collaborative filtering retrieval model based on local and global features. By the integration of the local and global characteristics of the accessed information, this model overcomes the limitations of a single feature, and increases the degree of application of the retrieval model. For recommendation by personalized retrieval, we propose two recommendation models based on the web log. The first is based on the user s atomic

18 retrieval transaction sequence and the browse characteristics. This model decomposes search transactions, and calculates the user s degree of interest on the search term, which allows users to query information more clearly. Further, it incorporates the user feedback on the search results evaluation value, which overcomes the shortcomings of the model based on content filtering. The second model is based on user interests association findings, which can be used to: find the relationship between resources accessed by users, extract the associations of user interests, and address the problem of user interests isolation.

19 Chapter 1 Introduction 1.1 Backgroud With the constantly increasing size of resources available on the Internet, the Internet has effectively become a world s largest and most extensive resource repository. However, most Web structures are large and complicated and users often miss the goal of their inquiry, or receive ambiguous results when they try to navigate through them [Eirinaki and Vazirgiannis, 2003]. We list several common issues when users try to obtain information from the Internet. (1) Difficult to find the desired information Though the Web information is distributed, dynamic, multi-structured, and stored on various sites around the world, no one is responsible for the validity and orderliness of these information. So how to quickly and accurately find desired information from the huge resource repository becomes increasingly important for Web users. With the continuous efforts of many research institutes and commercial companies, Web users can obtain information using classified portal web sites and search engines, and this helps to address the issue to a certain extent. However, this is not always desirable because of the low precision and recall in the returned 1

20 results from portal sites and search engines. (2) Difficult to obtain the deep knowledge and patterns behind Web information Web Data contains huge, useful and often profound knowledge and patterns, which can be difficult to discover and directly exploit. In e-business, What relevance is there between items bought at the same time? What differences exist among different users shopping behaviours? How can users purchase or browse preferences be harnessed to recommend or promote products? In terms of information retrieval, how many times do query strings appear and how long do they have? What pattern exists in the page-turns of query results? What rules are there in the URL-click of query results? It is extremely useful to optimise recommendation or information retrieval algorithms through exploiting the discovered knowledge and patterns. (3) Lack of individual information services Because the information needs of users are various, specific, limited, and the information resources on the Web are infinitely dispersed, there exists a contradiction between the specific information needs of users and the infinite information resources on networks. Accordingly, it is necessary to provide personalized services to meet the needs of specific users. In addition, traditional search engines cannot meet the needs of information services. On one hand, the same query results are returned when different users input the same query string into the traditional search engines, which disregard users preferences. On the other hand, due to dynamic web information, users have to continuously search the same query if they want to obtain the latest information from the Internet. Consequently, it is a hot research topic to provide personalized recommendation and information retrieval for users with different backgrounds and preferences. 2

21 1.2 Research Questions This thesis mainly focuses on three kinds of research questions, which are related to query suggestion for information retrieval, personalized retrieval, and information recommendation. Q1: How to provide query suggestions based on the query semantics and clickthrough data? To address the query suggestion issue, we consider to design approaches from two perspectives: (1) how to borrow the word s concept of the Knowledge Network, which is categorized according to a document based method; (2) how to effectively use query logs, which are categorized according to log-based methods. Based on these two perspectives, the research problems are described as follows: Document based method. It is difficult to find relative documents because the query string is short and sparse text. We mainly address the issue by word frequency and domain knowledge (the word s concept of Knowledge Network). Log based method. Give query logs, and learn to exploit them to improve the performance of query suggestion. Based on the Query-URL bipartite graph methods, we obtain the query similarity matrix to contribute to the performance of query suggestion. Hybrid method. After addressing the problems above, we need to consider how to integrate these approaches to pursue further improvement. Q2: How to optimise the information retrieval algorithm based on collaborative filtering of local and global features? The traditional search engines cannot meet the users personalized information needs because they disregard the users preferences. The different preferences and behaviours of users reflect their different information demands. How to return different query results when different 3

22 users input the same query string is a challenge to traditional search engines. By analyzing and studying users logs, we propose a collaborative filtering retrieval model based on local and global features, which consider the local and global characteristics of the accessed information, and treat the two types of characteristics with different methods. This model overcomes the limitations of a single feature, and increases the degree of the retrieval model s application. Q3: How to recommend information based on search behaviours and browsing features? The information needs of users are ultimately reflected by browsing specific pages and examining the behaviours associated with browsing them. There are many kinds of web resources, such as page contents, page linkages, web logs, and so on, which can be analyzed to discover users preferences and build users profiles. So which types of information can we collect to reflect users interests and how do we find user s preferences and then recommend potential interesting resources to them? To answer these questions, we propose two recommendation models based on search behaviours, browsing features and associated user interests. 1.3 Research Objectives Our research aims to develop innovative solutions to improve the performance of web log based recommendation by personalized retrieval. Several major research objectives (RO) which aim to answer the relevant research questions are discussed below: RO1: To propose a query suggestion model based on the query semantics and click-through data (aims to answer Q1); In order to tackle the deficit in effective semantic processes, this paper proposes a query suggestion model based on query semantics and click-through data. The model combines the click-stream data matrix model and query semantic information. By word frequency informa- 4

23 tion and the word s concept of Knowledge Network (HowNet), the model calculates the subject relevance between queries, and then combines the semantic information and the relevance of the query-click matrix model. RO2: To propose a collaborative filtering retrieval model based on local and global features (aims to answer Q2); In order to improve the performance of personalized information retrieval, we propose a collaborative filtering retrieval model based on local and global features. This considers the local and global characteristics of the accessed information, and treats the two types of characteristics with different methods. The normal process of local features is to use the user s click-stream logs stored in the history of accessed information, analyzing the retrieval session between accessed sequences of resources, and build the user s interest function. The process of global features is to use the global user s evaluation of information resources, to build the global user function of resource interests. By the integration of two types of characteristic information, the model overcomes the limitations of a single feature, and increases the degree of the retrieval model s application. RO3: To propose a recommended model based on the retrieval transaction sequence and the browsing features (aims to answer Q3); A comprehensive recommendation model is developed, which is based on the user s atomic retrieval transaction sequences and the browsing characteristics (save, print, bookmarks, and browsing time). This model decomposes search transactions, and calculates the user s degree of interest in the search term, which allows users to query information more clearly. Further, this model incorporates user feedback into the search results evaluation value, which overcomes the shortcomings of the model based on content filtering. RO4: To propose a recommendation model based on user interests association findings (aims to answer Q3); 5

24 By studying and analyzing the user browsing information, we propose a personalized recommendation model based on user interests association findings, which can be used to find the relationship between resources accessed by users, extract the associations of user interests, and address the problem of user interests isolation. It obtains good recommendation accuracy by testing the actual data. 1.4 Significance and Main Contributions The significance and main contributions of the proposed work are as follows: The proposed query suggestion model based on the query semantics and click-through data will be a new extension for search engine query suggestions. The model utilizes the bipartite graph to learn the low-rank query feature space and build a query similarity matrix model. Meanwhile, it combines query literal similarity with query semantic information and calculates subject relevance among queries by word frequency information and the word s concept of Knowledge Network (HowNet). Finally, the model integrates two kinds of relevance to pursue high performance query suggestion The proposed collaborative filtering retrieval model of local and global represents a new extension for information retrieval algorithms. This model considers the local and global characteristics of user accessed information, and treats the two types of characteristics with different methods. The local features use the user s click-stream logs stored in the history of accessed information, to analyze the retrieval session between accessed sequences of resources, and to build the local user s interest function. The global features use the global user s evaluation of information resources to build the global user function of resource interests. The model overcomes the limitations of a single feature, and enlarges the application domain of the retrieval model. 6

25 The proposed web search recommended model based on the users atomic retrieval transaction sequences and the browsing features solves two challenging questions involving information recommendation: Which types of information reflect the user s interests, and how do we find the user s preferences and recommend potential resources to them? We ahve introduced user s atomic retrieval transaction sequences and exploited the browsing characteristics (save, print, bookmarks, browsing time) to build the user s profile. This model decomposes search transactions, and calculates the user s interest degree on the search term, which allows users to query information more clearly; this model incorporates user feedback on search results evaluation values which overcomes the shortcomings of the model based on content filtering. The proposed recommendation model based on user interests association findings will be an extension of the former recommendation model. By studying and analyzing the user browsing information, the model can be used to find the relationship between the resources accessed by users, extract the associations of user interests, and address the problem of user interests isolation. 1.5 Research Methodology Personalized information services focus on the fulfilment of the personalized information demands of different users based on their actual demands, preference characteristics, behaviour patterns, etc. Personalized services can effectively cater to users personal interests, which are widely accepted, and are becoming more and more popular. This thesis considers three aspects of personalized information services, these being the query suggestion for information retrieval, personalized retrieval and information recommendation. For query suggestion for information retrieval, we propose a novel and efficient query sug- 7

26 gestion model integrating the query semantics and click-through data and thereby overcoming the disadvantages of the two kinds of methods. First, we propose a method which combines query literal similarity with query semantic information, and calculates the subject relevance among queries by word frequency information and the word s concept of the Knowledge Network (HowNet); Secondly, we propose another method which utilises the bipartite graph to learn the low-rank query feature space, and then builds a query similarity matrix model based on the features. Based on these, we design a ranking algorithm to propagate the similarities of the users query log information, and then recommend semantically relevant queries to users. The model is composed of three parts: query semantics, the query-url bipartite graph and the integration of multiple suggestion models. In query semantics for the document-based method, we combine the query literal similarity with query semantic information, and then calculate the subject relevance among queries by word frequency information and the word s concept of the Knowledge Network (HowNet). In the query-url bipartite graph for the log-based method, we utilize the bipartite graph (query-url bipartite graph) to learn the low-rank query feature space, and build a query similarity matrix model based on the features. in integrate multiple suggestion models, we integrate two models to pursue high performance query suggestion. Empirical experiments on the click-through data of a commercial search engine have proved the effectiveness and the efficiency of this model. For personalized retrieval, we propose the local and global features-driven collaborative filtering retrieval model which aims to utilize click-through data and Web page ratings to improve Web searching. By performing analysis on the click-through data, we attempt to discover the 8

27 latent factors among these multi-type objects. Page rating is one important characteristic, which can be calculated from the explicit relevance rates of users who have browsed the Web page. By analyzing associations among click-through data multi-type objects and computing Web page ratings, we construct a personalized search model, and then re-rank search results by the model. The model is composed of three parts: user profile, user-based collaborative filtering and apersonalized search model. In the user s profile of the information retrieval model, we calculate the user s preference score through integrating the user search sequence score and web page rating. This model simultaneously takes into account the local features (search sequence) and the global features (web page). In the user-based collaborative filtering part, we predict a test user s interest in a test item based on the rating information from similar user profiles. Each user profile is sorted by its dissimilarity to the test user s profile. Ratings by similar users contribute to predict the test item rating. A set of similar users can be identified by employing a threshold or selecting top-n. In the personalized search model, by selecting top-scoring documents and the documents of interest to users (including those accessed by users and those that are system predicted), we propose a personalized Web search retrieval model whereby different users enter the same query keywords and the search results list is different. To verify the personalized information retrieval, we evaluate it on real-world datasets. The experimental results show that the collaborative filtering retrieval model of local and global features enhances the accuracy of information retrieval. For information recommendation, we propose a recommended model based on the user s atomic retrieval transaction sequence and the browsing features and recommendation model 9

28 based on user interests association findings. The basic idea of the two models is to extract users preferences, build users profiles, and recommend to users potential information based on the users profile. The models are composed of three parts: preference collection, user s profile, and recommendation. In preference collection which extracts data from the web log, we use the users browsing behaviours and the page content browsed by users to develop a novel tool to generate the users preference data. Particularly, we enrich the data by integrating several types of collected data. In the user s profile part which presents the user s preference, we exploit content-based and rule-based motheds to generate the user s profile. Moreover, we adopt different user models for different user s preferences. In the recommendation part, we design the recommendation algorithm which iteratively employs the user s profile and resources in repository to find potential interesting resources for the user. We evaluate the accuracy of recommendation models on real-world datasets. Our experimental results demonstrate that the proposed methods are effective in recommendation, and consistently outperform existing and baseline methods. 1.6 Thesis Structure The rest of the thesis is organized as follows: Chapter 2 is the literature review, we review related works about web log mining, personalized retrieval, recommedation systems. We then discuss the technical challenges in these areas and examine how the developed techniques meet these challenges. 10

29 Chapter 3 describes in detail the query suggestion model based on the query semantics and click-through data. Firstly, it canvasses the related work of query suggesting. Secondly, in the methodology part, it proposes a query suggestion model based on the query semantics and click-through data, which combines the click-stream data matrix model and query semantic information. And lastly, it outlines experiments that show how technology can effectively eliminate ambiguity and input errors reminder queries. Chapter 4 describes in detail the local and global feature-driven collaborative filtering retrieval model. Firstly, it canvasses the related works on personalized searching and click-through data. Secondly, it proposes a local and global feature-driven collaborative filtering retrieval model, which considers the local and global characteristics of the accessed information and treats the two types of characteristics with different methods. Lastly, it shows provides experimental results and analysis.. Chapter 5 describes in detail recommendations based on user s atomic retrieval transaction sequences and the browsing features. Firstly, it discusses the related work on this kind of method and outlines our motivation for this kind of method. Secondly, in the methodology part, we discuss the process of user s atomic retrieval transaction sequence generation, and then introduce a content filtering algorithm to recommend resources to users. Lastly we analyze the experimental results which show that the model has better recommendation effectiveness, and that the recommended efficiency is significantly increased. Chapter 6 describes in detail the model of personalized recommendation based on user interests association findings. Firstly, it canvasses the related work of association rules mining and content-based filtering. Secondly, in the methodology part, we give the term definitions used in this model, present resources and user s prefereces, and build a recommendation model. Finally, we set up experiments about recommendation accuracy by 11

30 testing the actual data. Chapter 7 presents conclusions and recommendations for future research. The relationships between all chapters are shown in Figure 1.1 Figure 1.1: Relationship between chapters 1.7 Publications Related to This Thesis A list of the papers associated with my PhD research that have been submitted, accepted and published appears below: 12

31 1. Xueping Peng, Zhendong Niu, Sheng Huang. Query Suggestion Based on the Query Semantics and Clickthrough Data. Advanced Science Letters. Volume 9, Number 1, pp (6), April Xueping Peng, Zhengong Niu, Sheng Huang, Yumin Zhao. Personalized Web Search Using Clickthrough Data and Web Page Rating. Journal of Computers, Vol 7, No 10 (2012), pp , Oct Xueping Peng, Zhendong Niu,Sheng Huang. A Study on Personalized Recommendation Model Based on Search Behaviors and Resource Properties. ICIECS2010:International Conference on Information Engineering and Computer Science, pp Wuhan, China. Dec , Xueping Peng, Yujuan Cao, Zhendong Niu. Mining Web Access Log for the Personalization Recommendation International Conference on MultiMedia and Information Technology, pp Three Gorges, China, Dec , Xueping Peng, Zhendong Niu. The Research of the Personalization Recommendation Model Based on the Behavior of User s Retrieval and Browse International Conferences on Web Intelligence and Intelligent Agent Technology, pp Silicon Valley, California. Nov Yumin Zhao, Zhendong Niu, Xueping Peng. Research on Data Mining Technologies for Complicated Attributes Relationship in Digital Library Collections. Appl. Math. Inf. Sci. 8, No. 3, pp , Sheng Huang, Xueping Peng, Zhendong Niu, Kunshan Wang. News topic detection based on hierarchical clustering and named entity International Conference on Natural Language Processing and Knowledge Engineering, pp Tokushima, Japan. Nov , Yujuan Cao, Xueping Peng, Kun Zhao, Zhendong Niu, Guixian Xu, Weiqiang Wang. 13

32 Query expansion based on query log and small world characteristic. WISE 2009: 10th International Conference on Web Information Systems Engineering, Poland Poznan Lecture Notes in Computer Science v5802 LNCS, pp ,

33 Chapter 2 Literature Review In this chapter, we review the related works about web log mining, personalized retrieval, the categorization and technology of recommedation systems, and then discuss the technical challenges in these areas. 2.1 Web Log Mining Web log mining is the application of data mining techniques to the data generated by the interactions of users with web servers. This kind of data, stored in server logs, represents a valuable source of information [Mele, 2013]. In analyzing this data, the users basic behavior and mutual association will be explored. These provide direct support for researching user behavior pattern, evaluating the performance of websites, etc. The results accrued from the mining of web logs can also be used to personalize the presentation of web content; mprove user navigation; improve web design or e-commerce sites; optimize the document-retrieval task; improve query suggestion; and improve the customers satisfaction [Abedin and Sohrabi, 2009; Xie, 2014]. 15

34 2.1.1 Technologies Web log mining also known as web usage mining is the application of data mining techniques on large web log repositories to discover useful knowledge about users behavioral patterns and website usage statistics that can be used for various website design tasks. The main source of data for web usage mining consists of textual logs collected by numerous web servers all around the world. There are four stages in web usage mining [Chitraa et al., 2010]. Data Collection: the users log data is collected from various sources like serverside, client side, proxy servers and so on. Preprocessing: performs a series in processing the web log file covering data cleaning, user identification, session dentification, path completion and transaction identification. Mining Algorithms: this is the various data mining techniques to process data like statistical analysis, association, clustering, pattern matching and so on. Pattern Analysis: once patterns are discovered from web logs, uninteresting rules are filtered out. Analysis is done using knowledge query mechanisms such as SQL or data cubes to perform OLAP operations. All the four stages are depicted through the following figure 2.1 [Singh and Singh, 2010] Data Collection Data collection is the very first initialization step of web usage mining. The data authenticity and integrality directly affects the smooth functioning and final recommendation of characteristic service s quality. Therefore it must use scientific, reasonable and advanced technology to gather various data. At present, in relation to web usage mining technology, the main data has originated from three sources: server data, client data and middle data (agent server data 16

35 Figure 2.1: High level Web log mining process and package detecting) [Bari and Chawan, 2013; Singh et al., 2013]. A Web server log is an important source for performing Web Usage Mining because it explicitly records the browsing behavior of site visitors [Domenech and Lorenzo, 2007]. The data which is recorded in server logs contains the information which relates to the access of a Web site by multiple users. However, the site usage data recorded by server logs may not be entirely reliable due to the presence of various levels of caching within the Web environment. Cached page views are not recorded in a server log. A Web proxy acts as an intermediate level of caching between client browsers and Web servers. Proxy caching can be used to reduce the loading time of a Web page experienced by users as well as the networkload Data Preprocessing The information available in the web log is heterogeneous and unstructured. Therefore, the preprocessing phase is a prerequisite for discovering patterns. The goal of preprocessing is to transform the raw click stream data into a set of user profiles. Data preprocessing mainly 17

36 includes data cleaning, user identification, session identification and path completion. Data Cleaning: Most data used for mining [Srivastava et al., 2000] is collected from Web servers, clients, proxy servers, or server databases, all of whom produce noisy data. Because Web mining is sensitive to noise, data cleaning methods are necessary. Data Cleaning is a process of removing irrelevant (noisy data) items such as graphics, videos and format information containing the filename suffixes of GIF, JPEG, CSS, etc. Improved data quality improves the analysis of it. User and Session Identification:The task of user and session identification is to check the different user sessions on the original web access log. User identification identify who accessed the website and which pages were accessed. The goal of session identification is to divide the page accesses of each user into individual sessions. A session is a series of web pages users browse in a single access. The difficulties to accomplish this step are introduced by using proxy servers, e.g. different users may have the same IP address in the log [Singh et al., 2013]. Path Completion: Another critical step in data preprocessing is path completion. There are a number of reasons that result in path s incompletion, for instance, local cache, agent cache, post technique and browser s back button can result in some important accesses not being recorded in the access log file, and the number of Uniform Resource Locators (URL) recorded in the log may be less than the real one. Using the local caching and proxy servers also produces difficulties for path completion because users can access the pages in the local caching or the proxy servers caching without leaving any record in theserver s access log. As a result, the user access paths are incompletely preserved in the web access log. To discover the user s travel pattern, the missing pages in the user access path should be appended. The purpose of the path completion is to accomplish this task. 18

37 The better results in terms of data preprocessing, the better we can improve the mined patterns quality and save the algorithm s running time. It is especially important to web log files, that the structure of web log files are not the same as the data in the database or data warehouse. They are not structured and complete due to various contributing factors. So it is especially necessary to pre-process web log files in the web usage mining. Through data pre-processing, the web log can be transformed into another data structure, which can be better mined Mining Algorithms Web log mining algorithms use the statistical method to carry on the analysis and mine the pretreated data. At present, the typically used machine learning methods are primarily concerned with clustering, classifying, relation discovery and order model discovery. Each method has its own significance and shortcomings, but the most effective method at the moment is classifying and clustering Pattern Analysis The challenges of pattern analysis are to filter uninteresting information and to visualize and interpret interesting patterns for the user. First, we need to delete the less significant rules or models from the interested model storehouse. Second, we use technology of OLAP to carry out the comprehensive mining and analysis, and allow the discovered data or knowledge to be visible. Finally, we provide the characteristic service to the electronic commerce website Research and Applications Web log mining deals with understanding user behavior in interacting with the web or with a website. One of the aims is to obtain information that may assist web site reorganization 19

38 or assist site adaptation to better suit the user. Web log mining model is a form of mining to server logs and its aim is to get useful user access information in logs to make sites perfect themselves with appropriate user requirements, serve users better and provide more economical benefits [Singh and Singh, 2010]. Many researches have developed Web Usage Mining (WUM) algorithms utilizing Web log records in order to discover useful knowledge to be used in supporting business applications and decision making. The quality of WUM in knowledge discovery, however, depends on the algorithm as well as the data. Tao et al. [2008] explored a new data source called intentional browsing data (IBD) for potentially improving the effectiveness of WUM applications. IBD is a category of online browsing actions, such as copy, scroll, or save as, and is not recorded in web log files. Consequently, the research aims to build a basic understanding of IBD which will lead to its easy adoption in WUM research and practice. Recently, a number of WUM algorithms [Bhushan and Nath, 2012; Hollink et al., 2013; Hosseini and Abolhassani, 2007; Hung et al., 2013; Mele, 2013; Sumathi et al., 2010] have been proposed to analyze and predict user behavior patterns. Prediction of user future movements and intentions is based on the users clickstream data. Romero et al. [2013] developed a specific Moodle mining tool and applied it to e-learning systems in order to predict the marks that university students will obtain in the final exams. Jalali et al. [2009] developed a model for online predicting through web usage mining systems and proposed an approach for classifying user navigation patterns to predict users future intentions. The approach is based on using the longest common subsequence algorithm to classify current user activities to predict the user s next movement. 20

39 2.1.3 Challenges Although many techniques and applications have been proposed to support web log mining, there are still many issues that need to be tackled in order to provide high quality web services. These issues are listed as follows: Discovering high quality knowledge: The quality of the discovered knowledge directly influences the quality of the web services provided. In order to discover high quality knowledge, new data mining methods and techniques are required. Applying the discovered knowledge for advanced web applications: Once access patterns have been discovered, they should be further analyzed and applied to advanced web applications, such as personalized retrieval and recommendation. Discovering semantic information: Since web logs lack semantic information about the web pages visited by users, it is difficult to understand the preferences and intentions of users. With the development of the Semantic Web (such as HowNet), semantics in web content can be used for improving the results of web log mining. 2.2 Personalized Retrieval Many information systems have attempted to solve the information overload problem that querying information seekers are currently facing. However, despite being very efficient, traditional Information Retrieval (IR) techniques often follow the one-size-fits-all paradigm by delivering the same information in the same form and order for every user with the same query. Since different user information needs and queries arise in varying contexts with different intentions, research has started to focus on retrieving potentially relevant documents [Dumais, 2009]. This development has sparked off the notion of personalized retrieval, which attempts 21

40 to modify and evolve established IR techniques in order to produce more personally relevant results. Such systems tend to represent users with simplified profiles, which are often based on historic interests or user location properties (e.g. geographical location, language prevalence in a region). Initial evidence has emerged that some of these PIR techniques have been applied within popular web search engines, however little detail has been published so far. Although such statistical approaches enable the efficient calculation of personalized ranked lists, other considerations such as user context or preferences are often neglected [Steichen et al., 2012]. Personalized retrieval is based on the standard Information Retrieval model, which traditionally focuses on the retrieval of documents that are relevant to a unitary query. While personalized retrieval extends this model by taking into account historical interactions, the paradigm is still concerned with finding the most relevant documents for a single user query. This fundamental underpinning of personalized retrieval makes such systems particularly suitable for the general information access paradigm of searching by query, where it is assumed that a user can express their information need in a relatively precise user query [Steichen et al., 2012]. The current strategies of personalized search fall into two categories [Pitkow et al., 2002]. Cai et al. [2014] described two approaches to personalizing Web search results: query expansion and re-ranking of search results. In query expansion, user interests are conflated with a given query, and the expanded query is used for searching the Web. For re-ranking of search results, the search engine results are re-ranked by computing the similarity between the document contents and the terms in the user interest preference [Kumar et al., 2014] Query Expansion Query expansion [Chirita et al., 2007] refers to modifying the original query either by expanding it with other terms or assigning different weights to the terms in the query [Cai et al., 2014]. 22

41 Query expansion involves adding new words and phrases to the existing search terms to generate an expanded query. The expansion can be computed by finding relationships between query terms and document terms in terms of probabilistic correlations or association rules. It can also be approached by analyzing the implicit actions that a user performs during the search [Agosti et al., 2012]. In [Cui et al., 2002, 2003], by exploiting correlations between terms in documents and user queries mined from user logs, the query expansion method achieved significant improvements in retrieval effectiveness compared to other query expansion techniques. The central idea of the method is that if a set of documents is often selected for the same queries, then the terms in these documents are strongly related to the terms of the queries. Thus some probabilistic correlations between query terms and document terms can be established based on the query logs, and these probabilistic correlations can be used for selecting high-quality expansion terms from documents for new queries. Shi and Yang [2007] used an improved association rule mining model to mine related queries from query transactions in query logs. The model presented an algorithm that firstly segments the user sessions identified in query logs into query transactions, and then mines association rules of related queries using an improved association rule mining model. This mining model utilizes not only the co-occurrences between distinct queries but also the distance similarity between them. White et al. [2007] reports the results of a comparison of pseudo-relevance feedback and query log-based refinement. The study showed that the source, the amount of feedback and the query type affect the similarity between query extension and pseudo-relevance feedback. Conceivably, both techniques can be deployed in parallel and refinements can be offered based on query classification. 23

42 2.2.2 Result Processing Result processing adapts the search results to a particular user s preferences. Most reranking strategies attempt to construct a user profile from the user s historical behavior and use the profile to filter out resources that do not match his/her interests [Cai et al., 2014]. Pretschner and Gauch [1999] structured user profiles with an ontology consisting of 4400 nodes. Chirita et al. [2005] modeled both user profiles and resources as topic vectors from an ODP8 hierarchy; thus the matching between user interest and content can be measured by their vector distance. Besides learning user profiles based on their own browsing histories, Sugiyama et al. [2004] also explored social information to refine search results with the help of like-minded neighbors. Dou et al. [2007] compared various personalization approaches (e.g. click-based, profile based, long term based, and short term based) and proposed an evaluation framework for the strategies Challenges Although the methods described in the above-mentioned work are able to handle personalized searches with user and item profiles, there are some limitations. Most (if not all) of the current methods for personalized searching construct user profiles and resource profiles based on the Vector Space Model (VSM) or BM25 ranking model [Kumar et al., 2014; Sun et al., 2005a; Wang and Zhai, 2007]. The weight of each item in a user profile is the degree to which the user is interested in the item. In addition, the weight of each item in a resource profile is the degree to which the resource is relevant to the item. However, solely relying on TF, or BM25 values to measure the weight of items does not sufficiently indicate how much a user is interested in an item. 24

Enhanced Web Log Based Recommendation by Personalized Retrieval

Enhanced Web Log Based Recommendation by Personalized Retrieval Xueping Peng FACULTY OF ENGINEERING AND INFORMATION TECHNOLOGY UNIVERSITY OF TECHNOLOGY, SYDNEY A thesis submitted for the degree of Doctor