0. WEB PERSONALIZATION AND WEB USAGE MINING

Size: px

Start display at page:

Download "0. WEB PERSONALIZATION AND WEB USAGE MINING"

Dwight Dawson
5 years ago
Views:

1 Chapter WEB PERSONALIZATION AND WEB USAGE MINING Web personalization is proposing approach to ease the people from burden of information overload on internet and provide them relevant information as per their needs. The goal of web personalization using web usage mining is to identify interesting patterns from web usage data and recommend objects to the user which consists of products, text, links and so forth. First this chapter describes the overall process of web personalization in section 3.1. Next, web mining along with its categories is described in section 3.2. Web usage mining and its phases have been described in section 3.3. The Section 3.4 and section 3.5 presents web data and web logs respectively. Major application areas for web usage mining are given in section 3.6. Section 3.7 describes various personalization techniques. The requirements of personalized search have been presented in section 3.8. Finally summary of the chapter is given in section Web Personalization WWW is a huge repository of information which is growing exponentially. More and more people visit various web sites and search engines to find relevant information. To provide the huge information is not the problem, but the problem is that day by day more and more people having different needs and requirements search through this huge WWW and get lost in complex web structures and hence miss their inquiry goals. Web personalization can be the solution to this problem [Sci]. Erinaki et. al. [EV03] defined web personalization as any action that adapts information or services provided by web sites to the needs of a user/set of users, taking advantage of the knowledge gained from the user s navigational behavior and individual interests. The aim of web personalization is to present the results to users based on their interest and need. Personalization on the web covers a broad area, including check-box customization, recommender systems and adaptive websites. The various cases of personalization include e-commerce applications, information portals and search engines where results are to be filtered out as per the profile of a user [Sci]. 18

2 The overall process of web personalization [EV05] is as shown in Figure 3.1. Figure 3.1: Web Personalization Process The overall process of web personalization in general, consists of following tasks [EV03]: a) Data collection: In this task data is collected which can be usage data, content data, user profile data and structure data. b) Data preprocessing: Data is then preprocessed which is necessary task to find out interesting usage patterns. The web usage data is stored in the form of web logs in web servers, proxy servers or client browsers. In the proposed architecture, proxy server web usage logs are used for the experimentation. c) User profiling: It is the process of gathering user specific information, either implicitly or explicitly. The user profile can include his/her personal information, interest and navigational behavior while surfing on net. Generally there are two basic types of user profiles: interest-based user profile and behavior-based user profile [WSYL09]. Also the user profile can be either static or dynamic [EV03]. The static user profile never or rarely changes, e.g. user s personal information such as name, sex. The data in dynamic user profile changes frequently. The proposed architecture uses dynamic type behavior based user profile. 19

3 d) Extracting useful knowledge and interesting usage patterns (i.e. Pattern discovery): The data collected is then analyzed to extract useful knowledge and interesting usage patterns. There are various approaches to analyze the data which includes, content-based filtering, collaborative filtering, rule based filtering, web usage mining method etc. The proposed architecture uses web usage mining method for web search personalization. e) Pattern analysis: In pattern analysis step, uninteresting rules or patterns are filtered out. f) Personalization: It includes the actions to be carried out, recommended by such personalization systems. 3.2 Web Mining There are number of basis for the emergence of web mining [SCDT00]. The WWW is huge and growing exponentially. It contains vast amount of information which is growing and updating rapidly. Various companies, institutes, government agencies and service centers update their information regularly. The web pages do not have any standard structure and carry complex style. Also, the web pages are organized in complex fashion than any other traditional text documents. The WWW provides its services to the varieties of web users. Web users may have different interests, needs and backgrounds. When any user searches for the information on internet, actually he/she is interested only in small portion of information. The challenges listed above encourage in finding out some means to use web resources effectively, which also leads to the web mining. Most of the researchers call web mining to all methods that apply data mining to web data [PPPS03]. Web Mining can be defined as application of data mining techniques to extract knowledge from the web data [Sri]. Mainly there are three categories to carry out web mining task: web usage mining, web structure mining and web content mining Web Mining Categories As shown in Figure 3.2, web mining is broadly divided into three categories according to the kinds of data to be mined [SCDT00, BFRRT03, BFPKM14]: web content mining, web structure mining and web usage mining. a) Web content mining: It is the task of extracting useful information from the content of web documents. The contents of web documents can be text information, some video, any image and graphs. Actually many times, those contents of web documents are in 20

4 unstructured or semi-structured format and hence extracting the useful information or knowledge becomes difficult and complicated. To mine the contents of web pages multimedia data mining and text mining are useful [WSYL09, SCDT00, Zho06, SKVP13, WYZ06, Min]. Web Mining Web Content Mining Web Structure Mining Web Usage Mining Text and Multimedia Documents Hyperlink Structure Web Log Records Figure 3.2: Web Mining Overview b) Web structure mining: It depends on the structure of web documents. It includes XML or HTML links/tags used in web pages. Normally various pages are linked together via HTML hyperlinks. So by studying these hyperlink connections, some useful information such as importance of the particular web page, can be found out. If the web page is linked to many other web pages, then it can be considered as an important page and can be placed in higher rank category. [WSYL09, SCDT00, Zho06, SKVP13, WYZ06, Min]. Social network analysis is the famous research done in the area of web structure mining. c) Web usage mining: It is the application of data mining techniques which aims to discover interesting and frequent access patterns from web log data [CMS97]. Table 3.1 gives an overview of web mining categories [KB00, WYZ06]. 3.3 Web Usage Mining The term Web Usage Mining (WUM) was introduced by Cooley in Web usage mining (also known as web log mining) is the application of data mining techniques which aims to discover interesting and frequent access patterns from web log data [CMS97, Rat5]. The extracted interesting usage patterns and knowledge can be used in varieties of applications like system improvement, website modification/restructuring, use of caching & pre-fetching for improvement of user navigation and personalized web 21

5 Table 0.1: Overview of Web Mining Categories Web content Web structure Web usage mining mining mining Method - Statistical -Proprietary -Association rules - Machine learning algorithms -Machine learning -TFIDF and variants -Statistical -Sequential pattern Mining Main data -Text documents Links structure -Server logs -Hypertext -Proxy server logs documents -Browser logs View of data -Unstructured Links structure -Interactivity -Semi structured Application -Clustering -Clustering -Site construction categories -Categorization -Categorization -System -Finding extraction improvements rules -Modification of website -Business intelligence -Personalization search. As shown in Figure 3.3, web usage mining consists of three phases, namely preprocessing (after data collection), pattern discovery, and pattern analysis [MNLM06, WYZ06, Grc04, BS07]. Figure 3.3: Web Usage Mining Process 22

6 3.3.1 Preprocessing The preprocessing of web logs is usually complex and time consuming. It consists of four tasks: (i) data cleaning, (ii) identification and the reconstruction of user s sessions, (iii) retrieving of information about page content and structure and (iv) data formatting. Figure 3.4: Preprocessing of Web Usage Data Data cleaning step consists of removing all entries/data in web logs that are irrelevant and not useful in mining process e.g. graphical page content (e.g. jpg and gif images), or the requests of robots and web spiders are considered as irrelevant and useless [SK09, CMS99, Cha, CA11]. Robots and web spiders related irrelevant entries are found out by referring to the user agent, or by checking the text file; robots.txt. A heuristic based approach can be used in the cases where robots send false information such as false user agent in HTTP request. In such approach, the user s sessions and robots sessions are separated. Sessionization is the process of segmenting the user activity into sessions. Episode identification can be performed as a final step in pre-processing of the click stream data in order to focus on the relevant subsets of page-views in each user session. An episode is a subset or subsequence of a session comprised of semantically or functionally related pageviews [LK07]. 23

7 Session identification step involves the identification of different users sessions. Those sessions are identified using incomplete information form web logs. The use of proxy servers create the caching problem, which affects on session identification. So sessions can be reconstructed by using navigation oriented heuristics, time oriented heuristics or using cookies. In some situations the cookies does not solve the problem. In such situations, the URL is rewritten by including the session-id in original URL. So web logs contain the modified URL instead of original URL [LK07, HNJ08]. The major problem with this solution is to insert a software agent at server side to perform these tasks. Web browser caching affects on creating a consistent path. Many times a web user presses a back button and visits previous page. However web logs cannot contain such information. A heuristic approach can be used to reconstruct a consistent navigational path [SB09]. In many web usage mining applications, visited URLs are used as main source of information for mining purpose. In addition to the URLs, classification of web pages can be done according to the content type of web pages. Then this classification is used during mining task. If sufficient classification is not possible then web structure mining can be used to build it. The last step of preprocessing is to format the data properly and then provide the formatted data for mining purpose. Data can be formatted in various ways such as; to use relational database to store data extracted from web logs, to use signature tree for indexing the logs or to use WAP-tree to store access sequence. Even a cube-like structure can be used to store session information [GS05, CMS99]. In the proposed architecture, data cleaning, user identification and session identifications steps are carried out Pattern Discovery Pattern discovery aims to detect interesting patterns from the preprocessed web usage data i.e. mining the data. [Sci]. It includes methods and algorithms developed from several fields such as statistics, data mining, machine learning and pattern recognition. Generally, there are many data mining techniques particularly for web personalization based on classification, clustering, sequential pattern mining, association rule discovery and statistical approaches. Among them, sequential pattern mining method is an extensively used data analysis technique in web usage mining. In the proposed 24

8 architecture, pattern discovery phase is accomplished using proposed sequential pattern mining algorithm. There are various data mining techniques that are used in web usage mining e.g. association rule mining is used in many web usage mining applications. The aim of association rule mining is to identify association or correlation between different data items or different set of data items. For example, consider the association rule of the form C D, where C and D are set of data items within some transaction. Then rule C D says that the transaction which contains items in C are also likely to contain the items in D. A typical example of association rule mining is market basket analysis. In the analysis process, the customer buying habits are analyzed. For that, the associations between the different data items purchased by the customers are found out. The detection of these associations helps the retailers to get the knowledge about which items are purchased together by customers. For web usage mining, association rule mining is used to find correlation between web pages accessed together in a session. For example, C.html, D.html E.html, means that if a user has visited a web page C.html and D.html, then most probably the same user has also visited E.html in the same session. The association rule mining can be used for web personalization system or web recommender system. A mixed technique of association rules and fuzzy logic is used to extract fuzzy association rules from web logs [GS05, SCDT00, CMS99]. Clustering partitions a set of objects in groups (clusters), such that objects within the same group bear a closer similarity to each other, than objects in different groups A cluster is a collection of data objects where all data objects are similar with each other in same cluster but are dissimilar to the objects in other clusters [KKBD07]. The choice of clustering algorithm depends on the application and type of data available. In web usage mining applications, clustering techniques [XP01, NFKJ99, MCS99] are used to identify two type of clusters; page clusters and user clusters. The user cluster means a group of users which have similar browsing habits, while page cluster is a group of pages that are conceptually related according to user s view. In some web usage mining applications, clustering has been used to group together similar sessions. Some type of statistical information are required to be generated which are used by system administrator to improve the system performance, modify the website for customer satisfaction and improve security of the system. Some of the statistical information discovered from web logs is shown in Table 3.2[Zho06, SK09]. 25

9 Table 0.2: Statistical Information Discovered from Web Logs Statistics Referrer statistics Diagnostic statistics Client statistics Website activity statistics Information -Top search engines -Top referring websites -Page not found errors -Server errors -Proxy authentication required -Bad request -Visitor s web browser -Cookies -Number of hits -Average view time -Traffic of a day -Duration of maximum and minimum traffic Classification is a process that learns to assign data items to one of several predefined classes, i.e. it is a process which predicts a data item and assigns one of the class from number of predefined classes. It is also known as supervised learning. In the case of web usage mining, classification is generally used to construct different user profiles where users belong to particular class or category [Zho06, Sci, RVZB09]. Tan et al. [TK02] built a web robot classification model, using classification techniques. The aim of this model is to use minimum number of requests to find out robot sessions and non-robot sessions. The C4.5 algorithm is used to build classification model, which gives very high accuracy with minimum number of requests and also discovers many robots including previously unidentified robots. Sequential pattern mining discovers interesting and frequent patterns from web data. It is defined by Agrawal and Srikant as follows [AS95]: Given a sequence database where each sequence is a list of transactions ordered by transaction time and each transaction consists of a set of items, find all sequential patterns with a user-specified minimum support, where the support is the number of data sequences that contain the pattern. In web usage mining, sequential patterns are utilized to find out frequent sequential browsing patterns in a user s sessions. 26

10 Generally sequential pattern mining algorithms are differentiated by: number of scans required for the database, process to generate and store candidate set of k-itemsets, number of candidate sets generated and a process to count the support value. Run time and memory utilization are two important measures for performance evaluation of those mining algorithms. There exists number of sequential pattern mining algorithms with different techniques. Two techniques that are primarily used by most of them are: Apriori based and Pattern-growth based (also called as FP-growth). Some algorithms are earlypruning based. AprioriTid, AprioriAll, AprioriSome, GSP (Generalized Sequence Pattern) are Apriori based algorithms while WAP-mine, FreeSpan, PrefixSpan belong to pattern-growth technique. Most of Apriori based algorithms encounter the problems such as: multiple scans of databases, generation of explosive number of candidate sequences and difficulties at mining long sequential patterns. FP-growth based algorithms such as PrefixSpan involves the construction of projected databases in various steps. All these processes are costly in terms of memory and run time. The basic mining algorithm based on WAP-tree is the WAP-mine algorithm which needs only two scans of sequence database. The algorithm builds a tree at start and then number of intermediate trees for frequent subsequences. This results in utilization of more memory. However, WAP-mine outperforms GSP algorithm [ME10, SKS98]. [Mor01] provides a comparison of different three sequential pattern algorithms applied to web usage mining. The comparison includes (i) PSP+, an evolution of GSP (ii) FreeSpan and (iii) PrefixSpan based on data projection. The frequent sequential patterns are mined through a breadth first search over the hypertext probabilistic grammar. Markov models are useful for addressing the problems. Higher order Markov models display higher predictive accuracies. Markov models are extremely complicated due to their large number of states that increases their space and runtime requirements, whereas lower order models do not capture the entire behavior of a user in a session. So, predicting the next request not in the sequence is difficult [HWV01, LK07, DK04, CMS99]. The CSB-mine algorithm is an efficient sequential access mining algorithm which does not generate any candidate sequences like Apriori-based algorithms. Also there is no need to build any WAP-tree to store web access sequences. Table 3.3 shows some of the main pattern discovery techniques for web usage mining, published in research papers alogwith their application areas [Zho06, SZA10]. 27

11 Table 0.3: Pattern Discovery Techniques for Web Usage Mining Papers Technique Application Lin et al. [LAR01] Association rule mining Collaborative recommender system Mobasher et al. Association rule Web personalization [MDLN01] mining Wong et al. Association rule Web personalization [WSP01] mining Pei et al. [PHMZ00] Sequential pattern mining with WAPtree and WAP-mine General Ezeife et al. [EL05] Maged et al. [SRR04] Mobasher et al. [MDLN02] Nasraoui et al. [NFJK99] Tan et al. [TK02] Cho et al. [CKK02] Zhou et al. [ZHF06] Borges et al. [BL99 ] Sequential pattern mining with PLWAP algorithm Sequential pattern mining with FS-tree Sequential pattern mining Clustering Classification Classification with decision tree Sequential pattern mining, CSB algorithm Sequential pattern mining General Web page prediction and prefetching Web personalization and recommender system Web personalization Preprocessing Personalized web recommender system Web personalization and recommender system Website modification 28

12 3.3.3 Pattern Analysis It is the final step in the web usage mining process. The aim of pattern analysis is to convert discovered rules or patterns into knowledge. Here the knowledge means conceptual idea which describes the information to understanding [Sci]. It is highly dependent on a person performing the analysis. Also the exact method of analysis depends on the application for which web mining is done. For example it is done by using knowledge query mechanism, like SQL. Another method is to perform OLAP (OnLine Analytical Processing) operations using usage data. Visualization techniques, such as graphing patterns or assigning colors to different values, can often highlight the overall patterns or trends in the data. Content and structure information can be used to filter out patterns containing pages of a certain usage type, content type, or pages that match a certain hyperlink structure. Pattern analysis step is also referred as recommendation phase in case of web personalization using web usage mining, where various URLs are recommended [Sci]. Web personalization using web usage mining is superior to other traditional approaches (e.g. content based filtering and collaborative filtering), in terms of both scalability and reliance on objective input data [Sci] and it can reach to more accurate personalization [Zho06]. In most of the efforts by researchers, web usage mining is used for web personalization on a particular website whose structure and content is known in advance. In this research, the focus is on web search personalization using web usage mining, where mining is applied on proxy server logs such that each user will obtain personalized recommendations. The recommendations are improved when same user fires the same/similar query. 3.4 Web Data There are different kinds of web data used in web personalization, as follow: a) Content data: The content data of a site can be the combination of textual information and images. The data resources used to generate this data includes HTML/XML pages, graphics, any image, sound or video data. It also includes the metadata embedded in web page such as HTTP variables or semantic tags [Sci]. The domain ontology for the website is also considered as content data, which includes conceptual hierarchies over page contents, structural hierarchies represented by the underlying file and directory structure in which the site content is stored. 29

13 b) Structure data: It is the designer s view of various contents organized within a website. It includes data such as HTML tags, XML tags or hyperlinks used to connect web pages. The hyperlink structure of any website is represented by site map. This site map is used to capture the structure data of a site. c) Usage data: It includes data from various logs such as proxy server logs, web server logs and client s browser logs. These log data represent navigational behavior of each user [Sci]. The usage data include user s website visit information such as client IP address, date and time of request, requested URL, status code, bytes transferred, referrer etc [JR14]. The web usage data also includes the data from cookies, user queries, mouse clicks, registration data, user profiles, bookmark data, user sessions etc [KB00]. Proxy (squid) web logs in combined log format are used for the experimentations. d) User profile data: It contains user information such as name, address, age, education, state and country etc. which can be explicitly collected from a user by filling some online registration form. It can also contain user s interest and his navigational behavior which can be collected without user s explicit feedback by applying web mining techniques on web logs. The proposed architecture uses behavior based user profile with some modification, which stores user navigational behavior without user s explicit feedback. Web usage data are stored in web servers, proxy servers or client browsers in some predefined format in the form of web logs: a) Server-side logs: Web server stores the web usage data of multiple users visiting the website. These web usage data are stored in web logs file. There are various formats to store the web logs such as common log format and extended format. Web server suffers from one main problem the visit information of cached pages is not stored. The users sessions have to be identified when web logs from web servers are exploited for mining purpose. A session is a group all web page requests (selected URLs) made by a user over a certain period of time. This identifies user s correct navigation path. There are various approaches for session identification which includes cookie based approaches, navigation oriented heuristics and time oriented heuristics [CMS99,FL05]. b) Proxy-side logs: If web proxy server is used then the user s request goes to a web server through proxy server. So proxy server stores the web usage data of multiple users making request for web resources on multiple web servers. The web logs stored on proxy server are useful to personalize web search using web usage mining approach. We have 30

14 used proxy server logs to accomplish web search personalization using web usage mining. c) Client-side logs: To collect the usage data at client side requires installation of some agent software on client s machine. This agent traces and stores user s browsing activity. The activity includes the single user s web browsing behavior over multiple websites. The drawback of this approach is to achieve compatibility of agent software with number of existing operating systems and web browsers. 3.5 Web Logs When a web user interacts with the web and submits a request, then his/her navigational information called as web access log (sometimes also called as web logs in some literature) is stored in a web log file. The users requests to the web resources are stored in a log file sequentially, i.e. in the time order the requests are made. The raw information stored in a web log file is converted into a set of transactions (one transaction means list of web pages visited by a user) in the preprocessing step of web usage mining. The remaining irrelevant requests (including web robots request) are removed. The three different sources of web log files are: web servers, proxy servers, and client browsers [SK09]. There are various types of log formats. Common Log Format (CLF) used by Apache includes host name, username, timestamp, requested URL, HTTP reply code, bytes sent in reply. Following is a fragment from NASA server logs, in CLF: [01/Jul/1995:00:00: ] "GET /history/apollo/ HTTP/1.0" Microsoft IIS log format include remote host, date, time, HTTP request, status code, transfer volume (B), referrer field, user agent [W2]. Extended log format in general consists of a list of identifiers such as date, time, IP, bytes transferred, cached (whether a cache hit occurred or not), status code, comment returned with status code, method used to retrieve data, URI requested, uri-stem(stem portion alone of URI and omitting query) and uri-query (query portion alone of URI) [DCI00,W2]. Additional information such as referrer (the web page the client was visiting before requesting that page), user agent, or keyword (that is the keywords used when visiting that page after a search engine query), can also be stored [W2, Eir06]. 31

15 Proxy server logs have been used to carry out the experiments for personalized recommendations. Following is the sample entry from the proxy server having squid combined web log format: [05/Sep/2014:17:21: ] "GET HTTP/1.1" sa=t&rct=j&q=macro%20in%20excel&source=web&cd=1&cad=rja&uact=8&sqi=2&ve d=0cccqfjaa&url=http%3a%2f%2fwww.excel-easy.com%2 Fvba.html&ei=saQJVImgPMqOuASLyYLICg&usg=AFQjCNFEZeyEk7sF_jOZdYU82 6TIN d5g&bvm=bv ,d.c2e" "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; The entry reflects the information as follow:- Remote IP address: It is the IP address of client s machine (host address). Username: It is denoted by - -. It has relevance only when accessing password protected content. Timestamp: Date and time of client s request. Access request: The request made by the client. Here, it is a GET request for the file using HTTP/1.1 protocol. Status code: The resulting status code, e.g. 200 denotes the success. Bytes transferred: Number of bytes (e.g. 3808) transferred to the client. Referrer: It is the URL of previous page that linked the user with current page. User Agent: It denotes the web browser and platform used by the user. Table 3.4 shows status codes of Hypertext Transfer Protocol which includes error conditions as well as successful transmission of data [SK09, W3]. Table 0.4: Status Codes of Hypertext Transfer Protocol (HTTP) Status Meaning of Status code Status code meaning code series status code series 1xx Informational 100 Continue 101 Switching protocols 2xx Success 200 OK. The client request has succeeded 201 Created 32

16 202 Accepted 204 No content. 205 Reset content. 206 Partial content. 3xx Redirection 301 Moved Permanently 302 Object moved 304 Not modified 305 Use proxy 4xx Client error 400 Bad request 401 Access denied 403 Forbidden 404 Not found 405 Method not allowed 407 Proxy authentication required 412 Precondition failed 413 Request entity too large 414 Request-URI too long 415 Unsupported media type 416 Requested range not satisfiable 417 Execution failed 423 Locked error 5xx Sever error 500 Internal server error 501 Not implemented 502 Web server received an invalid response while acting as a gateway or proxy 503 Service unavailable 504 Gateway timeout 505 HTTP version not supported 33

17 3.6 Major Application Areas for Web Usage Mining There are many practical application areas where web usage mining has been applied, which includes a) Personalization, b) System improvement, c) Site modification, d) Business intelligence and e) Usage characterization [FL05, Zho06, SCDT00, WOP12]: a) Personalization: Web personalization is the process where web site contents are tailored as per the needs of a user. For the personalization, the interesting access patterns can be mined from web usage data. In many applications of web personalization, dynamic recommendations of items are made based on user s browsing behavior and his/her profile. For example, cross-sales in e-commerce. WebWatcher [JFM97] is a tool for web site personalization based on web usage data. The tool guides a user for next hyperlink and improves advice skill from past experiences. Lieberman [Lie95] developed software agent known as Letizia, which helps the user while browsing on web. It also uses web usage data to give personalize experience to a user. It uses some simple heuristics to predict the next interested items to a user. b) System improvement: The frequent access patterns and web traffic behavior can be found out using web usage data, which can be further used to decide the policies for document prefetching, web caching, data distribution and load balancing. Hence the system performance can be improved. Web usage mining can also be used in fraud detection by discovering frequent unexpected access patterns. Yang et al. [YZL01] presented a method to use web access patterns (generated by using web usage mining), in web caching policies and document prefetching policies to improve system performance. c) Site modification: In many commercial applications website attractiveness is a crucial feature from the business perspective. So web site structure i.e the web pages organization needs to be improved. Web usage mining extracts the knowledge from users behavior and helps the website designer to modify the website. Perkowitz et al. [PE98] presented an approach for adaptive websites which automatically improves web structure organization by mining web usage logs from web server. Authors presented a cluster mining algorithm known as PageGather for mining purpose. d) Business intelligence: Various e-commerce websites are used by number of users for on-line purchasing. Web usage mining can be used to determine marketing intelligence from users navigation behavior, so as to boost up product sales. Buchner et al. [BAMH98] described the discovery of marketing intelligence from web data. Authors 34

18 used users navigational behavior in their proposed MIMIC (Mining the Internet for Marketing IntelligenCe) architecture. The marketing intelligence can be used for marketing activities such as cross-sales, customer attraction and retention. e) Usage characterization: The various web data and information such as kind of data accessed, access patterns, number of bytes transferred, IP address ( e.g. IP address can be used to find out user s location i.e. the country and city, he belongs to) and popular web services can be analyzed to look how the web is growing. 3.7 Personalization Techniques There are various personalization techniques. In this section some main techniques are described such as a) Machine learning based Approaches, b) Language modeling based approaches, c) Recommender systems, d) Hyperlink-based personalized web search, e) Personalized web sites, f) Ontology based personalized search and g) Web usage mining approach. [Sci, SV11, RS10, MGSG07, WDS, KVJ13, RS10, MGSG07, Zho6, Roh07, KL05, SHY04, SV11]. a) Machine learning based Approaches: The use of machine learning based approaches can be made in personalizing the web search, where various algorithms are used which perform desired operations when trained on sufficient amount of data. In most of the machine learning based approaches, the task is simplified to a binary classification problem with two classes relevant and non-relevant. This type of classification suffers from some drawbacks. Firstly, in the case of providing explicit relevance feedback, a user gives information of only relevant documents. This may lead to the generation of majority of non-relevant documents. So the learner generally achieves maximum predictive classification accuracy, if it always responds non-relevant, without considering the ranked relevant documents. Secondly, in the case of implicit relevance feedback (e.g. clickthrough data)a user typically click documents from top few results, which he/she thinks as relevant. Here clickthrough data contains partial information, so binary classification would be over simplification. Support Vector Machine (SVM) approach can be used to optimize the retrieval quality of search engines using clickthrough data (query logs and logs of links clicked by user). The SVM approach can also perform well for large number of features and large number of queries [Joa02]. The probabilistic model based approach can be used for web search personalization, where machine learning algorithms can be used to train the ranking functions [STB12]. Many machine learning systems have been developed over the past 35

19 years [CC]. The major areas include neural networks, genetic algorithms, analytic learning, case-based learning and rule induction. b) Language modeling based approaches: Statistical language modeling or language modeling, a probabilistic framework to describe information retrieval process refers to the problem of estimating the likelihood that a query and a document could have been generated by the same language model, given the language model of the document and with or without a language model of the query. Statistical language modeling can also be used for session identification method where session identification does not rely on time out boundaries, but can use different approach which measures changes occurred in information in the sequence of requests [HPAS04]. Language modeling based algorithms are used for personalized web search using implicit user modeling [STZ05]. The approach uses implicit feedback which includes previous queries, summary of previous clicked documents and current query. The session time limit boundary can be eliminated in web search personalization using user s long term search history and methods based on statistical language modeling[tsz06]. c) Recommender systems: In web personalization process, the web resources (e.g. web pages) are generally recommended as an added function. There are various techniques for web personalization and recommendation such as collaborative filtering, contentbased filtering and rule-based filtering [PPPS03, WDS]. i) Collaborative filtering-based recommendation system: Collaborative filtering is another technique to improve web search results. The term collaborative filtering was coined by Goldberg et al. [SHY04, GNOT92, WCC01]. Collaborative filtering means that different users collaborate with each other where they record their reactions (such as like or dislike) for the documents. So this will form a community of those users who have similar interests. A current user is matched against the community database to find out users with similar interests. The items which the neighbors like are then recommended to the current user assuming that he/she will also like the same (i.e. this approach is based on the assumption that users with similar tastes on some items may also have similar preferences on other items). This technique suffers with a major problem as user has to provide some personal information about his/her likes and dislikes, however web users hesitate to provide such information. Collaborative method is used for re-ranking the search results in [CGG00]. The search process and ranking of documents is completed within a context of community of users or particular user. Two types of profiles are built, a user profile and community profiles. These profiles are built 36

20 by studying the documents selected by the users and by the community to which they belong. The proposed recommender system is enhanced by re-ranking the search results and by term weights using adapted cosine function. Grouplens is a system designed for Usenet news using collaborative filtering technique [KMMHGR97]. Numbers of users collaborate to select particular article from huge collection of news. However the performance of such pure collaborative method degrades for large websites having massive number of pages. So in some personalization approaches, hybrid approach e.g. combination of collaborative filtering and content based filtering have been used [JFM97]. Yoda is a personalized recommendation system which uses hybrid approach and combines content based querying collaborative filtering for more accurate recommendations [SKCM01]. It presents on-line real time recommendations while the model is trained off-line. ii) Content-based recommender system: In a content-based recommender system, contents of web pages accessed by a user are analyzed. Then next web pages are recommended which are similar to user s past likes [Zho06, Wan13, and Xu08]. The user s likes for the item are captured from attributes of an item such as price, tags, meta data etc. The user profile contains previous purchase history of a user. It may also contain the information of items which he/she just viewed in the past. The attributes of an item from user profile is learned to model the interest (i.e. liking) of a user. The model can be built by different machine learning algorithms such as Bayesian network, rulebased models and clustering [SHY04]. Content-based recommender system is useful to predict individual user s likes on the basis of considering individuals past purchase/view history rather than considering other users preferences [Lie95]. The major trouble with this approach is to analyze the contents of web pages and reaching to some similarities [PPPS03, WCC01]. iii) Rule-based recommendation system: In rule-based recommendation system, rules are generated based on the answers submitted by users at the time of registration. These rules are then used for recommendation to a user whose profile matches with rule conditions. The main difficulty with this approach firstly, is to construct proper rules and secondly user profiles are generally constructed explicitly where user involvement is necessary. [Zho06]. The sequential access patterns can be used for web recommendation, instead of using above traditional recommendation techniques [ZHF06]. d) Hyperlink-based personalized web search: The hyperlink structures of web are also useful for personalized web search [SHY04]. For example, some search engines use 37

21 PageRank algorithm to compute a ranking for every web page for identifying relative importance of each web page [Pag99, SHY04]. The ranking depends on graph of the web. Every web page has some back-links and forward-links. A web page has high rank if sum of back-links is high. Personalized PageRank algorithm is used for personalized search, by modifying the PageRank algorithm [SHY04]. A set of PageRank vectors can be computed instead of computing a single vector (like in PageRank algorithm), for more accurate personalized search results [Hav02]. For the given query, topic-sensitive PageRank scores are computed. In the proposed model [Hav02], a set of PageRank vectors is computed off-line, by creating a set of importance scores for each web page belonging to particular topic. In the experiments, number of topic-sensitive PageRank vectors is limited to 16. Personalized views can be constructed from partial vectors at query time rather than computing and storing all personalized views in advance [JW03]. Authors also proposed some algorithms to compute partial vectors and an algorithm to construct personalized view from partial vector. The proposed approach can scale and hence vector limit of 16 is removed. A framework has been developed to extract information from hyperlink structures of web [Kle99]. The main goal of the proposed framework is to refine search topic and to discover authoritative information sources for such topics. The proposed algorithm called as Hypertext Induced Topic Selection (HITS) first computes eigenvector of document link matrix and then ranks the authority of a magnitude considering eigenvectors. The main problem with HITS algorithm is that users are not allowed to put their view regarding authoritative resources. This problem is addressed in [CCM02] and a technique is presented to learn user s internal model of authority. The proposed technique realigns the eigenvectors of document link matrix by taking user feedback and then re- computes measure of authority to match with the user s internal model. e) Personalized web sites: Personalized web sites can be constructed using structure, link topology and contents of web pages [SHY04]. Link personalization and content personalization are the two main schemes. In link personalization, more relevant links are selected by the user and then navigation history is updated by reducing or improving the relationships between web pages. Link personalization is used in E-commerce applications where products are recommended based on purchasing history of customer or some community of customers based on their ratings and views. In content personalization, web pages present different information to different users. i.e. information in a web page is personalized. In these techniques, users have to provide 38

22 some personal information and the ratings (e.g.1= bad, 2=good and 3=excellent) regarding products. So these systems success depend on users feedback and users themselves have to update their profile in case of changes in preferences. f) Ontology based personalized search: The ontology based personalized search is proposed in [PG99]. In the proposed approach, different ways are studied to model user s interest (model is also called as profile). The user profile stores content of page, length of document and time spent on a web page. When certain pages are visited again and again, then it is considered as user s interest in that particular subject. The contents of pages are determined using a hierarchy of concepts, i.e. ontology. For the experiments Magellan hierarchy of 4400 nodes is considered. Each node of the hierarchy represents a set of documents which represents particular content of a page. The web search personalization with ontological user-profiles is presented in [SMB07]. The ontology is defined as an explicit specification of concepts and relationships that can exist between them. In the proposed approach, users profiles are built using ontology concept by giving interesting scores to existing concepts in domain ontology. The interesting score for a concept is updated (depending on user s continuing behavior) using Spreading Activation algorithm. The ontological user-profiles approach successfully addresses the cold-start problem. Initially there will be existing domain ontology. When a user fires a query, then the initial user behavior is matched with existing concepts in domain ontology and relationships between these concepts. A method for ontology based personalized search, using weighted concept hierarchy to construct user profile is presented in [GCP03]. These automatically created ontology-based user profiles replicate user interest reasonably fine and are used to personalize search results. g) Web usage mining approach: Web usage mining is the process of discovering interesting patterns from web usage data [CMS97, PPPS03]. It has been used in number of web personalization applications. As explained in section 3.3, web usage mining process for web personalization consists of preprocessing, pattern discovery and recommendation phases. Various data mining techniques such as clustering, association rule mining, sequential pattern mining can be used to discover interesting patterns. The discovered interesting patterns are then used for recommendation purpose. For example, a recommender system is developed using association rule discovery techniques [FBH00]. The system mines navigation history of a user for the recommendation. Mobasher et al. [MDLN01] proposed a technique for web personalization using web usage mining based on association rule discovery. The technique is found to be 39

23 promising with respect to effectiveness of personalization and scalability when compared with collaborative filtering technique such as knn (k-nearest-neighbor) approach. A framework is designed for web personalization using sequential and non sequential patterns discovered from usage data [MDLN02]. The results of experiments indicate that contiguous sequential patterns are more useful in some applications such as web prefetching. In [MCS99], web usage mining is applied for web personalization which uses an effective technique based on usage-based clustering and association rule discovery, to capture common user profiles. Authors used real time web usage data for the performance evaluation and a technique is found to be promising. The efficient sequential access pattern mining algorithm known as CSB-mine is effectively used in web recommender system for recommending personalized services [ZHF06.] Web personalization using web usage mining is a promising approach when compared with other traditional approaches (e.g. content based filtering and collaborative filtering), and it can achieve more accurate personalization [Zho06]. In most of the research, web usage mining is used for web personalization on a particular website, while in this research the focus is on web search personalization using web usage mining with sequential pattern mining algorithm. 3.8 Requirements of Personalized Search The various types of personalization techniques are a dissimilar gathering and there cannot be direct comparison with one another. They generally have different aims. Following is the list of some requirements for a system, to be an ideal personalized search [KL05]. a) User data collection method: There are two types of data collection method: explicitly and implicitly. For a system to be ideal, it must be implicit collection method. The proposed architecture uses implicit collection method. It exploits web usage data from proxy server logs. b) Profile storage: It deals with the storage of user profile; whether it is stored on user machine or on a server. In the proposed architecture the user profile is created and stored in memory dynamically. i.e. at run time. This reduces the overall I/O time. c) Adaptivity: It deals with the system s automatic adaption to the user s change in preferences over time. The proposed architecture updates the user profile every time whenever he/she fires a query and hence generates improved recommendations. 40

24 d) Profile construction: It deals the user profile s construction. The proposed architecture, constructs the user profile whenever a user fires a query, using preprocessed web logs. e) Profile data: It deals with profile s stored data. i.e. which data it exactly stores. The proposed architecture stores IP address & user agent for user identification, query and user s browsing patterns including long term & short term browsing history. f) Personalization method: There are various types of personalization methods. The proposed architecture uses web usage mining approach. g) Algorithm used: It deals with the algorithm used for mining purpose. The proposed architecture uses an efficient sequential access pattern mining algorithm based on CBSmine algorithm. h) Interface: It deals with the presentation of results to the user. This could be the mobile interface; browser based or customized client application. The proposed architecture is a browser based interface. 3.9 Summary As the information on WWW is growing exponentially, finding the relevant information according to the user s interest and need is a challenging issue. Most search engines return similar type results all the time, based on keywords without considering client s need. The user is presented with number of URLs to locate his required need. Thus more time and efforts are required to obtain required information. Web search personalization is the solution to this problem. There are various techniques for web search personalization as described in section 3.7. Most of the machine learning based approaches suffers from the binary classification problem in both methods of learning user s interest; explicit feedback and implicit feedback. The collaborative filtering-based recommender system mainly suffers with a problem where user has to provide some information explicitly. The major problem with content-based recommender system is to analyze the contents of web pages and reaching to some similarities between them. Rule-based recommender system has main difficulty in the construction of exact rules and generally the user profile is constructed with user s involvement explicitly. Hyperlink-based personalized systems do not make clear whether search results satisfy user s need i.e. whether different search results are presented to different types of users. In link personalization of web site, user s involvement is 41

Web Mining. Data Mining and Text Mining (UIC Politecnico di Milano) Daniele Loiacono

Web Mining. Data Mining and Text Mining (UIC Politecnico di Milano) Daniele Loiacono Web Mining Data Mining and Text Mining (UIC 583 @ Politecnico di Milano) References q Jiawei Han and Micheline Kamber, "Data Mining: Concepts and Techniques", The Morgan Kaufmann Series in Data Management