Automatic Recommender System over CMS stored content

Automatic Recommender System over CMS stored content Erik van Egmond 6087485 Bachelor thesis Credits: 18 EC Bachelor Opleiding Kunstmatige Intelligentie University of Amsterdam Faculty of Science Science Park 904 1098 XH Amsterdam Supervisor Dr. Evangelos Kanoulas Institute for Language and Logic Faculty of Science University of Amsterdam Science Park 904 1098 XH Amsterdam June 25, 2015

Abstract Visitors are browsing sites on the World Wide Web looking for content that can help them achieve their task; finding particular information, pages about their interest or products to buy. Webmasters want the visitor to find that content as quick and easy as possible, which can be achieved by recommending content. When this is done manually, the webmasters need to spend a lot of time on the recommendations. This thesis explores several options on the automation of the process by examining content recommenders offline and in situ. The recommender systems used in this project are a content-based, a visitor-based and a personalized recommender. After online evaluation online the content-based recommender proved to perform better than the manually selected recommendations. In online evaluation the visitor-based was a great improvement over the content-based recommender and the personalized a small improvement. Thus can be concluded that the usage of automated recommender systems can prove useful for webmasters. 2

Contents 1 Introduction 4 2 Relevant work 5 3 Method & Approach 6 3.1 Establishing the golden standard.................. 6 3.2 Content-based recommendations.................. 6 3.3 Visitor-based recommendations................... 7 3.4 Personalized recommendations................... 7 4 Results & Evaluation 8 4.1 Evaluations methods......................... 8 4.1.1 Comparing to the golden standard............. 8 4.1.2 Click Through Rate..................... 9 4.1.3 Hypothetical Click Through Rate.............. 9 4.1.4 Team-Draft Interleaving................... 9 4.1.5 Personalized recommendations............... 10 4.2 Evaluation of the results....................... 11 4.2.1 Performance of the golden standard............ 11 4.2.2 Performance of the content-based recommender...... 11 4.2.3 Performance of the visitor-based recommender...... 12 4.2.4 Performance of the personalized recommender....... 13 4.2.5 Team-Draft Interleaving: Content-Based - Golden Standard 13 5 Conclusion & Discussion 14 References 16 3

1 Introduction The World Wide Web is an information system mainly composed of interlinked hypertext documents (called web pages) which are primarily text documents formatted and annotated with the Hypertext Markup Language (HTML). In contrast to using hand-written and hand-maintained static HTML pages, currently, most websites use Content Management Systems (CMS), which helps store and manage the content on a website and generate the HTML page when the content is requested. This enables webmasters to maintain a website without the requirement of experience at building a website. By using a CMS, content is deployed to the web and becomes accessible to web visitors who come to a site in order to find specific content of their interest or information needs. The webmaster can assist visitors by recommending other pages depending on the current page visitors are viewing. Manual recommendations have many shortcomings, in particular they are inefficient to build. To be able to provide a recommendation, content information and user context is required for all available pages and all visitors of a site. Smaller sites, such as a blog with less than hundred documents, do not encounter this problem. However, many sites have a huge amount of pages; in the case of a site with thousands of documents managed by multiple people, it is no longer possible to take all that content into account. Furthermore, hand-picked documents will become outdated, when time passes new content will be added to the site and could be more relevant than older documents. If the webmaster desires to account for this problem, all recommendations are required to be updated periodically. Hence, manual recommendations are tailored towards the specific content the webmaster has in mind; not personalized to the interests of the visitors, which results in every visitor receiving the same recommendations. As each visitor has different goals on the site, the webmaster can not ensure that each visitor gets the best recommendation possible. On the World Wide Web recommendations systems are an omnipresent phenomenon, which is encountered on a wide variety of websites. On e-commerce sites such as Amazon and ebay recommender systems provide suggestions on other related products, which is has a clear goal: to sell more items. In addition it is simple to measure an appreciated suggestion; the product is bought. The entertainment industry has a large influence on recommendation systems, for example sites like Netflix, Spotify and Youtube rely on recommendation systems to keep visitors consuming their content. Netflix in particular has had a big impact with their Netflix Prize between 2007 and 2009, which encouraged research on recommender systems on behalf of their movie recommendations system. When searching the web using a search engine like Google, Bing or Yahoo a query that is partially entered recieves have several suggestions to what the entire query could be, these are provided by a recommender system. In this thesis solutions to the aforementioned shortcomings will be explored by automating the recommendation task. To achieve this goal, the following question is examined: how can high quality content recommendations be generated based on available content and user interests within the context of a single site? This question has the requirement of generating recommendations of high quality, to evaluate this several evaluation methods are explored. This thesis proposes solutions to this problem by comparing a number of recommender systems and their performances offline and in situ (in the real world 4

operational setting). In this research project four methods of recommending pages are proposed, implemented, evaluated, and analysed: (a) manual recommendations composed by a person who manages the site; (b) content-based recommendations generated based on document similarity calculated using the TF-IDF scoring function between the current page a visitor is looking at and all other available pages; (c) visitor-based recommendations produced on the basis of the behavior and trails of all the visitors coming to the site; and finally (d) personalized recommendations using user profiles that can find pages that are of interest for the specific visitor. Several methods of evaluation are implemented and used to assess the quality of the recommendation systems, both offline and online. In the offline evaluation the recommenders can be compared to the manual recommendations (gold standard set) and to each other. Online recommendations will be evaluated on the basis clicks in a within-subject manner. In order to eliminate as much variability as possible from the comparison we use the state-of-the-art Team-Draft Interleaving. In Section 2 this thesis will be positioned in the field by examining several related studies. Subsequently several recommendations systems will be discussed in Section 3. The evaluation of these systems is explained in Section 4.1 and explored in Section 4.2. Finally the thesis is concluded in Section 5. 2 Relevant work A considerable amount of previous studies have examined how related content can be found for a certain user. These studies cover different approaches, commonly a user profile is combined with a content-based recommender, but they differ more in the models that are used. The article by (Lops, De Gemmis, & Semeraro, 2011) provides a good overview on the topic of content based recommenders. Primarily the content is analyzed, secondly user profiles are created and lastly a filter uses the user profile to recommend content. The paper covers two state of the art content-based recommenders, a Keyword-based Vector Space Model and Semantic Analysis by using Ontologies. For user-profiling Probabilistic Methods, Naive Bayes and a Semantic analysis are discussed. These methods provide a good starting point for this research project. (Khribi, Jemni, & Nasraoui, 2008) describes how a recommender can be leveraged in an e-learning environment. Initially it creates user and content offline based on previously obtained data and the relations between pieces of content. Once the system goes online it can extract the profile of a single user and recommend new content. This approach will have problems with a cold start, as data is already required to be able to recommend. By building a topic model of the document the recommending can start solely on the content. Such a model can be made by utilizing latent Dirichlet allocation (Wilson, Chaudhury, & Lall, 2014). Temporal changes in interest are an important aspect of recommending content (Blanco-Fernández, López-Nores, Pazos-Arias, & García-Duque, 2011). If a user has recently consumed content on a particular topic it might be that the user wants something differently. (Wang & McCallum, 2006) discuss an approach that can see the change of topics over time using a non-markov 5

continuous-time model of topical trends. This could be applied to the the visited document by the user to see how interest change over time. Using such information common shifts in interest can then be used to recommend other users. The previous works discuss different approaches on the models used to recommend content. Once the models are built and implemented they must be assessed on their quality. For this purpose Team-Draft Interleaving was introduced by (Radlinski, Kurup, & Joachims, 2008), it provides an unbiased method to evaluate the performance of two ranked lists in situ. 3 Method & Approach The recommendations systems that will be discussed are using a collection of pages, C. This collection contains all pages that are available in the CMS. The recommender system usually is a function Recommender(d, n) = {r 1, r 2,..., r n } where d is a document in C, n (optional) is the number of desired recommended pages, and r i are pages from C d, the recommender should not recommend the input page. In the following sections several different recommender systems are discussed: (a) manual recommendations; (b) content-based recommendations; (c) visitor-based recommendations; and finally (d) personalized recommendations, which takes the visitor data instead of the current page. 3.1 Establishing the golden standard The gold standard for a Recommender(d, n) is a manually selected set of pages {r 1, r 2,..., r n }. This gold standard will be used in the evaluation process. For a page d an expert in the content selects a number of pages from C d. This results in a data set where for most documents d in C a number of handpicked documents exists that are considered to be a guarantee for a good recommendation. To retrieve the manual recommendations only the current page is needed as input, ManualRecommender(d) = {r 1, r 2,..., r n }, the actual number of manual recommendations is decided by the webmaster, thus this can not be set in this function. Note that with the manual recommendations some shortcomings exist. That means that these recommendations are not necessarily the best recommendations possible. 3.2 Content-based recommendations When recommending pages the ideal would be read the mind of the visitor, which is unfortunately impossible. The most direct piece of information that is available is the content the visitor is currently looking at. The assumption is that a visitor might also be interested in similar content. The content-based recommender uses the content available in the CMS to provide recommendations. This system recommends new pages based on similarity to the document that is currently being viewed, which is defined by the function: ContentRecommender(d, n) = {r 1, r 2,..., r n }. To retrieve similar documents TF-IDF is used. The terms used for TF are all text base properties on of a document such as the title and body text. The pages with the highest 6

similarity to the current page are recommended to the visitor. The similarity between two document is calculated using the cosine similarity, which is calculated by the angle between two document vectors. Two documents with similar frequencies of words which results in a small angle, even if the length of the documents varies greatly. 3.3 Visitor-based recommendations Using the content of one page to recommend provides a small insight in what the user might want to read next, to improve the predictability of the current visitor the behavior of all the visitors can be analyzed. By analyzing the collective behavior, predictions can be made what content is related to the current document. Therefore, in the case of the visitor behavior based recommendations the behavior of all visitors is considered. For each page d C a collection of all pages is created that are visited in combination with page d. Two pages are visited in combination with each other when they occur in the same session. Sessions are defined as a set of pages, where the length of a session is at least two, that are visited by one unique visitor with no more than 30 minutes between consecutive requests. Each visitor has a list of pages and timestamps; [(p 1, t 1 ), (p 2, t 2 ),..., (p n, t n )]. Each session is a list of pages where (t n t n 1 ) < 30 minutes, sessions can be more than 30 minutes long, as long as pages are being requested by the visitor. For the purpose of not including bots, which are visitors that are not human, in the sessions the length of a session is limited to 50 requests. Each page d C has a collection of other pages and the number of occurrences. Recommendations can be retrieved from this collection. A straightforward way would be to recommend the page that has the highest count, i.e. the page that most visitors visited in combination with the current page. This has the problem that very general pages have high counts; almost all visitors have visited the home page as well. However, recommending the homepage is not useful even if the visitor has not visited this page in the session. To prevent this problem TF-IDF is used. In this approach a document is a page with the collection. Using TF-IDF a score is calculated for each page in the collection. The recommender can then retrieve the pages with the highest similarity score compared to the current page; V isitorrecommender(d, n) = {r 1, r 2,..., r n }. 3.4 Personalized recommendations By analyzing the collective behavior, the individual preferences are not captured. A recommendation that is good for one visitor might be uninteresting for another. To improve the individual recommendations the individual behavior must be analyzed. Unlike the previous three recommendation methods the personalized recommender does not take the current page into account. Instead known visited pages of the current visitor are considered, for each visitor the visited pages are clustered using k-means. The centroids of the clusters represent the typical distribution of words for documents in that cluster, this distribution is used to generate pseudo-documents. For these pseudo-documents similar documents from the CMS can be retrieved using the previously discussed 7

content-based recommender. Pseudo-documents are generated by distribution, rather than combining all content; the combination of many documents will become a very large document that some systems, will cut off to save memory. Normally this is not a problem as a document is probably about one subject and as such effect on the distribution will be acceptable. But if after the truncation many visited documents are missing the recommender will not work as intended. The number of clusters is configurable; one cluster will result in one pseudodocument for the visited content, more clusters will create a simple form of a topic model. Without any modification to this approach each page has the same recommendations if the behavior is the same. 4 Results & Evaluation 4.1 Evaluations methods Several methods of evaluating the data are used to analyze the effects of the constructed recommender systems. All recommendation systems, that use the current page as input, are compared to the golden standard. Furthermore, the content-based recommender is evaluated using Team-Draft Interleaving by comparing, again, to the golden standard. Besides comparing to the manual recommendations, the Click Through Rate is analyzed to determine if the recommendations proved useful. These methods provide some real word data on the performance of the recommender system. Finally the recommenders are evaluated by a variant of the Click Through Rate. Only pages that occur in a session are considered in the evaluation, it is presumed that if a page is not in a session the visitor had no intention of visiting the site. 4.1.1 Comparing to the golden standard Both the content-based as the visitor based recommender systems are evaluated by comparing to the gold standard. For each document d C a relatively high number of recommendations (10) is retrieved. Normally ten recommendations would be too much for a user due to a cluttering effect on the site, the average is just over 3. However, this larger number enables the calculation of the recall and precision to a greater depth and can provide a better insight in the performance of the different systems. Recall is defined by equation 1, whereas the precision is calculated by equation 2. The precision and recall can be plotted in one plot, with the recall on the x-axis and the precision on the y-axis, as seen in Figure 1 (Page 13). Generally the precision starts high with a low recall and slowly drops while the recall increases. This plot is generated by calculating the precision and recall for one to ten recommendations separately and plotting the results. Recall = P recision = ManualRecommender(d) Recommender(d) M anualrecommender(d) ManualRecommender(d) Recommender(d) Recommender(d) (1) (2) 8

For all requests with manual recommendations new recommendations generated based on the recommendation model. This is done, rather than on a page basis, to make sure that pages that are visited more are weighed more. If a page that is only visited once would have the same weight as a page that is visited 100 times the results would not be realistic compared to a real life setting. 4.1.2 Click Through Rate The measure determined by comparison to the gold standard might not result in the best scores for the best recommenders, as the manual recommendation themselves might not be optimal, other methods of evaluation must be explored. The first of these methods is the Click Through Rate (CTR), which is the percentage of impressions that resulted in at least one click. One impression is one time a set of recommendations is provided to the visitor and at least one click is registered. Clicks are determined by the visit of one of the recommended pages the current page within the same session; so when p 1 is recommended and p 1 is visited after the current page, a click is registered. 4.1.3 Hypothetical Click Through Rate As not all recommenders can be evaluated in a live environment with actual visitors, an alternative to the Click Through Rate is proposed. Using a similar method to determining the CTR a hypothetical CTR (hctr) can be calculated: instead of actually providing the recommendations to the visitor, recommendations are generated for each page in the request logs that also has manual recommendations, and are evaluated on possible clicks. It is assumed that when a visitor visited a page that would have been suggested, the recommendation would be a good recommendation. Hence, this method provides a way to evaluate the generated recommendations without the need of a deployment. 4.1.4 Team-Draft Interleaving The previous methods assume that one recommender provides several pages and that the visitor will pick one of them. This will provide results that can explain which system recommends the best pages, however, it is not possible to directly compare recommenders. Despite the fact that one recommender performs good, it is unknown if another recommender can outperform the previous. To be able to determine which system has the preference of the user both system should be presented at the same time, a state-of-the-art approach is Team-Draft Interleaving. Team-Draft Interleaving is a method that merges two ranked lists based on random coin flips. The two ranked lists are generated by a recommender. Each time a page is clicked the system that originally recommended this page receives a point. Over time the recommender with the most points has received the most clicks will be perceived as the best recommender of the two. Suppose there are two generated lists of recommendations; A = {p 1, p 2, p 3, p 4 } and B = {p 4, p 2, p 5, p 6 }, as seen in Table 1. A simple method of interleaving would be to take a page from A, take a page from B, take a page from A, etc. If a page was already added, take the next page. This would always result in the following list: L = {p 1, p 4, p 2, p 5 }. The problem with this list is that page 9

A i is always one rank higher in the merged list than page B i, which can result in a bias towards recommender A. To overcome this problem a stochastic variation is introduced. Each round a coin is flipped; if it is heads, list A is merged before B, if it is tails the reverse. An example sequence of coin tosses would be tails first and secondly heads which would result in the following merged list: L = {p 4, p 1, p 2, p 5 }. The first item to be added to the list is p 4 from B, followed by p 1 from A. As the second coin is heads p 2 is added from A. Now an item from list B must be added, however, the next item is p 2, which was added earlier so the next item from the list is chosen: p 5. List A List B Simple Interleave TDI (TH) TDI (HT) TDI (TT) TDI (HH) p 1 p 4 p 1 p 4 p 1 p 4 p 1 p 2 p 1 p 4 p 1 p 4 p 1 p 4 p 3 p 5 p 2 p 2 p 5 p 5 p 2 p 4 p 6 p 5 p 5 p 2 p 2 p 5 Table 1: Example of interleaving with 2 teams Using this approach page A i is befor B i in 50% of the merges, which will cancel the bias towards either one of the lists. As the bias is removed from the merged list the recommenders have to be scored to be able to evaluate the results. The scoring of the results from Team-Draft Interleaving can be done by simply counting each hit. A click on a link can be counted in several ways; (a) each item gets assigned to the team it has the highest occurrence; (b) each item gets assigned to the team where it has an occurrence; (c) each item gets assigned to the team where it has an occurrence, if it occurs in both it is counted half. After evaluation the recommender that is able to recommend most pages should have the highest score, hence option b is chosen. Option c is not chosen as it devalues options that are recommended by both, however, that is not relevant in this case. 4.1.5 Personalized recommendations Evaluating the personalized recommender is a challenge as in the current scope of the project there is no option for the recommender to be tested on actual visitors. Therefore the recommender should be evaluated on data that has been gathered before. The way the evaluation is done by verifying if the recommender can recommend documents that the visitor would have visited on their own. If a visitor visit such a page, the recommender has a hit, otherwise a miss. To verify using this method all sessions of all visitors are gathered. The training set consists of the first 70% of the sessions of each user. The test set is the remainder of the sessions. For each visitor, using the first 70% of the sessions, a list of recommended pages is generated. Subsequently, using the remaining 30% the recommendations are verified. For each recommended page a hit is counted if that pages was visited in the remaining 30% of the sessions. A miss is recorded if the recommended page is not visited. In the evaluation of the personalized recommender only visitors with more than 10 request are considered. This is done so that there is enough data to create the topic models for the user. 10

4.2 Evaluation of the results The recommenders that are examined during this thesis are evaluated following the methods discussed in Section 4.1. For all automated recommenders three recommendations per page are used to calculate the recall of the manual recommendations, hctrs and CTRs as this is approximately the average number of recommendations provided by editors. See Table 2 for a quick overview of the results, the continuation of this section will go in more depth. As far as applicable each recommender is evaluated on the comparison against the golden standard (Section 4.1.1), the hypothetical Click Through Rate (Section 4.1.3) and the Click Through Rate (Section 4.1.2). The personalized recommender has a different approach to calculate the CTR as described in Section 4.1.5. Finally Team-Draft Interleaving is used to compare content-based recommender directly with the manual recommendations, see Section 4.1.4. Manual Content-based Recall of manual recommendations 100% 11.2% 30.3% - - - - - Hypothetical Click Through Rate - 5.1% 44.7% 9.6% 13.9% 11.7% 11.7% 12.3% Click Through Rate online 14.33% 21% - - - - - - Visitor-based Personalized (1 cluster) Personalized (2 clusters) Personalized (3 clusters) Personalized (4 clusters) Personalized (5 clusters) Table 2: Summary of the results 4.2.1 Performance of the golden standard Initially the performance of the golden standard must be established. This performance can later be used to compare the automated recommendations. During a period of one month there were 5340 impressions with a total of 17565 pages recommended, that is an average of 3.3 recommendations per page with a recommendation block. This average will be used as the number of default recommendations for the other methods. During this period 765 impressions resulted in at least one click, which is 14.33% of the impressions. In total 966 recommendations where clicked, a click through rate of 5.5%. 4.2.2 Performance of the content-based recommender The content-based recommender is evaluated both online as offline. This provides a way to compare online evaluation with offline evaluation. Comparison against the golden standard: First the content-based recommendations are compared to the manual recommendations. There are 5226 pages with recommendations with a total of 14604 recommended pages, of which 1631 pages are in the manually selected pages as well, this is a recall of 11.2%. Using the precision and recall a Precision-Recall curve can be plotted, see Figure 1. This plot shows the curve for the visitor-based recommender as well. The curve for the content-based recommender starts with an increasing slope, 11

this is because the first pages recommended are usually not in the manually recommended pages, thus having a low precision. Hypothetical Click Through Rate: For each request with manual recommendations the content-based recommender provides the automated recommendations. There were 4868 of such requests, 247 of those were later visited, an of hctr 5.1%. Online Click Through Rate: For a period of 6 days the content-based recommender has made 1736 impressions of which 369 resulted in at least one click; a click through rate of 21%. Online, the content-based recommender reaches a high CTR of 21%, this is an improvement over the manual recommendations. The recall of the manual recommendations of 11.2% suggests that the manual recommendations might not be the best way to suggest content as only a small amount of these recommendations also where recommended using the content-based. 4.2.3 Performance of the visitor-based recommender As the visitor-based recommender is trained on the request data and tested on the same request data the data set is split in a training set of 70% and a test set of 30%. The data used are sessions as the pages are recommended based on the occurrence in the same session. The total size of the test set is 2980 sessions. Using the training set a recommendation model is created following the method described in Section 3.3. Once the model is generated for each session in the test set the pages that should have recommendations are located, for which the recommendations are generated using the model. Comparison against the golden standard: There are 5227 requests that had manual recommendations, with a total of 13489 recommended pages. For these requests new recommendations are generated based on the recommendation model. For all impressions 4092 of the generated recommendations where the same as the manual selected pages, that is a recall of 30.3%. Using the precision and recall a Precision-Recall curve can be plotted, see Figure 1. This plot shows the curve for the content-based recommender as well. Hypothetical Click Through Rate: For 1405 impressions recommendations are generated, of which 628 impressions received at least one click; a hctr of 44.7%. Online Click Through Rate: This recommender was not deployed to a live environment. As the visitor-based recommender was not deployed only results from offline evaluation are available, there a very high hypothetical click through rate of 44.7% was found. The recall of the golden standard was an improvement over the content-based recommender as well. It should be noted that although the hctr is high, when deployed the CTR will probably be significantly lower; even if the recommendations are perfect, not all visitors will follow the recommendation. In Figure 1 the visitor-based curve is shown higher than the content-based curve, indicating a better performance. This is in accordance with the result when only 3 pages are recommended. 12

Figure 1: Precision-Recall plots for the content- and visitor-based recommenders and their ability to mimic the golden standard 4.2.4 Performance of the personalized recommender For the personalized evaluation 366 visitors, who have more than 10 requests, are analyzed. For each of these five topic models, consisting of one to five clusters, are created, which are used to recommend pages as described in Section 3.4. Subsequently the pages are evaluated by determining which are actually visited in the test set. Comparison against the golden standard: The personalized recommender provides recommendations based on previous behaviour of the visitor; the current page is not used to provide recommendations. Hence, it does not make sense to compare to the golden standard. Hypothetical Click Through Rate: There are 5 different settings evaluated for the personalized recommendations. The average hctr is 11.9%, more details can be found in Table 3. That the topic model with one cluster is lower than the others, while they are quite close to each other, might be an indication that there are usually only 2 topics in the previously visited documents. Online Click Through Rate: The personalized recommender was not deployed. Although only the measure used was the hctr the results seem to position the performance of this recommender slightly above the content-based recommender. 4.2.5 Team-Draft Interleaving: Content-Based - Golden Standard Lastly the content-based recommender is compared to the golden standard using Team-Draft Interleaving. In a time period of 14 days there have been 1723 impressions of the recommendations, of which there have been 6104 pages recommended using the content-based recommender and 4692 pages from the golden standard with a total of 8764 pages, as pages can belong to both teams. The 13

K Impressions Impressions followed CTR 1 366 35 9.6% 2 366 51 14.0% 3 366 43 11.7% 4 366 43 11.7% 5 366 45 12.3% Table 3: Results for the personalized recommender large difference between the amount of pages from the content-based and the gold standard is due to the fact that not all pages with recommendations have pages from the golden standard. After the trail the content-based recommender has won 61% of the impressions and the manual recommendations 39%. Of the recommended pages 286 were followed from the content-based, 247 from the golden standard and 354 in total, this results in the click through rates of 4.7% for the content-based, 5.3% for the golden standard. Most impressions were won by the context-based recommender and overall had more points. This suggests that the content-based recommendations are better than the manually selected recommendations, which is in line with the results from the separate trials. 5 Conclusion & Discussion As seen in the results of the manually selected recommendations, the impact of such systems can be really positive, which underlines the importance of good recommendations and the benefit of automating the process. Using several methods this automation has been explored and evaluated. Firstly the content-based recommender was tested and evaluated and yielded promising results. Both in the separate trial as in the Team-Draft Interleaving trial the content-based performed better than the manually selected recommendations. However, in the offline evaluation the recommender performs much worse. This might be a result from the fact that it almost only registers clicks for content that is manually recommended, of which the recall is only 11.25%. The visitor-based recommender performs really well in the offline evaluation with a recall of the manually selected recommendations of 30.3% and a hctr of 44.7%. A reason for the extreme performance of the visitor-based recommender could be that visitors might not click all good recommendations. It is impossible to reach very high click through rates, therefore it is to be expected that the actual rates would be lower if the visitor-based recommender would be in situ. Regardless of this the offline performance offline was an improvement over the content-based recommender, which indicates that the performance will be an improvement online as well. Finally the personalized recommender was tested and evaluated, in this case only the hctr was calculated, which was higher than the content-based recommender, although not as high as the visitor-based recommender. Following the results of the examined recommender systems it can be concluded that indeed high quality content recommendations can be generated 14

using an automated recommender. Both usage of the available content and user interest show promising results. A possible improvement for more personalized recommendations would be to shift more toward the visitor-based recommendation by clustering similar users and using their collective behaviour to recommend content, instead of the collective behavior of all visitors or only one visitor. To improve the conclusions that can be drawn from offline evaluation the other recommenders, visitor-based and personalized, should be operating online. This will provide insight on how the hctrs can be translated to CTRs. The timespan should also be increased from a few days to a few weeks, the amount of visitors tends to fluctuate over time and to capture good measures the trials should last longer. The results of this project are likely to be replicated in other settings, although the not all site rely heavily on written content, such as e-commerce or other entertainment sites, the visitor-based recommender can be used in all settings that generate enough web-traffic. 15

References Blanco-Fernández, Y., López-Nores, M., Pazos-Arias, J. J., & García-Duque, J. (2011). An improvement for semantics-based recommender systems grounded on attaching temporal information to ontologies and user profiles. Engineering Applications of Artificial Intelligence, 24 (8), 1385 1397. Khribi, M. K., Jemni, M., & Nasraoui, O. (2008). Automatic recommendations for e-learning personalization based on web usage mining techniques and information retrieval. In Advanced learning technologies, 2008. icalt 08. eighth ieee international conference on (pp. 241 245). Lops, P., De Gemmis, M., & Semeraro, G. (2011). Content-based recommender systems: State of the art and trends. In Recommender systems handbook (pp. 73 105). Springer. Radlinski, F., Kurup, M., & Joachims, T. (2008). How does clickthrough data reflect retrieval quality? In Proceedings of the 17th acm conference on information and knowledge management (pp. 43 52). Wang, X., & McCallum, A. (2006). Topics over time: a non-markov continuoustime model of topical trends. In Proceedings of the 12th acm sigkdd international conference on knowledge discovery and data mining (pp. 424 433). Wilson, J., Chaudhury, S., & Lall, B. (2014). Improving collaborative filtering based recommenders using topic modelling. In Proceedings of the 2014 ieee/wic/acm international joint conferences on web intelligence (wi) and intelligent agent technologies (iat)-volume 01 (pp. 340 346). 16