USING CLICKTHROUGH DATA TO OPTIMIZE SEARCH RESULT RANKING

Size: px

Start display at page:

Download "USING CLICKTHROUGH DATA TO OPTIMIZE SEARCH RESULT RANKING"

Hubert Horn
5 years ago
Views:

USING CLICKTHROUGH DATA TO OPTIMIZE SEARCH RESULT RANKING An evaluation of clickthrough data in terms of relevancy and efficiency ANVÄNDNING AV CLICKTHROUGH DATA FÖR ATT OPTIMERA RANKNING AV

1 USING CLICKTHROUGH DATA TO OPTIMIZE SEARCH RESULT RANKING An evaluation of clickthrough data in terms of relevancy and efficiency ANVÄNDNING AV CLICKTHROUGH DATA FÖR ATT OPTIMERA RANKNING AV SÖKRESULTAT En utvärdering av clickthrough data gällande relevans och effektivitet Bachelor Degree Project in Informatics 30 ECTS Spring Term 2017 Anton Paulsson Supervisor: András Márki Examiner: Henrik Gustavsson

2 Abstract Search engines are in a constant need for improvements as the rapid growth of information is affecting the search engines ability to return documents with high relevance. Search results are being lost in between pages and the search algorithms are being exploited to gain a higher ranking on the documents. This study attempts to minimize those two issues, as well as increasing the relevancy of search results by usage of clickthrough data to add another layer of weighting the search results. Results from the evaluation indicate that clickthrough data in fact can be used to gain more relevant search results. Keywords: [Search Result Reorganization, Clickthrough Data, Information Retrieval, Apache Solr]

3 Table of Contents 1 Introduction Background Search Engines Apache Lucene Apache Solr Sphinx Search Algorithm Clickthrough Data Stack Overflow Data Dump Problem Aim Hypothesis Method Systematic Literature Review Approach Relevancy Testing Performance Testing Research ethics Implementation Systematic Literature Review Progression Solr Implementation MySQL Database Search platform Clickthrough Data Query Parser Plugin Clickthrough Data Algorithm Pilot Study Evaluation The Study Efficiency Evaluation Relevancy Evaluation Analysis Conclusion Concluding Remarks Summary Discussion Research Ethics Future Work References... 30

4 1 Introduction The need for better and more precise search engines is constantly increasing. Users of the World Wide Web (WWW) are in large need of being able to access information efficiently, but at the same time as the need increases, so does the information. According to Chua (2012) the average increase of information available on the Internet is as large as 23% per year, which requires search engines to get better at handle the information that is being indexed. To achieve this, search engines have to adapt to the major information increases by using different search algorithms. The most commonly used algorithms to rank documents are PageRanking and Hyperlink Induced Topic Search, but both of them are out-dated and easy to exploit. (Pawar & Natani, 2014) Joachims (2002) suggests that usage of clickthrough data would help the search engines to rank the search results. By looking at what documents are being visited the most and what documents that the user tends to choose. Users have different opinions and thoughts behind the choices they make, which can t be collected, but the choices they make can however be collected. This information could be used to potentially increase the relevancy of the search results that are being displayed to the user by adding a new layer of document scoring. One of the major issues with the large amount of information that search engines have to handle is the problem that relevant documents gets lost in between different pages of results. Osiński (2003) explains how most documents are never visited due to users not navigating further in to the page numbers. In theory, this could be improved by the system that Joachims (2002) suggests. In this study an experiment is performed to be able to evaluate if clickthrough data can be used to increase relevancy of search results without impacting the efficiency of the search engine in such a large manner that it becomes inefficient. An algorithm based on clickthrough data is developed and implemented in Apache Solr (Apache Solr, 2017). The search engines index is then populated with a dataset provided by Stack Overflow (Stack Overflow, 2014). To be able to record the clickthrough data, a minor search platform is developed. The search platform is developed in Php, JavaScript and HTML, the clickthrough data is then stored in a MySQL database to allow the algorithm to retrieve the data. To evaluate the efficiency of the algorithm, a version of Apache Solr with the clickthrough data algorithm is compared to an original version of Apache Solr. An addition of a quantitative survey is used to be able to treat the aspect of relevancy, as relevancy is not something that can be measured without human judgement. A pilot study was performed where 20 human testers were asked to perform seven different queries on the search platform. The objective was to see if there were any major time differences between an original version of Apache Solr and a version of Apache Solr with the implemented clickthrough data algorithm. The pilot study showed that on average the clickthrough data algorithm added an additional 5 milliseconds to the full request time. 1

5 2 Background Information is no longer something that is stored in file rooms and cabinets; it is now something that is available all over the word through the World Wide Web. Fukumura, Nakano, Harumoto, Shimojo, & Nishio (2003) describe how information is being moved over to the digital world and leaving the papers behind. Databases are a typical way of storing information, it allows information to be sorted and easily found. The majority of the population on earth does however not have the knowledge of how to access information directly from a database, therefore some sort of graphical interface is needed. This is one of the reasons that the demand for high quality search engines is so large. Search engines enables anyone to access the information on the internet no matter what their computer knowledge is. Fakhraee & Fotouhi (2011) describes how the importance for search engines grows as the information available in databases grows and that the current amount of data is already massive. 2.1 Search Engines Search engines are tools with absolutely no technical skill required, which allows users to be able to access information available on the World Wide Web. It allows users to be able to locate information through usage of keywords and sometimes full-text search. There are several different types of search engines. Two of the most regular types of search engines are general search engines and internal search engines. General search engines are the kind of search engines that are used every day by the public. Google is the most used general search engine according to NetMarketShare (2017). It is used by 85% of all desktop users and 97% of all mobile and tablet users. Google is a website where users are allowed to search through all of the information that Google has indexed in their system and receive a link directly to the source of the information. Internal search engines allow users to be able to find information that is stored locally on a website. Bian, Li, Yue, Lei, Zhao & Xiao (2015) explains that an internal search engine is used to be able to locate information that general search engines are not able to find, or at least puts the result so lowly ranked that it is very hard to find. For example a university usually has its own website, where documents and information is released to the staff and students. To be able to find this information and documents quickly an internal search engine is one of the most rapid ways of accessing it. While using a general search engine to find the same data could take several minutes or hours of browsing before it has been found. A search engine needs to handle three different problems by three different parts according to Singh, Hsu, Sun, Chitaure & Yan (2005). The first part that has to be created is an index where the information will be stored. The second part consists of creating an algorithm that can retrieve the information. The last part is creating an interface where the information is presented after it has been retrieved Apache Lucene Apache Lucene is an open source text search engine created and open sourced by Doug Cutting in Lucene consists mainly of its index and the possibility to query for 2

6 documents in the index; it has the functionality to index documents as well as an API to allow remote querying of the documents. It is based on Java but allows for usage in several different programming languages, such as Python or.net. (Balipa & Balasubramani, 2015) Lucene offers several different features that allow for creation of a complete search engine, such as an inverted index to be able to efficiently retrieve documents, a large set of text analysis components, a query syntax that allows for a large number of query types and a scoring algorithm for weighting of the documents. (Smiley & Pugh, 2011) Apache Solr Apache Solr is an open source enterprise search server created by the Apache Software Foundation. The server is written in Java and built on top of Apache Lucene. Apache Solr also contains a web administration interface out of the box (see Figure 1). Nagi (2015) describes it as a web application that can be deployed in any servlet container and lists the addition of functionality that Solr brings that Lucene does not have. The functionalities that are added in Solr are: XML and JSON APIs Hit highlighting Faceted search and filtering Geospatial search Caching Near real-time searching of newly indexed documents. Web administration interface The additional functions allow Solr to become more versatile, compared to Lucene, which can lead to more research that include Solr. Figure 1 - Apache Solr Web Administration Tool 3

2.1.3 Sphinx Sphinx is an open source search server created by Sphinx Technologies Inc. A web administration interface is available through Sphinx tools (see Figure 2).

7 2.1.3 Sphinx Sphinx is an open source search server created by Sphinx Technologies Inc. A web administration interface is available through Sphinx tools (see Figure 2). It is a full text search server created to with performance and relevancy in mind (Sphinx, 2017). The system is mostly built in Figure 2 - Sphinx Tools Admin Panel 2.2 Search Algorithm The user searches for information by asking the search engine a question, either by simple words or by usage of full-text questions. The search engine then analyses the question and try to find information that is considered related to the search question. Depending on how the search algorithm is built, it will select different answers to the question. The search engine then replies with the most relevant information. Pawar & Natani (2014) explains that one of the key elements to a good search engine is a ranking algorithm. As the relevancy of the search result depends on the ranking algorithm. According to Pawar & Natani (2014) the two most common algorithms used to rank web pages are (1) PageRanking algorithm and (2) HITS (Hyperlink Induced Topic Search). 2.3 Clickthrough Data When a user completes a search, he or she is presented with a list of alternatives to choose to continue reading about. In most cases a user does not automatically choose the first available option after completing a search. Depending on how relevant the title of the option feels like for humans, they choose different options. Clickthrough data can be used to be able to change the ranking of the options available, depending on which options that the users tend to choose (Joachims, 2002). Clickthrough data can be used to store which options a user chooses, as well as in what order the options was chosen. This can be used to create a new layer in the search engines weighting layer, so that the results can be re-ordered depending on the provided data that 4

8 the clickthrough system uses. This means that the search engine would be able to improve depending on how much it is being used. In theory, the more users that are using the system the faster it will improve. Joachims (2002) explains how he implemented a system with an equal solution. It allowed the search results to be re-ordered, so that the most relevant and interesting documents would slowly move upwards in the ranking system and more relevant information could easier be found by the users of the search engine. 2.4 Stack Overflow Data Dump Stack Overflow is an internet forum in the Stack Exchange network containing thousands of topics regarding technical questions and answers. It is used by more than 6.7 million users and has more than 40 million visitors every month. (Stack Overflow, 2017) Every quarter of the year Stack Exchange publishes a dump of all of the user-contributed content on Stack Exchange since The dump is also licensed under Creative Commons BY-SA 3.0 which allows anyone to use it as long as the credit and posts are still linked back to the original posts or users. Each site that is in the Stack Exchange network is also downloadable individually. (Stack Exchange, 2017) 5

9 3 Problem The amount of information handled by search engines increases for every year, at the same time as the demand of information availability by the users also increase. This requires search engines to get more efficient at handling the information at the same time as it has to get better at sorting the information that the users are requesting. Chua (2012) estimates the growth of the information available on the Internet by an average of 23% per year. The presentation of data in a search engine is one of the major issues that needs to be addressed, as Osiński (2003) explains that a lot of the information that users want access to disappears in between the massive amount of information that exists. It especially becomes a problem as the information is usually displayed in an ordered list that can be split in to multiple pages. It creates the problem that users need to choose a very specific search question to be able to find relevant information from a search result. The algorithms used to rank documents in search engines are very limited. Pawar & Natani (2014) talks about how search engines tend to only use two different implementations of a ranking system. They explain that most usual ways of ranking documents are too restricted to be able to gain an actual good way of judging what documents are relevant to a specific search-term. They also explain that the PageRanking algorithm is easy to exploit as it is based off of how many terms that can be associated to the search query. This allows the document creator to abuse the algorithm by entering terms that automatically puts a document at a higher rating, even though it may not contain as much relevant information as another documents. In the same way as PageRanking can be exploited, so can Hyperlink Induced Topic Search. It evaluates the relevancy depending on how many times the document has been referred to, which could be hyperlinks for example. As social media posts or similar are a part of the indexing systems as much as any other documents, it is simple to create a large amount of hyperlinks to a document through social media. This also allows documents to gain a higher score by excessively creating social media posts that refers to the document. According to Joachims (2002) a larger problem for the clickthrough data algorithm was that the system he created was not able to detect differences between a regular user performing a search query and a user spamming the system. This created a problem with so-called false relevancy, as hyperlinks that was selected a large amount of times, would slowly get better ranking in the search engine. This was the effect of basing the ranking only on which results were chosen the most. 6

10 3.1 Aim The goal of this research is to evaluate if clickthrough data can efficiently be used to increase the relevance of search results. As user experience is dependent on the time it takes for a search result to be displayed, it has to be efficient, otherwise the majority of users will not continue to use it. By using an existing search engine that is open source and combining it with an algorithm that can rescore documents depending on clickthrough data, tests can be made by installing and configuring an original version of the same search engine. This will allow for evaluation of the algorithm both my measuring time differences as well as performing user tests. Research question: Will a clickthrough data algorithm allow search engines to efficiently display more relevant search results? To be able to derive an answer to this question 4 objectives needs to be addressed: 1. Gain knowledge of the domain by examining existing literature. 2. Design an experiment and implement algorithm to evaluate clickthrough data. 3. Perform experiment and collect data. 4. Analyse the results and present conclusion. 3.2 Hypothesis The hypothesis of this study argues that based on existing theory a clickthrough data algorithm can be created to display search results with a better precision to the search term compared to the original algorithm in the search engine. It also argues that the algorithm will add a minor delay to retrieve the results, but that the delay is low enough for the search engine to be efficient. 7

11 4 Method Performing evaluation of software engineering consists of three major empirical techniques according to Wohlin et. al. (2012). They define the three major empirical techniques as surveys, case studies and experiments. A survey consists of collecting data from humans or about humans to be able to draw conclusions from that data. A case study consists of reviewing data by looking at an existing method or tool at a corporation or an organization. An experiment consists of testing the system in a controlled environment and manipulates one factor or variable of the studied setting. The chosen method for this study is experiment with a quantitative approach. It is most suitable for the study as it depends on being able to test performance differences through manipulation of the algorithm in a precise and systematic way. Just as Wohlin et. al (2012) mention as the main definition of when an experiment is suitable. Experiments are launched when we want control over the situation and want to manipulate behaviour directly, precisely and systematically. Wohlin et. al., 2012, pp. 16 In addition to the experiment a data collection will be completed through a quantitative survey, to determine the relevance of the search results in a similar approach as Clarke et. al (2008) suggests in their framework. This is required to be able to answer the research question, as it depends on being able to evaluate the relevance of the search algorithm at the same time as evaluating the efficiency of the algorithm. An alternative to performing an experiment, a case study could be possible to execute to be able to answer the research question. As a case study would let the algorithm be used in its intended real-world application, which could provide a more accurate answer to the research question. A large problem with performing a case study is the issue of finding an appropriate organization to implement the system at or one that is already using a similar implementation of relevancy weighting, which also could create legal complications. Another aspect to take in to account is also that an experiment allows the testing to be performed in a more controlled environment compared to performing a case study according to Wohlin et. al. (2012). 4.1 Systematic Literature Review To be able to gain background knowledge of the field a systematic literature review (SLR) will be performed. The SLR will be based on a set of papers gathered from IEEE Xplore and the ACM portal using a set of search queries (see Appendix A). To able to filter the papers depending on their relevance to the field a set of criteria s was defined both for inclusions and exclusions (see Appendix A). Jalali & Wohlin (2012) showed that if an SLR is executed and performed in an efficient way, that this method is a very exceptional way of gaining a deeper understanding and an improved knowledge of the field. It also allows researchers to be able to not stumble in to the same problems as earlier research has already experienced. 8

12 4.2 Approach In this experiment a clickthrough data algorithm will be implemented together with a search engine and tested. The install of the search engine will be indexing the posts from the dataset that Stack Overflow is providing as a test dataset with more than 30 million entries. The testing of the algorithms is split in to two parts. (1) Measurement of relevancy differences between a basic install of a search engine and an install of a search engine combined with the Clickthrough Data Algorithm (CDA). (2) Measurement of performance differences between a basic install of the search engine and an install of the search engine combined with the CDA Relevancy Testing Joachims (2002) proposes a framework to be able to test the relevancy difference with and without a CDA. It lets users see two different search result lists and lets the user choose which list that the user found most relevant to what he or she was looking for. Joachims (2002) is using a very similar way of judging relevancy to the framework that Clarke et. al. (2008) presents. It is based on the human assessor selecting a binary answer to each search result, if it is either (a) relevant or (b) irrelevant, to the search question asked. The most effective way of measuring relevancy on keyword search systems is conducting a survey according to Joachims (2002) and Clarke et. al. (2008). As relevancy is not something that can be defined, it is important to get a broad audience to contribute to the survey. As humans tend to have different definitions of what is relevant, it can depend on a lot of different factors; for example it can depend on how many times the search words is included in the document found, it could be dependent of how the document is interpreted, hence why it is so important to gain a broad selection of people to judge the relevancy Performance Testing Wohlin et. al. (2012) explains how measurements are used to be able to make judgments based on facts instead of intuition. When there are measurements you are able to compare it and allow the data to be analyzed, find out where the bottlenecks are as well as being able to determine if resources are being used efficiently. To be able to measure the efficiency of the clickthrough data algorithm, measurements of the time differences with and without the CDA will be performed. To be able to measure the differences, two installations of Apache Solr will be run on identical systems. The installations will be using identical indexes of the Stack Overflow dataset, the only difference will be that one will be running with the CDA, and one will not. Both of the systems will be measured during the testing to be able to answer the following questions: 1. How large is the time difference in retrieving the search result for the user with and without the CDA activated? 2. How long does it take for the CDA to run? 3. How does the CDA impact the search engine s efficiency? 9

13 4.2.3 Research ethics The dataset gathered from Stack Overflow is completely sanitized from any personal information to keep the information anonymous according to Stack Exchange (2014). An important part to make sure that the survey data does not contain any personal information. To be able to know that the entries are unique some sort of generation of a unique identifier needs to be created, that could be based on IP-addresses or something similar. It is also very important to be able to decide what could be considered as personal information, as per definition personal information is any kind of data that could be derived to a specific person. Another aspect to take in to account is to be sure that the persons that are included in the study knows and allows usage of the data that they create. If these terms are not accepted, the data will not be used nor collected. Participants are allowed to quit the survey whenever they want and any data that the participant has created will not be used. To be able to treat the aspect of reliability and repeatability in software testing, all of the implementations, configurations, results, software, hardware specifications and any other resources that are relevant to the experiment, will be presented as appendixes to allow anyone to repeat the experiment and verify the study. 10

14 5 Implementation 5.1 Systematic Literature Review There are many ways of implementing a system that allows similar evaluation of a search engine and different algorithms, as there are several programming languages that can be used to accomplish very similar results. To be able to create a test environment that is suitable for this study, an availability approach was chosen. All of the software, programming languages and programming platforms chosen, was all open source in order to be able to keep the study as repeatable as possible. Selecting a search server to use was not easy; all different systems have a lot of pros and cons. It was narrowed down to two options, Apache Solr or Sphinx Search. However, the final choice was Apache Solr. The first reason that Solr was chosen over Sphinx, was due to the licensing. As the Apache2 license does not apply any requirements that could limit the usage of the system, whereas Sphinx is licensed under GPLv2 which does apply conditions depending on its usage. Another reason for choosing Solr over Sphinx is the possibility to store very large indexes, as the Stack Overflow dataset consists of more than 30 million documents. According to Khabsa, Carman, Choudhury & Lee Giles (2012) Solr can hold an index as large as 3 billion documents, but there is in fact no documented limit for how large of an index Sphinx Search can hold. Inspiration for how to configure Solr was taken from the book Apache Solr 3 Enterprise Search Server by Smiley & Pugh (2011). It contains information of how to set up Solr, lists a large amount of problems that may occur and how to handle them. To build the web platform and online survey environment, JavaScript will be used for frontend and PHP/MySQL for the backend. The combination of PHP and MySQL allows for a secure connection between the two, as PHP as integrated support for a secure connection to MySQL. All three techniques are open-source, which allows for simple and easy usage. The Stack Overflow dataset will partially be used to build the document base in the search engine. The information that will be used is the posts created by the users of Stack Overflow, which consists of more than 30 million documents. 5.2 Progression This section explains the different design choices made and the implementation of scripts, servers and other related parts of the system. The parts are listed in the order that they were created or installed and any obstacles that may have occurred. The hardware specifications of the server that all of the server-side code and software ran on can be found in Appendix S Solr Implementation Apache Solr was installed with the newest version available at the start of the research, which was version at the 1 st of March Configuration of the Solr core was performed as per Appendix B, Appendix C and Appendix D. This allowed usage of the built in Data Import Handler. 11

15 A problem that occurred during the indexing of the dataset was a lack of heap memory for the Java Virtual Machine. Smiley & Pugh (2011) mentions that when scaling up Solr for larger indexes it is common that the default heap memory for the JVM is simply not enough and that it may need some fine tuning to minimize the risk of getting an OutOfMemoryException. The solution to the problem was to allow the JVM to use more memory during the import by adding the -m parameter when launching Solr. This allowed Solr to use a total of 6 gigabytes of heap memory instead of the default 512 megabytes and solved the errors that occurred during the importation of the dataset MySQL Database The MySQL database is where all of the clicks performed by users are being stored. It stores information regarding the search query, which document that has been clicked and the amount of times the document has been clicked on. The database table configuration staid the same until late parts of the experiment. The only changes that were made (see Figure 3 and 4) were an increase of the size of the full-term column from 40 characters to 80 characters, an increase of the term column size, from 20 characters to 30 characters as well as changing the default character set to UTF-8 to match the character set of the search platform. The reason behind changing the sizes was to be able to fit larger words and search terms in to the columns. The table configuration is also listed in Appendix G. Figure 3 Initial database table configuration of the Clicks table Figure 4 Final database table configuration of the Clicks table Search platform The main application of the system is the search platform (see Figure 5). It is the very thing that ties each part of the system together. It is developed in PHP and JavaScript, where PHP is used for querying for documents in Solr, displaying the documents to the user as well as exchanging information with the MySQL database, the connection between MySQL and PHP 12

16 can be seen in Appendix F. JavaScript handles Ajax requests to call for storing the clickthrough data in the database. Figure 5 - Search Platform At the start of the project the idea was to use a PHP extension for Solr (Php.Net 2016), which is an object-oriented library for communication between PHP and Solr. However, the library is quite out-dated as it is several versions behind the development of Solr and it lacks support for some of the features that are offered in the Solr API. Instead a connection to the Solr API was used as presented in Appendix J. The collection of clicks is performed through JavaScript (See Appendix F) that listens for a click event to occur, then it checks if the user has the cookie for the same hyperlink & search query combination. If it does, then no clickthrough data is recorded, but if it doesn t it sends an Ajax request that sends the data to the MySQL database. The JavaScript code is presented as Appendix K and the PHP file that handles the Ajax request is presented as Appendix E Clickthrough Data Query Parser Plugin Whenever the parameter rq={!cda cdaweight=2 cdadocs=500} is used while requesting documents from the Solr API, the Clickthrough Data Query Parser Plugin is called. It takes care of searching the documents that matches the query and if the documents do, they will be re-scored by the Clickthrough Data Algorithm. The parser also allows for a set of optional 13

17 parameters to be added, weight of the new score and the amount of documents to be rescored. The weighting is performed by multiplying the click score with the chosen weight. If no weight or number of documents is chosen, the algorithm defaults the weight at 2.0 and rescores the top 200 documents in the result list. When all of the documents have been checked and possibly re-scored, the list of documents gets sorted depending on the score of each document and the list is sent back to where the request was called from. The full source code is presented in Appendix I Clickthrough Data Algorithm When the Clickthrough Data Query Parser finds a document that matches the query the Clickthrough Data Algorithm rescoring is called. The algorithm then checks if the document has any recorded clicks, if the document does not, no added score is returned. If it does, the algorithm then checks if the clicks match the query and if they do, additional scoring is added depending on the rate of the clicks (as seen in Figure 6). The full source code is presented in Appendix H. Figure 6 - Clickthrough Data Algorithm score method 14

Figure 7 First running version of the Clickthrough Data Algorithm score method Figure 8 - Clickthrough Data Algorithm Constructor A large issue in the early versions of the clickthrough data

18 Figure 7 First running version of the Clickthrough Data Algorithm score method Figure 8 - Clickthrough Data Algorithm Constructor A large issue in the early versions of the clickthrough data algorithm was added delay in each query. As shown in Figure 7, the reason behind this issue was that it performed a request to the click storage table in the database each time the score method was called. At that stage, each query would take longer than 10 seconds to retrieve, which just was not efficient enough. The API request was moved to the constructor of the Clickthrough Data Algorithm class (see Figure 8), as well as the API was redesigned to respond with clicks depending on the query term instead of responding with clicks depending on the ID of the document. By changing the structure of the Clickthrough Data Algorithm class, a single request would respond with the clicks of all of the documents for the query instead of one request per document. This made a large impact on the time it takes to perform a query. Instead of every query taking more than 10 seconds to retrieve, it would be faster than 1 second each time. 15

19 Seconds 5.3 Pilot Study A pilot study was performed on the clickthrough data algorithm, comparing the speed that the documents are presented with and without the algorithm. A set of 20 testers used the search functionality to perform seven different queries. The objective was simply to find out if there was a time difference between using the added algorithm or not, as well as to check how big the gap was, if there was any. Average request Time (s) Clickthrough Data Algorithm Average Regular Search Average 1 0,9 0,8 0,7 0,6 0,5 0,4 0,3 0,2 0, Test ID Chart 1 - Pilot study average request time in seconds The measurements of the request speeds showed very promising results, it showed that there was infact a minor difference between using the algorithm and not using it, as shown in Chart 1. On average the Clickthrough Data Algorithm added an additional 0.05 seconds delay to collect the information compared to a request without the algorithm. However that result is gained with a very small test set of clicks compared to how it would be if the general public would have populated the clickthrough data table. Which proves that the algorithm has to have a larger set of clicks to encounter to be able to make a fair assumption of it s performance. A minor issue found during the testing, was that the system was missing a mechanism to find out if any document had been rescored according to the clickthrough data algorithm. This could have added a minor time difference, as the system would try to sort the array of documents even if there was no actual difference between before and after the clickthrough data algorithm had completed its rescoring. 16

20 6 Evaluation 6.1 The Study The experiment was conducted in two parts. The first part was to evaluate the efficiency of the CDA by comparing a version of Solr with the CDA to a version of Solr without the CDA. The second part was to evaluate if the CDA would increase or decrease the relevancy of the search results by using an online survey Efficiency Evaluation Search Term Index Size Number of Results Windows Install Small (5,000 documents) 25 Java Get Time Small (5,000 documents) 10 JavaScript Cookies Small (5,000 documents) 5 Windows Install Medium (50,000 documents) 134 Java Get Time Medium (50,000 documents) 36 JavaScript Cookies Medium (50,000 documents) 10 Windows Install Large (1,000,000 documents) 5157 Java Get Time Large (1,000,000 documents) 1268 JavaScript Cookies Large (1,000,000 documents) 350 Table 1 Number of results for each search term depending on index size To be able to determine if the clickthrough data algorithm was affected by differences in amount of search results, a collection of three queries were chosen (see Table 1). The reason that just those 3 search terms were used was because that they all differ in amount of search results in all of the 3 index sizes. The search term Windows Install had the most amount of results, JavaScript cookies had the fewest amount of results and Java Get Time was somewhere in between the other two. 17

21 Search Term Index Size Search Algorithm Windows Install Small (5,000 documents) Clickthrough Data Algorithm Windows Install Medium (50,000 documents) Clickthrough Data Algorithm Windows Install Large (1,000,000 documents) Clickthrough Data Algorithm Windows Install Small (5,000 documents) Original Solr Windows Install Medium (50,000 documents) Original Solr Windows Install Large (1,000,000 documents) Original Solr Java Get Time Small (5,000 documents) Clickthrough Data Algorithm Java Get Time Medium (50,000 documents) Clickthrough Data Algorithm Java Get Time Large (1,000,000 documents) Clickthrough Data Algorithm Java Get Time Small (5,000 documents) Original Solr Java Get Time Medium (50,000 documents) Original Solr Java Get Time Large (1,000,000 documents) Original Solr JavaScript Cookies Small (5,000 documents) Clickthrough Data Algorithm JavaScript Cookies Medium (50,000 documents) Clickthrough Data Algorithm JavaScript Cookies Large (1,000,000 documents) Clickthrough Data Algorithm JavaScript Cookies Small (5,000 documents) Original Solr JavaScript Cookies Medium (50,000 documents) Original Solr JavaScript Cookies Large (1,000,000 documents) Original Solr Table 2 Parameter array for efficiency tests. A set of 18 parameter combinations were measured to be able to determine the efficiency of the clickthrough data algorithm at 3 different index sizes. The complete set of parameter combinations can be seen in Table 2. Each of the parameter combinations were run for 200 iterations, which resulted in a total of 1800 results with the CDA activated (see Appendix M) and a total of 1800 results without the CDA activated (see Appendix N). To be able to see how the efficiency has been affected in Apache Solr, there are 3 major aspects to evaluate: 1. How the algorithm affects the request speed depending on how many results there are for the search term. 2. How the algorithm affects the request speed depending on the index size of the search engine. 3. How the algorithm affects the request speed in general. The request speed is the time that it takes between a request was sent from the search platform to Solr until the search platform has retrieved the search results and presented them to the user. 18

Chart 2 - Comparison chart of request speeds for all of the parameter combinations with standard deviation To address the first Chart 2 shows the average of the measurement results of the

As no other testing was performed on any search term with as low amount of search results, the reason behind the low request time for the search term JavaScript Cookies is unsure.

22 Chart 2 - Comparison chart of request speeds for all of the parameter combinations with standard deviation To address the first Chart 2 shows the average of the measurement results of the request time for each of the search terms listed in Table 1. Each of the search words have been iterated 200 times for each index size. As no other testing was performed on any search term with as low amount of search results, the reason behind the low request time for the search term JavaScript Cookies is unsure. An educated guess would be that it is due to the low amount of results for that search term in the small index, which was a total of 5 search results. Chart 3 Comparison of the request speed depending on the index size with and without the Clickthrough Data Algorithm 19

By combining the results for the for search terms on each index size a more fair assumption of the differences in request speed between the index sizes can be made.

23 By combining the results for the for search terms on each index size a more fair assumption of the differences in request speed between the index sizes can be made. Chart 3 displays the average request speed for each index size with standard deviation. The measurements used in Chart 3 are a combination of the 3 different search terms that was run on the same index size, combining a total of 600 iterations. Chart 4 Full report of 1800 executions of the Clickthrough Data Algorithm Chart 4 displays all of the measurement iterations that were run on the version of Apache Solr with the clickthrough data algorithm. It shows that the execution speed of the CDA, which is the time it takes from when the query parser plugin (see Chapter 5, Section 5.2.4) receives the request to start rescoring the documents of the query and until it has finished processing all documents and possibly, re-sorted them. 20

Chart 5 - Comparison of the full page load time of the search platform, with and without the Clickthrough Data Algorithm The full page load time of the search platform is the time it takes from when

24 Chart 5 - Comparison of the full page load time of the search platform, with and without the Clickthrough Data Algorithm The full page load time of the search platform is the time it takes from when a user has connected to the webserver through a web browser and until the page has been fully loaded. In the case shown in Chart 5 the page loaded also includes a query to Solr and the retrieval of the search results, the Other part is all of the material that is loaded that has no ties or connections to the search engine itself. The measurement in that chart consists of the average value for each respective part of the page load, which has been iterated 1800 times for each version of Solr Relevancy Evaluation A quantitative survey was created to be able to investigate if users preferred the ranking that the clickthrough data algorithm created or not. A total of 100 individuals with some prior knowledge of programming were asked to partake in the survey. The survey was closed after 45 respondents had finished it. The reason behind asking for prior programming knowledge was due to the fact that most of the possible search results are related to programming as Stack Overflow is an online programming community (Stack Overflow, 2017). The computer and programming experience for each user is shown in Chart 6. 21

25 Chart 6 - Years of professional computer/programming experience for all survey participants. Each bar represents a participant of the survey. After each participant had filled in their experience they were presented with the search results from a randomly selected search term, displaying the results for both the original version of Solr and the version of Solr with the clickthrough data algorithm next to each other. Then the participants were instructed to select the result list that they would say provided the best answer to the search term (see Figure 9). Each participant performed this task twice with different randomly selected search terms until every search term had been evaluated 15 times. Chart 7 displays all of the selections that were done by the participants of the survey. 22

26 Figure 9 - Survey User Interface Chart 7 Survey result list selections 23

6.2 Analysis The difference in request speed is the major interest area as that is what can be affected the most by the clickthrough data algorithm.

27 6.2 Analysis The difference in request speed is the major interest area as that is what can be affected the most by the clickthrough data algorithm. The request speed is the time it takes from the moment a query has been sent to the API of Apache Solr from the search platform and until the data has been retrieved and displayed to the user. After performing a total of 3600 iterations of the measurements, 1800 iterations with and 1800 iterations without the clickthrough data algorithm, the average difference between the request speeds lowered by more than half compared to the pilot study. Chart 8 displays the average request speed for both the version of Solr with and the version without the clickthrough data algorithm. As shown in Chart 8 the time difference is small enough to barely be able to notice a difference between the two versions. On average the version of Solr running with the clickthrough data algorithm was 0,028 seconds (2.8 ms) faster than the original version of Solr, even after a much larger set of clicks compared to the pilot study was recorded. Chart 8 - Comparison of request Speed with and without the clickthrough data algorithm with standard deviation When comparing the average request speeds between different index sizes regardless of what search term was used, it is a clear how little the index size matters (see Chart 3). Comparing the standard deviation for all of the tests, it is visible that all except the original version of Solr with a large index intersect at approximately 0,34 seconds. This proves that on average the index size does not affect the general efficiency of Solr. No correlation could be found between the clickthrough data algorithm execution time and the index size, see Appendix N for full report of measurements. By comparing the full page load (see Chart 5) of the original version of Solr and the version of Solr with the clickthrough data algorithm it is clearly visible that the only time difference 24

28 between them are a few milliseconds. The average time it takes for the clickthrough data algorithm to be fully executed is 0,00304 seconds (see Chart 4 for all 1800 iterations) and no other components of Solr should be affected by the clickthrough data algorithm as it is implemented as one of the last components to be executed in Solr. Which is clearly presented in Chart 5 as the time difference between the two versions is barely separable. Chart 9 - Average survey result list selection The survey data shows that in most cases the clickthrough data algorithm increases the relevance of the search queries performed (see Chart 9). The only search term where the original version of Solr is chosen over the version with the clickthrough data algorithm is the term php shorthand (see Chart 7). However, by examining the full survey results (see Appendix O) a correlation between participants with low experience and participants receiving the search term php shorthand is seen in multiple cases, which could have affected the results of that specific search term, which would also affect the average list selection. 25

29 6.3 Conclusion By examining Chart 3 and 4 it is clear that the index size of Solr does not affect the speed of retrieving results in a large manner. This proves that the clickthrough data algorithm can be used at any size without causing any major loss of time. The only exclusion in the measurements performed in this study was when a search term had a lower number of hits, it would on average have a lower request speed when the size of the index was of the smaller size. The measurements shown in Chart 3, 4 and 8 indicates that the clickthrough data algorithm does not add any major delay during the retrieval of search results in such a large manner that the system becomes inefficient. On average the supposedly added delay was low enough (less than 4 milliseconds) for the version of Solr with the clickthrough data algorithm to have a lower average full page load time than the original version of Solr. It was actually 0,2 milliseconds faster on average. The standard deviation in Chart8 of both versions of Solr intersect, which shows that on average there are no difference between the two versions, which also proves that the algorithm is not affecting the efficiency of the search engine. When comparing the search results in both the version of Solr with the clickthrough data algorithm and the original version of Solr, the survey data shows that the algorithm is increasing the relevancy of the search results, as shown in Chart 9. Out of the 90 queries that participants performed, even with several participants who had little or no professional experience of programming, in 67.78% of the search queries the participant preferred the version of Solr with the added algorithm. This indicates that the algorithm returns a scoring of documents that is according to the survey data, a bit more accurate. The hypothesis was approximately correct, as it was possible to create an algorithm based on clickthrough data using the existing theory and literature. By evaluating the algorithm it can be concluded that the algorithm increased the relevance of the search results at the cost of a minor delay. However, most online testing tools consider the page load time of the search platform to be in the top percentiles of the sites registered, for example GTmetrix (2017) places the search platform in the top 2% fastest pages of the 182 million pages that has been analysed. 26

30 7 Concluding Remarks 7.1 Summary The goal of the study was to evaluate if clickthrough data could efficiently be used to increase the relevance of search results. This was investigated by performing an experiment where an algorithm was developed and implemented in the search engine Apache Solr. The hypothesis of the study was: The hypothesis of this study argues that based on existing theory a clickthrough data algorithm can be created to display search results with a better precision to the search term compared to the original algorithm in the search engine. It also argues that the algorithm will add a minor delay to retrieve the results, but that the delay is low enough for the search engine to be efficient. (Chapter 3, section 3.2) This study concludes that clickthrough data can be used to efficiently increase the relevance of search results by implementing a clickthrough data algorithm in Apache Solr. The efficiency evaluation indicates that the algorithm on average is fully executed in 3,04 milliseconds, which affects the full page load by an average of 0,38%. The survey shows that the algorithm provides an increased level of relevance in the search results, as in 67,78% of the queries that were completed, the participants chose the version of Apache Solr with the algorithm above the original version of Apache Solr. Both evaluations prove the strength of the hypothesis at the same time as it proves that clickthrough data can be used to efficiently increase the relevancy of search results. 7.2 Discussion By examining the results of the experiment an important factor that plays an important role are the search terms that are used and its respective clicks. What could have been affecting the search results are the clicks that have been recorded during the study and how they could have affected the choices that were made by the participants in the survey. The clicks that has been used to calculate scores for the documents in the survey are the clicks that has been recorded (see Appendix R for full list of clicks) by using the during the study as well as the clicks that were recorded during the pilot study instead of manipulating the clicks manually. The order of the results was drastically changed and the score changes that were added by the clickthrough data algorithm were rather large. This was most likely affected by the weight chosen during the query to Solr, as the amount of clicks that were relatively low, a higher weight were chosen to be able to see any differences in the ordering of the search results. As Joachims (2002) describes, the idea is that the system is created to be able to reorder the search results that are presented for better or worse, depending on what documents the users tend to choose. A higher weight could affect the order of the search results as much in a positive light as well as in a negative light, as all of the documents that have been chosen by other users gain as much scoring depending on the recorded clicks no matter if the document is a relevant or irrelevant one, as long as the user visited the page where the document is listed. As all of the measurements that were performed were executed in the web browser Google Chrome, no conclusion regarding how each web browser could have affected the possible 27

DRACULA. CSM Turner Connor Taylor, Trevor Worth June 18th, 2015

DRACULA. CSM Turner Connor Taylor, Trevor Worth June 18th, 2015 DRACULA CSM Turner Connor Taylor, Trevor Worth June 18th, 2015 Acknowledgments Support for this work was provided by the National Science Foundation Award No. CMMI-1304383 and CMMI-1234859. Any opinions,