USING CLICKTHROUGH DATA TO OPTIMIZE SEARCH RESULT RANKING

Size: px
Start display at page:

Download "USING CLICKTHROUGH DATA TO OPTIMIZE SEARCH RESULT RANKING"

Transcription

1 USING CLICKTHROUGH DATA TO OPTIMIZE SEARCH RESULT RANKING An evaluation of clickthrough data in terms of relevancy and efficiency ANVÄNDNING AV CLICKTHROUGH DATA FÖR ATT OPTIMERA RANKNING AV SÖKRESULTAT En utvärdering av clickthrough data gällande relevans och effektivitet Bachelor Degree Project in Informatics 30 ECTS Spring Term 2017 Anton Paulsson Supervisor: András Márki Examiner: Henrik Gustavsson

2 Abstract Search engines are in a constant need for improvements as the rapid growth of information is affecting the search engines ability to return documents with high relevance. Search results are being lost in between pages and the search algorithms are being exploited to gain a higher ranking on the documents. This study attempts to minimize those two issues, as well as increasing the relevancy of search results by usage of clickthrough data to add another layer of weighting the search results. Results from the evaluation indicate that clickthrough data in fact can be used to gain more relevant search results. Keywords: [Search Result Reorganization, Clickthrough Data, Information Retrieval, Apache Solr]

3 Table of Contents 1 Introduction Background Search Engines Apache Lucene Apache Solr Sphinx Search Algorithm Clickthrough Data Stack Overflow Data Dump Problem Aim Hypothesis Method Systematic Literature Review Approach Relevancy Testing Performance Testing Research ethics Implementation Systematic Literature Review Progression Solr Implementation MySQL Database Search platform Clickthrough Data Query Parser Plugin Clickthrough Data Algorithm Pilot Study Evaluation The Study Efficiency Evaluation Relevancy Evaluation Analysis Conclusion Concluding Remarks Summary Discussion Research Ethics Future Work References... 30

4 1 Introduction The need for better and more precise search engines is constantly increasing. Users of the World Wide Web (WWW) are in large need of being able to access information efficiently, but at the same time as the need increases, so does the information. According to Chua (2012) the average increase of information available on the Internet is as large as 23% per year, which requires search engines to get better at handle the information that is being indexed. To achieve this, search engines have to adapt to the major information increases by using different search algorithms. The most commonly used algorithms to rank documents are PageRanking and Hyperlink Induced Topic Search, but both of them are out-dated and easy to exploit. (Pawar & Natani, 2014) Joachims (2002) suggests that usage of clickthrough data would help the search engines to rank the search results. By looking at what documents are being visited the most and what documents that the user tends to choose. Users have different opinions and thoughts behind the choices they make, which can t be collected, but the choices they make can however be collected. This information could be used to potentially increase the relevancy of the search results that are being displayed to the user by adding a new layer of document scoring. One of the major issues with the large amount of information that search engines have to handle is the problem that relevant documents gets lost in between different pages of results. Osiński (2003) explains how most documents are never visited due to users not navigating further in to the page numbers. In theory, this could be improved by the system that Joachims (2002) suggests. In this study an experiment is performed to be able to evaluate if clickthrough data can be used to increase relevancy of search results without impacting the efficiency of the search engine in such a large manner that it becomes inefficient. An algorithm based on clickthrough data is developed and implemented in Apache Solr (Apache Solr, 2017). The search engines index is then populated with a dataset provided by Stack Overflow (Stack Overflow, 2014). To be able to record the clickthrough data, a minor search platform is developed. The search platform is developed in Php, JavaScript and HTML, the clickthrough data is then stored in a MySQL database to allow the algorithm to retrieve the data. To evaluate the efficiency of the algorithm, a version of Apache Solr with the clickthrough data algorithm is compared to an original version of Apache Solr. An addition of a quantitative survey is used to be able to treat the aspect of relevancy, as relevancy is not something that can be measured without human judgement. A pilot study was performed where 20 human testers were asked to perform seven different queries on the search platform. The objective was to see if there were any major time differences between an original version of Apache Solr and a version of Apache Solr with the implemented clickthrough data algorithm. The pilot study showed that on average the clickthrough data algorithm added an additional 5 milliseconds to the full request time. 1

5 2 Background Information is no longer something that is stored in file rooms and cabinets; it is now something that is available all over the word through the World Wide Web. Fukumura, Nakano, Harumoto, Shimojo, & Nishio (2003) describe how information is being moved over to the digital world and leaving the papers behind. Databases are a typical way of storing information, it allows information to be sorted and easily found. The majority of the population on earth does however not have the knowledge of how to access information directly from a database, therefore some sort of graphical interface is needed. This is one of the reasons that the demand for high quality search engines is so large. Search engines enables anyone to access the information on the internet no matter what their computer knowledge is. Fakhraee & Fotouhi (2011) describes how the importance for search engines grows as the information available in databases grows and that the current amount of data is already massive. 2.1 Search Engines Search engines are tools with absolutely no technical skill required, which allows users to be able to access information available on the World Wide Web. It allows users to be able to locate information through usage of keywords and sometimes full-text search. There are several different types of search engines. Two of the most regular types of search engines are general search engines and internal search engines. General search engines are the kind of search engines that are used every day by the public. Google is the most used general search engine according to NetMarketShare (2017). It is used by 85% of all desktop users and 97% of all mobile and tablet users. Google is a website where users are allowed to search through all of the information that Google has indexed in their system and receive a link directly to the source of the information. Internal search engines allow users to be able to find information that is stored locally on a website. Bian, Li, Yue, Lei, Zhao & Xiao (2015) explains that an internal search engine is used to be able to locate information that general search engines are not able to find, or at least puts the result so lowly ranked that it is very hard to find. For example a university usually has its own website, where documents and information is released to the staff and students. To be able to find this information and documents quickly an internal search engine is one of the most rapid ways of accessing it. While using a general search engine to find the same data could take several minutes or hours of browsing before it has been found. A search engine needs to handle three different problems by three different parts according to Singh, Hsu, Sun, Chitaure & Yan (2005). The first part that has to be created is an index where the information will be stored. The second part consists of creating an algorithm that can retrieve the information. The last part is creating an interface where the information is presented after it has been retrieved Apache Lucene Apache Lucene is an open source text search engine created and open sourced by Doug Cutting in Lucene consists mainly of its index and the possibility to query for 2

6 documents in the index; it has the functionality to index documents as well as an API to allow remote querying of the documents. It is based on Java but allows for usage in several different programming languages, such as Python or.net. (Balipa & Balasubramani, 2015) Lucene offers several different features that allow for creation of a complete search engine, such as an inverted index to be able to efficiently retrieve documents, a large set of text analysis components, a query syntax that allows for a large number of query types and a scoring algorithm for weighting of the documents. (Smiley & Pugh, 2011) Apache Solr Apache Solr is an open source enterprise search server created by the Apache Software Foundation. The server is written in Java and built on top of Apache Lucene. Apache Solr also contains a web administration interface out of the box (see Figure 1). Nagi (2015) describes it as a web application that can be deployed in any servlet container and lists the addition of functionality that Solr brings that Lucene does not have. The functionalities that are added in Solr are: XML and JSON APIs Hit highlighting Faceted search and filtering Geospatial search Caching Near real-time searching of newly indexed documents. Web administration interface The additional functions allow Solr to become more versatile, compared to Lucene, which can lead to more research that include Solr. Figure 1 - Apache Solr Web Administration Tool 3

7 2.1.3 Sphinx Sphinx is an open source search server created by Sphinx Technologies Inc. A web administration interface is available through Sphinx tools (see Figure 2). It is a full text search server created to with performance and relevancy in mind (Sphinx, 2017). The system is mostly built in Figure 2 - Sphinx Tools Admin Panel 2.2 Search Algorithm The user searches for information by asking the search engine a question, either by simple words or by usage of full-text questions. The search engine then analyses the question and try to find information that is considered related to the search question. Depending on how the search algorithm is built, it will select different answers to the question. The search engine then replies with the most relevant information. Pawar & Natani (2014) explains that one of the key elements to a good search engine is a ranking algorithm. As the relevancy of the search result depends on the ranking algorithm. According to Pawar & Natani (2014) the two most common algorithms used to rank web pages are (1) PageRanking algorithm and (2) HITS (Hyperlink Induced Topic Search). 2.3 Clickthrough Data When a user completes a search, he or she is presented with a list of alternatives to choose to continue reading about. In most cases a user does not automatically choose the first available option after completing a search. Depending on how relevant the title of the option feels like for humans, they choose different options. Clickthrough data can be used to be able to change the ranking of the options available, depending on which options that the users tend to choose (Joachims, 2002). Clickthrough data can be used to store which options a user chooses, as well as in what order the options was chosen. This can be used to create a new layer in the search engines weighting layer, so that the results can be re-ordered depending on the provided data that 4

8 the clickthrough system uses. This means that the search engine would be able to improve depending on how much it is being used. In theory, the more users that are using the system the faster it will improve. Joachims (2002) explains how he implemented a system with an equal solution. It allowed the search results to be re-ordered, so that the most relevant and interesting documents would slowly move upwards in the ranking system and more relevant information could easier be found by the users of the search engine. 2.4 Stack Overflow Data Dump Stack Overflow is an internet forum in the Stack Exchange network containing thousands of topics regarding technical questions and answers. It is used by more than 6.7 million users and has more than 40 million visitors every month. (Stack Overflow, 2017) Every quarter of the year Stack Exchange publishes a dump of all of the user-contributed content on Stack Exchange since The dump is also licensed under Creative Commons BY-SA 3.0 which allows anyone to use it as long as the credit and posts are still linked back to the original posts or users. Each site that is in the Stack Exchange network is also downloadable individually. (Stack Exchange, 2017) 5

9 3 Problem The amount of information handled by search engines increases for every year, at the same time as the demand of information availability by the users also increase. This requires search engines to get more efficient at handling the information at the same time as it has to get better at sorting the information that the users are requesting. Chua (2012) estimates the growth of the information available on the Internet by an average of 23% per year. The presentation of data in a search engine is one of the major issues that needs to be addressed, as Osiński (2003) explains that a lot of the information that users want access to disappears in between the massive amount of information that exists. It especially becomes a problem as the information is usually displayed in an ordered list that can be split in to multiple pages. It creates the problem that users need to choose a very specific search question to be able to find relevant information from a search result. The algorithms used to rank documents in search engines are very limited. Pawar & Natani (2014) talks about how search engines tend to only use two different implementations of a ranking system. They explain that most usual ways of ranking documents are too restricted to be able to gain an actual good way of judging what documents are relevant to a specific search-term. They also explain that the PageRanking algorithm is easy to exploit as it is based off of how many terms that can be associated to the search query. This allows the document creator to abuse the algorithm by entering terms that automatically puts a document at a higher rating, even though it may not contain as much relevant information as another documents. In the same way as PageRanking can be exploited, so can Hyperlink Induced Topic Search. It evaluates the relevancy depending on how many times the document has been referred to, which could be hyperlinks for example. As social media posts or similar are a part of the indexing systems as much as any other documents, it is simple to create a large amount of hyperlinks to a document through social media. This also allows documents to gain a higher score by excessively creating social media posts that refers to the document. According to Joachims (2002) a larger problem for the clickthrough data algorithm was that the system he created was not able to detect differences between a regular user performing a search query and a user spamming the system. This created a problem with so-called false relevancy, as hyperlinks that was selected a large amount of times, would slowly get better ranking in the search engine. This was the effect of basing the ranking only on which results were chosen the most. 6

10 3.1 Aim The goal of this research is to evaluate if clickthrough data can efficiently be used to increase the relevance of search results. As user experience is dependent on the time it takes for a search result to be displayed, it has to be efficient, otherwise the majority of users will not continue to use it. By using an existing search engine that is open source and combining it with an algorithm that can rescore documents depending on clickthrough data, tests can be made by installing and configuring an original version of the same search engine. This will allow for evaluation of the algorithm both my measuring time differences as well as performing user tests. Research question: Will a clickthrough data algorithm allow search engines to efficiently display more relevant search results? To be able to derive an answer to this question 4 objectives needs to be addressed: 1. Gain knowledge of the domain by examining existing literature. 2. Design an experiment and implement algorithm to evaluate clickthrough data. 3. Perform experiment and collect data. 4. Analyse the results and present conclusion. 3.2 Hypothesis The hypothesis of this study argues that based on existing theory a clickthrough data algorithm can be created to display search results with a better precision to the search term compared to the original algorithm in the search engine. It also argues that the algorithm will add a minor delay to retrieve the results, but that the delay is low enough for the search engine to be efficient. 7

11 4 Method Performing evaluation of software engineering consists of three major empirical techniques according to Wohlin et. al. (2012). They define the three major empirical techniques as surveys, case studies and experiments. A survey consists of collecting data from humans or about humans to be able to draw conclusions from that data. A case study consists of reviewing data by looking at an existing method or tool at a corporation or an organization. An experiment consists of testing the system in a controlled environment and manipulates one factor or variable of the studied setting. The chosen method for this study is experiment with a quantitative approach. It is most suitable for the study as it depends on being able to test performance differences through manipulation of the algorithm in a precise and systematic way. Just as Wohlin et. al (2012) mention as the main definition of when an experiment is suitable. Experiments are launched when we want control over the situation and want to manipulate behaviour directly, precisely and systematically. Wohlin et. al., 2012, pp. 16 In addition to the experiment a data collection will be completed through a quantitative survey, to determine the relevance of the search results in a similar approach as Clarke et. al (2008) suggests in their framework. This is required to be able to answer the research question, as it depends on being able to evaluate the relevance of the search algorithm at the same time as evaluating the efficiency of the algorithm. An alternative to performing an experiment, a case study could be possible to execute to be able to answer the research question. As a case study would let the algorithm be used in its intended real-world application, which could provide a more accurate answer to the research question. A large problem with performing a case study is the issue of finding an appropriate organization to implement the system at or one that is already using a similar implementation of relevancy weighting, which also could create legal complications. Another aspect to take in to account is also that an experiment allows the testing to be performed in a more controlled environment compared to performing a case study according to Wohlin et. al. (2012). 4.1 Systematic Literature Review To be able to gain background knowledge of the field a systematic literature review (SLR) will be performed. The SLR will be based on a set of papers gathered from IEEE Xplore and the ACM portal using a set of search queries (see Appendix A). To able to filter the papers depending on their relevance to the field a set of criteria s was defined both for inclusions and exclusions (see Appendix A). Jalali & Wohlin (2012) showed that if an SLR is executed and performed in an efficient way, that this method is a very exceptional way of gaining a deeper understanding and an improved knowledge of the field. It also allows researchers to be able to not stumble in to the same problems as earlier research has already experienced. 8

12 4.2 Approach In this experiment a clickthrough data algorithm will be implemented together with a search engine and tested. The install of the search engine will be indexing the posts from the dataset that Stack Overflow is providing as a test dataset with more than 30 million entries. The testing of the algorithms is split in to two parts. (1) Measurement of relevancy differences between a basic install of a search engine and an install of a search engine combined with the Clickthrough Data Algorithm (CDA). (2) Measurement of performance differences between a basic install of the search engine and an install of the search engine combined with the CDA Relevancy Testing Joachims (2002) proposes a framework to be able to test the relevancy difference with and without a CDA. It lets users see two different search result lists and lets the user choose which list that the user found most relevant to what he or she was looking for. Joachims (2002) is using a very similar way of judging relevancy to the framework that Clarke et. al. (2008) presents. It is based on the human assessor selecting a binary answer to each search result, if it is either (a) relevant or (b) irrelevant, to the search question asked. The most effective way of measuring relevancy on keyword search systems is conducting a survey according to Joachims (2002) and Clarke et. al. (2008). As relevancy is not something that can be defined, it is important to get a broad audience to contribute to the survey. As humans tend to have different definitions of what is relevant, it can depend on a lot of different factors; for example it can depend on how many times the search words is included in the document found, it could be dependent of how the document is interpreted, hence why it is so important to gain a broad selection of people to judge the relevancy Performance Testing Wohlin et. al. (2012) explains how measurements are used to be able to make judgments based on facts instead of intuition. When there are measurements you are able to compare it and allow the data to be analyzed, find out where the bottlenecks are as well as being able to determine if resources are being used efficiently. To be able to measure the efficiency of the clickthrough data algorithm, measurements of the time differences with and without the CDA will be performed. To be able to measure the differences, two installations of Apache Solr will be run on identical systems. The installations will be using identical indexes of the Stack Overflow dataset, the only difference will be that one will be running with the CDA, and one will not. Both of the systems will be measured during the testing to be able to answer the following questions: 1. How large is the time difference in retrieving the search result for the user with and without the CDA activated? 2. How long does it take for the CDA to run? 3. How does the CDA impact the search engine s efficiency? 9

13 4.2.3 Research ethics The dataset gathered from Stack Overflow is completely sanitized from any personal information to keep the information anonymous according to Stack Exchange (2014). An important part to make sure that the survey data does not contain any personal information. To be able to know that the entries are unique some sort of generation of a unique identifier needs to be created, that could be based on IP-addresses or something similar. It is also very important to be able to decide what could be considered as personal information, as per definition personal information is any kind of data that could be derived to a specific person. Another aspect to take in to account is to be sure that the persons that are included in the study knows and allows usage of the data that they create. If these terms are not accepted, the data will not be used nor collected. Participants are allowed to quit the survey whenever they want and any data that the participant has created will not be used. To be able to treat the aspect of reliability and repeatability in software testing, all of the implementations, configurations, results, software, hardware specifications and any other resources that are relevant to the experiment, will be presented as appendixes to allow anyone to repeat the experiment and verify the study. 10

14 5 Implementation 5.1 Systematic Literature Review There are many ways of implementing a system that allows similar evaluation of a search engine and different algorithms, as there are several programming languages that can be used to accomplish very similar results. To be able to create a test environment that is suitable for this study, an availability approach was chosen. All of the software, programming languages and programming platforms chosen, was all open source in order to be able to keep the study as repeatable as possible. Selecting a search server to use was not easy; all different systems have a lot of pros and cons. It was narrowed down to two options, Apache Solr or Sphinx Search. However, the final choice was Apache Solr. The first reason that Solr was chosen over Sphinx, was due to the licensing. As the Apache2 license does not apply any requirements that could limit the usage of the system, whereas Sphinx is licensed under GPLv2 which does apply conditions depending on its usage. Another reason for choosing Solr over Sphinx is the possibility to store very large indexes, as the Stack Overflow dataset consists of more than 30 million documents. According to Khabsa, Carman, Choudhury & Lee Giles (2012) Solr can hold an index as large as 3 billion documents, but there is in fact no documented limit for how large of an index Sphinx Search can hold. Inspiration for how to configure Solr was taken from the book Apache Solr 3 Enterprise Search Server by Smiley & Pugh (2011). It contains information of how to set up Solr, lists a large amount of problems that may occur and how to handle them. To build the web platform and online survey environment, JavaScript will be used for frontend and PHP/MySQL for the backend. The combination of PHP and MySQL allows for a secure connection between the two, as PHP as integrated support for a secure connection to MySQL. All three techniques are open-source, which allows for simple and easy usage. The Stack Overflow dataset will partially be used to build the document base in the search engine. The information that will be used is the posts created by the users of Stack Overflow, which consists of more than 30 million documents. 5.2 Progression This section explains the different design choices made and the implementation of scripts, servers and other related parts of the system. The parts are listed in the order that they were created or installed and any obstacles that may have occurred. The hardware specifications of the server that all of the server-side code and software ran on can be found in Appendix S Solr Implementation Apache Solr was installed with the newest version available at the start of the research, which was version at the 1 st of March Configuration of the Solr core was performed as per Appendix B, Appendix C and Appendix D. This allowed usage of the built in Data Import Handler. 11

15 A problem that occurred during the indexing of the dataset was a lack of heap memory for the Java Virtual Machine. Smiley & Pugh (2011) mentions that when scaling up Solr for larger indexes it is common that the default heap memory for the JVM is simply not enough and that it may need some fine tuning to minimize the risk of getting an OutOfMemoryException. The solution to the problem was to allow the JVM to use more memory during the import by adding the -m parameter when launching Solr. This allowed Solr to use a total of 6 gigabytes of heap memory instead of the default 512 megabytes and solved the errors that occurred during the importation of the dataset MySQL Database The MySQL database is where all of the clicks performed by users are being stored. It stores information regarding the search query, which document that has been clicked and the amount of times the document has been clicked on. The database table configuration staid the same until late parts of the experiment. The only changes that were made (see Figure 3 and 4) were an increase of the size of the full-term column from 40 characters to 80 characters, an increase of the term column size, from 20 characters to 30 characters as well as changing the default character set to UTF-8 to match the character set of the search platform. The reason behind changing the sizes was to be able to fit larger words and search terms in to the columns. The table configuration is also listed in Appendix G. Figure 3 Initial database table configuration of the Clicks table Figure 4 Final database table configuration of the Clicks table Search platform The main application of the system is the search platform (see Figure 5). It is the very thing that ties each part of the system together. It is developed in PHP and JavaScript, where PHP is used for querying for documents in Solr, displaying the documents to the user as well as exchanging information with the MySQL database, the connection between MySQL and PHP 12

16 can be seen in Appendix F. JavaScript handles Ajax requests to call for storing the clickthrough data in the database. Figure 5 - Search Platform At the start of the project the idea was to use a PHP extension for Solr (Php.Net 2016), which is an object-oriented library for communication between PHP and Solr. However, the library is quite out-dated as it is several versions behind the development of Solr and it lacks support for some of the features that are offered in the Solr API. Instead a connection to the Solr API was used as presented in Appendix J. The collection of clicks is performed through JavaScript (See Appendix F) that listens for a click event to occur, then it checks if the user has the cookie for the same hyperlink & search query combination. If it does, then no clickthrough data is recorded, but if it doesn t it sends an Ajax request that sends the data to the MySQL database. The JavaScript code is presented as Appendix K and the PHP file that handles the Ajax request is presented as Appendix E Clickthrough Data Query Parser Plugin Whenever the parameter rq={!cda cdaweight=2 cdadocs=500} is used while requesting documents from the Solr API, the Clickthrough Data Query Parser Plugin is called. It takes care of searching the documents that matches the query and if the documents do, they will be re-scored by the Clickthrough Data Algorithm. The parser also allows for a set of optional 13

17 parameters to be added, weight of the new score and the amount of documents to be rescored. The weighting is performed by multiplying the click score with the chosen weight. If no weight or number of documents is chosen, the algorithm defaults the weight at 2.0 and rescores the top 200 documents in the result list. When all of the documents have been checked and possibly re-scored, the list of documents gets sorted depending on the score of each document and the list is sent back to where the request was called from. The full source code is presented in Appendix I Clickthrough Data Algorithm When the Clickthrough Data Query Parser finds a document that matches the query the Clickthrough Data Algorithm rescoring is called. The algorithm then checks if the document has any recorded clicks, if the document does not, no added score is returned. If it does, the algorithm then checks if the clicks match the query and if they do, additional scoring is added depending on the rate of the clicks (as seen in Figure 6). The full source code is presented in Appendix H. Figure 6 - Clickthrough Data Algorithm score method 14

18 Figure 7 First running version of the Clickthrough Data Algorithm score method Figure 8 - Clickthrough Data Algorithm Constructor A large issue in the early versions of the clickthrough data algorithm was added delay in each query. As shown in Figure 7, the reason behind this issue was that it performed a request to the click storage table in the database each time the score method was called. At that stage, each query would take longer than 10 seconds to retrieve, which just was not efficient enough. The API request was moved to the constructor of the Clickthrough Data Algorithm class (see Figure 8), as well as the API was redesigned to respond with clicks depending on the query term instead of responding with clicks depending on the ID of the document. By changing the structure of the Clickthrough Data Algorithm class, a single request would respond with the clicks of all of the documents for the query instead of one request per document. This made a large impact on the time it takes to perform a query. Instead of every query taking more than 10 seconds to retrieve, it would be faster than 1 second each time. 15

19 Seconds 5.3 Pilot Study A pilot study was performed on the clickthrough data algorithm, comparing the speed that the documents are presented with and without the algorithm. A set of 20 testers used the search functionality to perform seven different queries. The objective was simply to find out if there was a time difference between using the added algorithm or not, as well as to check how big the gap was, if there was any. Average request Time (s) Clickthrough Data Algorithm Average Regular Search Average 1 0,9 0,8 0,7 0,6 0,5 0,4 0,3 0,2 0, Test ID Chart 1 - Pilot study average request time in seconds The measurements of the request speeds showed very promising results, it showed that there was infact a minor difference between using the algorithm and not using it, as shown in Chart 1. On average the Clickthrough Data Algorithm added an additional 0.05 seconds delay to collect the information compared to a request without the algorithm. However that result is gained with a very small test set of clicks compared to how it would be if the general public would have populated the clickthrough data table. Which proves that the algorithm has to have a larger set of clicks to encounter to be able to make a fair assumption of it s performance. A minor issue found during the testing, was that the system was missing a mechanism to find out if any document had been rescored according to the clickthrough data algorithm. This could have added a minor time difference, as the system would try to sort the array of documents even if there was no actual difference between before and after the clickthrough data algorithm had completed its rescoring. 16

20 6 Evaluation 6.1 The Study The experiment was conducted in two parts. The first part was to evaluate the efficiency of the CDA by comparing a version of Solr with the CDA to a version of Solr without the CDA. The second part was to evaluate if the CDA would increase or decrease the relevancy of the search results by using an online survey Efficiency Evaluation Search Term Index Size Number of Results Windows Install Small (5,000 documents) 25 Java Get Time Small (5,000 documents) 10 JavaScript Cookies Small (5,000 documents) 5 Windows Install Medium (50,000 documents) 134 Java Get Time Medium (50,000 documents) 36 JavaScript Cookies Medium (50,000 documents) 10 Windows Install Large (1,000,000 documents) 5157 Java Get Time Large (1,000,000 documents) 1268 JavaScript Cookies Large (1,000,000 documents) 350 Table 1 Number of results for each search term depending on index size To be able to determine if the clickthrough data algorithm was affected by differences in amount of search results, a collection of three queries were chosen (see Table 1). The reason that just those 3 search terms were used was because that they all differ in amount of search results in all of the 3 index sizes. The search term Windows Install had the most amount of results, JavaScript cookies had the fewest amount of results and Java Get Time was somewhere in between the other two. 17

21 Search Term Index Size Search Algorithm Windows Install Small (5,000 documents) Clickthrough Data Algorithm Windows Install Medium (50,000 documents) Clickthrough Data Algorithm Windows Install Large (1,000,000 documents) Clickthrough Data Algorithm Windows Install Small (5,000 documents) Original Solr Windows Install Medium (50,000 documents) Original Solr Windows Install Large (1,000,000 documents) Original Solr Java Get Time Small (5,000 documents) Clickthrough Data Algorithm Java Get Time Medium (50,000 documents) Clickthrough Data Algorithm Java Get Time Large (1,000,000 documents) Clickthrough Data Algorithm Java Get Time Small (5,000 documents) Original Solr Java Get Time Medium (50,000 documents) Original Solr Java Get Time Large (1,000,000 documents) Original Solr JavaScript Cookies Small (5,000 documents) Clickthrough Data Algorithm JavaScript Cookies Medium (50,000 documents) Clickthrough Data Algorithm JavaScript Cookies Large (1,000,000 documents) Clickthrough Data Algorithm JavaScript Cookies Small (5,000 documents) Original Solr JavaScript Cookies Medium (50,000 documents) Original Solr JavaScript Cookies Large (1,000,000 documents) Original Solr Table 2 Parameter array for efficiency tests. A set of 18 parameter combinations were measured to be able to determine the efficiency of the clickthrough data algorithm at 3 different index sizes. The complete set of parameter combinations can be seen in Table 2. Each of the parameter combinations were run for 200 iterations, which resulted in a total of 1800 results with the CDA activated (see Appendix M) and a total of 1800 results without the CDA activated (see Appendix N). To be able to see how the efficiency has been affected in Apache Solr, there are 3 major aspects to evaluate: 1. How the algorithm affects the request speed depending on how many results there are for the search term. 2. How the algorithm affects the request speed depending on the index size of the search engine. 3. How the algorithm affects the request speed in general. The request speed is the time that it takes between a request was sent from the search platform to Solr until the search platform has retrieved the search results and presented them to the user. 18

22 Chart 2 - Comparison chart of request speeds for all of the parameter combinations with standard deviation To address the first Chart 2 shows the average of the measurement results of the request time for each of the search terms listed in Table 1. Each of the search words have been iterated 200 times for each index size. As no other testing was performed on any search term with as low amount of search results, the reason behind the low request time for the search term JavaScript Cookies is unsure. An educated guess would be that it is due to the low amount of results for that search term in the small index, which was a total of 5 search results. Chart 3 Comparison of the request speed depending on the index size with and without the Clickthrough Data Algorithm 19

23 By combining the results for the for search terms on each index size a more fair assumption of the differences in request speed between the index sizes can be made. Chart 3 displays the average request speed for each index size with standard deviation. The measurements used in Chart 3 are a combination of the 3 different search terms that was run on the same index size, combining a total of 600 iterations. Chart 4 Full report of 1800 executions of the Clickthrough Data Algorithm Chart 4 displays all of the measurement iterations that were run on the version of Apache Solr with the clickthrough data algorithm. It shows that the execution speed of the CDA, which is the time it takes from when the query parser plugin (see Chapter 5, Section 5.2.4) receives the request to start rescoring the documents of the query and until it has finished processing all documents and possibly, re-sorted them. 20

24 Chart 5 - Comparison of the full page load time of the search platform, with and without the Clickthrough Data Algorithm The full page load time of the search platform is the time it takes from when a user has connected to the webserver through a web browser and until the page has been fully loaded. In the case shown in Chart 5 the page loaded also includes a query to Solr and the retrieval of the search results, the Other part is all of the material that is loaded that has no ties or connections to the search engine itself. The measurement in that chart consists of the average value for each respective part of the page load, which has been iterated 1800 times for each version of Solr Relevancy Evaluation A quantitative survey was created to be able to investigate if users preferred the ranking that the clickthrough data algorithm created or not. A total of 100 individuals with some prior knowledge of programming were asked to partake in the survey. The survey was closed after 45 respondents had finished it. The reason behind asking for prior programming knowledge was due to the fact that most of the possible search results are related to programming as Stack Overflow is an online programming community (Stack Overflow, 2017). The computer and programming experience for each user is shown in Chart 6. 21

25 Chart 6 - Years of professional computer/programming experience for all survey participants. Each bar represents a participant of the survey. After each participant had filled in their experience they were presented with the search results from a randomly selected search term, displaying the results for both the original version of Solr and the version of Solr with the clickthrough data algorithm next to each other. Then the participants were instructed to select the result list that they would say provided the best answer to the search term (see Figure 9). Each participant performed this task twice with different randomly selected search terms until every search term had been evaluated 15 times. Chart 7 displays all of the selections that were done by the participants of the survey. 22

26 Figure 9 - Survey User Interface Chart 7 Survey result list selections 23

27 6.2 Analysis The difference in request speed is the major interest area as that is what can be affected the most by the clickthrough data algorithm. The request speed is the time it takes from the moment a query has been sent to the API of Apache Solr from the search platform and until the data has been retrieved and displayed to the user. After performing a total of 3600 iterations of the measurements, 1800 iterations with and 1800 iterations without the clickthrough data algorithm, the average difference between the request speeds lowered by more than half compared to the pilot study. Chart 8 displays the average request speed for both the version of Solr with and the version without the clickthrough data algorithm. As shown in Chart 8 the time difference is small enough to barely be able to notice a difference between the two versions. On average the version of Solr running with the clickthrough data algorithm was 0,028 seconds (2.8 ms) faster than the original version of Solr, even after a much larger set of clicks compared to the pilot study was recorded. Chart 8 - Comparison of request Speed with and without the clickthrough data algorithm with standard deviation When comparing the average request speeds between different index sizes regardless of what search term was used, it is a clear how little the index size matters (see Chart 3). Comparing the standard deviation for all of the tests, it is visible that all except the original version of Solr with a large index intersect at approximately 0,34 seconds. This proves that on average the index size does not affect the general efficiency of Solr. No correlation could be found between the clickthrough data algorithm execution time and the index size, see Appendix N for full report of measurements. By comparing the full page load (see Chart 5) of the original version of Solr and the version of Solr with the clickthrough data algorithm it is clearly visible that the only time difference 24

28 between them are a few milliseconds. The average time it takes for the clickthrough data algorithm to be fully executed is 0,00304 seconds (see Chart 4 for all 1800 iterations) and no other components of Solr should be affected by the clickthrough data algorithm as it is implemented as one of the last components to be executed in Solr. Which is clearly presented in Chart 5 as the time difference between the two versions is barely separable. Chart 9 - Average survey result list selection The survey data shows that in most cases the clickthrough data algorithm increases the relevance of the search queries performed (see Chart 9). The only search term where the original version of Solr is chosen over the version with the clickthrough data algorithm is the term php shorthand (see Chart 7). However, by examining the full survey results (see Appendix O) a correlation between participants with low experience and participants receiving the search term php shorthand is seen in multiple cases, which could have affected the results of that specific search term, which would also affect the average list selection. 25

29 6.3 Conclusion By examining Chart 3 and 4 it is clear that the index size of Solr does not affect the speed of retrieving results in a large manner. This proves that the clickthrough data algorithm can be used at any size without causing any major loss of time. The only exclusion in the measurements performed in this study was when a search term had a lower number of hits, it would on average have a lower request speed when the size of the index was of the smaller size. The measurements shown in Chart 3, 4 and 8 indicates that the clickthrough data algorithm does not add any major delay during the retrieval of search results in such a large manner that the system becomes inefficient. On average the supposedly added delay was low enough (less than 4 milliseconds) for the version of Solr with the clickthrough data algorithm to have a lower average full page load time than the original version of Solr. It was actually 0,2 milliseconds faster on average. The standard deviation in Chart8 of both versions of Solr intersect, which shows that on average there are no difference between the two versions, which also proves that the algorithm is not affecting the efficiency of the search engine. When comparing the search results in both the version of Solr with the clickthrough data algorithm and the original version of Solr, the survey data shows that the algorithm is increasing the relevancy of the search results, as shown in Chart 9. Out of the 90 queries that participants performed, even with several participants who had little or no professional experience of programming, in 67.78% of the search queries the participant preferred the version of Solr with the added algorithm. This indicates that the algorithm returns a scoring of documents that is according to the survey data, a bit more accurate. The hypothesis was approximately correct, as it was possible to create an algorithm based on clickthrough data using the existing theory and literature. By evaluating the algorithm it can be concluded that the algorithm increased the relevance of the search results at the cost of a minor delay. However, most online testing tools consider the page load time of the search platform to be in the top percentiles of the sites registered, for example GTmetrix (2017) places the search platform in the top 2% fastest pages of the 182 million pages that has been analysed. 26

30 7 Concluding Remarks 7.1 Summary The goal of the study was to evaluate if clickthrough data could efficiently be used to increase the relevance of search results. This was investigated by performing an experiment where an algorithm was developed and implemented in the search engine Apache Solr. The hypothesis of the study was: The hypothesis of this study argues that based on existing theory a clickthrough data algorithm can be created to display search results with a better precision to the search term compared to the original algorithm in the search engine. It also argues that the algorithm will add a minor delay to retrieve the results, but that the delay is low enough for the search engine to be efficient. (Chapter 3, section 3.2) This study concludes that clickthrough data can be used to efficiently increase the relevance of search results by implementing a clickthrough data algorithm in Apache Solr. The efficiency evaluation indicates that the algorithm on average is fully executed in 3,04 milliseconds, which affects the full page load by an average of 0,38%. The survey shows that the algorithm provides an increased level of relevance in the search results, as in 67,78% of the queries that were completed, the participants chose the version of Apache Solr with the algorithm above the original version of Apache Solr. Both evaluations prove the strength of the hypothesis at the same time as it proves that clickthrough data can be used to efficiently increase the relevancy of search results. 7.2 Discussion By examining the results of the experiment an important factor that plays an important role are the search terms that are used and its respective clicks. What could have been affecting the search results are the clicks that have been recorded during the study and how they could have affected the choices that were made by the participants in the survey. The clicks that has been used to calculate scores for the documents in the survey are the clicks that has been recorded (see Appendix R for full list of clicks) by using the during the study as well as the clicks that were recorded during the pilot study instead of manipulating the clicks manually. The order of the results was drastically changed and the score changes that were added by the clickthrough data algorithm were rather large. This was most likely affected by the weight chosen during the query to Solr, as the amount of clicks that were relatively low, a higher weight were chosen to be able to see any differences in the ordering of the search results. As Joachims (2002) describes, the idea is that the system is created to be able to reorder the search results that are presented for better or worse, depending on what documents the users tend to choose. A higher weight could affect the order of the search results as much in a positive light as well as in a negative light, as all of the documents that have been chosen by other users gain as much scoring depending on the recorded clicks no matter if the document is a relevant or irrelevant one, as long as the user visited the page where the document is listed. As all of the measurements that were performed were executed in the web browser Google Chrome, no conclusion regarding how each web browser could have affected the possible 27

DRACULA. CSM Turner Connor Taylor, Trevor Worth June 18th, 2015

DRACULA. CSM Turner Connor Taylor, Trevor Worth June 18th, 2015 DRACULA CSM Turner Connor Taylor, Trevor Worth June 18th, 2015 Acknowledgments Support for this work was provided by the National Science Foundation Award No. CMMI-1304383 and CMMI-1234859. Any opinions,

More information

Codify: Code Search Engine

Codify: Code Search Engine Codify: Code Search Engine Dimitriy Zavelevich (zavelev2) Kirill Varhavskiy (varshav2) Abstract: Codify is a vertical search engine focusing on searching code and coding problems due to it s ability to

More information

Getting started with Inspirometer A basic guide to managing feedback

Getting started with Inspirometer A basic guide to managing feedback Getting started with Inspirometer A basic guide to managing feedback W elcome! Inspirometer is a new tool for gathering spontaneous feedback from our customers and colleagues in order that we can improve

More information

Building a Large, Successful Web Site on a Shoestring: A Decade of Progress

Building a Large, Successful Web Site on a Shoestring: A Decade of Progress Building a Large, Successful Web Site on a Shoestring: A Decade of Progress Theodore W. Frick Bude Su Yun-Jo An Instructional Systems Technology School of Education Indiana University Bloomington Abstract

More information

Open Source Search. Andreas Pesenhofer. max.recall information systems GmbH Künstlergasse 11/1 A-1150 Wien Austria

Open Source Search. Andreas Pesenhofer. max.recall information systems GmbH Künstlergasse 11/1 A-1150 Wien Austria Open Source Search Andreas Pesenhofer max.recall information systems GmbH Künstlergasse 11/1 A-1150 Wien Austria max.recall information systems max.recall is a software and consulting company enabling

More information

Programming the World Wide Web by Robert W. Sebesta

Programming the World Wide Web by Robert W. Sebesta Programming the World Wide Web by Robert W. Sebesta Tired Of Rpg/400, Jcl And The Like? Heres A Ticket Out Programming the World Wide Web by Robert Sebesta provides students with a comprehensive introduction

More information

Popularity of Twitter Accounts: PageRank on a Social Network

Popularity of Twitter Accounts: PageRank on a Social Network Popularity of Twitter Accounts: PageRank on a Social Network A.D-A December 8, 2017 1 Problem Statement Twitter is a social networking service, where users can create and interact with 140 character messages,

More information

EPL660: Information Retrieval and Search Engines Lab 3

EPL660: Information Retrieval and Search Engines Lab 3 EPL660: Information Retrieval and Search Engines Lab 3 Παύλος Αντωνίου Γραφείο: B109, ΘΕΕ01 University of Cyprus Department of Computer Science Apache Solr Popular, fast, open-source search platform built

More information

Web Engineering (CC 552)

Web Engineering (CC 552) Web Engineering (CC 552) Introduction Dr. Mohamed Magdy mohamedmagdy@gmail.com Room 405 (CCIT) Course Goals n A general understanding of the fundamentals of the Internet programming n Knowledge and experience

More information

Relevancy Workbench Module. 1.0 Documentation

Relevancy Workbench Module. 1.0 Documentation Relevancy Workbench Module 1.0 Documentation Created: Table of Contents Installing the Relevancy Workbench Module 4 System Requirements 4 Standalone Relevancy Workbench 4 Deploy to a Web Container 4 Relevancy

More information

How To Construct A Keyword Strategy?

How To Construct A Keyword Strategy? Introduction The moment you think about marketing these days the first thing that pops up in your mind is to go online. Why is there a heck about marketing your business online? Why is it so drastically

More information

AN OVERVIEW OF SEARCHING AND DISCOVERING WEB BASED INFORMATION RESOURCES

AN OVERVIEW OF SEARCHING AND DISCOVERING WEB BASED INFORMATION RESOURCES Journal of Defense Resources Management No. 1 (1) / 2010 AN OVERVIEW OF SEARCHING AND DISCOVERING Cezar VASILESCU Regional Department of Defense Resources Management Studies Abstract: The Internet becomes

More information

Lecture #3: PageRank Algorithm The Mathematics of Google Search

Lecture #3: PageRank Algorithm The Mathematics of Google Search Lecture #3: PageRank Algorithm The Mathematics of Google Search We live in a computer era. Internet is part of our everyday lives and information is only a click away. Just open your favorite search engine,

More information

Web Engineering. Introduction. Husni

Web Engineering. Introduction. Husni Web Engineering Introduction Husni Husni@trunojoyo.ac.id Outline What is Web Engineering? Evolution of the Web Challenges of Web Engineering In the early days of the Web, we built systems using informality,

More information

Information Retrieval Spring Web retrieval

Information Retrieval Spring Web retrieval Information Retrieval Spring 2016 Web retrieval The Web Large Changing fast Public - No control over editing or contents Spam and Advertisement How big is the Web? Practically infinite due to the dynamic

More information

PROJECT REPORT. TweetMine Twitter Sentiment Analysis Tool KRZYSZTOF OBLAK C

PROJECT REPORT. TweetMine Twitter Sentiment Analysis Tool KRZYSZTOF OBLAK C PROJECT REPORT TweetMine Twitter Sentiment Analysis Tool KRZYSZTOF OBLAK C00161361 Table of Contents 1. Introduction... 1 1.1. Purpose and Content... 1 1.2. Project Brief... 1 2. Description of Submitted

More information

ODK Tables Graphing Tool

ODK Tables Graphing Tool ODK Tables Graphing Tool Nathan Brandes, Gaetano Borriello, Waylon Brunette, Samuel Sudar, Mitchell Sundt Department of Computer Science and Engineering University of Washington, Seattle, WA [USA] {nfb2,

More information

CREATE Compact REtrofit Advanced Thermal Energy storage. European Commission Archive 1x

CREATE Compact REtrofit Advanced Thermal Energy storage. European Commission Archive 1x Page: Page 2 of 21 Distribution list External TNO European Commission Archive 1x Change log Issue Date Pages Remark / changes Page 1 26.01.2016 21 First issue All Table of contents Background... 4 1 References...

More information

Western Libraries Website Re-Design Update ( July 2016)

Western Libraries Website Re-Design Update ( July 2016) Western Washington University Western CEDAR Usability & Design Working Group Documents Western Libraries Departmental, Committee, and Working Group Documents 7-2016 Western Libraries Website Re-Design

More information

rpaf ktl Pen Apache Solr 3 Enterprise Search Server J community exp<= highlighting, relevancy ranked sorting, and more source publishing""

rpaf ktl Pen Apache Solr 3 Enterprise Search Server J community exp<= highlighting, relevancy ranked sorting, and more source publishing Apache Solr 3 Enterprise Search Server Enhance your search with faceted navigation, result highlighting, relevancy ranked sorting, and more David Smiley Eric Pugh rpaf ktl Pen I I riv IV I J community

More information

(p t y) lt d. 1995/04149/07. Course List 2018

(p t y) lt d. 1995/04149/07. Course List 2018 JAVA Java Programming Java is one of the most popular programming languages in the world, and is used by thousands of companies. This course will teach you the fundamentals of the Java language, so that

More information

Ranking in a Domain Specific Search Engine

Ranking in a Domain Specific Search Engine Ranking in a Domain Specific Search Engine CS6998-03 - NLP for the Web Spring 2008, Final Report Sara Stolbach, ss3067 [at] columbia.edu Abstract A search engine that runs over all domains must give equal

More information

Election Analysis and Prediction Using Big Data Analytics

Election Analysis and Prediction Using Big Data Analytics Election Analysis and Prediction Using Big Data Analytics Omkar Sawant, Chintaman Taral, Roopak Garbhe Students, Department Of Information Technology Vidyalankar Institute of Technology, Mumbai, India

More information

PHP & MySQL For Dummies, 4th Edition PDF

PHP & MySQL For Dummies, 4th Edition PDF PHP & MySQL For Dummies, 4th Edition PDF Here's what Web designers need to know to create dynamic, database-driven Web sites To be on the cutting edge, Web sites need to serve up HTML, CSS, and products

More information

Geobarra.org: A system for browsing and contextualizing data from the American Recovery and Reinvestment Act of 2009

Geobarra.org: A system for browsing and contextualizing data from the American Recovery and Reinvestment Act of 2009 Geobarra.org: A system for browsing and contextualizing data from the American Recovery and Reinvestment Act of 2009 Ben Cohen, Michael Lissner, Connor Riley May 10, 2010 Contents 1 Introduction 2 2 Process

More information

Information Retrieval Lecture 4: Web Search. Challenges of Web Search 2. Natural Language and Information Processing (NLIP) Group

Information Retrieval Lecture 4: Web Search. Challenges of Web Search 2. Natural Language and Information Processing (NLIP) Group Information Retrieval Lecture 4: Web Search Computer Science Tripos Part II Simone Teufel Natural Language and Information Processing (NLIP) Group sht25@cl.cam.ac.uk (Lecture Notes after Stephen Clark)

More information

Skill Area 209: Use Internet Technology. Software Application (SWA)

Skill Area 209: Use Internet Technology. Software Application (SWA) Skill Area 209: Use Internet Technology Software Application (SWA) Skill Area 209.1 Use Browser for Research (10hrs) 209.1.1 Familiarise with the Environment of Selected Browser Internet Technology The

More information

CS50 Quiz Review. November 13, 2017

CS50 Quiz Review. November 13, 2017 CS50 Quiz Review November 13, 2017 Info http://docs.cs50.net/2017/fall/quiz/about.html 48-hour window in which to take the quiz. You should require much less than that; expect an appropriately-scaled down

More information

An Application for Monitoring Solr

An Application for Monitoring Solr An Application for Monitoring Solr Yamin Alam Gauhati University Institute of Science and Technology, Guwahati Assam, India Nabamita Deb Gauhati University Institute of Science and Technology, Guwahati

More information

5 Choosing keywords Initially choosing keywords Frequent and rare keywords Evaluating the competition rates of search

5 Choosing keywords Initially choosing keywords Frequent and rare keywords Evaluating the competition rates of search Seo tutorial Seo tutorial Introduction to seo... 4 1. General seo information... 5 1.1 History of search engines... 5 1.2 Common search engine principles... 6 2. Internal ranking factors... 8 2.1 Web page

More information

Adobe Marketing Cloud Data Workbench Controlled Experiments

Adobe Marketing Cloud Data Workbench Controlled Experiments Adobe Marketing Cloud Data Workbench Controlled Experiments Contents Data Workbench Controlled Experiments...3 How Does Site Identify Visitors?...3 How Do Controlled Experiments Work?...3 What Should I

More information

Google Tag Manager. Google Tag Manager Custom Module for Magento

Google Tag Manager. Google Tag Manager Custom Module for Magento Google Tag Manager Custom Module for Magento TABLE OF CONTENTS Table of Contents Table Of Contents...2 1. INTRODUCTION...3 2. Overview...3 3. Requirements...3 4. Features...4 4.1 Features accessible from

More information

Evaluation of Long-Held HTTP Polling for PHP/MySQL Architecture

Evaluation of Long-Held HTTP Polling for PHP/MySQL Architecture Evaluation of Long-Held HTTP Polling for PHP/MySQL Architecture David Cutting University of East Anglia Purplepixie Systems David.Cutting@uea.ac.uk dcutting@purplepixie.org Abstract. When a web client

More information

Creating an Intranet using Lotus Web Content Management. Part 2 Project Planning

Creating an Intranet using Lotus Web Content Management. Part 2 Project Planning Creating an Intranet using Lotus Web Content Management Introduction Part 2 Project Planning Many projects have failed due to poor project planning. The following article gives an overview of the typical

More information

Middle East Technical University. Department of Computer Engineering

Middle East Technical University. Department of Computer Engineering Middle East Technical University Department of Computer Engineering TurkHITs Software Requirements Specifications v1.1 Group fourbytes Safa Öz - 1679463 Mert Bahadır - 1745785 Özge Çevik - 1679414 Sema

More information

Istat s Pilot Use Case 1

Istat s Pilot Use Case 1 Istat s Pilot Use Case 1 Pilot identification 1 IT 1 Reference Use case X 1) URL Inventory of enterprises 2) E-commerce from enterprises websites 3) Job advertisements on enterprises websites 4) Social

More information

MAXIMIZING ROI FROM AKAMAI ION USING BLUE TRIANGLE TECHNOLOGIES FOR NEW AND EXISTING ECOMMERCE CUSTOMERS CONSIDERING ION CONTENTS EXECUTIVE SUMMARY... THE CUSTOMER SITUATION... HOW BLUE TRIANGLE IS UTILIZED

More information

3. WWW and HTTP. Fig.3.1 Architecture of WWW

3. WWW and HTTP. Fig.3.1 Architecture of WWW 3. WWW and HTTP The World Wide Web (WWW) is a repository of information linked together from points all over the world. The WWW has a unique combination of flexibility, portability, and user-friendly features

More information

CHAPTER 5 TESTING AND IMPLEMENTATION

CHAPTER 5 TESTING AND IMPLEMENTATION CHAPTER 5 TESTING AND IMPLEMENTATION 5.1. Introduction This chapter will basically discuss the result of the user acceptance testing of the prototype. The comments and suggestions retrieved from the respondents

More information

WORDPRESS 101 A PRIMER JOHN WIEGAND

WORDPRESS 101 A PRIMER JOHN WIEGAND WORDPRESS 101 A PRIMER JOHN WIEGAND CONTENTS Starters... 2 Users... 2 Settings... 3 Media... 6 Pages... 7 Posts... 7 Comments... 7 Design... 8 Themes... 8 Menus... 9 Posts... 11 Plugins... 11 To find a

More information

5. Application Layer. Introduction

5. Application Layer. Introduction Book Preview This is a sample chapter of Professional PHP - Building maintainable and secure applications. The book starts with a few theory chapters and after that it is structured as a tutorial. The

More information

RavenDB & document stores

RavenDB & document stores université libre de bruxelles INFO-H415 - Advanced Databases RavenDB & document stores Authors: Yasin Arslan Jacky Trinh Professor: Esteban Zimányi Contents 1 Introduction 3 1.1 Présentation...................................

More information

Exploring the Nuxeo REST API

Exploring the Nuxeo REST API Exploring the Nuxeo REST API Enabling Rapid Content Application Craftsmanship Copyright 2018 Nuxeo. All rights reserved. Copyright 2017 Nuxeo. All rights reserved. Chapter 1 The Nuxeo REST API What do

More information

Characterizing Home Pages 1

Characterizing Home Pages 1 Characterizing Home Pages 1 Xubin He and Qing Yang Dept. of Electrical and Computer Engineering University of Rhode Island Kingston, RI 881, USA Abstract Home pages are very important for any successful

More information

INTRODUCTION. Chapter GENERAL

INTRODUCTION. Chapter GENERAL Chapter 1 INTRODUCTION 1.1 GENERAL The World Wide Web (WWW) [1] is a system of interlinked hypertext documents accessed via the Internet. It is an interactive world of shared information through which

More information

magento_1:blog_pro

magento_1:blog_pro magento_1:blog_pro https://amasty.com/docs/doku.php?id=magento_1:blog_pro For more details see the Blog Pro extension page. Blog Pro Create responsive blog posts with a handy WYSIWYG editor, easily customize

More information

Usability Test Report: Requesting Library Material 1

Usability Test Report: Requesting Library Material 1 Usability Test Report: Requesting Library Material 1 Summary Emily Daly and Kate Collins conducted usability testing on the processes of requesting library material. The test was conducted at the temporary

More information

Information Retrieval May 15. Web retrieval

Information Retrieval May 15. Web retrieval Information Retrieval May 15 Web retrieval What s so special about the Web? The Web Large Changing fast Public - No control over editing or contents Spam and Advertisement How big is the Web? Practically

More information

Usability Test Report: Homepage / Search Interface 1

Usability Test Report: Homepage / Search Interface 1 Usability Test Report: Homepage / Search Interface 1 Summary Emily Daly, Bendte Fagge, and Steph Matthiesen conducted usability testing of the homepage and search interface in the newly redesigned Duke

More information

If you re a Facebook marketer, you re likely always looking for ways to

If you re a Facebook marketer, you re likely always looking for ways to Chapter 1: Custom Apps for Fan Page Timelines In This Chapter Using apps for Facebook marketing Extending the Facebook experience Discovering iframes, Application Pages, and Canvas Pages Finding out what

More information

FACETs. Technical Report 05/19/2010

FACETs. Technical Report 05/19/2010 F3 FACETs Technical Report 05/19/2010 PROJECT OVERVIEW... 4 BASIC REQUIREMENTS... 4 CONSTRAINTS... 5 DEVELOPMENT PROCESS... 5 PLANNED/ACTUAL SCHEDULE... 6 SYSTEM DESIGN... 6 PRODUCT AND PROCESS METRICS...

More information

Retrieval Evaluation

Retrieval Evaluation Retrieval Evaluation - Reference Collections Berlin Chen Department of Computer Science & Information Engineering National Taiwan Normal University References: 1. Modern Information Retrieval, Chapter

More information

MySQL for Developers. Duration: 5 Days

MySQL for Developers. Duration: 5 Days Oracle University Contact Us: 0800 891 6502 MySQL for Developers Duration: 5 Days What you will learn This MySQL for Developers training teaches developers how to develop console and web applications using

More information

Chapter 17: INTERNATIONAL DATA PRODUCTS

Chapter 17: INTERNATIONAL DATA PRODUCTS Chapter 17: INTERNATIONAL DATA PRODUCTS After the data processing and data analysis, a series of data products were delivered to the OECD. These included public use data files and codebooks, compendia

More information

D 9.1 Project website

D 9.1 Project website Doc: FEN--RP-017 Page: Page 1 of 21 H2020 - EEB - 2017-766464 D 9.1 Project website Name Signature and date Prepared by Martina Bakešová (FENIX) 17.1.2018 Checked by Approved by Ir. C.L.G. (Christophe)

More information

In this third unit about jobs in the Information Technology field we will speak about software development

In this third unit about jobs in the Information Technology field we will speak about software development In this third unit about jobs in the Information Technology field we will speak about software development 1 The IT professionals involved in the development of software applications can be generically

More information

Viewpoint Review & Analytics

Viewpoint Review & Analytics The Viewpoint all-in-one e-discovery platform enables law firms, corporations and service providers to manage every phase of the e-discovery lifecycle with the power of a single product. The Viewpoint

More information

Teachers Manual for Creating a Website with WordPress

Teachers Manual for Creating a Website with WordPress Teachers Manual for Creating a Website with WordPress ISBN 978 90 5905 422 6 2 1. Introduction This course manual assumes a lesson structure consisting of nine points. These points have been divided into

More information

CHAPTER 5 SYSTEM IMPLEMENTATION AND TESTING. This chapter describes the implementation and evaluation process conducted on the e-

CHAPTER 5 SYSTEM IMPLEMENTATION AND TESTING. This chapter describes the implementation and evaluation process conducted on the e- CHAPTER 5 SYSTEM IMPLEMENTATION AND TESTING 5.1 Introduction This chapter describes the implementation and evaluation process conducted on the e- BSC system. In terms of implementation, the development

More information

Website minute read. Understand the business implications, tactics, costs, and creation process of an effective website.

Website minute read. Understand the business implications, tactics, costs, and creation process of an effective website. Website 101 Understand the business implications, tactics, costs, and creation process of an effective website. 8 minute read Mediant Web Development What to Expect 1. Why a Good Website is Crucial 2.

More information

A Simple Course Management Website

A Simple Course Management Website A Simple Course Management Website A Senior Project Presented to The Faculty of the Computer Engineering Department California Polytechnic State University, San Luis Obispo In Partial Fulfillment Of the

More information

MySQL for Developers. Duration: 5 Days

MySQL for Developers. Duration: 5 Days Oracle University Contact Us: Local: 0845 777 7 711 Intl: +44 845 777 7 711 MySQL for Developers Duration: 5 Days What you will learn This MySQL for Developers training teaches developers how to develop

More information

BEAWebLogic. Portal. Overview

BEAWebLogic. Portal. Overview BEAWebLogic Portal Overview Version 10.2 Revised: February 2008 Contents About the BEA WebLogic Portal Documentation Introduction to WebLogic Portal Portal Concepts.........................................................2-2

More information

Learning PHP, MySQL, JavaScript, And CSS: A Step-by-Step Guide To Creating Dynamic Websites PDF

Learning PHP, MySQL, JavaScript, And CSS: A Step-by-Step Guide To Creating Dynamic Websites PDF Learning PHP, MySQL, JavaScript, And CSS: A Step-by-Step Guide To Creating Dynamic Websites PDF Learn how to build interactive, data-driven websitesâ even if you donâ t have any previous programming experience.

More information

Web Systems Staff Intranet Card Sorting. Project Cover Sheet. Library Staff Intranet. UM Library Web Systems

Web Systems Staff Intranet Card Sorting. Project Cover Sheet. Library Staff Intranet. UM Library Web Systems Project Cover Sheet Library Staff Intranet Project Committee & Members Report Info Objectives Methodology Card Sorting The Library Staff Intranet is a gateway to various library staff administrative information

More information

Inbound Website. How to Build an. Track 1 SEO and SOCIAL

Inbound Website. How to Build an. Track 1 SEO and SOCIAL How to Build an Inbound Website Track 1 SEO and SOCIAL In this three part ebook series, you will learn the step by step process of making a strategic inbound website. In part 1 we tackle the inner workings

More information

Link Analysis and Web Search

Link Analysis and Web Search Link Analysis and Web Search Moreno Marzolla Dip. di Informatica Scienza e Ingegneria (DISI) Università di Bologna http://www.moreno.marzolla.name/ based on material by prof. Bing Liu http://www.cs.uic.edu/~liub/webminingbook.html

More information

CHAPTER 1 COPYRIGHTED MATERIAL. Finding Your Way in the Inventor Interface

CHAPTER 1 COPYRIGHTED MATERIAL. Finding Your Way in the Inventor Interface CHAPTER 1 Finding Your Way in the Inventor Interface COPYRIGHTED MATERIAL Understanding Inventor s interface behavior Opening existing files Creating new files Modifying the look and feel of Inventor Managing

More information

Transaction Cordinator: Design and Planning

Transaction Cordinator: Design and Planning Transaction Cordinator: Design and Planning Joshua Lee, Damon McCormick, Kim Ly, Chris Orimoto, John Wang, and Daniel LeCheminant October 4, 2004 Contents 1 Overview 2 2 Document Revision History 2 3 System

More information

Improving Drupal search experience with Apache Solr and Elasticsearch

Improving Drupal search experience with Apache Solr and Elasticsearch Improving Drupal search experience with Apache Solr and Elasticsearch Milos Pumpalovic Web Front-end Developer Gene Mohr Web Back-end Developer About Us Milos Pumpalovic Front End Developer Drupal theming

More information

D8.1 Project website

D8.1 Project website D8.1 Project website WP8 Lead Partner: FENIX Dissemination Level: PU Deliverable due date: M3 Actual submission date: M3 Deliverable Version: V1 Project Acronym Project Title EnDurCrete New Environmental

More information

Polyratings Website Update

Polyratings Website Update Polyratings Website Update Senior Project Spring 2016 Cody Sears Connor Krier Anil Thattayathu Outline Overview 2 Project Beginnings 2 Key Maintenance Issues 2 Project Decision 2 Research 4 Customer Survey

More information

CREATE FORUMS THE CREATE FORUM PAGE

CREATE FORUMS THE CREATE FORUM PAGE CREATE FORUMS A discussion board forum is an area where participants discuss a topic or a group of related topics. Within each forum, users can create multiple threads. A thread includes the initial post

More information

Usability evaluation in practice: the OHIM Case Study

Usability evaluation in practice: the OHIM Case Study Usability evaluation in practice: the OHIM Case David García Dorvau, Nikos Sourmelakis coppersony@hotmail.com, nikos.sourmelakis@gmail.com External consultants at the Office for Harmonization in the Internal

More information

Build Meeting Room Management Website Using BaaS Framework : Usergrid

Build Meeting Room Management Website Using BaaS Framework : Usergrid Build Meeting Room Management Website Using BaaS Framework : Usergrid Alvin Junianto Lan 13514105 Informatics, School of Electrical Engineering and Informatics Bandung Institute of Technology Bandung,

More information

CHAPTER THREE INFORMATION RETRIEVAL SYSTEM

CHAPTER THREE INFORMATION RETRIEVAL SYSTEM CHAPTER THREE INFORMATION RETRIEVAL SYSTEM 3.1 INTRODUCTION Search engine is one of the most effective and prominent method to find information online. It has become an essential part of life for almost

More information

Overview On Methods Of Searching The Web

Overview On Methods Of Searching The Web Overview On Methods Of Searching The Web Introduction World Wide Web (WWW) is the ultimate source of information. It has taken over the books, newspaper, and any other paper based material. It has become

More information

Byte Academy. Python Fullstack

Byte Academy. Python Fullstack Byte Academy Python Fullstack 06/30/2017 Introduction Byte Academy pioneered industry-focused programs beginning with the launch of our FinTech course, the first of its type. Our educational programs bridge

More information

JBoss ESB 4.0 GA RC1. Message Transformation Guide JBESB-MTG-12/1/06 JBESB-PG-12/1/06

JBoss ESB 4.0 GA RC1. Message Transformation Guide JBESB-MTG-12/1/06 JBESB-PG-12/1/06 JBoss ESB 4.0 GA RC1 Message Transformation Guide JBESB-MTG-12/1/06 JBESB-PG-12/1/06 i JBESB-PG-12/1/06 ii Legal Notices The information contained in this documentation is subject to change without notice.

More information

ESET Remote Administrator 6. Version 6.0 Product Details

ESET Remote Administrator 6. Version 6.0 Product Details ESET Remote Administrator 6 Version 6.0 Product Details ESET Remote Administrator 6.0 is a successor to ESET Remote Administrator V5.x, however represents a major step forward, completely new generation

More information

Unit VIII. Chapter 9. Link Analysis

Unit VIII. Chapter 9. Link Analysis Unit VIII Link Analysis: Page Ranking in web search engines, Efficient Computation of Page Rank using Map-Reduce and other approaches, Topic-Sensitive Page Rank, Link Spam, Hubs and Authorities (Text Book:2

More information

Student Usability Project Recommendations Define Information Architecture for Library Technology

Student Usability Project Recommendations Define Information Architecture for Library Technology Student Usability Project Recommendations Define Information Architecture for Library Technology Erika Rogers, Director, Honors Program, California Polytechnic State University, San Luis Obispo, CA. erogers@calpoly.edu

More information

MovieRec - CS 410 Project Report

MovieRec - CS 410 Project Report MovieRec - CS 410 Project Report Team : Pattanee Chutipongpattanakul - chutipo2 Swapnil Shah - sshah219 Abstract MovieRec is a unique movie search engine that allows users to search for any type of the

More information

Blog Pro for Magento 2 User Guide

Blog Pro for Magento 2 User Guide Blog Pro for Magento 2 User Guide Table of Contents 1. Blog Pro Configuration 1.1. Accessing the Extension Main Setting 1.2. Blog Index Page 1.3. Post List 1.4. Post Author 1.5. Post View (Related Posts,

More information

Implementation Architecture

Implementation Architecture Implementation Architecture Software Architecture VO/KU (707023/707024) Roman Kern ISDS, TU Graz 2017-11-15 Roman Kern (ISDS, TU Graz) Implementation Architecture 2017-11-15 1 / 54 Outline 1 Definition

More information

WebBiblio Subject Gateway System:

WebBiblio Subject Gateway System: WebBiblio Subject Gateway System: An Open Source Solution for Internet Resources Management 1. Introduction Jack Eapen C. 1 With the advent of the Internet, the rate of information explosion increased

More information

COLUMN. Worlds apart: the difference between intranets and websites. The purpose of your website is very different to that of your intranet MARCH 2003

COLUMN. Worlds apart: the difference between intranets and websites. The purpose of your website is very different to that of your intranet MARCH 2003 KM COLUMN MARCH 2003 Worlds apart: the difference between intranets and websites Beyond a common use of HTML, intranets and corporate websites (internet sites) are very different animals. The needs they

More information

User Guide. Version 1.5 Copyright 2006 by Serials Solutions, All Rights Reserved.

User Guide. Version 1.5 Copyright 2006 by Serials Solutions, All Rights Reserved. User Guide Version 1.5 Copyright 2006 by Serials Solutions, All Rights Reserved. Central Search User Guide Table of Contents Welcome to Central Search... 3 Starting Your Search... 4 Basic Search & Advanced

More information

Data analysis using Microsoft Excel

Data analysis using Microsoft Excel Introduction to Statistics Statistics may be defined as the science of collection, organization presentation analysis and interpretation of numerical data from the logical analysis. 1.Collection of Data

More information

FoodBack. FoodBack facilitates communication between chefs and patrons. Problem Solution and Overview: Yes&: Andrei T, Adrian L, Aaron Z, Dyllan A

FoodBack. FoodBack facilitates communication between chefs and patrons. Problem Solution and Overview: Yes&: Andrei T, Adrian L, Aaron Z, Dyllan A FoodBack Yes&: Andrei T, Adrian L, Aaron Z, Dyllan A FoodBack facilitates communication between chefs and patrons. Problem Solution and Overview: In cafeterias across America, hungry patrons are left unsatisfied

More information

Using Internet as a Data Source for Official Statistics: a Comparative Analysis of Web Scraping Technologies

Using Internet as a Data Source for Official Statistics: a Comparative Analysis of Web Scraping Technologies Using Internet as a Data Source for Official Statistics: a Comparative Analysis of Web Scraping Technologies Giulio Barcaroli 1 (barcarol@istat.it), Monica Scannapieco 1 (scannapi@istat.it), Donato Summa

More information

Wiki A Systems Programming Productivity Tool

Wiki A Systems Programming Productivity Tool Wiki A Systems Programming Productivity Tool John W. Noel MIS Graduate Student Tech Student / System Programmer I SAS Institute Inc. Cary, NC Regina Robbins Systems Programmer I SAS Institute Inc. Cary,

More information

Qualtrics Survey Software

Qualtrics Survey Software Qualtrics Survey Software GETTING STARTED WITH QUALTRICS Qualtrics Survey Software 0 Contents Qualtrics Survey Software... 2 Welcome to Qualtrics!... 2 Getting Started... 2 Creating a New Survey... 5 Homepage

More information

The Topic Specific Search Engine

The Topic Specific Search Engine The Topic Specific Search Engine Benjamin Stopford 1 st Jan 2006 Version 0.1 Overview This paper presents a model for creating an accurate topic specific search engine through a focussed (vertical)

More information

Pro Events. Functional Specification. Name: Jonathan Finlay. Student Number: C Course: Bachelor of Science (Honours) Software Development

Pro Events. Functional Specification. Name: Jonathan Finlay. Student Number: C Course: Bachelor of Science (Honours) Software Development Pro Events Functional Specification Name: Jonathan Finlay Student Number: C00193379 Course: Bachelor of Science (Honours) Software Development Tutor: Hisain Elshaafi Date: 13-11-17 Contents Introduction...

More information

Information Retrieval (IR) Introduction to Information Retrieval. Lecture Overview. Why do we need IR? Basics of an IR system.

Information Retrieval (IR) Introduction to Information Retrieval. Lecture Overview. Why do we need IR? Basics of an IR system. Introduction to Information Retrieval Ethan Phelps-Goodman Some slides taken from http://www.cs.utexas.edu/users/mooney/ir-course/ Information Retrieval (IR) The indexing and retrieval of textual documents.

More information

The main website for Henrico County, henrico.us, received a complete visual and structural

The main website for Henrico County, henrico.us, received a complete visual and structural Page 1 1. Program Overview The main website for Henrico County, henrico.us, received a complete visual and structural overhaul, which was completed in May of 2016. The goal of the project was to update

More information

Knowing something about how to create this optimization to harness the best benefits will definitely be advantageous.

Knowing something about how to create this optimization to harness the best benefits will definitely be advantageous. Blog Post Optimizer Contents Intro... 3 Page Rank Basics... 3 Using Articles And Blog Posts... 4 Using Backlinks... 4 Using Directories... 5 Using Social Media And Site Maps... 6 The Downfall Of Not Using

More information

News English.com Ready-to-use ESL / EFL Lessons

News English.com Ready-to-use ESL / EFL Lessons www.breaking News English.com Ready-to-use ESL / EFL Lessons 1,000 IDEAS & ACTIVITIES FOR LANGUAGE TEACHERS The Breaking News English.com Resource Book http://www.breakingnewsenglish.com/book.html Top

More information

Page Title is one of the most important ranking factor. Every page on our site should have unique title preferably relevant to keyword.

Page Title is one of the most important ranking factor. Every page on our site should have unique title preferably relevant to keyword. SEO can split into two categories as On-page SEO and Off-page SEO. On-Page SEO refers to all the things that we can do ON our website to rank higher, such as page titles, meta description, keyword, content,

More information