F O C U S E D S U R F E R M O D E L S Ranking visual search results

Size: px

Start display at page:

Download "F O C U S E D S U R F E R M O D E L S Ranking visual search results"

Matilda Gaines
5 years ago
Views:

1 F O C U S E D S U R F E R M O D E L S Ranking visual search results jasper hafkenscheid Faculty of Mathematics and Natural Sciences University of Groningen August 2010

2 Jasper Hafkenscheid: Focused surfer models August 2010 supervisors: Marco Aiello Nicolai Petkov Mathijs Homminga location: Groningen time frame: August 2010

3 A B S T R A C T PageRank is a graph-based ranking algorithm that ranks the nodes in a graph based on their connections. It has been designed to determine the intrinsic value of a page, based on the link structure of the web. We research the effects of adaptations to the PageRank algorithm in the context of a gallery search engine. The search engine uses text-based search to select galleries. Galleries are web pages that contain a number of small images, which link to enlarged versions of them. We evaluate the performance by looking at the ordering of the search results. The generic PageRank algorithm does not work well for this kind of application; it seems that the best galleries are not found on the pages with the highest PageRank. PageRank assumes that all links convey trust to other pages, but many links disrupt that concept (e.g. links to download or update your browser or required plug-ins). The inherent nature of the algorithm is the Random Surfer Model, which is based on a virtual user that navigates the web by following a random link on a page, and repeating this infinitely. The PageRank algorithm can be changed by altering the probability with which the random surfer chooses which link to follow. Our design choice is to replace the random surfer model with a focused surfer model, which increases the probability of following a link based on the similarity between the linking page and the target. We will analyze the performance with the gallery search engine and compare the results with a generic PageRank and uniform ranking. The search engine uses a dataset that consists of 15 million galleries; to acquire this dataset we have crawled 1.5 billion pages. We find that alterations to the system affect the outcome, and that the overall performance is increased. The PageRank algorithm is intended for information retrieval, it works very well for general purpose ranking, but the random surfer model does not work well enough for all applications. iii

5 C O N T E N T S 1 introduction 1 2 background Information Retrieval Image Retrieval Ranking Web graph PageRank Concept Customization 5 3 related work Web-based Image Retrieval Intentional surfer model Topic-sensitive PageRank Hyperlink-Induced Topic Search On-Line Page Importance Computation 8 4 concept Importance of links Random surfer Open directory project (ODP)-biased surfer Levenshtein-distance surfer Granularity Self-links Spam 11 5 realization System architecture Procedure Build graph Iterate PageRank Build a rank-lookup index Rerank gallery index Measuring performance Data set Tools Apache Hadoop Lucene Java Nutch LingPipe 17 6 evaluation and results Sample web graph Real web graph 21 7 conclusion Surfer-models Dataset Information retrieval Measuring 24 v

6 vi contents 8 future work Level of detail Tooling Link prioritization ODP classifier 25 a odp classifier 27 b sample web graph 29 c lucene scoring 31 d large dataset results 33 bibliography 35

7 L I S T O F F I G U R E S Figure 1 Example link-structure 6 Figure 2 A linked set of hubs and authorities. 8 Figure 3 Architecture overview 13 Figure 4 MapReduce 16 Figure 5 Example web graph 19 Figure 6 Sample scoring page 33 L I S T O F TA B L E S Table 1 Ranking of sample web graph 20 Table 2 Static gallery ordering using generic PageRank 20 Table 3 Static gallery ordering using Levenshtein based PageRank 20 Table 4 Static gallery ordering using ODP-based Page- Rank 21 Table 5 ODP classifier precision and recall. 27 Table 6 ODP classifier confusion matrix. 28 Table 7 Probability matrix based on ODP topic matches. 30 Table 8 User feedback on ranking 34 A C R O N Y M S ODP Open directory project DFS Distributed file system OPIC Online Page Importance Computation HITS Hyperlink-Induced Topic Search TLD Top Level Domain TREC Text REtrieval Conference vii

9 I N T R O D U C T I O N 1 The Internet has grown since the invention of html in Html is the language for web pages, which can contain links to other html pages. These links are called hyperlinks. The users can follow the links to navigate the web, but as the Internet grew, it became harder to accurately find information. This was the motivation to develop search engines that index the web and allow searching based on keywords. Ordering the search results became more and more important as the web grew. PageRank is an algorithm that uses the hyperlinks on the pages to determine the value of a web page. It is the basis for the Google search engine. When the Internet is seen as a graph with nodes (pages) and vertices (hyperlinks), then it is possible to determine the value of a page with respect to the number of references. PageRank works by distributing rank via the outlinks of every page. The rank of a page is determined by the amount of rank it receives from the pages that refer to it. This recursive algorithm calculates the rank for every node in graph. Nodes with a higher rank represent more valuable web pages which are more likely to contain valuable information. These pages should be given a higher position in the search results. We apply PageRank to an image gallery dataset, which consists of image galleries that are found on the web. We find that the performance is sub-optimal, because the quality of a photo gallery cannot be properly determined by the popularity on the web page. However, PageRank can still be used to determine which galleries are more valuable than others. To improve the PageRank computation, we replace the surfer model with a model that is smarter and that more accurately resembles real users. The concept is that links between pages with the same topic are more important than links to pages that concern a different topic. Pages that have a lot in common are rated as more likely candidates to follow than pages that cover different topics. We have developed an ODP biased surfer, which uses a classifier trained with ODP data to determine the similarity of the two pages. ODP is a project that collects and categorizes links to web sites. This data is freely available. The other surfer model that we are testing is based on the Levenshtein distance measurement. This measurement is based on the similarity of the original and target URLs. The concept behind this model is that pages that discuss the same topic are likely to have that mentioned in their URL. To compare the results, we will also calculate the PageRank with the original random surfer model. The fourth and last sorting method gives each page a rank of 1, which we call uniform ranking. To gather results we process a dataset that consists of approximately 15 million galleries and 1.5 billion other pages. The calculated ranks are then combined with the text-match score to produce search results for 22 queries. Because there are no good validation methods for this dataset, we have to ask people what they think of the resulting ordering 1

10 2 introduction of the galleries. To demonstrate the desired effect, we have also applied the algorithm to a sample web graph. We see that the new surfer models outperform the random surfer model. The ODP surfer model does not work better than the Levenshtein model. This can be explained by the poor performance of the classifier. The research still shows that the random surfer model is sub-optimal, and also shows that the uniform ranking is better in some areas, which was unexpected. The remainder of this thesis is structured into seven chapters. Chapter 2 will explain some of the basics of information retrieval and PageRank. Chapter 3 explains the position of this thesis with respect to the state of art. In Chapter 4 we explain the concept and introduce the three surfer models. Then we describe how we computed the rank values and explain some of the concepts (Chapter 5). In Chapters 6 and 7 we discuss the results and explain them. The final chapter contains some future work on the field of alternate surfer models for PageRank.

11 B A C K G R O U N D 2 There has been a lot of research regarding information retrieval and various ranking methods. Some of the most effective groundwork that is done regarding query-independent ranking is PageRank. In this chapter we will introduce these topics. 2.1 information retrieval Information retrieval concerns the storage of and access to information. The representation and organization of the information items should provide the user with easy access to the information in which he is interested [4]. The typical scenario is that a user has an information need. It then becomes the task of the information retrieval system to provide the information. The information need is often translated into a query. The query normally consists of some keywords, but may include other parameters like a time period. The query is then fed into a search engine, which returns the requested data to the user. To provide the user with easy access, the results are ordered based on the relevance with respect to the information need. There also is a part of the information need that is not specified in the query; this could be a desired language or a desire to get recent information. These desires also have to be accounted for in the ordering of the search results. One of the differences between information retrieval and data retrieval is that an information retrieval system deals with natural language. A data retrieval system often uses a database and deals with a well-defined structure and semantics. 2.2 image retrieval Image retrieval is a type of information retrieval which involves images. The search can either be text-based or it can be based on image data. Most of the time the text data consists of meta-data that is added to the image (e.g. caption or keywords). Content-based information retrieval uses data that is extracted from the image with computervision techniques. This allows for searching based on a different image or by defining image features like color or composition [19]. In this thesis we have not used computer-vision techniques. We perform image retrieval using the text that is found with the image. 2.3 ranking Ranking is determining the order of elements. In our case, the elements are search result entries. The ordering of the results is based on the information need. This can be broken down into two components: a query and a generic component. The query is represented as a number of keywords, but the generic component describes some features which are more difficult to grasp. These can include popularity or authoritativeness. We want to see the results that have the best match with 3

12 4 background the information need first, so they should be ranked higher than other results. To rank the results, each element is given a score, a number that represents the quality of that element with respect to the information need. It is difficult to determine the score of a search result, because only some of the parameters of the information need can be calculated. Other features are more difficult to measure. The score is the sum of two components: a query-dependent and a query-independent score. The query-independent score is not influenced by the query, but is based on the implicit component of the information need. The query-independent score of an element reflects the intrinsic quality of a page. Because these scores are calculated independently of the query, they can be calculated ahead of time. This is called off-line computation. This allows for more extensive analysis, because on-line methods have to be very fast. The query-dependent score is usually based on heuristics that consider the number and locations of matches of the various query words on the page itself, in the URL or in any anchor text referring to the page. There is the number of matches, the order of the matches, the number of other words and the uniqueness of the terms. All these factors have to be combined into the final ranking. When the elements are used to perform a search based on a query, then the query-independent score is used together with a query-dependent score to rank all results matching the query [9]. 2.4 web graph To determine a score for some of the implicit features of the information need, one can use the link structure of the web. We expect that valuable pages have more links pointing towards them. A graph is a network of nodes (vertices) and links between nodes (edges). In a directed graph the edges have a direction: they have an origin and destination vertex. All the edges that originate from a vertex are the out-edges or outlinks of that vertex. The edges that point to a vertex are the in-edges or inlinks. The pages and links of the web can be modeled as a directed graph, the edges being the links and the vertices being pages. This graph is one of the data structures that is used to determine the score of a page. There are several ranking methods that are based on the web graph. The most successful one is PageRank, which we will be using in this thesis. A number of other methods are discussed in Chapter pagerank PageRank is a graph algorithm that can rank vertices in a graph [18, 5]. It was first used as a citation ranking mechanism, which ranked publications based on the number and quality of the citations. Larry Page and Sergey Brin developed it and later used it to build the Google search engine. If the graph is based on the link structure of the web (Figure 1), it can be used to rank web pages independently of a query.

13 2.5 pagerank Concept The idea behind the PageRank algorithm is that a hyperlink to a web page is a vote of support. Pages that receive a lot of votes are given a better rank than other pages. PageRank does not only count the number of votes, but also takes the rank of the page that voted into account. A page that receives a lot of votes from highly ranked pages receives a high rank itself. The PageRank of a page is defined recursively (Formula 2.1). The concept can also be explained by the random surfer model [14]. This model is based on a random surfer. This surfer follows a random outlink on the page he is visiting. Every time he visits a page, the score is increased. This is repeated infinitely. If the surfer visits a page more often, then the rank increases. The surfer sometime gets bored and jumps to a random page (which models bookmarks and dead-ends). This probability is modeled by the dampening factor d. Pages that do not contain outlinks transfer their rank equally over all pages of the web. Pages that have not yet been downloaded also fall into this category. The algorithm can provide some sort of scoring for pages that have not yet been downloaded. These scores can then be used to selectively download the web, instead of a breadth-first search. PR(p i ) = (1 d) + d p j M(p i ) P(p j, p i ) PR(p j) L(p j ) (2.1) M(p i ) is the set of pages linking to p i L(p i ) is the number of outlinks of p i d is the dampening factor, usually 0.85 P(p i, p j ) is the probability that the link between the pages in traversed, in the original formula this is always Customization There are several ways to alter the behavior of PageRank. The granularity of the webgraph can allow pages to merge, which would make the graph more dense. This also results in multiple pages getting the same score. One can also handle the dangling pages problem differently. Page and Brin suggested to add the random jump. This is a basic solution for the problem, but it does not mimic the behavior of real users. The probability that a user will make a random jump is 1 d, which usually is 15%. The random jump also remedies the problem of rank-sinks by always allowing for a way out. The best way to influence the outcome of the algorithm is to change the personality vector P. The vector represents the possibility that the user follows a specific outlink. Page and Brin included it to allow personalization of the search results. Personalization might not be the right word, because it would be far too costly to compute the PageRank values for every user. Rank-sink: A section of a graph that accumulates rank, due to the lack of outlinks.

14 6 background E 1.50 E 2.98 D F D F A B A B C 3.77 C 1.00 (a) After the 1st iteraton. (b) After the 2nd iteraton E 1.32 E 0.89 D F D F A B A B C C (c) After the 3rd iteraton. (d) After the 25th iteraton. Figure 1: Example link structure with PageRank values for each of the iterations. Pages are vertices and directed edges represent links.

15 R E L AT E D W O R K 3 This chapter notes some other research on the field of search result ranking, and describes the relation with respect to this thesis. We discuss Hyperlink-Induced Topic Search (HITS) and Online Page Importance Computation (OPIC) and elaborate on the various surfer models. 3.1 web-based image retrieval Internet search engines have been available for quite some time, but most of them are designed for generic information retrieval. There is a lot of research to be done on the field of image retrieval[12]. In this thesis, we focus on the ranking aspect of image retrieval. The generic selection and ranking of the search results are processed by traditional text search engine methods. 3.2 intentional surfer model An improvement to the random surfer model can be made by using realworld data. This has to be gathered by analyzing the surfing behavior of real users. Google released their toolbar in December 2000 [2], which allows people to use the search service and translate pages easily. Another feature of the toolbar is that it sends information regarding the surfing behavior to Google. If enough people submit their data, then Google can accurately replace the random surfer model with an intentional surfer model [10]. The browser Google Chrome, Google Analytics and the Google advertising services may also provide valuable information that enables Google to improve their ranking. For our research we do not have access to this data, and therefore use other methods to develop a new surfer model. The advantage is that our models can operate on pages that no user has ever visited. 3.3 topic-sensitive pagerank Topic-sensitive PageRank is an adaptation to the PageRank algorithm that has been researched by Haveliwala [8]. It uses a classifier which is trained using the data from the ODP [17]. ODP is a project that gathers and categorizes links to web pages. Each link has a description and a title. The dataset is maintained by volunteers and is freely available. The classifier can determine into which of the 16 main categories a page belongs. He has applied it to all pages in their dataset and performed the PageRank calculation for each of the top-level categories. At search time the topic of the query is determined, and the corresponding PageRank data is used. This resulted in a significant improvement in the ordering of the results. 7

16 8 related work The key difference between our approach and the topic-sensitive PageRank method is that we only calculate one score for each link. The researchers rely on the classifier to analyze the query, and use the corresponding PageRank data. Experience shows that can be difficult to accurately analyze very short texts. There are numerous ambiguous words in any language, and the demographic of this search service is not bound to one language. 3.4 hyperlink-induced topic search HITS is a graph-based algorithm that computes two scores for all the results of a query [13]. The hub score estimates the value of its links to other pages, the authority score estimates the value of the content of the page. The results can be divided into two categories: hubs which lead to authorities and authorities. It can also be used to identify results that are not relevant for the query (by looking at the links in the graph). One of the differences with PageRank is that it is not computed off-line, but is a query-time process. It operates on the results of a search. It is not commonly used by search engines, but a similar algorithm has been in use by ask.com. Hub Hub Hub Authority Authority Authority Authority Figure 2: A linked set of hubs and authorities. 3.5 on-line page importance computation OPIC is, as the name suggests, an on-line method to rank pages [3]. The advantage is that it is able to do the calculations without storing a separate link structure. The score of a page is updated when the page is downloaded, so it is not a search-time algorithm. In the long run the scoring should be the same as with conventional PageRank. The Nutch platform (discussed in Chapter 5) includes an implementation of OPIC. It is used to determine which URLs should be downloaded first. At search time the scoring can be used to order the search results. The disadvantage is that pages need to be crawled multiple times to reach a meaningful score.

17 C O N C E P T 4 The goal of this research is to measure how different adaptations of the PageRank algorithm influence the quality of ranking. The alterations that we are exploring concern using alternative surfer models. The probability with which a certain outlink is chosen is made to depend on the similarity between the two pages. 4.1 importance of links The probability with which an outlink would be traversed, depends on its own score with respect to the other outlinks on that web page. We adapt the PageRank algorithm by determining the relevance of each of the outlinks on a page. The random surfer model is thus changed into a smart surfer model. For every surfer model the probability function is given. The formula returns a score with which the p j link on page p i will be followed. The scores are then divided by the sum of the scores of all outlinks, resulting in normalized probabilities Random surfer In the random surfer model the scores are all equal to one. Each link is regarded as equally important. This method is used as reference and is later referred to as generic PageRank. P(p i, p j ) = ODP-biased surfer We have constructed a basic URL-classifier by training it with URLs from the ODP [17]. The ODP consists of approximately 5 million URLs, all of which have been categorized into a rich data structure. They are grouped by topic on multiple levels and are also divided over a number of languages. There are 16 main categories, 14 of which we used for our research. The URLs that are found in the regional and world categories have been re-mapped to their respective top-level category. This ensures that the dataset does not contain English pages only. Both the URL of the originating page and the destination page are processed by the classifier. The relevance of the link is based on how much the topics overlap. Pages that have the same topic have a good connection, others are very weak. Table 5 shows the results of the ODP classifier in a 10-fold evaluation. We have used every tenth ODP record to test the classifier. The table shows that the accuracy is very high (93%). The precision(50%) and recall(50%) are much lower. The accuracy is not an appropriate measurement in this case, because of the non-uniform distribution of the classes. Business, Arts and Society together account for 50% of the dataset and have a low accuracy. Accuracy: (T P + T N)/Sum Precision: FP/(T P + FP) Recall: FN/(T P + FN) T P = True positive T N = True negative FP = False positive FN = False negative 9

18 10 concept The classifier performs poorly, because the precision and recall is low. This can be explained because of the ambiguity of the categories. A website about a computer game might be in the Computers, Games or Kids and Teens category. Another problem is the limited features that are used to train the classifier. We have chosen to use only the URL, but this is a very limited feature. Retrieving more information about a certain outlink is complex, because the page may or may not have been crawled. Even if the page is crawled, it is still difficult to gather the information on that page, because we cannot store all the web pages that we have crawled. The goal of this thesis is not to build an accurate classifier but to study the effects of intelligent-surfer models on PageRank. Therefore we have not invested too much time and resources in optimizing this classifier. The sample web graph that is shown in Figure 5 shows that the classifier works. We are confident that the performance of this classifier can be improved by using more features or a less ambiguous class hierarchy. P(p i, p j ) = O(p i, c) O(p j, c) c Cats O(p i, c) = probability that page i belongs to category c. Cats = collection of 14 categories Levenshtein-distance surfer The other surfer model is based on the Levenshtein distance measurement [15]. The Levenshtein distance measurement is based on the number of characters that have to be added, changed or removed to transform a given text string to another string. This is a very simple method to determine how much two pages have in common. There is a chance that pages that discuss a certain topic are mentioning that fact in their URL. If this is also the case for some of the outlinks, then this can be used to calculate the probability that someone will follow the link. This is a very simple measurement that does not require any information about the destination page other than the URL. We expect that URLs that are on the same domain or on a very similar host are not much more relevant than links to other pages. We do not want to spread the rank to other pages on the same domain, but use it to assign a meaningful rank to other pages. We have chosen to fix the importance of links to the same domain to 0.1 and links to a host that is very similar to 0.2. The last rule is to reduce the effects of links to domains that are similar but with a different Top Level Domain (TLD) (e.g. google.nl, google.com). 0.1 if Domain(p i ) = Domain(p j ) P(p i, p j ) = 0.2 if PL(Host(p i ), Host(p j )) < 0.5 PL(p i, p j ) otherwise Domain(p i ) = this will return the highest level of the domain. E.g.: Domain( = apache.org Host(p i ) = this will return the full hostname. E.g.: Host( = hadoop.apache.org

19 4.2 granularity 11 PL(i, j) = LD(i, j)/ length(i)+length(j) 2 LD(i 1, j 1) //subst/copy LD(i, j) = min LD(i 1, j) + 1 //insert LD(i, j 1) + 1 //delete 4.2 granularity We have to choose a level of granularity for the web graph. We could decide to use domains for the nodes in the graph. This would mean that links to any page in a domain will be merged to one single node in the graph. All galleries that are hosted on that domain will get the same rank. This is not a perfect solution, because of the large number of galleries that are hosted on photo sharing sites like Flickr or Deviantart. These sharing sites have a huge amount of galleries, but only a small percentage of that is of very good quality. The opposite strategy is to use the entire URL, including the query (e.g. the page number or sorting method). This might cause problems because links to the same gallery are not combined. Because of this we have chosen to use a simplified version of the URL. The query part is removed, and the remainder is used to identify galleries. This could also cause problems if multiple galleries share the same URL but use a different query parameter (e.g. albumid=2). Most sites do not use such a system, or use it for other purposes like sorting or page selection. Website developers often aim to make pretty URLs, which improve their performance in search engines. They also allow users to guess what the page is about when they only see the URL. These pretty URLs will not be merged, but other pages will be grouped in a single node in the graph. 4.3 self-links Another point of attention is which links are to be included in the graph. The concept is that links conduct importance to the destination page. But links that point to the same page are not relevant. This occurs a lot on forums that have links to the top of the page after each message. A well-known method to retain rank within a particular website is to create a lot of links on every page that link to all the pages on that website. This is also common in forums; each posting has a link to the profile of the poster. And popular threads have a number of pages that all link to each other. 4.4 spam The last decision is the problem of spam. The web is contaminated by pages that are constructed to influence the ranking of search engines. The policy that we have applied is that pages with a large number of outlinks are not included in the graph. This should not greatly alter the overall computation, because the rank of such a page would be divided by the number of outlinks. Thus each outlink only transfers a tiny amount of rank.

21 R E A L I Z AT I O N 5 In order to perform the necessary computations, we have built a system that is capable of calculating the PageRank of a large number of pages in a few hours. We will explain the architecture of the system and describe the tools that we used to build it. 5.1 system architecture Figure 3 shows the architecture of the system. Some of the tools are provided by Nutch, and some have been written for the purpose of this thesis. 3 Internet Build graph Crawl the internet 1 Nutch Databases Iterate pagerank 4 Build Gallery-index 2 Build rank-lookup index 5 Unranked Gallery-index Rank-lookup Index Rerank Gallery-index 6 Lucene index Process Data flow Ranked Gallery-index Figure 3: Architecture of the system. Step 1 and 2 are provided by Nutch, which provides for the crawling and stores the data into its database. In step 3 we build the graph and calculate the probabilities that define the surfer model. Step 4 provides for the PageRank computations, which are stored to a rank-lookup index in step 5. Step 6 updates the Nutch index with boost values from the rank-lookup index Procedure In this section we will discuss the procedure that is required to statically order the galleries in a web graph using PageRank. We assume that a 13

22 14 realization dataset is available, which has been produced by running the Nutch crawler for a considerable amount of time Build graph In this step we parse the Nutch dataset and extract the web graph. Each link is represented by the origin, the target and the traversal probability; the probability is based on the formulas that are mentioned in the previous chapter. The outlinks are then grouped by source page. We have written a MapReduce processing job to perform this task Iterate PageRank Now we iterate the PageRank algorithm, until it reaches a stable state. In each iteration the rank of each page is calculated using the rank of the inlinks to that page. The amount of rank that is transferred is relative to the probability of the outlinks. This is also wrapped in a MapReduce job. Depending on the density of the graph, this can take a lot of iterations Build a rank-lookup index We construct a rank index containing the rank of all galleries. This step simplifies the process of updating the rank in the gallery index. This is also a MapReduce job; at the end all the subindexes that are produced by the reducers are merged into one index Rerank gallery index The gallery index has been built from the same Nutch dataset with a Nutch tool. We traverse the gallery index and set the boost value for every document according to the rank that is stored in the rank index. This is the only part of the process that is not performed by a MapReduce job, because the input for this step is an existing index. Indexes are great for searching, but building and reading must be done in one thread. The Lucene library uses a complex formula to score search results based on the search terms, boost values for different fields, and the number of occurrences of the terms (Appendix C). We calculate the boost value of the documents by taking the logarithm of the PageRank. This ensures that the boost values do not overpower the dynamic ranking. It is important to rank documents based on their dynamic ranking. The goal is to slightly influence the ordering by setting the boost values Measuring performance At this point we load the new gallery index into the search engine. We use the application to give insight in the new ranking. We have asked some people to rate the computed ranking. For a number of queries they are asked to count the number of ugly or bad galleries, and give a score in five steps from very bad to excellent for the ranking of the first five results.

23 5.2 data set 15 The search terms that are used in this evaluation are selected by looking at the words that are used the most in search queries. We have also added some well-known people, some football clubs and some vague subjects. This should be a good sample set to determine the quality of the ranking. 5.2 data set To be able to compare the results with other ranking methods, a generic dataset is required. Text REtrieval Conference (TREC) provides such datasets. We have not been able to find a relevant dataset that could be used for this thesis, because all of the datasets focus on text retrieval, and not on the images that are found on the pages [16]. We use a previously-built dataset that is constructed by an enhanced Nutch crawler (section 5.3.3). It can detect galleries, which are pages with a group of thumbnails that link to higher resolution versions of those thumbnails. The data is stored to a Distributed file system (DFS). It also stores all the outlinks of each page. These files are processed by a MapReduce task that creates vertices for each link. The set that we will use for testing consists of pages that have been crawled in the year This contains about 15 million galleries. On average one percent of the crawled pages is a gallery. The starting points of the crawl consisted of a number of well-known web pages (targeted at galleries). We can assume that the graph does not have too many components. We also use a sample web graph (Appendix B) that clearly illustrates the intended behavior. 5.3 tools Whilst working on the application, we have been introduced to some excellent open source tools. These tools have been vital for the success of this application Apache Hadoop Apache Hadoop is an open-source MapReduce engine. MapReduce [6] is an algorithm that allows data processing jobs to be split in small chunks. These chunks are processed on a server cluster, which enables it to process huge amounts of data in very little time. The system is able to detect failures and can resubmit a chunk when needed. The Hadoop framework also contains a DFS. The DFS takes care of replication and fail-over. When the system assigns jobs to nodes, the location of the data is taken into account. Jobs are preferably run on the node that has the data on its local disk. MapReduce processes the data in two phases. The map phase is used to read the source data and produces <key,value> pairs. The output of the map tasks is sorted by key, and then processed by the reducers. The sorting assures that all <key,value> pairs with the same key are bundled and processed by one reducer. A full explanation, including source code, is available at http: //hadoop.apache. org/core/docs/ current/mapred_ tutorial.html.

24 16 realization Comput ecl us t er DFSBl oc k1 Dat a dat a dat a dat a dat a dat a dat a dat a dat a dat a dat a DFSBl oc k1 Map DFSBl oc k1 Res ul t s DFSBl oc k2 Map DFSBl oc k2 Reduc e DFSBl oc k2 dat a dat a dat a dat a dat a dat a dat a dat a dat a Map DFSBl oc k3 DFSBl oc k3 DFSBl oc k3 Figure 4: Architecture of the MapReduce framework. (source: Apache) Lucene Java Apache Lucene is a high-performance, full-featured text search engine library written entirely in Java [7]. It allows us to perform searches in a large indexed set of data. The information that is stored in a Lucene index is pre-processed to allow for very fast searching. An index contains documents, and documents contain fields. Fields have a name, a value and a number of indexing settings. These settings are: Indexed Fields that have this setting are indexed, resulting in an inverse lookup table. Tokenized The value of this field is split into tokens by a tokenizer. This can also prepare the text before indexing (e.g. stemming and stop-word removal). Stored This parameter stores the original text in the index, which can be useful if the original value has to be available at search time. Documents also have a boost, which is a query-independent (static) value that is used when ranking the results of a query. During this project we have used Lucene indexes to store the image galleries and to build a lookup table for the ranks Nutch Nutch is also an open-source software package that is developed by Apache. It combines Hadoop and Lucene and adds an extensible crawler. Together they can be used to crawl, index and search the internet. At almost every step of the process the authors have integrated extension points. The plug-ins determine which pages are crawled, how the information is processed and what search parameters are processed. Nutch takes care of all the registration that is involved in crawling the web. It stores downloaded pages and all the outlinks. There is a main crawldb that stores all the links that the crawler has found. This is used to determine which pages are to be crawled next. Once a selection has been made from those URLs, they are formed into a segment. The segment is then downloaded and the results are fed back into the crawldb and into the search index.

25 5.3 tools LingPipe This package is one of the few products that we have used that is not developed by Apache. It is developed by alias-i. LingPipe is a suite of Java libraries for the linguistic analysis of human language [1]. We have used it to construct a classifier which can to categorize a web page in one of the 14 selected ODP categories. It does so by only looking at the URL. There are multiple types of classifiers available, but they all work by training them with examples. The suite also provides an evaluator which can be used with a k-fold evaluation. The evaluation results of the ODP classifier are shown in Table 5 and 6. Classifiers can be saved to hard-disk, which is useful if you need multiple instances.

27 E VA L U AT I O N A N D R E S U LT S 6 Here we give and discuss the results of our experiments. We have calculated the ranks for each of the three surfer-model variations. The measurements that we can perform on the dataset are limited, because we have no reliable way to measure the performance. To be able to illustrate the desired effect, we constructed a sample web graph (Figure 5). 6.1 sample web graph To explain the effects of the surfer models, we have constructed a sample web graph. We made sure that it will explain the problem of the generic PageRank algorithm, but these structures are very common on the web. The web graph consists of seven pages and contains three galleries which all about penguins. The Artis and Dierenpark Emmen galleries are about the animals in the respective zoos. They link to each other and are referred to by Wikipedia and the Arctic Council. The third gallery is about the computer-animated movie Happy Feet, which is also about penguins. The Happy Feet gallery has an inlink from the Adobe website, which is always very popular because of all the inlinks from pages that require the Flash plug-in. en.wikipedia.org/ wiki/penguin arctic-council.org/ artis.nl/ paginas/dierentuin/ vogels/pinguins.html dierenpark-emmen.nl/ nl/dierenpark/ favo_pinguins.htm warnerbros.com/happyfeet adobe.com/ getflashplayer adobe.com/ cfusion/showcase/ Figure 5: The example web graph that is used to demonstrate the desired effect of the new surfer models. The graph consists of three galleries and four non-galleries. The structure of the graph leads to a high rank for the Adobe plug-in page. This has been one of the highest scoring pages in our experiments. Table 1 shows the results of a calculation with 20 iterations on the sample web graph. 19

28 20 evaluation and results Web-page Uniform Generic Levenshtein ODP en.wikipedia.org/wiki/penguin dierenpark-emmen.nl/nl/dierenpark/favo_pinguins.htm artis.nl/paginas/t/dierentuin/vogels/pinguins.html adobe.com/go/getflashplayer adobe.com/cfusion/showcase warnerbros.com/happyfeet arctic-council.org Table 1: Ranking of sample web graph. The generic PageRank values are as to be expected. The rank of the Wikipedia page is 0.15, because there are no inlinks that convey rank to this page. The getflashplayer page receives the highest rank, which is expected and is also the case on the real web. The Happy Feet page receives a large amount of rank from the Adobe showcase web page, and is by far the most popular gallery. The resulting gallery ranking is shown in Table warnerbros.com/happyfeet artis.nl/paginas/t/dierentuin/vogels/pinguins.html dierenpark-emmen.nl/nl/dierenpark/favo_pinguins.htm 0.43 Table 2: Static gallery ordering with random surfer model. The PageRank based on the Levenshtein distance does not affect the ordering of the rank in this example, but it does change the rank that is assigned to the nodes (Table 3). The amount of rank that is conveyed to the Happy Feet page is increased, because the other outlink of the showcase page is regarded as irrelevant, because links that are added for navigational purpose within a website do not express value to that page. The outlinks from the Wikipedia page are rated as expected. The probabilities of the Artis and Dierenpark Emmen galleries are equal. The link to the Arctic Council is rated as less important. 1. warnerbros.com/happyfeet artis.nl/paginas/t/dierentuin/vogels/pinguins.html dierenpark-emmen.nl/nl/dierenpark/favo_pinguins.htm 0.35 Table 3: Static gallery ordering with Levenshtein surfer model. The ODP-based PageRank algorithm clearly has positive effect on the ranking. Links that take the user to a different topic are rated as less important, which can be seen in the corresponding probability matrix (Appendix B). As a result, the rank of the Happy Feet gallery is greatly

29 6.2 real web graph 21 reduced. The galleries on the zoo websites are now the most highly ranked. 2. artis.nl/paginas/t/dierentuin/vogels/pinguins.html dierenpark-emmen.nl/nl/dierenpark/favo_pinguins.htm warnerbros.com/happyfeet 0.17 Table 4: Static gallery ordering with ODP surfer-model. 6.2 real web graph The example shows that the concept works, but the true test is to apply the algorithms to a real-world dataset. The queries have been selected to cover a number of different cases. These are queries that match a famous person and a few queries that are very general, such as car or movie. The related queries are special, they generally use much more keywords in a boolean OR-query, which should result in galleries that concern the same topic. All the other queries are boolean AND-queries. The results can be seen in Table 8 (appendix D). When we look at the overall number of bad results, we see that the uniform scoring method performs best. It receives the highest quality score and the least bad results. The generic PageRank delivered the worst results, it performed better in only one query. The Levenshtein and ODP models performed roughly in between the others. However, if we group the queries into themes, the results show some differences. When searching for people, the Levenshtein scoring method performs slightly better. But in the soccer and travel-related queries, the uniform scoring method is still the best. For the remaining broad queries it is up to the user to determine what they appreciate and what they mark as bad result. The Levenshtein distance scoring method performs slightly better in these cases. The ODP method is slightly better than the generic PageRank method, but worse than the Levenshtein and uniform scoring methods. This is probably the result of an inaccurate classifier. It still shows that the random-surfer model performs sub-optimal.

31 C O N C L U S I O N 7 We have seen the results of the experiment and in this chapter we draw some conclusions. We will discuss the tested surfer models, the importance of a good dataset and the differences between image search and classical information retrieval. We conclude this chapter with some notes on the measurability of rankings on large datasets. 7.1 surfer-models Previous tests with the OPIC algorithm have shown that generic Page- Rank has a detrimental effect on the quality of the search results when applied to this dataset. These results have been confirmed by the results of the random-surfer model. We expected that the ODP-surfer model would yield the best results, but it did not perform extremely well, although it was significantly better than the generic PageRank. This probably can be improved by using a better classifier. The Levenshtein-distance surfer was a long shot, and although it does not have a scientific background, it performed surprisingly well. This illustrates that even simple methods can improve the quality of the PageRank algorithm. The research that has been performed in Topic-sensitive PageRank [8], shows that the surfer model can be tuned to a specific topic. They had to calculate the PageRank 16 times, once for each category. This research shows that this is not always required. If the links are weighted according to their relevance to the destination page, the algorithm performs better than the generic approach. 7.2 dataset We have found that a well-managed dataset is required to construct the web graph. The Nutch crawler stores all the results, but it is up to the administrator to keep the data files. Even if all the files are available, it is still important to assure that the dataset forms a web graph of one component, as many small components are not useful in the PageRank algorithm. As the dataset is quite large, it is important to reduce the number of nodes in the web graph, because otherwise the computations would get too complex. However, leaving links out can create a lot of separated components. 7.3 information retrieval We have also found that the adaptations to the algorithm have a positive effect on the quality of the search results, but the uniform sorting method still performs better. This is explained by the fact that PageRank is good for information retrieval, but finding the best gallery for a given keyword requires more than that. 23

32 24 conclusion Another aspect that is not very well-covered within the PageRank algorithm is time. For a number of queries that were used in the experiment, the age of a gallery is important. We would for example prefer images of FC Barcelona that were taken during their last game. But there are only a small number of links to those galleries, because they are very fresh. 7.4 measuring It was difficult to get a good measurement of the quality of the ranking. There are plenty of methods to compare two sortings, like the τ distance measure [11]. It is very hard however to produce a reference ranking by which the quality can be measured, because the dataset is very large and diverse. It gets even more complicated when static and dynamic ranking are combined. It is hard to compare the results of searches with and without static boosting, and it is difficult to balance the dynamic and static influences. With a lot more fine-tuning of the parameters it might be possible to improve the quality of the search results.

33 F U T U R E W O R K 8 In this chapter we discuss topics that require further research. 8.1 level of detail The level of detail of the web graph could be improved by using a more intelligent system. This system should be able to merge vertices that represent the same web page. One could for example remove any reference to sorting and page numbering. It is also possible to create custom filters for domains that do not perform well with the default approach, for example by creating a filter that combines multiple Flickr URLs that lead to the same gallery. 8.2 tooling The tools that have been used/written for this experiment are very rudimentary. They are very scalable, but offer little compatibility with the Nutch framework. With the current tools it is not possible to expand the dataset without recomputing all the ranks. It would be much more efficient to compute the ranks based on the previous results. Using these systems in a production environment would mean that roughly 20% of the resources would go into building, maintaining and updating the graph and ranks. 8.3 link prioritization The behavior of the smart surfer is determined by the link traversal probability distribution. This distribution can be improved; we could for example create a blacklist for pages that do not deserve rank. Another option is to combine multiple probability functions (ODP, Levenshtein, other topic comparison). It might be necessary to obtain more data to make a better estimate, for example the position of the link on a web page. Links that are at the bottom of a long page are less likely to be clicked on by a real user. 8.4 odp classifier The ODP classifier could also be improved. We could increase the number of categories, or find more suitable (less ambiguous) categories. Furthermore, the URL of the page might not be enough to determine the topic. The contents of a page contain much more clues which would make the system more robust. The structure of the application would in that case have to be changed, because the textual content of the destination of an outlink is not readily available. 25

35 O D P C L A S S I F I E R A Category True positive False positive False negative True negative Accuracy Precision Recall Adult % 81% 49% Arts % 59% 56% Business % 36% 83% Computers % 65% 37% Games % 67% 44% Health % 65% 35% Home % 64% 36% Kids and Teens % 32% 25% News % 44% 18% Recreation % 50% 28% Science % 63% 51% Shopping % 44% 11% Society % 59% 56% Sports % 75% 39% Averages % 50% 50% Table 5: ODP classifier precision and recall. 27

36 28 odp classifier True class Table 6: ODP classifier confusion matrix. Sports Society Shopping Science Recreation News Kids and Teens Home Health Games Computers Business Arts Adult Adult Arts Business Computers Games Health Home Kids and Teens News Recreation Science Shopping Society Sports Predicted class

37 S A M P L E W E B G R A P H B 29

38 30 sample web graph Table 7: Probability matrix based on ODP topic matches. Probabilities that are marked bold are part of the the sample web-graph warnerbros.com/happyfeet adobe.com/cfusion/showcase adobe.com/go/getflashplayer artis.nl/paginas/t/dierentuin/vogels/pinguins.html dierenpark-emmen.nl/nl/dierenpark/favo_pinguins.htm en.wikipedia.org/wiki/penguin en.wikipedia.org/wiki/penguin dierenpark-emmen.nl/nl/dierenpark/favo_pinguins.htm artis.nl/paginas/t/dierentuin/vogels/pinguins.html adobe.com/go/getflashplayer adobe.com/cfusion/showcase warnerbros.com/happyfeet

39 L U C E N E S C O R I N G C The score of query q for document d correlates to the cosine-distance or dot-product between document and query vectors in a Vector Space Model (VSM) of Information Retrieval. A document whose vector is closer to the query vector in that model is scored higher. The score is computed as follows: where score(q, d) = coord(q, d) querynorm(q) ( ) tf(t d) idf(t) 2 t.getboost() norm(t, d) t q 1. tf(t d) correlates to the term s frequency, defined as the number of times term t appears in the currently scored document d. Documents that have more occurrences of a given term receive a higher score. The default computation for tf(t d) in Default- Similarity is: tf(t d) = frequency idf(t) stands for Inverse Document Frequency. This value correlates to the inverse of docfreq(t) (the number of documents in which the term t appears). This means rarer terms deliver a higher contribution to the total score. The default computation for idf(t) in Default-Similarity is: ( ) numdocs idf(t) = 1 + log docfreq coord(q, d) is a score factor based on how many of the query terms are found in the specified document. Typically, a document that contains more of the query s terms will receive a higher score than another document with fewer query terms. This is a search time factor computed in coord(q, d) by the Similarity in effect at search time. 4. querynorm(q) is a normalizing factor used to make scores between queries comparable. This factor does not affect document ranking (since all ranked documents are multiplied by the same factor), but rather just attempts to make scores from different queries (or even different indexes) comparable. This is a search time factor computed by the Similarity in effect at search time. The default computation in Default-Similarity is: querynorm(q) = querynorm(sumofsquaredweights) 1 querynorm(sumofsquaredweights) = sumofsquaredweights

40 32 lucene scoring The sum of squared weights (of the query terms) is computed by the query Weight object. For example, a boolean query computes this value as: sumofsquaredweights = q.getboost() 2 (idf(t) t.getboost()) 2 t q 5. t.getboost() is a search time boost of term t in the query q as specified in the query text (see query syntax), or as set by application calls to setboost(). Notice that there is really no direct API for accessing a boost of one term in a multi-term query, but rather multi terms are represented in a query as multi TermQuery objects, and so the boost of a term in the query is accessible by calling the sub-query getboost(). 6. norm(t, d) encapsulates a few (indexing time) boost and length factors: Document boost set by calling doc.setboost() before adding the document to the index. Field boost set by calling field.setboost() before adding the field to a document. lengthnorm(field) computed when the document is added to the index in accordance with the number of tokens of this field in the document, so that shorter fields contribute more to the score. LengthNorm is computed by the Similarity class in effect at indexing. When a document is added to the index, all the above factors are multiplied. If the document has multiple fields with the same name, all their boosts are multiplied together: norm(t, d) = doc.getboost() lengthnorm(field) f.getboost() fieldf d named as t However the resulted norm value is encoded as a single byte before being stored. At search time, the norm-byte value is read from the index directory and decoded back to a float norm value. This encoding/decoding, while reducing index size, comes with the price of precision loss - it is not guaranteed that decode(encode(x)) = x. For instance, decode(encode(0.89)) = Also notice that at search time is too late to modify this norm part of scoring, e.g. by using a different Similarity for search.

41 D L A R G E D ATA S E T R E S U LT S Figure 6: This figure shows one of the pages that people have scored. 33

CC PROCESAMIENTO MASIVO DE DATOS OTOÑO Lecture 7: Information Retrieval II. Aidan Hogan

CC PROCESAMIENTO MASIVO DE DATOS OTOÑO Lecture 7: Information Retrieval II. Aidan Hogan CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 7: Information Retrieval II Aidan Hogan aidhog@gmail.com How does Google know about the Web? Inverted Index: Example 1 Fruitvale Station is a 2013