F O C U S E D S U R F E R M O D E L S Ranking visual search results
|
|
- Matilda Gaines
- 5 years ago
- Views:
Transcription
1 F O C U S E D S U R F E R M O D E L S Ranking visual search results jasper hafkenscheid Faculty of Mathematics and Natural Sciences University of Groningen August 2010
2 Jasper Hafkenscheid: Focused surfer models August 2010 supervisors: Marco Aiello Nicolai Petkov Mathijs Homminga location: Groningen time frame: August 2010
3 A B S T R A C T PageRank is a graph-based ranking algorithm that ranks the nodes in a graph based on their connections. It has been designed to determine the intrinsic value of a page, based on the link structure of the web. We research the effects of adaptations to the PageRank algorithm in the context of a gallery search engine. The search engine uses text-based search to select galleries. Galleries are web pages that contain a number of small images, which link to enlarged versions of them. We evaluate the performance by looking at the ordering of the search results. The generic PageRank algorithm does not work well for this kind of application; it seems that the best galleries are not found on the pages with the highest PageRank. PageRank assumes that all links convey trust to other pages, but many links disrupt that concept (e.g. links to download or update your browser or required plug-ins). The inherent nature of the algorithm is the Random Surfer Model, which is based on a virtual user that navigates the web by following a random link on a page, and repeating this infinitely. The PageRank algorithm can be changed by altering the probability with which the random surfer chooses which link to follow. Our design choice is to replace the random surfer model with a focused surfer model, which increases the probability of following a link based on the similarity between the linking page and the target. We will analyze the performance with the gallery search engine and compare the results with a generic PageRank and uniform ranking. The search engine uses a dataset that consists of 15 million galleries; to acquire this dataset we have crawled 1.5 billion pages. We find that alterations to the system affect the outcome, and that the overall performance is increased. The PageRank algorithm is intended for information retrieval, it works very well for general purpose ranking, but the random surfer model does not work well enough for all applications. iii
4
5 C O N T E N T S 1 introduction 1 2 background Information Retrieval Image Retrieval Ranking Web graph PageRank Concept Customization 5 3 related work Web-based Image Retrieval Intentional surfer model Topic-sensitive PageRank Hyperlink-Induced Topic Search On-Line Page Importance Computation 8 4 concept Importance of links Random surfer Open directory project (ODP)-biased surfer Levenshtein-distance surfer Granularity Self-links Spam 11 5 realization System architecture Procedure Build graph Iterate PageRank Build a rank-lookup index Rerank gallery index Measuring performance Data set Tools Apache Hadoop Lucene Java Nutch LingPipe 17 6 evaluation and results Sample web graph Real web graph 21 7 conclusion Surfer-models Dataset Information retrieval Measuring 24 v
6 vi contents 8 future work Level of detail Tooling Link prioritization ODP classifier 25 a odp classifier 27 b sample web graph 29 c lucene scoring 31 d large dataset results 33 bibliography 35
7 L I S T O F F I G U R E S Figure 1 Example link-structure 6 Figure 2 A linked set of hubs and authorities. 8 Figure 3 Architecture overview 13 Figure 4 MapReduce 16 Figure 5 Example web graph 19 Figure 6 Sample scoring page 33 L I S T O F TA B L E S Table 1 Ranking of sample web graph 20 Table 2 Static gallery ordering using generic PageRank 20 Table 3 Static gallery ordering using Levenshtein based PageRank 20 Table 4 Static gallery ordering using ODP-based Page- Rank 21 Table 5 ODP classifier precision and recall. 27 Table 6 ODP classifier confusion matrix. 28 Table 7 Probability matrix based on ODP topic matches. 30 Table 8 User feedback on ranking 34 A C R O N Y M S ODP Open directory project DFS Distributed file system OPIC Online Page Importance Computation HITS Hyperlink-Induced Topic Search TLD Top Level Domain TREC Text REtrieval Conference vii
8
9 I N T R O D U C T I O N 1 The Internet has grown since the invention of html in Html is the language for web pages, which can contain links to other html pages. These links are called hyperlinks. The users can follow the links to navigate the web, but as the Internet grew, it became harder to accurately find information. This was the motivation to develop search engines that index the web and allow searching based on keywords. Ordering the search results became more and more important as the web grew. PageRank is an algorithm that uses the hyperlinks on the pages to determine the value of a web page. It is the basis for the Google search engine. When the Internet is seen as a graph with nodes (pages) and vertices (hyperlinks), then it is possible to determine the value of a page with respect to the number of references. PageRank works by distributing rank via the outlinks of every page. The rank of a page is determined by the amount of rank it receives from the pages that refer to it. This recursive algorithm calculates the rank for every node in graph. Nodes with a higher rank represent more valuable web pages which are more likely to contain valuable information. These pages should be given a higher position in the search results. We apply PageRank to an image gallery dataset, which consists of image galleries that are found on the web. We find that the performance is sub-optimal, because the quality of a photo gallery cannot be properly determined by the popularity on the web page. However, PageRank can still be used to determine which galleries are more valuable than others. To improve the PageRank computation, we replace the surfer model with a model that is smarter and that more accurately resembles real users. The concept is that links between pages with the same topic are more important than links to pages that concern a different topic. Pages that have a lot in common are rated as more likely candidates to follow than pages that cover different topics. We have developed an ODP biased surfer, which uses a classifier trained with ODP data to determine the similarity of the two pages. ODP is a project that collects and categorizes links to web sites. This data is freely available. The other surfer model that we are testing is based on the Levenshtein distance measurement. This measurement is based on the similarity of the original and target URLs. The concept behind this model is that pages that discuss the same topic are likely to have that mentioned in their URL. To compare the results, we will also calculate the PageRank with the original random surfer model. The fourth and last sorting method gives each page a rank of 1, which we call uniform ranking. To gather results we process a dataset that consists of approximately 15 million galleries and 1.5 billion other pages. The calculated ranks are then combined with the text-match score to produce search results for 22 queries. Because there are no good validation methods for this dataset, we have to ask people what they think of the resulting ordering 1
10 2 introduction of the galleries. To demonstrate the desired effect, we have also applied the algorithm to a sample web graph. We see that the new surfer models outperform the random surfer model. The ODP surfer model does not work better than the Levenshtein model. This can be explained by the poor performance of the classifier. The research still shows that the random surfer model is sub-optimal, and also shows that the uniform ranking is better in some areas, which was unexpected. The remainder of this thesis is structured into seven chapters. Chapter 2 will explain some of the basics of information retrieval and PageRank. Chapter 3 explains the position of this thesis with respect to the state of art. In Chapter 4 we explain the concept and introduce the three surfer models. Then we describe how we computed the rank values and explain some of the concepts (Chapter 5). In Chapters 6 and 7 we discuss the results and explain them. The final chapter contains some future work on the field of alternate surfer models for PageRank.
11 B A C K G R O U N D 2 There has been a lot of research regarding information retrieval and various ranking methods. Some of the most effective groundwork that is done regarding query-independent ranking is PageRank. In this chapter we will introduce these topics. 2.1 information retrieval Information retrieval concerns the storage of and access to information. The representation and organization of the information items should provide the user with easy access to the information in which he is interested [4]. The typical scenario is that a user has an information need. It then becomes the task of the information retrieval system to provide the information. The information need is often translated into a query. The query normally consists of some keywords, but may include other parameters like a time period. The query is then fed into a search engine, which returns the requested data to the user. To provide the user with easy access, the results are ordered based on the relevance with respect to the information need. There also is a part of the information need that is not specified in the query; this could be a desired language or a desire to get recent information. These desires also have to be accounted for in the ordering of the search results. One of the differences between information retrieval and data retrieval is that an information retrieval system deals with natural language. A data retrieval system often uses a database and deals with a well-defined structure and semantics. 2.2 image retrieval Image retrieval is a type of information retrieval which involves images. The search can either be text-based or it can be based on image data. Most of the time the text data consists of meta-data that is added to the image (e.g. caption or keywords). Content-based information retrieval uses data that is extracted from the image with computervision techniques. This allows for searching based on a different image or by defining image features like color or composition [19]. In this thesis we have not used computer-vision techniques. We perform image retrieval using the text that is found with the image. 2.3 ranking Ranking is determining the order of elements. In our case, the elements are search result entries. The ordering of the results is based on the information need. This can be broken down into two components: a query and a generic component. The query is represented as a number of keywords, but the generic component describes some features which are more difficult to grasp. These can include popularity or authoritativeness. We want to see the results that have the best match with 3
12 4 background the information need first, so they should be ranked higher than other results. To rank the results, each element is given a score, a number that represents the quality of that element with respect to the information need. It is difficult to determine the score of a search result, because only some of the parameters of the information need can be calculated. Other features are more difficult to measure. The score is the sum of two components: a query-dependent and a query-independent score. The query-independent score is not influenced by the query, but is based on the implicit component of the information need. The query-independent score of an element reflects the intrinsic quality of a page. Because these scores are calculated independently of the query, they can be calculated ahead of time. This is called off-line computation. This allows for more extensive analysis, because on-line methods have to be very fast. The query-dependent score is usually based on heuristics that consider the number and locations of matches of the various query words on the page itself, in the URL or in any anchor text referring to the page. There is the number of matches, the order of the matches, the number of other words and the uniqueness of the terms. All these factors have to be combined into the final ranking. When the elements are used to perform a search based on a query, then the query-independent score is used together with a query-dependent score to rank all results matching the query [9]. 2.4 web graph To determine a score for some of the implicit features of the information need, one can use the link structure of the web. We expect that valuable pages have more links pointing towards them. A graph is a network of nodes (vertices) and links between nodes (edges). In a directed graph the edges have a direction: they have an origin and destination vertex. All the edges that originate from a vertex are the out-edges or outlinks of that vertex. The edges that point to a vertex are the in-edges or inlinks. The pages and links of the web can be modeled as a directed graph, the edges being the links and the vertices being pages. This graph is one of the data structures that is used to determine the score of a page. There are several ranking methods that are based on the web graph. The most successful one is PageRank, which we will be using in this thesis. A number of other methods are discussed in Chapter pagerank PageRank is a graph algorithm that can rank vertices in a graph [18, 5]. It was first used as a citation ranking mechanism, which ranked publications based on the number and quality of the citations. Larry Page and Sergey Brin developed it and later used it to build the Google search engine. If the graph is based on the link structure of the web (Figure 1), it can be used to rank web pages independently of a query.
13 2.5 pagerank Concept The idea behind the PageRank algorithm is that a hyperlink to a web page is a vote of support. Pages that receive a lot of votes are given a better rank than other pages. PageRank does not only count the number of votes, but also takes the rank of the page that voted into account. A page that receives a lot of votes from highly ranked pages receives a high rank itself. The PageRank of a page is defined recursively (Formula 2.1). The concept can also be explained by the random surfer model [14]. This model is based on a random surfer. This surfer follows a random outlink on the page he is visiting. Every time he visits a page, the score is increased. This is repeated infinitely. If the surfer visits a page more often, then the rank increases. The surfer sometime gets bored and jumps to a random page (which models bookmarks and dead-ends). This probability is modeled by the dampening factor d. Pages that do not contain outlinks transfer their rank equally over all pages of the web. Pages that have not yet been downloaded also fall into this category. The algorithm can provide some sort of scoring for pages that have not yet been downloaded. These scores can then be used to selectively download the web, instead of a breadth-first search. PR(p i ) = (1 d) + d p j M(p i ) P(p j, p i ) PR(p j) L(p j ) (2.1) M(p i ) is the set of pages linking to p i L(p i ) is the number of outlinks of p i d is the dampening factor, usually 0.85 P(p i, p j ) is the probability that the link between the pages in traversed, in the original formula this is always Customization There are several ways to alter the behavior of PageRank. The granularity of the webgraph can allow pages to merge, which would make the graph more dense. This also results in multiple pages getting the same score. One can also handle the dangling pages problem differently. Page and Brin suggested to add the random jump. This is a basic solution for the problem, but it does not mimic the behavior of real users. The probability that a user will make a random jump is 1 d, which usually is 15%. The random jump also remedies the problem of rank-sinks by always allowing for a way out. The best way to influence the outcome of the algorithm is to change the personality vector P. The vector represents the possibility that the user follows a specific outlink. Page and Brin included it to allow personalization of the search results. Personalization might not be the right word, because it would be far too costly to compute the PageRank values for every user. Rank-sink: A section of a graph that accumulates rank, due to the lack of outlinks.
14 6 background E 1.50 E 2.98 D F D F A B A B C 3.77 C 1.00 (a) After the 1st iteraton. (b) After the 2nd iteraton E 1.32 E 0.89 D F D F A B A B C C (c) After the 3rd iteraton. (d) After the 25th iteraton. Figure 1: Example link structure with PageRank values for each of the iterations. Pages are vertices and directed edges represent links.
15 R E L AT E D W O R K 3 This chapter notes some other research on the field of search result ranking, and describes the relation with respect to this thesis. We discuss Hyperlink-Induced Topic Search (HITS) and Online Page Importance Computation (OPIC) and elaborate on the various surfer models. 3.1 web-based image retrieval Internet search engines have been available for quite some time, but most of them are designed for generic information retrieval. There is a lot of research to be done on the field of image retrieval[12]. In this thesis, we focus on the ranking aspect of image retrieval. The generic selection and ranking of the search results are processed by traditional text search engine methods. 3.2 intentional surfer model An improvement to the random surfer model can be made by using realworld data. This has to be gathered by analyzing the surfing behavior of real users. Google released their toolbar in December 2000 [2], which allows people to use the search service and translate pages easily. Another feature of the toolbar is that it sends information regarding the surfing behavior to Google. If enough people submit their data, then Google can accurately replace the random surfer model with an intentional surfer model [10]. The browser Google Chrome, Google Analytics and the Google advertising services may also provide valuable information that enables Google to improve their ranking. For our research we do not have access to this data, and therefore use other methods to develop a new surfer model. The advantage is that our models can operate on pages that no user has ever visited. 3.3 topic-sensitive pagerank Topic-sensitive PageRank is an adaptation to the PageRank algorithm that has been researched by Haveliwala [8]. It uses a classifier which is trained using the data from the ODP [17]. ODP is a project that gathers and categorizes links to web pages. Each link has a description and a title. The dataset is maintained by volunteers and is freely available. The classifier can determine into which of the 16 main categories a page belongs. He has applied it to all pages in their dataset and performed the PageRank calculation for each of the top-level categories. At search time the topic of the query is determined, and the corresponding PageRank data is used. This resulted in a significant improvement in the ordering of the results. 7
16 8 related work The key difference between our approach and the topic-sensitive PageRank method is that we only calculate one score for each link. The researchers rely on the classifier to analyze the query, and use the corresponding PageRank data. Experience shows that can be difficult to accurately analyze very short texts. There are numerous ambiguous words in any language, and the demographic of this search service is not bound to one language. 3.4 hyperlink-induced topic search HITS is a graph-based algorithm that computes two scores for all the results of a query [13]. The hub score estimates the value of its links to other pages, the authority score estimates the value of the content of the page. The results can be divided into two categories: hubs which lead to authorities and authorities. It can also be used to identify results that are not relevant for the query (by looking at the links in the graph). One of the differences with PageRank is that it is not computed off-line, but is a query-time process. It operates on the results of a search. It is not commonly used by search engines, but a similar algorithm has been in use by ask.com. Hub Hub Hub Authority Authority Authority Authority Figure 2: A linked set of hubs and authorities. 3.5 on-line page importance computation OPIC is, as the name suggests, an on-line method to rank pages [3]. The advantage is that it is able to do the calculations without storing a separate link structure. The score of a page is updated when the page is downloaded, so it is not a search-time algorithm. In the long run the scoring should be the same as with conventional PageRank. The Nutch platform (discussed in Chapter 5) includes an implementation of OPIC. It is used to determine which URLs should be downloaded first. At search time the scoring can be used to order the search results. The disadvantage is that pages need to be crawled multiple times to reach a meaningful score.
17 C O N C E P T 4 The goal of this research is to measure how different adaptations of the PageRank algorithm influence the quality of ranking. The alterations that we are exploring concern using alternative surfer models. The probability with which a certain outlink is chosen is made to depend on the similarity between the two pages. 4.1 importance of links The probability with which an outlink would be traversed, depends on its own score with respect to the other outlinks on that web page. We adapt the PageRank algorithm by determining the relevance of each of the outlinks on a page. The random surfer model is thus changed into a smart surfer model. For every surfer model the probability function is given. The formula returns a score with which the p j link on page p i will be followed. The scores are then divided by the sum of the scores of all outlinks, resulting in normalized probabilities Random surfer In the random surfer model the scores are all equal to one. Each link is regarded as equally important. This method is used as reference and is later referred to as generic PageRank. P(p i, p j ) = ODP-biased surfer We have constructed a basic URL-classifier by training it with URLs from the ODP [17]. The ODP consists of approximately 5 million URLs, all of which have been categorized into a rich data structure. They are grouped by topic on multiple levels and are also divided over a number of languages. There are 16 main categories, 14 of which we used for our research. The URLs that are found in the regional and world categories have been re-mapped to their respective top-level category. This ensures that the dataset does not contain English pages only. Both the URL of the originating page and the destination page are processed by the classifier. The relevance of the link is based on how much the topics overlap. Pages that have the same topic have a good connection, others are very weak. Table 5 shows the results of the ODP classifier in a 10-fold evaluation. We have used every tenth ODP record to test the classifier. The table shows that the accuracy is very high (93%). The precision(50%) and recall(50%) are much lower. The accuracy is not an appropriate measurement in this case, because of the non-uniform distribution of the classes. Business, Arts and Society together account for 50% of the dataset and have a low accuracy. Accuracy: (T P + T N)/Sum Precision: FP/(T P + FP) Recall: FN/(T P + FN) T P = True positive T N = True negative FP = False positive FN = False negative 9
18 10 concept The classifier performs poorly, because the precision and recall is low. This can be explained because of the ambiguity of the categories. A website about a computer game might be in the Computers, Games or Kids and Teens category. Another problem is the limited features that are used to train the classifier. We have chosen to use only the URL, but this is a very limited feature. Retrieving more information about a certain outlink is complex, because the page may or may not have been crawled. Even if the page is crawled, it is still difficult to gather the information on that page, because we cannot store all the web pages that we have crawled. The goal of this thesis is not to build an accurate classifier but to study the effects of intelligent-surfer models on PageRank. Therefore we have not invested too much time and resources in optimizing this classifier. The sample web graph that is shown in Figure 5 shows that the classifier works. We are confident that the performance of this classifier can be improved by using more features or a less ambiguous class hierarchy. P(p i, p j ) = O(p i, c) O(p j, c) c Cats O(p i, c) = probability that page i belongs to category c. Cats = collection of 14 categories Levenshtein-distance surfer The other surfer model is based on the Levenshtein distance measurement [15]. The Levenshtein distance measurement is based on the number of characters that have to be added, changed or removed to transform a given text string to another string. This is a very simple method to determine how much two pages have in common. There is a chance that pages that discuss a certain topic are mentioning that fact in their URL. If this is also the case for some of the outlinks, then this can be used to calculate the probability that someone will follow the link. This is a very simple measurement that does not require any information about the destination page other than the URL. We expect that URLs that are on the same domain or on a very similar host are not much more relevant than links to other pages. We do not want to spread the rank to other pages on the same domain, but use it to assign a meaningful rank to other pages. We have chosen to fix the importance of links to the same domain to 0.1 and links to a host that is very similar to 0.2. The last rule is to reduce the effects of links to domains that are similar but with a different Top Level Domain (TLD) (e.g. google.nl, google.com). 0.1 if Domain(p i ) = Domain(p j ) P(p i, p j ) = 0.2 if PL(Host(p i ), Host(p j )) < 0.5 PL(p i, p j ) otherwise Domain(p i ) = this will return the highest level of the domain. E.g.: Domain( = apache.org Host(p i ) = this will return the full hostname. E.g.: Host( = hadoop.apache.org
19 4.2 granularity 11 PL(i, j) = LD(i, j)/ length(i)+length(j) 2 LD(i 1, j 1) //subst/copy LD(i, j) = min LD(i 1, j) + 1 //insert LD(i, j 1) + 1 //delete 4.2 granularity We have to choose a level of granularity for the web graph. We could decide to use domains for the nodes in the graph. This would mean that links to any page in a domain will be merged to one single node in the graph. All galleries that are hosted on that domain will get the same rank. This is not a perfect solution, because of the large number of galleries that are hosted on photo sharing sites like Flickr or Deviantart. These sharing sites have a huge amount of galleries, but only a small percentage of that is of very good quality. The opposite strategy is to use the entire URL, including the query (e.g. the page number or sorting method). This might cause problems because links to the same gallery are not combined. Because of this we have chosen to use a simplified version of the URL. The query part is removed, and the remainder is used to identify galleries. This could also cause problems if multiple galleries share the same URL but use a different query parameter (e.g. albumid=2). Most sites do not use such a system, or use it for other purposes like sorting or page selection. Website developers often aim to make pretty URLs, which improve their performance in search engines. They also allow users to guess what the page is about when they only see the URL. These pretty URLs will not be merged, but other pages will be grouped in a single node in the graph. 4.3 self-links Another point of attention is which links are to be included in the graph. The concept is that links conduct importance to the destination page. But links that point to the same page are not relevant. This occurs a lot on forums that have links to the top of the page after each message. A well-known method to retain rank within a particular website is to create a lot of links on every page that link to all the pages on that website. This is also common in forums; each posting has a link to the profile of the poster. And popular threads have a number of pages that all link to each other. 4.4 spam The last decision is the problem of spam. The web is contaminated by pages that are constructed to influence the ranking of search engines. The policy that we have applied is that pages with a large number of outlinks are not included in the graph. This should not greatly alter the overall computation, because the rank of such a page would be divided by the number of outlinks. Thus each outlink only transfers a tiny amount of rank.
20
21 R E A L I Z AT I O N 5 In order to perform the necessary computations, we have built a system that is capable of calculating the PageRank of a large number of pages in a few hours. We will explain the architecture of the system and describe the tools that we used to build it. 5.1 system architecture Figure 3 shows the architecture of the system. Some of the tools are provided by Nutch, and some have been written for the purpose of this thesis. 3 Internet Build graph Crawl the internet 1 Nutch Databases Iterate pagerank 4 Build Gallery-index 2 Build rank-lookup index 5 Unranked Gallery-index Rank-lookup Index Rerank Gallery-index 6 Lucene index Process Data flow Ranked Gallery-index Figure 3: Architecture of the system. Step 1 and 2 are provided by Nutch, which provides for the crawling and stores the data into its database. In step 3 we build the graph and calculate the probabilities that define the surfer model. Step 4 provides for the PageRank computations, which are stored to a rank-lookup index in step 5. Step 6 updates the Nutch index with boost values from the rank-lookup index Procedure In this section we will discuss the procedure that is required to statically order the galleries in a web graph using PageRank. We assume that a 13
22 14 realization dataset is available, which has been produced by running the Nutch crawler for a considerable amount of time Build graph In this step we parse the Nutch dataset and extract the web graph. Each link is represented by the origin, the target and the traversal probability; the probability is based on the formulas that are mentioned in the previous chapter. The outlinks are then grouped by source page. We have written a MapReduce processing job to perform this task Iterate PageRank Now we iterate the PageRank algorithm, until it reaches a stable state. In each iteration the rank of each page is calculated using the rank of the inlinks to that page. The amount of rank that is transferred is relative to the probability of the outlinks. This is also wrapped in a MapReduce job. Depending on the density of the graph, this can take a lot of iterations Build a rank-lookup index We construct a rank index containing the rank of all galleries. This step simplifies the process of updating the rank in the gallery index. This is also a MapReduce job; at the end all the subindexes that are produced by the reducers are merged into one index Rerank gallery index The gallery index has been built from the same Nutch dataset with a Nutch tool. We traverse the gallery index and set the boost value for every document according to the rank that is stored in the rank index. This is the only part of the process that is not performed by a MapReduce job, because the input for this step is an existing index. Indexes are great for searching, but building and reading must be done in one thread. The Lucene library uses a complex formula to score search results based on the search terms, boost values for different fields, and the number of occurrences of the terms (Appendix C). We calculate the boost value of the documents by taking the logarithm of the PageRank. This ensures that the boost values do not overpower the dynamic ranking. It is important to rank documents based on their dynamic ranking. The goal is to slightly influence the ordering by setting the boost values Measuring performance At this point we load the new gallery index into the search engine. We use the application to give insight in the new ranking. We have asked some people to rate the computed ranking. For a number of queries they are asked to count the number of ugly or bad galleries, and give a score in five steps from very bad to excellent for the ranking of the first five results.
23 5.2 data set 15 The search terms that are used in this evaluation are selected by looking at the words that are used the most in search queries. We have also added some well-known people, some football clubs and some vague subjects. This should be a good sample set to determine the quality of the ranking. 5.2 data set To be able to compare the results with other ranking methods, a generic dataset is required. Text REtrieval Conference (TREC) provides such datasets. We have not been able to find a relevant dataset that could be used for this thesis, because all of the datasets focus on text retrieval, and not on the images that are found on the pages [16]. We use a previously-built dataset that is constructed by an enhanced Nutch crawler (section 5.3.3). It can detect galleries, which are pages with a group of thumbnails that link to higher resolution versions of those thumbnails. The data is stored to a Distributed file system (DFS). It also stores all the outlinks of each page. These files are processed by a MapReduce task that creates vertices for each link. The set that we will use for testing consists of pages that have been crawled in the year This contains about 15 million galleries. On average one percent of the crawled pages is a gallery. The starting points of the crawl consisted of a number of well-known web pages (targeted at galleries). We can assume that the graph does not have too many components. We also use a sample web graph (Appendix B) that clearly illustrates the intended behavior. 5.3 tools Whilst working on the application, we have been introduced to some excellent open source tools. These tools have been vital for the success of this application Apache Hadoop Apache Hadoop is an open-source MapReduce engine. MapReduce [6] is an algorithm that allows data processing jobs to be split in small chunks. These chunks are processed on a server cluster, which enables it to process huge amounts of data in very little time. The system is able to detect failures and can resubmit a chunk when needed. The Hadoop framework also contains a DFS. The DFS takes care of replication and fail-over. When the system assigns jobs to nodes, the location of the data is taken into account. Jobs are preferably run on the node that has the data on its local disk. MapReduce processes the data in two phases. The map phase is used to read the source data and produces <key,value> pairs. The output of the map tasks is sorted by key, and then processed by the reducers. The sorting assures that all <key,value> pairs with the same key are bundled and processed by one reducer. A full explanation, including source code, is available at http: //hadoop.apache. org/core/docs/ current/mapred_ tutorial.html.
24 16 realization Comput ecl us t er DFSBl oc k1 Dat a dat a dat a dat a dat a dat a dat a dat a dat a dat a dat a DFSBl oc k1 Map DFSBl oc k1 Res ul t s DFSBl oc k2 Map DFSBl oc k2 Reduc e DFSBl oc k2 dat a dat a dat a dat a dat a dat a dat a dat a dat a Map DFSBl oc k3 DFSBl oc k3 DFSBl oc k3 Figure 4: Architecture of the MapReduce framework. (source: Apache) Lucene Java Apache Lucene is a high-performance, full-featured text search engine library written entirely in Java [7]. It allows us to perform searches in a large indexed set of data. The information that is stored in a Lucene index is pre-processed to allow for very fast searching. An index contains documents, and documents contain fields. Fields have a name, a value and a number of indexing settings. These settings are: Indexed Fields that have this setting are indexed, resulting in an inverse lookup table. Tokenized The value of this field is split into tokens by a tokenizer. This can also prepare the text before indexing (e.g. stemming and stop-word removal). Stored This parameter stores the original text in the index, which can be useful if the original value has to be available at search time. Documents also have a boost, which is a query-independent (static) value that is used when ranking the results of a query. During this project we have used Lucene indexes to store the image galleries and to build a lookup table for the ranks Nutch Nutch is also an open-source software package that is developed by Apache. It combines Hadoop and Lucene and adds an extensible crawler. Together they can be used to crawl, index and search the internet. At almost every step of the process the authors have integrated extension points. The plug-ins determine which pages are crawled, how the information is processed and what search parameters are processed. Nutch takes care of all the registration that is involved in crawling the web. It stores downloaded pages and all the outlinks. There is a main crawldb that stores all the links that the crawler has found. This is used to determine which pages are to be crawled next. Once a selection has been made from those URLs, they are formed into a segment. The segment is then downloaded and the results are fed back into the crawldb and into the search index.
25 5.3 tools LingPipe This package is one of the few products that we have used that is not developed by Apache. It is developed by alias-i. LingPipe is a suite of Java libraries for the linguistic analysis of human language [1]. We have used it to construct a classifier which can to categorize a web page in one of the 14 selected ODP categories. It does so by only looking at the URL. There are multiple types of classifiers available, but they all work by training them with examples. The suite also provides an evaluator which can be used with a k-fold evaluation. The evaluation results of the ODP classifier are shown in Table 5 and 6. Classifiers can be saved to hard-disk, which is useful if you need multiple instances.
26
27 E VA L U AT I O N A N D R E S U LT S 6 Here we give and discuss the results of our experiments. We have calculated the ranks for each of the three surfer-model variations. The measurements that we can perform on the dataset are limited, because we have no reliable way to measure the performance. To be able to illustrate the desired effect, we constructed a sample web graph (Figure 5). 6.1 sample web graph To explain the effects of the surfer models, we have constructed a sample web graph. We made sure that it will explain the problem of the generic PageRank algorithm, but these structures are very common on the web. The web graph consists of seven pages and contains three galleries which all about penguins. The Artis and Dierenpark Emmen galleries are about the animals in the respective zoos. They link to each other and are referred to by Wikipedia and the Arctic Council. The third gallery is about the computer-animated movie Happy Feet, which is also about penguins. The Happy Feet gallery has an inlink from the Adobe website, which is always very popular because of all the inlinks from pages that require the Flash plug-in. en.wikipedia.org/ wiki/penguin arctic-council.org/ artis.nl/ paginas/dierentuin/ vogels/pinguins.html dierenpark-emmen.nl/ nl/dierenpark/ favo_pinguins.htm warnerbros.com/happyfeet adobe.com/ getflashplayer adobe.com/ cfusion/showcase/ Figure 5: The example web graph that is used to demonstrate the desired effect of the new surfer models. The graph consists of three galleries and four non-galleries. The structure of the graph leads to a high rank for the Adobe plug-in page. This has been one of the highest scoring pages in our experiments. Table 1 shows the results of a calculation with 20 iterations on the sample web graph. 19
28 20 evaluation and results Web-page Uniform Generic Levenshtein ODP en.wikipedia.org/wiki/penguin dierenpark-emmen.nl/nl/dierenpark/favo_pinguins.htm artis.nl/paginas/t/dierentuin/vogels/pinguins.html adobe.com/go/getflashplayer adobe.com/cfusion/showcase warnerbros.com/happyfeet arctic-council.org Table 1: Ranking of sample web graph. The generic PageRank values are as to be expected. The rank of the Wikipedia page is 0.15, because there are no inlinks that convey rank to this page. The getflashplayer page receives the highest rank, which is expected and is also the case on the real web. The Happy Feet page receives a large amount of rank from the Adobe showcase web page, and is by far the most popular gallery. The resulting gallery ranking is shown in Table warnerbros.com/happyfeet artis.nl/paginas/t/dierentuin/vogels/pinguins.html dierenpark-emmen.nl/nl/dierenpark/favo_pinguins.htm 0.43 Table 2: Static gallery ordering with random surfer model. The PageRank based on the Levenshtein distance does not affect the ordering of the rank in this example, but it does change the rank that is assigned to the nodes (Table 3). The amount of rank that is conveyed to the Happy Feet page is increased, because the other outlink of the showcase page is regarded as irrelevant, because links that are added for navigational purpose within a website do not express value to that page. The outlinks from the Wikipedia page are rated as expected. The probabilities of the Artis and Dierenpark Emmen galleries are equal. The link to the Arctic Council is rated as less important. 1. warnerbros.com/happyfeet artis.nl/paginas/t/dierentuin/vogels/pinguins.html dierenpark-emmen.nl/nl/dierenpark/favo_pinguins.htm 0.35 Table 3: Static gallery ordering with Levenshtein surfer model. The ODP-based PageRank algorithm clearly has positive effect on the ranking. Links that take the user to a different topic are rated as less important, which can be seen in the corresponding probability matrix (Appendix B). As a result, the rank of the Happy Feet gallery is greatly
29 6.2 real web graph 21 reduced. The galleries on the zoo websites are now the most highly ranked. 2. artis.nl/paginas/t/dierentuin/vogels/pinguins.html dierenpark-emmen.nl/nl/dierenpark/favo_pinguins.htm warnerbros.com/happyfeet 0.17 Table 4: Static gallery ordering with ODP surfer-model. 6.2 real web graph The example shows that the concept works, but the true test is to apply the algorithms to a real-world dataset. The queries have been selected to cover a number of different cases. These are queries that match a famous person and a few queries that are very general, such as car or movie. The related queries are special, they generally use much more keywords in a boolean OR-query, which should result in galleries that concern the same topic. All the other queries are boolean AND-queries. The results can be seen in Table 8 (appendix D). When we look at the overall number of bad results, we see that the uniform scoring method performs best. It receives the highest quality score and the least bad results. The generic PageRank delivered the worst results, it performed better in only one query. The Levenshtein and ODP models performed roughly in between the others. However, if we group the queries into themes, the results show some differences. When searching for people, the Levenshtein scoring method performs slightly better. But in the soccer and travel-related queries, the uniform scoring method is still the best. For the remaining broad queries it is up to the user to determine what they appreciate and what they mark as bad result. The Levenshtein distance scoring method performs slightly better in these cases. The ODP method is slightly better than the generic PageRank method, but worse than the Levenshtein and uniform scoring methods. This is probably the result of an inaccurate classifier. It still shows that the random-surfer model performs sub-optimal.
30
31 C O N C L U S I O N 7 We have seen the results of the experiment and in this chapter we draw some conclusions. We will discuss the tested surfer models, the importance of a good dataset and the differences between image search and classical information retrieval. We conclude this chapter with some notes on the measurability of rankings on large datasets. 7.1 surfer-models Previous tests with the OPIC algorithm have shown that generic Page- Rank has a detrimental effect on the quality of the search results when applied to this dataset. These results have been confirmed by the results of the random-surfer model. We expected that the ODP-surfer model would yield the best results, but it did not perform extremely well, although it was significantly better than the generic PageRank. This probably can be improved by using a better classifier. The Levenshtein-distance surfer was a long shot, and although it does not have a scientific background, it performed surprisingly well. This illustrates that even simple methods can improve the quality of the PageRank algorithm. The research that has been performed in Topic-sensitive PageRank [8], shows that the surfer model can be tuned to a specific topic. They had to calculate the PageRank 16 times, once for each category. This research shows that this is not always required. If the links are weighted according to their relevance to the destination page, the algorithm performs better than the generic approach. 7.2 dataset We have found that a well-managed dataset is required to construct the web graph. The Nutch crawler stores all the results, but it is up to the administrator to keep the data files. Even if all the files are available, it is still important to assure that the dataset forms a web graph of one component, as many small components are not useful in the PageRank algorithm. As the dataset is quite large, it is important to reduce the number of nodes in the web graph, because otherwise the computations would get too complex. However, leaving links out can create a lot of separated components. 7.3 information retrieval We have also found that the adaptations to the algorithm have a positive effect on the quality of the search results, but the uniform sorting method still performs better. This is explained by the fact that PageRank is good for information retrieval, but finding the best gallery for a given keyword requires more than that. 23
32 24 conclusion Another aspect that is not very well-covered within the PageRank algorithm is time. For a number of queries that were used in the experiment, the age of a gallery is important. We would for example prefer images of FC Barcelona that were taken during their last game. But there are only a small number of links to those galleries, because they are very fresh. 7.4 measuring It was difficult to get a good measurement of the quality of the ranking. There are plenty of methods to compare two sortings, like the τ distance measure [11]. It is very hard however to produce a reference ranking by which the quality can be measured, because the dataset is very large and diverse. It gets even more complicated when static and dynamic ranking are combined. It is hard to compare the results of searches with and without static boosting, and it is difficult to balance the dynamic and static influences. With a lot more fine-tuning of the parameters it might be possible to improve the quality of the search results.
33 F U T U R E W O R K 8 In this chapter we discuss topics that require further research. 8.1 level of detail The level of detail of the web graph could be improved by using a more intelligent system. This system should be able to merge vertices that represent the same web page. One could for example remove any reference to sorting and page numbering. It is also possible to create custom filters for domains that do not perform well with the default approach, for example by creating a filter that combines multiple Flickr URLs that lead to the same gallery. 8.2 tooling The tools that have been used/written for this experiment are very rudimentary. They are very scalable, but offer little compatibility with the Nutch framework. With the current tools it is not possible to expand the dataset without recomputing all the ranks. It would be much more efficient to compute the ranks based on the previous results. Using these systems in a production environment would mean that roughly 20% of the resources would go into building, maintaining and updating the graph and ranks. 8.3 link prioritization The behavior of the smart surfer is determined by the link traversal probability distribution. This distribution can be improved; we could for example create a blacklist for pages that do not deserve rank. Another option is to combine multiple probability functions (ODP, Levenshtein, other topic comparison). It might be necessary to obtain more data to make a better estimate, for example the position of the link on a web page. Links that are at the bottom of a long page are less likely to be clicked on by a real user. 8.4 odp classifier The ODP classifier could also be improved. We could increase the number of categories, or find more suitable (less ambiguous) categories. Furthermore, the URL of the page might not be enough to determine the topic. The contents of a page contain much more clues which would make the system more robust. The structure of the application would in that case have to be changed, because the textual content of the destination of an outlink is not readily available. 25
34
35 O D P C L A S S I F I E R A Category True positive False positive False negative True negative Accuracy Precision Recall Adult % 81% 49% Arts % 59% 56% Business % 36% 83% Computers % 65% 37% Games % 67% 44% Health % 65% 35% Home % 64% 36% Kids and Teens % 32% 25% News % 44% 18% Recreation % 50% 28% Science % 63% 51% Shopping % 44% 11% Society % 59% 56% Sports % 75% 39% Averages % 50% 50% Table 5: ODP classifier precision and recall. 27
36 28 odp classifier True class Table 6: ODP classifier confusion matrix. Sports Society Shopping Science Recreation News Kids and Teens Home Health Games Computers Business Arts Adult Adult Arts Business Computers Games Health Home Kids and Teens News Recreation Science Shopping Society Sports Predicted class
37 S A M P L E W E B G R A P H B 29
38 30 sample web graph Table 7: Probability matrix based on ODP topic matches. Probabilities that are marked bold are part of the the sample web-graph warnerbros.com/happyfeet adobe.com/cfusion/showcase adobe.com/go/getflashplayer artis.nl/paginas/t/dierentuin/vogels/pinguins.html dierenpark-emmen.nl/nl/dierenpark/favo_pinguins.htm en.wikipedia.org/wiki/penguin en.wikipedia.org/wiki/penguin dierenpark-emmen.nl/nl/dierenpark/favo_pinguins.htm artis.nl/paginas/t/dierentuin/vogels/pinguins.html adobe.com/go/getflashplayer adobe.com/cfusion/showcase warnerbros.com/happyfeet
39 L U C E N E S C O R I N G C The score of query q for document d correlates to the cosine-distance or dot-product between document and query vectors in a Vector Space Model (VSM) of Information Retrieval. A document whose vector is closer to the query vector in that model is scored higher. The score is computed as follows: where score(q, d) = coord(q, d) querynorm(q) ( ) tf(t d) idf(t) 2 t.getboost() norm(t, d) t q 1. tf(t d) correlates to the term s frequency, defined as the number of times term t appears in the currently scored document d. Documents that have more occurrences of a given term receive a higher score. The default computation for tf(t d) in Default- Similarity is: tf(t d) = frequency idf(t) stands for Inverse Document Frequency. This value correlates to the inverse of docfreq(t) (the number of documents in which the term t appears). This means rarer terms deliver a higher contribution to the total score. The default computation for idf(t) in Default-Similarity is: ( ) numdocs idf(t) = 1 + log docfreq coord(q, d) is a score factor based on how many of the query terms are found in the specified document. Typically, a document that contains more of the query s terms will receive a higher score than another document with fewer query terms. This is a search time factor computed in coord(q, d) by the Similarity in effect at search time. 4. querynorm(q) is a normalizing factor used to make scores between queries comparable. This factor does not affect document ranking (since all ranked documents are multiplied by the same factor), but rather just attempts to make scores from different queries (or even different indexes) comparable. This is a search time factor computed by the Similarity in effect at search time. The default computation in Default-Similarity is: querynorm(q) = querynorm(sumofsquaredweights) 1 querynorm(sumofsquaredweights) = sumofsquaredweights
40 32 lucene scoring The sum of squared weights (of the query terms) is computed by the query Weight object. For example, a boolean query computes this value as: sumofsquaredweights = q.getboost() 2 (idf(t) t.getboost()) 2 t q 5. t.getboost() is a search time boost of term t in the query q as specified in the query text (see query syntax), or as set by application calls to setboost(). Notice that there is really no direct API for accessing a boost of one term in a multi-term query, but rather multi terms are represented in a query as multi TermQuery objects, and so the boost of a term in the query is accessible by calling the sub-query getboost(). 6. norm(t, d) encapsulates a few (indexing time) boost and length factors: Document boost set by calling doc.setboost() before adding the document to the index. Field boost set by calling field.setboost() before adding the field to a document. lengthnorm(field) computed when the document is added to the index in accordance with the number of tokens of this field in the document, so that shorter fields contribute more to the score. LengthNorm is computed by the Similarity class in effect at indexing. When a document is added to the index, all the above factors are multiplied. If the document has multiple fields with the same name, all their boosts are multiplied together: norm(t, d) = doc.getboost() lengthnorm(field) f.getboost() fieldf d named as t However the resulted norm value is encoded as a single byte before being stored. At search time, the norm-byte value is read from the index directory and decoded back to a float norm value. This encoding/decoding, while reducing index size, comes with the price of precision loss - it is not guaranteed that decode(encode(x)) = x. For instance, decode(encode(0.89)) = Also notice that at search time is too late to modify this norm part of scoring, e.g. by using a different Similarity for search.
41 D L A R G E D ATA S E T R E S U LT S Figure 6: This figure shows one of the pages that people have scored. 33
CC PROCESAMIENTO MASIVO DE DATOS OTOÑO Lecture 7: Information Retrieval II. Aidan Hogan
CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 7: Information Retrieval II Aidan Hogan aidhog@gmail.com How does Google know about the Web? Inverted Index: Example 1 Fruitvale Station is a 2013
More informationInformation Retrieval Spring Web retrieval
Information Retrieval Spring 2016 Web retrieval The Web Large Changing fast Public - No control over editing or contents Spam and Advertisement How big is the Web? Practically infinite due to the dynamic
More informationInformation Retrieval
Multimedia Computing: Algorithms, Systems, and Applications: Information Retrieval and Search Engine By Dr. Yu Cao Department of Computer Science The University of Massachusetts Lowell Lowell, MA 01854,
More informationMining Web Data. Lijun Zhang
Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems
More informationMining Web Data. Lijun Zhang
Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems
More informationCOMP 4601 Hubs and Authorities
COMP 4601 Hubs and Authorities 1 Motivation PageRank gives a way to compute the value of a page given its position and connectivity w.r.t. the rest of the Web. Is it the only algorithm: No! It s just one
More informationInformation Retrieval May 15. Web retrieval
Information Retrieval May 15 Web retrieval What s so special about the Web? The Web Large Changing fast Public - No control over editing or contents Spam and Advertisement How big is the Web? Practically
More informationLink Analysis and Web Search
Link Analysis and Web Search Moreno Marzolla Dip. di Informatica Scienza e Ingegneria (DISI) Università di Bologna http://www.moreno.marzolla.name/ based on material by prof. Bing Liu http://www.cs.uic.edu/~liub/webminingbook.html
More informationSocial Network Analysis
Social Network Analysis Giri Iyengar Cornell University gi43@cornell.edu March 14, 2018 Giri Iyengar (Cornell Tech) Social Network Analysis March 14, 2018 1 / 24 Overview 1 Social Networks 2 HITS 3 Page
More informationInformation Retrieval and Web Search
Information Retrieval and Web Search Link analysis Instructor: Rada Mihalcea (Note: This slide set was adapted from an IR course taught by Prof. Chris Manning at Stanford U.) The Web as a Directed Graph
More informationCreating a Classifier for a Focused Web Crawler
Creating a Classifier for a Focused Web Crawler Nathan Moeller December 16, 2015 1 Abstract With the increasing size of the web, it can be hard to find high quality content with traditional search engines.
More informationThe Anatomy of a Large-Scale Hypertextual Web Search Engine
The Anatomy of a Large-Scale Hypertextual Web Search Engine Article by: Larry Page and Sergey Brin Computer Networks 30(1-7):107-117, 1998 1 1. Introduction The authors: Lawrence Page, Sergey Brin started
More informationCOMP5331: Knowledge Discovery and Data Mining
COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani and Terry Winograd, Jon M. Kleinberg 1 1 PageRank
More informationInformation Retrieval (IR) Introduction to Information Retrieval. Lecture Overview. Why do we need IR? Basics of an IR system.
Introduction to Information Retrieval Ethan Phelps-Goodman Some slides taken from http://www.cs.utexas.edu/users/mooney/ir-course/ Information Retrieval (IR) The indexing and retrieval of textual documents.
More informationAdministrative. Web crawlers. Web Crawlers and Link Analysis!
Web Crawlers and Link Analysis! David Kauchak cs458 Fall 2011 adapted from: http://www.stanford.edu/class/cs276/handouts/lecture15-linkanalysis.ppt http://webcourse.cs.technion.ac.il/236522/spring2007/ho/wcfiles/tutorial05.ppt
More informationLecture #3: PageRank Algorithm The Mathematics of Google Search
Lecture #3: PageRank Algorithm The Mathematics of Google Search We live in a computer era. Internet is part of our everyday lives and information is only a click away. Just open your favorite search engine,
More informationINTRODUCTION TO DATA SCIENCE. Link Analysis (MMDS5)
INTRODUCTION TO DATA SCIENCE Link Analysis (MMDS5) Introduction Motivation: accurate web search Spammers: want you to land on their pages Google s PageRank and variants TrustRank Hubs and Authorities (HITS)
More informationChapter 6: Information Retrieval and Web Search. An introduction
Chapter 6: Information Retrieval and Web Search An introduction Introduction n Text mining refers to data mining using text documents as data. n Most text mining tasks use Information Retrieval (IR) methods
More informationInformation Retrieval. Lecture 11 - Link analysis
Information Retrieval Lecture 11 - Link analysis Seminar für Sprachwissenschaft International Studies in Computational Linguistics Wintersemester 2007 1/ 35 Introduction Link analysis: using hyperlinks
More informationA Survey on Web Information Retrieval Technologies
A Survey on Web Information Retrieval Technologies Lan Huang Computer Science Department State University of New York, Stony Brook Presented by Kajal Miyan Michigan State University Overview Web Information
More informationSOURCERER: MINING AND SEARCHING INTERNET- SCALE SOFTWARE REPOSITORIES
SOURCERER: MINING AND SEARCHING INTERNET- SCALE SOFTWARE REPOSITORIES Introduction to Information Retrieval CS 150 Donald J. Patterson This content based on the paper located here: http://dx.doi.org/10.1007/s10618-008-0118-x
More information5 Choosing keywords Initially choosing keywords Frequent and rare keywords Evaluating the competition rates of search
Seo tutorial Seo tutorial Introduction to seo... 4 1. General seo information... 5 1.1 History of search engines... 5 1.2 Common search engine principles... 6 2. Internal ranking factors... 8 2.1 Web page
More informationLecture 8: Linkage algorithms and web search
Lecture 8: Linkage algorithms and web search Information Retrieval Computer Science Tripos Part II Ronan Cummins 1 Natural Language and Information Processing (NLIP) Group ronan.cummins@cl.cam.ac.uk 2017
More informationPart I: Data Mining Foundations
Table of Contents 1. Introduction 1 1.1. What is the World Wide Web? 1 1.2. A Brief History of the Web and the Internet 2 1.3. Web Data Mining 4 1.3.1. What is Data Mining? 6 1.3.2. What is Web Mining?
More information1 Starting around 1996, researchers began to work on. 2 In Feb, 1997, Yanhong Li (Scotch Plains, NJ) filed a
!"#$ %#& ' Introduction ' Social network analysis ' Co-citation and bibliographic coupling ' PageRank ' HIS ' Summary ()*+,-/*,) Early search engines mainly compare content similarity of the query and
More informationHome Page. Title Page. Page 1 of 14. Go Back. Full Screen. Close. Quit
Page 1 of 14 Retrieving Information from the Web Database and Information Retrieval (IR) Systems both manage data! The data of an IR system is a collection of documents (or pages) User tasks: Browsing
More informationDesign and Implementation of Agricultural Information Resources Vertical Search Engine Based on Nutch
619 A publication of CHEMICAL ENGINEERING TRANSACTIONS VOL. 51, 2016 Guest Editors: Tichun Wang, Hongyang Zhang, Lei Tian Copyright 2016, AIDIC Servizi S.r.l., ISBN 978-88-95608-43-3; ISSN 2283-9216 The
More informationCOMP Page Rank
COMP 4601 Page Rank 1 Motivation Remember, we were interested in giving back the most relevant documents to a user. Importance is measured by reference as well as content. Think of this like academic paper
More informationEXTRACTION OF RELEVANT WEB PAGES USING DATA MINING
Chapter 3 EXTRACTION OF RELEVANT WEB PAGES USING DATA MINING 3.1 INTRODUCTION Generally web pages are retrieved with the help of search engines which deploy crawlers for downloading purpose. Given a query,
More informationWeighted Page Rank Algorithm Based on Number of Visits of Links of Web Page
International Journal of Soft Computing and Engineering (IJSCE) ISSN: 31-307, Volume-, Issue-3, July 01 Weighted Page Rank Algorithm Based on Number of Visits of Links of Web Page Neelam Tyagi, Simple
More informationA STUDY OF RANKING ALGORITHM USED BY VARIOUS SEARCH ENGINE
A STUDY OF RANKING ALGORITHM USED BY VARIOUS SEARCH ENGINE Bohar Singh 1, Gursewak Singh 2 1, 2 Computer Science and Application, Govt College Sri Muktsar sahib Abstract The World Wide Web is a popular
More informationWeb Structure Mining using Link Analysis Algorithms
Web Structure Mining using Link Analysis Algorithms Ronak Jain Aditya Chavan Sindhu Nair Assistant Professor Abstract- The World Wide Web is a huge repository of data which includes audio, text and video.
More informationInternational Journal of Scientific & Engineering Research Volume 2, Issue 12, December ISSN Web Search Engine
International Journal of Scientific & Engineering Research Volume 2, Issue 12, December-2011 1 Web Search Engine G.Hanumantha Rao*, G.NarenderΨ, B.Srinivasa Rao+, M.Srilatha* Abstract This paper explains
More information10/10/13. Traditional database system. Information Retrieval. Information Retrieval. Information retrieval system? Information Retrieval Issues
COS 597A: Principles of Database and Information Systems Information Retrieval Traditional database system Large integrated collection of data Uniform access/modifcation mechanisms Model of data organization
More informationPart 1: Link Analysis & Page Rank
Chapter 8: Graph Data Part 1: Link Analysis & Page Rank Based on Leskovec, Rajaraman, Ullman 214: Mining of Massive Datasets 1 Graph Data: Social Networks [Source: 4-degrees of separation, Backstrom-Boldi-Rosa-Ugander-Vigna,
More informationSEARCH ENGINE INSIDE OUT
SEARCH ENGINE INSIDE OUT From Technical Views r86526020 r88526016 r88526028 b85506013 b85506010 April 11,2000 Outline Why Search Engine so important Search Engine Architecture Crawling Subsystem Indexing
More informationCRAWLING THE WEB: DISCOVERY AND MAINTENANCE OF LARGE-SCALE WEB DATA
CRAWLING THE WEB: DISCOVERY AND MAINTENANCE OF LARGE-SCALE WEB DATA An Implementation Amit Chawla 11/M.Tech/01, CSE Department Sat Priya Group of Institutions, Rohtak (Haryana), INDIA anshmahi@gmail.com
More informationThe Topic Specific Search Engine
The Topic Specific Search Engine Benjamin Stopford 1 st Jan 2006 Version 0.1 Overview This paper presents a model for creating an accurate topic specific search engine through a focussed (vertical)
More informationInformation Retrieval
Introduction to Information Retrieval CS3245 12 Lecture 12: Crawling and Link Analysis Information Retrieval Last Time Chapter 11 1. Probabilistic Approach to Retrieval / Basic Probability Theory 2. Probability
More informationCHAPTER THREE INFORMATION RETRIEVAL SYSTEM
CHAPTER THREE INFORMATION RETRIEVAL SYSTEM 3.1 INTRODUCTION Search engine is one of the most effective and prominent method to find information online. It has become an essential part of life for almost
More informationBrief (non-technical) history
Web Data Management Part 2 Advanced Topics in Database Management (INFSCI 2711) Textbooks: Database System Concepts - 2010 Introduction to Information Retrieval - 2008 Vladimir Zadorozhny, DINS, SCI, University
More informationEvaluating the Usefulness of Sentiment Information for Focused Crawlers
Evaluating the Usefulness of Sentiment Information for Focused Crawlers Tianjun Fu 1, Ahmed Abbasi 2, Daniel Zeng 1, Hsinchun Chen 1 University of Arizona 1, University of Wisconsin-Milwaukee 2 futj@email.arizona.edu,
More informationPageRank Algorithm Abstract: Keywords: I. Introduction II. Text Ranking Vs. Page Ranking
IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661,p-ISSN: 2278-8727, Volume 19, Issue 1, Ver. III (Jan.-Feb. 2017), PP 01-07 www.iosrjournals.org PageRank Algorithm Albi Dode 1, Silvester
More informationUNIT-V WEB MINING. 3/18/2012 Prof. Asha Ambhaikar, RCET Bhilai.
UNIT-V WEB MINING 1 Mining the World-Wide Web 2 What is Web Mining? Discovering useful information from the World-Wide Web and its usage patterns. 3 Web search engines Index-based: search the Web, index
More informationINTRODUCTION. Chapter GENERAL
Chapter 1 INTRODUCTION 1.1 GENERAL The World Wide Web (WWW) [1] is a system of interlinked hypertext documents accessed via the Internet. It is an interactive world of shared information through which
More informationIn the recent past, the World Wide Web has been witnessing an. explosive growth. All the leading web search engines, namely, Google,
1 1.1 Introduction In the recent past, the World Wide Web has been witnessing an explosive growth. All the leading web search engines, namely, Google, Yahoo, Askjeeves, etc. are vying with each other to
More informationWeb Search Engines: Solutions to Final Exam, Part I December 13, 2004
Web Search Engines: Solutions to Final Exam, Part I December 13, 2004 Problem 1: A. In using the vector model to compare the similarity of two documents, why is it desirable to normalize the vectors to
More informationWeb Search Ranking. (COSC 488) Nazli Goharian Evaluation of Web Search Engines: High Precision Search
Web Search Ranking (COSC 488) Nazli Goharian nazli@cs.georgetown.edu 1 Evaluation of Web Search Engines: High Precision Search Traditional IR systems are evaluated based on precision and recall. Web search
More informationRanking in a Domain Specific Search Engine
Ranking in a Domain Specific Search Engine CS6998-03 - NLP for the Web Spring 2008, Final Report Sara Stolbach, ss3067 [at] columbia.edu Abstract A search engine that runs over all domains must give equal
More informationRecent Researches on Web Page Ranking
Recent Researches on Web Page Pradipta Biswas School of Information Technology Indian Institute of Technology Kharagpur, India Importance of Web Page Internet Surfers generally do not bother to go through
More informationSearching the Web What is this Page Known for? Luis De Alba
Searching the Web What is this Page Known for? Luis De Alba ldealbar@cc.hut.fi Searching the Web Arasu, Cho, Garcia-Molina, Paepcke, Raghavan August, 2001. Stanford University Introduction People browse
More informationAn Enhanced Page Ranking Algorithm Based on Weights and Third level Ranking of the Webpages
An Enhanced Page Ranking Algorithm Based on eights and Third level Ranking of the ebpages Prahlad Kumar Sharma* 1, Sanjay Tiwari #2 M.Tech Scholar, Department of C.S.E, A.I.E.T Jaipur Raj.(India) Asst.
More informationInformation Retrieval
Information Retrieval CSC 375, Fall 2016 An information retrieval system will tend not to be used whenever it is more painful and troublesome for a customer to have information than for him not to have
More informationCLOUD COMPUTING PROJECT. By: - Manish Motwani - Devendra Singh Parmar - Ashish Sharma
CLOUD COMPUTING PROJECT By: - Manish Motwani - Devendra Singh Parmar - Ashish Sharma Instructor: Prof. Reddy Raja Mentor: Ms M.Padmini To Implement PageRank Algorithm using Map-Reduce for Wikipedia and
More informationBig Data Analytics CSCI 4030
High dim. data Graph data Infinite data Machine learning Apps Locality sensitive hashing PageRank, SimRank Filtering data streams SVM Recommen der systems Clustering Community Detection Web advertising
More informationTHE WEB SEARCH ENGINE
International Journal of Computer Science Engineering and Information Technology Research (IJCSEITR) Vol.1, Issue 2 Dec 2011 54-60 TJPRC Pvt. Ltd., THE WEB SEARCH ENGINE Mr.G. HANUMANTHA RAO hanu.abc@gmail.com
More informationNYU CSCI-GA Fall 2016
1 / 45 Information Retrieval: Personalization Fernando Diaz Microsoft Research NYC November 7, 2016 2 / 45 Outline Introduction to Personalization Topic-Specific PageRank News Personalization Deciding
More informationThe PageRank Citation Ranking
October 17, 2012 Main Idea - Page Rank web page is important if it points to by other important web pages. *Note the recursive definition IR - course web page, Brian home page, Emily home page, Steven
More informationChapter 27 Introduction to Information Retrieval and Web Search
Chapter 27 Introduction to Information Retrieval and Web Search Copyright 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 27 Outline Information Retrieval (IR) Concepts Retrieval
More informationTopology-Based Spam Avoidance in Large-Scale Web Crawls
Topology-Based Spam Avoidance in Large-Scale Web Crawls Clint Sparkman Joint work with Hsin-Tsang Lee and Dmitri Loguinov Internet Research Lab Department of Computer Science and Engineering Texas A&M
More informationCrawler. Crawler. Crawler. Crawler. Anchors. URL Resolver Indexer. Barrels. Doc Index Sorter. Sorter. URL Server
Authors: Sergey Brin, Lawrence Page Google, word play on googol or 10 100 Centralized system, entire HTML text saved Focused on high precision, even at expense of high recall Relies heavily on document
More informationIntroduction to Information Retrieval
Introduction to Information Retrieval http://informationretrieval.org IIR 21: Link Analysis Hinrich Schütze Center for Information and Language Processing, University of Munich 2014-06-18 1/80 Overview
More informationSEO Factors Influencing National Search Results
SEO Factors Influencing National Search Results 1. Domain Age Domain Factors 2. Keyword Appears in Top Level Domain: Doesn t give the boost that it used to, but having your keyword in the domain still
More informationCS47300 Web Information Search and Management
CS47300 Web Information Search and Management Search Engine Optimization Prof. Chris Clifton 31 October 2018 What is Search Engine Optimization? 90% of search engine clickthroughs are on the first page
More informationInternational Journal of Advance Engineering and Research Development. A Review Paper On Various Web Page Ranking Algorithms In Web Mining
Scientific Journal of Impact Factor (SJIF): 4.14 International Journal of Advance Engineering and Research Development Volume 3, Issue 2, February -2016 e-issn (O): 2348-4470 p-issn (P): 2348-6406 A Review
More informationPagerank Scoring. Imagine a browser doing a random walk on web pages:
Ranking Sec. 21.2 Pagerank Scoring Imagine a browser doing a random walk on web pages: Start at a random page At each step, go out of the current page along one of the links on that page, equiprobably
More informationEinführung in Web und Data Science Community Analysis. Prof. Dr. Ralf Möller Universität zu Lübeck Institut für Informationssysteme
Einführung in Web und Data Science Community Analysis Prof. Dr. Ralf Möller Universität zu Lübeck Institut für Informationssysteme Today s lecture Anchor text Link analysis for ranking Pagerank and variants
More informationBig Data Analytics CSCI 4030
High dim. data Graph data Infinite data Machine learning Apps Locality sensitive hashing PageRank, SimRank Filtering data streams SVM Recommen der systems Clustering Community Detection Web advertising
More informationSEO ISSUES FOUND ON YOUR SITE (MARCH 29, 2016)
www.advantageserviceco.com SEO ISSUES FOUND ON YOUR SITE (MARCH 29, 2016) This report shows the SEO issues that, when solved, will improve your site rankings and increase traffic to your website. 16 errors
More informationCS6200 Information Retreival. The WebGraph. July 13, 2015
CS6200 Information Retreival The WebGraph The WebGraph July 13, 2015 1 Web Graph: pages and links The WebGraph describes the directed links between pages of the World Wide Web. A directed edge connects
More informationResearch and implementation of search engine based on Lucene Wan Pu, Wang Lisha
2nd International Conference on Advances in Mechanical Engineering and Industrial Informatics (AMEII 2016) Research and implementation of search engine based on Lucene Wan Pu, Wang Lisha Physics Institute,
More informationUniversity of Virginia Department of Computer Science. CS 4501: Information Retrieval Fall 2015
University of Virginia Department of Computer Science CS 4501: Information Retrieval Fall 2015 5:00pm-6:15pm, Monday, October 26th Name: ComputingID: This is a closed book and closed notes exam. No electronic
More informationCS 347 Parallel and Distributed Data Processing
CS 347 Parallel and Distributed Data Processing Spring 2016 Notes 12: Distributed Information Retrieval CS 347 Notes 12 2 CS 347 Notes 12 3 CS 347 Notes 12 4 CS 347 Notes 12 5 Web Search Engine Crawling
More informationCS 347 Parallel and Distributed Data Processing
CS 347 Parallel and Distributed Data Processing Spring 2016 Notes 12: Distributed Information Retrieval CS 347 Notes 12 2 CS 347 Notes 12 3 CS 347 Notes 12 4 Web Search Engine Crawling Indexing Computing
More informationExperimental study of Web Page Ranking Algorithms
IOSR IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661, p- ISSN: 2278-8727Volume 16, Issue 2, Ver. II (Mar-pr. 2014), PP 100-106 Experimental study of Web Page Ranking lgorithms Rachna
More informationCS555: Distributed Systems [Fall 2017] Dept. Of Computer Science, Colorado State University
CS 555: DISTRIBUTED SYSTEMS [MAPREDUCE] Shrideep Pallickara Computer Science Colorado State University Frequently asked questions from the previous class survey Bit Torrent What is the right chunk/piece
More informationSearch Engine Optimization (SEO) using HTML Meta-Tags
2018 IJSRST Volume 4 Issue 9 Print ISSN : 2395-6011 Online ISSN : 2395-602X Themed Section: Science and Technology Search Engine Optimization (SEO) using HTML Meta-Tags Dr. Birajkumar V. Patel, Dr. Raina
More informationRanking Techniques in Search Engines
Ranking Techniques in Search Engines Rajat Chaudhari M.Tech Scholar Manav Rachna International University, Faridabad Charu Pujara Assistant professor, Dept. of Computer Science Manav Rachna International
More informationWeb search before Google. (Taken from Page et al. (1999), The PageRank Citation Ranking: Bringing Order to the Web.)
' Sta306b May 11, 2012 $ PageRank: 1 Web search before Google (Taken from Page et al. (1999), The PageRank Citation Ranking: Bringing Order to the Web.) & % Sta306b May 11, 2012 PageRank: 2 Web search
More informationChapter 3: Google Penguin, Panda, & Hummingbird
Chapter 3: Google Penguin, Panda, & Hummingbird Search engine algorithms are based on a simple premise: searchers want an answer to their queries. For any search, there are hundreds or thousands of sites
More informationSEARCHMETRICS WHITEPAPER RANKING FACTORS Targeted Analysis for more Success on Google and in your Online Market
2018 SEARCHMETRICS WHITEPAPER RANKING FACTORS 2018 Targeted for more Success on Google and in your Online Market Table of Contents Introduction: Why ranking factors for niches?... 3 Methodology: Which
More informationLecture 9: I: Web Retrieval II: Webology. Johan Bollen Old Dominion University Department of Computer Science
Lecture 9: I: Web Retrieval II: Webology Johan Bollen Old Dominion University Department of Computer Science jbollen@cs.odu.edu http://www.cs.odu.edu/ jbollen April 10, 2003 Page 1 WWW retrieval Two approaches
More informationData-Intensive Computing with MapReduce
Data-Intensive Computing with MapReduce Session 5: Graph Processing Jimmy Lin University of Maryland Thursday, February 21, 2013 This work is licensed under a Creative Commons Attribution-Noncommercial-Share
More informationPersonalizing PageRank Based on Domain Profiles
Personalizing PageRank Based on Domain Profiles Mehmet S. Aktas, Mehmet A. Nacar, and Filippo Menczer Computer Science Department Indiana University Bloomington, IN 47405 USA {maktas,mnacar,fil}@indiana.edu
More informationProximity Prestige using Incremental Iteration in Page Rank Algorithm
Indian Journal of Science and Technology, Vol 9(48), DOI: 10.17485/ijst/2016/v9i48/107962, December 2016 ISSN (Print) : 0974-6846 ISSN (Online) : 0974-5645 Proximity Prestige using Incremental Iteration
More informationWhy it Really Matters to RESNET Members
Welcome to SEO 101 Why it Really Matters to RESNET Members Presented by Fourth Dimension at the 2013 RESNET Conference 1. 2. 3. Why you need SEO How search engines work How people use search engines
More informationThe Ultimate Digital Marketing Glossary (A-Z) what does it all mean? A-Z of Digital Marketing Translation
The Ultimate Digital Marketing Glossary (A-Z) what does it all mean? In our experience, we find we can get over-excited when talking to clients or family or friends and sometimes we forget that not everyone
More informationCS224W: Social and Information Network Analysis Jure Leskovec, Stanford University
CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University http://cs224w.stanford.edu How to organize the Web? First try: Human curated Web directories Yahoo, DMOZ, LookSmart Second
More informationA Modified Algorithm to Handle Dangling Pages using Hypothetical Node
A Modified Algorithm to Handle Dangling Pages using Hypothetical Node Shipra Srivastava Student Department of Computer Science & Engineering Thapar University, Patiala, 147001 (India) Rinkle Rani Aggrawal
More informationAuthoritative K-Means for Clustering of Web Search Results
Authoritative K-Means for Clustering of Web Search Results Gaojie He Master in Information Systems Submission date: June 2010 Supervisor: Kjetil Nørvåg, IDI Co-supervisor: Robert Neumayer, IDI Norwegian
More informationModule 1: Internet Basics for Web Development (II)
INTERNET & WEB APPLICATION DEVELOPMENT SWE 444 Fall Semester 2008-2009 (081) Module 1: Internet Basics for Web Development (II) Dr. El-Sayed El-Alfy Computer Science Department King Fahd University of
More informationCS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University
CS473: CS-473 Course Review Luo Si Department of Computer Science Purdue University Basic Concepts of IR: Outline Basic Concepts of Information Retrieval: Task definition of Ad-hoc IR Terminologies and
More informationFILTERING OF URLS USING WEBCRAWLER
FILTERING OF URLS USING WEBCRAWLER Arya Babu1, Misha Ravi2 Scholar, Computer Science and engineering, Sree Buddha college of engineering for women, 2 Assistant professor, Computer Science and engineering,
More informationSearch Engines. Charles Severance
Search Engines Charles Severance Google Architecture Web Crawling Index Building Searching http://infolab.stanford.edu/~backrub/google.html Google Search Google I/O '08 Keynote by Marissa Mayer Usablity
More informationHow to organize the Web?
How to organize the Web? First try: Human curated Web directories Yahoo, DMOZ, LookSmart Second try: Web Search Information Retrieval attempts to find relevant docs in a small and trusted set Newspaper
More informationLink Analysis in Web Mining
Problem formulation (998) Link Analysis in Web Mining Hubs and Authorities Spam Detection Suppose we are given a collection of documents on some broad topic e.g., stanford, evolution, iraq perhaps obtained
More informationInformation Retrieval Lecture 4: Web Search. Challenges of Web Search 2. Natural Language and Information Processing (NLIP) Group
Information Retrieval Lecture 4: Web Search Computer Science Tripos Part II Simone Teufel Natural Language and Information Processing (NLIP) Group sht25@cl.cam.ac.uk (Lecture Notes after Stephen Clark)
More informationClustering. Informal goal. General types of clustering. Applications: Clustering in information search and analysis. Example applications in search
Informal goal Clustering Given set of objects and measure of similarity between them, group similar objects together What mean by similar? What is good grouping? Computation time / quality tradeoff 1 2
More informationAutomatically Constructing a Directory of Molecular Biology Databases
Automatically Constructing a Directory of Molecular Biology Databases Luciano Barbosa Sumit Tandon Juliana Freire School of Computing University of Utah {lbarbosa, sumitt, juliana}@cs.utah.edu Online Databases
More informationDisambiguating Search by Leveraging a Social Context Based on the Stream of User s Activity
Disambiguating Search by Leveraging a Social Context Based on the Stream of User s Activity Tomáš Kramár, Michal Barla and Mária Bieliková Faculty of Informatics and Information Technology Slovak University
More information