F O C U S E D S U R F E R M O D E L S Ranking visual search results

Size: px
Start display at page:

Download "F O C U S E D S U R F E R M O D E L S Ranking visual search results"

Transcription

1 F O C U S E D S U R F E R M O D E L S Ranking visual search results jasper hafkenscheid Faculty of Mathematics and Natural Sciences University of Groningen August 2010

2 Jasper Hafkenscheid: Focused surfer models August 2010 supervisors: Marco Aiello Nicolai Petkov Mathijs Homminga location: Groningen time frame: August 2010

3 A B S T R A C T PageRank is a graph-based ranking algorithm that ranks the nodes in a graph based on their connections. It has been designed to determine the intrinsic value of a page, based on the link structure of the web. We research the effects of adaptations to the PageRank algorithm in the context of a gallery search engine. The search engine uses text-based search to select galleries. Galleries are web pages that contain a number of small images, which link to enlarged versions of them. We evaluate the performance by looking at the ordering of the search results. The generic PageRank algorithm does not work well for this kind of application; it seems that the best galleries are not found on the pages with the highest PageRank. PageRank assumes that all links convey trust to other pages, but many links disrupt that concept (e.g. links to download or update your browser or required plug-ins). The inherent nature of the algorithm is the Random Surfer Model, which is based on a virtual user that navigates the web by following a random link on a page, and repeating this infinitely. The PageRank algorithm can be changed by altering the probability with which the random surfer chooses which link to follow. Our design choice is to replace the random surfer model with a focused surfer model, which increases the probability of following a link based on the similarity between the linking page and the target. We will analyze the performance with the gallery search engine and compare the results with a generic PageRank and uniform ranking. The search engine uses a dataset that consists of 15 million galleries; to acquire this dataset we have crawled 1.5 billion pages. We find that alterations to the system affect the outcome, and that the overall performance is increased. The PageRank algorithm is intended for information retrieval, it works very well for general purpose ranking, but the random surfer model does not work well enough for all applications. iii

4

5 C O N T E N T S 1 introduction 1 2 background Information Retrieval Image Retrieval Ranking Web graph PageRank Concept Customization 5 3 related work Web-based Image Retrieval Intentional surfer model Topic-sensitive PageRank Hyperlink-Induced Topic Search On-Line Page Importance Computation 8 4 concept Importance of links Random surfer Open directory project (ODP)-biased surfer Levenshtein-distance surfer Granularity Self-links Spam 11 5 realization System architecture Procedure Build graph Iterate PageRank Build a rank-lookup index Rerank gallery index Measuring performance Data set Tools Apache Hadoop Lucene Java Nutch LingPipe 17 6 evaluation and results Sample web graph Real web graph 21 7 conclusion Surfer-models Dataset Information retrieval Measuring 24 v

6 vi contents 8 future work Level of detail Tooling Link prioritization ODP classifier 25 a odp classifier 27 b sample web graph 29 c lucene scoring 31 d large dataset results 33 bibliography 35

7 L I S T O F F I G U R E S Figure 1 Example link-structure 6 Figure 2 A linked set of hubs and authorities. 8 Figure 3 Architecture overview 13 Figure 4 MapReduce 16 Figure 5 Example web graph 19 Figure 6 Sample scoring page 33 L I S T O F TA B L E S Table 1 Ranking of sample web graph 20 Table 2 Static gallery ordering using generic PageRank 20 Table 3 Static gallery ordering using Levenshtein based PageRank 20 Table 4 Static gallery ordering using ODP-based Page- Rank 21 Table 5 ODP classifier precision and recall. 27 Table 6 ODP classifier confusion matrix. 28 Table 7 Probability matrix based on ODP topic matches. 30 Table 8 User feedback on ranking 34 A C R O N Y M S ODP Open directory project DFS Distributed file system OPIC Online Page Importance Computation HITS Hyperlink-Induced Topic Search TLD Top Level Domain TREC Text REtrieval Conference vii

8

9 I N T R O D U C T I O N 1 The Internet has grown since the invention of html in Html is the language for web pages, which can contain links to other html pages. These links are called hyperlinks. The users can follow the links to navigate the web, but as the Internet grew, it became harder to accurately find information. This was the motivation to develop search engines that index the web and allow searching based on keywords. Ordering the search results became more and more important as the web grew. PageRank is an algorithm that uses the hyperlinks on the pages to determine the value of a web page. It is the basis for the Google search engine. When the Internet is seen as a graph with nodes (pages) and vertices (hyperlinks), then it is possible to determine the value of a page with respect to the number of references. PageRank works by distributing rank via the outlinks of every page. The rank of a page is determined by the amount of rank it receives from the pages that refer to it. This recursive algorithm calculates the rank for every node in graph. Nodes with a higher rank represent more valuable web pages which are more likely to contain valuable information. These pages should be given a higher position in the search results. We apply PageRank to an image gallery dataset, which consists of image galleries that are found on the web. We find that the performance is sub-optimal, because the quality of a photo gallery cannot be properly determined by the popularity on the web page. However, PageRank can still be used to determine which galleries are more valuable than others. To improve the PageRank computation, we replace the surfer model with a model that is smarter and that more accurately resembles real users. The concept is that links between pages with the same topic are more important than links to pages that concern a different topic. Pages that have a lot in common are rated as more likely candidates to follow than pages that cover different topics. We have developed an ODP biased surfer, which uses a classifier trained with ODP data to determine the similarity of the two pages. ODP is a project that collects and categorizes links to web sites. This data is freely available. The other surfer model that we are testing is based on the Levenshtein distance measurement. This measurement is based on the similarity of the original and target URLs. The concept behind this model is that pages that discuss the same topic are likely to have that mentioned in their URL. To compare the results, we will also calculate the PageRank with the original random surfer model. The fourth and last sorting method gives each page a rank of 1, which we call uniform ranking. To gather results we process a dataset that consists of approximately 15 million galleries and 1.5 billion other pages. The calculated ranks are then combined with the text-match score to produce search results for 22 queries. Because there are no good validation methods for this dataset, we have to ask people what they think of the resulting ordering 1

10 2 introduction of the galleries. To demonstrate the desired effect, we have also applied the algorithm to a sample web graph. We see that the new surfer models outperform the random surfer model. The ODP surfer model does not work better than the Levenshtein model. This can be explained by the poor performance of the classifier. The research still shows that the random surfer model is sub-optimal, and also shows that the uniform ranking is better in some areas, which was unexpected. The remainder of this thesis is structured into seven chapters. Chapter 2 will explain some of the basics of information retrieval and PageRank. Chapter 3 explains the position of this thesis with respect to the state of art. In Chapter 4 we explain the concept and introduce the three surfer models. Then we describe how we computed the rank values and explain some of the concepts (Chapter 5). In Chapters 6 and 7 we discuss the results and explain them. The final chapter contains some future work on the field of alternate surfer models for PageRank.

11 B A C K G R O U N D 2 There has been a lot of research regarding information retrieval and various ranking methods. Some of the most effective groundwork that is done regarding query-independent ranking is PageRank. In this chapter we will introduce these topics. 2.1 information retrieval Information retrieval concerns the storage of and access to information. The representation and organization of the information items should provide the user with easy access to the information in which he is interested [4]. The typical scenario is that a user has an information need. It then becomes the task of the information retrieval system to provide the information. The information need is often translated into a query. The query normally consists of some keywords, but may include other parameters like a time period. The query is then fed into a search engine, which returns the requested data to the user. To provide the user with easy access, the results are ordered based on the relevance with respect to the information need. There also is a part of the information need that is not specified in the query; this could be a desired language or a desire to get recent information. These desires also have to be accounted for in the ordering of the search results. One of the differences between information retrieval and data retrieval is that an information retrieval system deals with natural language. A data retrieval system often uses a database and deals with a well-defined structure and semantics. 2.2 image retrieval Image retrieval is a type of information retrieval which involves images. The search can either be text-based or it can be based on image data. Most of the time the text data consists of meta-data that is added to the image (e.g. caption or keywords). Content-based information retrieval uses data that is extracted from the image with computervision techniques. This allows for searching based on a different image or by defining image features like color or composition [19]. In this thesis we have not used computer-vision techniques. We perform image retrieval using the text that is found with the image. 2.3 ranking Ranking is determining the order of elements. In our case, the elements are search result entries. The ordering of the results is based on the information need. This can be broken down into two components: a query and a generic component. The query is represented as a number of keywords, but the generic component describes some features which are more difficult to grasp. These can include popularity or authoritativeness. We want to see the results that have the best match with 3

12 4 background the information need first, so they should be ranked higher than other results. To rank the results, each element is given a score, a number that represents the quality of that element with respect to the information need. It is difficult to determine the score of a search result, because only some of the parameters of the information need can be calculated. Other features are more difficult to measure. The score is the sum of two components: a query-dependent and a query-independent score. The query-independent score is not influenced by the query, but is based on the implicit component of the information need. The query-independent score of an element reflects the intrinsic quality of a page. Because these scores are calculated independently of the query, they can be calculated ahead of time. This is called off-line computation. This allows for more extensive analysis, because on-line methods have to be very fast. The query-dependent score is usually based on heuristics that consider the number and locations of matches of the various query words on the page itself, in the URL or in any anchor text referring to the page. There is the number of matches, the order of the matches, the number of other words and the uniqueness of the terms. All these factors have to be combined into the final ranking. When the elements are used to perform a search based on a query, then the query-independent score is used together with a query-dependent score to rank all results matching the query [9]. 2.4 web graph To determine a score for some of the implicit features of the information need, one can use the link structure of the web. We expect that valuable pages have more links pointing towards them. A graph is a network of nodes (vertices) and links between nodes (edges). In a directed graph the edges have a direction: they have an origin and destination vertex. All the edges that originate from a vertex are the out-edges or outlinks of that vertex. The edges that point to a vertex are the in-edges or inlinks. The pages and links of the web can be modeled as a directed graph, the edges being the links and the vertices being pages. This graph is one of the data structures that is used to determine the score of a page. There are several ranking methods that are based on the web graph. The most successful one is PageRank, which we will be using in this thesis. A number of other methods are discussed in Chapter pagerank PageRank is a graph algorithm that can rank vertices in a graph [18, 5]. It was first used as a citation ranking mechanism, which ranked publications based on the number and quality of the citations. Larry Page and Sergey Brin developed it and later used it to build the Google search engine. If the graph is based on the link structure of the web (Figure 1), it can be used to rank web pages independently of a query.

13 2.5 pagerank Concept The idea behind the PageRank algorithm is that a hyperlink to a web page is a vote of support. Pages that receive a lot of votes are given a better rank than other pages. PageRank does not only count the number of votes, but also takes the rank of the page that voted into account. A page that receives a lot of votes from highly ranked pages receives a high rank itself. The PageRank of a page is defined recursively (Formula 2.1). The concept can also be explained by the random surfer model [14]. This model is based on a random surfer. This surfer follows a random outlink on the page he is visiting. Every time he visits a page, the score is increased. This is repeated infinitely. If the surfer visits a page more often, then the rank increases. The surfer sometime gets bored and jumps to a random page (which models bookmarks and dead-ends). This probability is modeled by the dampening factor d. Pages that do not contain outlinks transfer their rank equally over all pages of the web. Pages that have not yet been downloaded also fall into this category. The algorithm can provide some sort of scoring for pages that have not yet been downloaded. These scores can then be used to selectively download the web, instead of a breadth-first search. PR(p i ) = (1 d) + d p j M(p i ) P(p j, p i ) PR(p j) L(p j ) (2.1) M(p i ) is the set of pages linking to p i L(p i ) is the number of outlinks of p i d is the dampening factor, usually 0.85 P(p i, p j ) is the probability that the link between the pages in traversed, in the original formula this is always Customization There are several ways to alter the behavior of PageRank. The granularity of the webgraph can allow pages to merge, which would make the graph more dense. This also results in multiple pages getting the same score. One can also handle the dangling pages problem differently. Page and Brin suggested to add the random jump. This is a basic solution for the problem, but it does not mimic the behavior of real users. The probability that a user will make a random jump is 1 d, which usually is 15%. The random jump also remedies the problem of rank-sinks by always allowing for a way out. The best way to influence the outcome of the algorithm is to change the personality vector P. The vector represents the possibility that the user follows a specific outlink. Page and Brin included it to allow personalization of the search results. Personalization might not be the right word, because it would be far too costly to compute the PageRank values for every user. Rank-sink: A section of a graph that accumulates rank, due to the lack of outlinks.

14 6 background E 1.50 E 2.98 D F D F A B A B C 3.77 C 1.00 (a) After the 1st iteraton. (b) After the 2nd iteraton E 1.32 E 0.89 D F D F A B A B C C (c) After the 3rd iteraton. (d) After the 25th iteraton. Figure 1: Example link structure with PageRank values for each of the iterations. Pages are vertices and directed edges represent links.

15 R E L AT E D W O R K 3 This chapter notes some other research on the field of search result ranking, and describes the relation with respect to this thesis. We discuss Hyperlink-Induced Topic Search (HITS) and Online Page Importance Computation (OPIC) and elaborate on the various surfer models. 3.1 web-based image retrieval Internet search engines have been available for quite some time, but most of them are designed for generic information retrieval. There is a lot of research to be done on the field of image retrieval[12]. In this thesis, we focus on the ranking aspect of image retrieval. The generic selection and ranking of the search results are processed by traditional text search engine methods. 3.2 intentional surfer model An improvement to the random surfer model can be made by using realworld data. This has to be gathered by analyzing the surfing behavior of real users. Google released their toolbar in December 2000 [2], which allows people to use the search service and translate pages easily. Another feature of the toolbar is that it sends information regarding the surfing behavior to Google. If enough people submit their data, then Google can accurately replace the random surfer model with an intentional surfer model [10]. The browser Google Chrome, Google Analytics and the Google advertising services may also provide valuable information that enables Google to improve their ranking. For our research we do not have access to this data, and therefore use other methods to develop a new surfer model. The advantage is that our models can operate on pages that no user has ever visited. 3.3 topic-sensitive pagerank Topic-sensitive PageRank is an adaptation to the PageRank algorithm that has been researched by Haveliwala [8]. It uses a classifier which is trained using the data from the ODP [17]. ODP is a project that gathers and categorizes links to web pages. Each link has a description and a title. The dataset is maintained by volunteers and is freely available. The classifier can determine into which of the 16 main categories a page belongs. He has applied it to all pages in their dataset and performed the PageRank calculation for each of the top-level categories. At search time the topic of the query is determined, and the corresponding PageRank data is used. This resulted in a significant improvement in the ordering of the results. 7

16 8 related work The key difference between our approach and the topic-sensitive PageRank method is that we only calculate one score for each link. The researchers rely on the classifier to analyze the query, and use the corresponding PageRank data. Experience shows that can be difficult to accurately analyze very short texts. There are numerous ambiguous words in any language, and the demographic of this search service is not bound to one language. 3.4 hyperlink-induced topic search HITS is a graph-based algorithm that computes two scores for all the results of a query [13]. The hub score estimates the value of its links to other pages, the authority score estimates the value of the content of the page. The results can be divided into two categories: hubs which lead to authorities and authorities. It can also be used to identify results that are not relevant for the query (by looking at the links in the graph). One of the differences with PageRank is that it is not computed off-line, but is a query-time process. It operates on the results of a search. It is not commonly used by search engines, but a similar algorithm has been in use by ask.com. Hub Hub Hub Authority Authority Authority Authority Figure 2: A linked set of hubs and authorities. 3.5 on-line page importance computation OPIC is, as the name suggests, an on-line method to rank pages [3]. The advantage is that it is able to do the calculations without storing a separate link structure. The score of a page is updated when the page is downloaded, so it is not a search-time algorithm. In the long run the scoring should be the same as with conventional PageRank. The Nutch platform (discussed in Chapter 5) includes an implementation of OPIC. It is used to determine which URLs should be downloaded first. At search time the scoring can be used to order the search results. The disadvantage is that pages need to be crawled multiple times to reach a meaningful score.

17 C O N C E P T 4 The goal of this research is to measure how different adaptations of the PageRank algorithm influence the quality of ranking. The alterations that we are exploring concern using alternative surfer models. The probability with which a certain outlink is chosen is made to depend on the similarity between the two pages. 4.1 importance of links The probability with which an outlink would be traversed, depends on its own score with respect to the other outlinks on that web page. We adapt the PageRank algorithm by determining the relevance of each of the outlinks on a page. The random surfer model is thus changed into a smart surfer model. For every surfer model the probability function is given. The formula returns a score with which the p j link on page p i will be followed. The scores are then divided by the sum of the scores of all outlinks, resulting in normalized probabilities Random surfer In the random surfer model the scores are all equal to one. Each link is regarded as equally important. This method is used as reference and is later referred to as generic PageRank. P(p i, p j ) = ODP-biased surfer We have constructed a basic URL-classifier by training it with URLs from the ODP [17]. The ODP consists of approximately 5 million URLs, all of which have been categorized into a rich data structure. They are grouped by topic on multiple levels and are also divided over a number of languages. There are 16 main categories, 14 of which we used for our research. The URLs that are found in the regional and world categories have been re-mapped to their respective top-level category. This ensures that the dataset does not contain English pages only. Both the URL of the originating page and the destination page are processed by the classifier. The relevance of the link is based on how much the topics overlap. Pages that have the same topic have a good connection, others are very weak. Table 5 shows the results of the ODP classifier in a 10-fold evaluation. We have used every tenth ODP record to test the classifier. The table shows that the accuracy is very high (93%). The precision(50%) and recall(50%) are much lower. The accuracy is not an appropriate measurement in this case, because of the non-uniform distribution of the classes. Business, Arts and Society together account for 50% of the dataset and have a low accuracy. Accuracy: (T P + T N)/Sum Precision: FP/(T P + FP) Recall: FN/(T P + FN) T P = True positive T N = True negative FP = False positive FN = False negative 9

18 10 concept The classifier performs poorly, because the precision and recall is low. This can be explained because of the ambiguity of the categories. A website about a computer game might be in the Computers, Games or Kids and Teens category. Another problem is the limited features that are used to train the classifier. We have chosen to use only the URL, but this is a very limited feature. Retrieving more information about a certain outlink is complex, because the page may or may not have been crawled. Even if the page is crawled, it is still difficult to gather the information on that page, because we cannot store all the web pages that we have crawled. The goal of this thesis is not to build an accurate classifier but to study the effects of intelligent-surfer models on PageRank. Therefore we have not invested too much time and resources in optimizing this classifier. The sample web graph that is shown in Figure 5 shows that the classifier works. We are confident that the performance of this classifier can be improved by using more features or a less ambiguous class hierarchy. P(p i, p j ) = O(p i, c) O(p j, c) c Cats O(p i, c) = probability that page i belongs to category c. Cats = collection of 14 categories Levenshtein-distance surfer The other surfer model is based on the Levenshtein distance measurement [15]. The Levenshtein distance measurement is based on the number of characters that have to be added, changed or removed to transform a given text string to another string. This is a very simple method to determine how much two pages have in common. There is a chance that pages that discuss a certain topic are mentioning that fact in their URL. If this is also the case for some of the outlinks, then this can be used to calculate the probability that someone will follow the link. This is a very simple measurement that does not require any information about the destination page other than the URL. We expect that URLs that are on the same domain or on a very similar host are not much more relevant than links to other pages. We do not want to spread the rank to other pages on the same domain, but use it to assign a meaningful rank to other pages. We have chosen to fix the importance of links to the same domain to 0.1 and links to a host that is very similar to 0.2. The last rule is to reduce the effects of links to domains that are similar but with a different Top Level Domain (TLD) (e.g. google.nl, google.com). 0.1 if Domain(p i ) = Domain(p j ) P(p i, p j ) = 0.2 if PL(Host(p i ), Host(p j )) < 0.5 PL(p i, p j ) otherwise Domain(p i ) = this will return the highest level of the domain. E.g.: Domain( = apache.org Host(p i ) = this will return the full hostname. E.g.: Host( = hadoop.apache.org

19 4.2 granularity 11 PL(i, j) = LD(i, j)/ length(i)+length(j) 2 LD(i 1, j 1) //subst/copy LD(i, j) = min LD(i 1, j) + 1 //insert LD(i, j 1) + 1 //delete 4.2 granularity We have to choose a level of granularity for the web graph. We could decide to use domains for the nodes in the graph. This would mean that links to any page in a domain will be merged to one single node in the graph. All galleries that are hosted on that domain will get the same rank. This is not a perfect solution, because of the large number of galleries that are hosted on photo sharing sites like Flickr or Deviantart. These sharing sites have a huge amount of galleries, but only a small percentage of that is of very good quality. The opposite strategy is to use the entire URL, including the query (e.g. the page number or sorting method). This might cause problems because links to the same gallery are not combined. Because of this we have chosen to use a simplified version of the URL. The query part is removed, and the remainder is used to identify galleries. This could also cause problems if multiple galleries share the same URL but use a different query parameter (e.g. albumid=2). Most sites do not use such a system, or use it for other purposes like sorting or page selection. Website developers often aim to make pretty URLs, which improve their performance in search engines. They also allow users to guess what the page is about when they only see the URL. These pretty URLs will not be merged, but other pages will be grouped in a single node in the graph. 4.3 self-links Another point of attention is which links are to be included in the graph. The concept is that links conduct importance to the destination page. But links that point to the same page are not relevant. This occurs a lot on forums that have links to the top of the page after each message. A well-known method to retain rank within a particular website is to create a lot of links on every page that link to all the pages on that website. This is also common in forums; each posting has a link to the profile of the poster. And popular threads have a number of pages that all link to each other. 4.4 spam The last decision is the problem of spam. The web is contaminated by pages that are constructed to influence the ranking of search engines. The policy that we have applied is that pages with a large number of outlinks are not included in the graph. This should not greatly alter the overall computation, because the rank of such a page would be divided by the number of outlinks. Thus each outlink only transfers a tiny amount of rank.

20

21 R E A L I Z AT I O N 5 In order to perform the necessary computations, we have built a system that is capable of calculating the PageRank of a large number of pages in a few hours. We will explain the architecture of the system and describe the tools that we used to build it. 5.1 system architecture Figure 3 shows the architecture of the system. Some of the tools are provided by Nutch, and some have been written for the purpose of this thesis. 3 Internet Build graph Crawl the internet 1 Nutch Databases Iterate pagerank 4 Build Gallery-index 2 Build rank-lookup index 5 Unranked Gallery-index Rank-lookup Index Rerank Gallery-index 6 Lucene index Process Data flow Ranked Gallery-index Figure 3: Architecture of the system. Step 1 and 2 are provided by Nutch, which provides for the crawling and stores the data into its database. In step 3 we build the graph and calculate the probabilities that define the surfer model. Step 4 provides for the PageRank computations, which are stored to a rank-lookup index in step 5. Step 6 updates the Nutch index with boost values from the rank-lookup index Procedure In this section we will discuss the procedure that is required to statically order the galleries in a web graph using PageRank. We assume that a 13

22 14 realization dataset is available, which has been produced by running the Nutch crawler for a considerable amount of time Build graph In this step we parse the Nutch dataset and extract the web graph. Each link is represented by the origin, the target and the traversal probability; the probability is based on the formulas that are mentioned in the previous chapter. The outlinks are then grouped by source page. We have written a MapReduce processing job to perform this task Iterate PageRank Now we iterate the PageRank algorithm, until it reaches a stable state. In each iteration the rank of each page is calculated using the rank of the inlinks to that page. The amount of rank that is transferred is relative to the probability of the outlinks. This is also wrapped in a MapReduce job. Depending on the density of the graph, this can take a lot of iterations Build a rank-lookup index We construct a rank index containing the rank of all galleries. This step simplifies the process of updating the rank in the gallery index. This is also a MapReduce job; at the end all the subindexes that are produced by the reducers are merged into one index Rerank gallery index The gallery index has been built from the same Nutch dataset with a Nutch tool. We traverse the gallery index and set the boost value for every document according to the rank that is stored in the rank index. This is the only part of the process that is not performed by a MapReduce job, because the input for this step is an existing index. Indexes are great for searching, but building and reading must be done in one thread. The Lucene library uses a complex formula to score search results based on the search terms, boost values for different fields, and the number of occurrences of the terms (Appendix C). We calculate the boost value of the documents by taking the logarithm of the PageRank. This ensures that the boost values do not overpower the dynamic ranking. It is important to rank documents based on their dynamic ranking. The goal is to slightly influence the ordering by setting the boost values Measuring performance At this point we load the new gallery index into the search engine. We use the application to give insight in the new ranking. We have asked some people to rate the computed ranking. For a number of queries they are asked to count the number of ugly or bad galleries, and give a score in five steps from very bad to excellent for the ranking of the first five results.

23 5.2 data set 15 The search terms that are used in this evaluation are selected by looking at the words that are used the most in search queries. We have also added some well-known people, some football clubs and some vague subjects. This should be a good sample set to determine the quality of the ranking. 5.2 data set To be able to compare the results with other ranking methods, a generic dataset is required. Text REtrieval Conference (TREC) provides such datasets. We have not been able to find a relevant dataset that could be used for this thesis, because all of the datasets focus on text retrieval, and not on the images that are found on the pages [16]. We use a previously-built dataset that is constructed by an enhanced Nutch crawler (section 5.3.3). It can detect galleries, which are pages with a group of thumbnails that link to higher resolution versions of those thumbnails. The data is stored to a Distributed file system (DFS). It also stores all the outlinks of each page. These files are processed by a MapReduce task that creates vertices for each link. The set that we will use for testing consists of pages that have been crawled in the year This contains about 15 million galleries. On average one percent of the crawled pages is a gallery. The starting points of the crawl consisted of a number of well-known web pages (targeted at galleries). We can assume that the graph does not have too many components. We also use a sample web graph (Appendix B) that clearly illustrates the intended behavior. 5.3 tools Whilst working on the application, we have been introduced to some excellent open source tools. These tools have been vital for the success of this application Apache Hadoop Apache Hadoop is an open-source MapReduce engine. MapReduce [6] is an algorithm that allows data processing jobs to be split in small chunks. These chunks are processed on a server cluster, which enables it to process huge amounts of data in very little time. The system is able to detect failures and can resubmit a chunk when needed. The Hadoop framework also contains a DFS. The DFS takes care of replication and fail-over. When the system assigns jobs to nodes, the location of the data is taken into account. Jobs are preferably run on the node that has the data on its local disk. MapReduce processes the data in two phases. The map phase is used to read the source data and produces <key,value> pairs. The output of the map tasks is sorted by key, and then processed by the reducers. The sorting assures that all <key,value> pairs with the same key are bundled and processed by one reducer. A full explanation, including source code, is available at http: //hadoop.apache. org/core/docs/ current/mapred_ tutorial.html.

24 16 realization Comput ecl us t er DFSBl oc k1 Dat a dat a dat a dat a dat a dat a dat a dat a dat a dat a dat a DFSBl oc k1 Map DFSBl oc k1 Res ul t s DFSBl oc k2 Map DFSBl oc k2 Reduc e DFSBl oc k2 dat a dat a dat a dat a dat a dat a dat a dat a dat a Map DFSBl oc k3 DFSBl oc k3 DFSBl oc k3 Figure 4: Architecture of the MapReduce framework. (source: Apache) Lucene Java Apache Lucene is a high-performance, full-featured text search engine library written entirely in Java [7]. It allows us to perform searches in a large indexed set of data. The information that is stored in a Lucene index is pre-processed to allow for very fast searching. An index contains documents, and documents contain fields. Fields have a name, a value and a number of indexing settings. These settings are: Indexed Fields that have this setting are indexed, resulting in an inverse lookup table. Tokenized The value of this field is split into tokens by a tokenizer. This can also prepare the text before indexing (e.g. stemming and stop-word removal). Stored This parameter stores the original text in the index, which can be useful if the original value has to be available at search time. Documents also have a boost, which is a query-independent (static) value that is used when ranking the results of a query. During this project we have used Lucene indexes to store the image galleries and to build a lookup table for the ranks Nutch Nutch is also an open-source software package that is developed by Apache. It combines Hadoop and Lucene and adds an extensible crawler. Together they can be used to crawl, index and search the internet. At almost every step of the process the authors have integrated extension points. The plug-ins determine which pages are crawled, how the information is processed and what search parameters are processed. Nutch takes care of all the registration that is involved in crawling the web. It stores downloaded pages and all the outlinks. There is a main crawldb that stores all the links that the crawler has found. This is used to determine which pages are to be crawled next. Once a selection has been made from those URLs, they are formed into a segment. The segment is then downloaded and the results are fed back into the crawldb and into the search index.

25 5.3 tools LingPipe This package is one of the few products that we have used that is not developed by Apache. It is developed by alias-i. LingPipe is a suite of Java libraries for the linguistic analysis of human language [1]. We have used it to construct a classifier which can to categorize a web page in one of the 14 selected ODP categories. It does so by only looking at the URL. There are multiple types of classifiers available, but they all work by training them with examples. The suite also provides an evaluator which can be used with a k-fold evaluation. The evaluation results of the ODP classifier are shown in Table 5 and 6. Classifiers can be saved to hard-disk, which is useful if you need multiple instances.

26

27 E VA L U AT I O N A N D R E S U LT S 6 Here we give and discuss the results of our experiments. We have calculated the ranks for each of the three surfer-model variations. The measurements that we can perform on the dataset are limited, because we have no reliable way to measure the performance. To be able to illustrate the desired effect, we constructed a sample web graph (Figure 5). 6.1 sample web graph To explain the effects of the surfer models, we have constructed a sample web graph. We made sure that it will explain the problem of the generic PageRank algorithm, but these structures are very common on the web. The web graph consists of seven pages and contains three galleries which all about penguins. The Artis and Dierenpark Emmen galleries are about the animals in the respective zoos. They link to each other and are referred to by Wikipedia and the Arctic Council. The third gallery is about the computer-animated movie Happy Feet, which is also about penguins. The Happy Feet gallery has an inlink from the Adobe website, which is always very popular because of all the inlinks from pages that require the Flash plug-in. en.wikipedia.org/ wiki/penguin arctic-council.org/ artis.nl/ paginas/dierentuin/ vogels/pinguins.html dierenpark-emmen.nl/ nl/dierenpark/ favo_pinguins.htm warnerbros.com/happyfeet adobe.com/ getflashplayer adobe.com/ cfusion/showcase/ Figure 5: The example web graph that is used to demonstrate the desired effect of the new surfer models. The graph consists of three galleries and four non-galleries. The structure of the graph leads to a high rank for the Adobe plug-in page. This has been one of the highest scoring pages in our experiments. Table 1 shows the results of a calculation with 20 iterations on the sample web graph. 19

28 20 evaluation and results Web-page Uniform Generic Levenshtein ODP en.wikipedia.org/wiki/penguin dierenpark-emmen.nl/nl/dierenpark/favo_pinguins.htm artis.nl/paginas/t/dierentuin/vogels/pinguins.html adobe.com/go/getflashplayer adobe.com/cfusion/showcase warnerbros.com/happyfeet arctic-council.org Table 1: Ranking of sample web graph. The generic PageRank values are as to be expected. The rank of the Wikipedia page is 0.15, because there are no inlinks that convey rank to this page. The getflashplayer page receives the highest rank, which is expected and is also the case on the real web. The Happy Feet page receives a large amount of rank from the Adobe showcase web page, and is by far the most popular gallery. The resulting gallery ranking is shown in Table warnerbros.com/happyfeet artis.nl/paginas/t/dierentuin/vogels/pinguins.html dierenpark-emmen.nl/nl/dierenpark/favo_pinguins.htm 0.43 Table 2: Static gallery ordering with random surfer model. The PageRank based on the Levenshtein distance does not affect the ordering of the rank in this example, but it does change the rank that is assigned to the nodes (Table 3). The amount of rank that is conveyed to the Happy Feet page is increased, because the other outlink of the showcase page is regarded as irrelevant, because links that are added for navigational purpose within a website do not express value to that page. The outlinks from the Wikipedia page are rated as expected. The probabilities of the Artis and Dierenpark Emmen galleries are equal. The link to the Arctic Council is rated as less important. 1. warnerbros.com/happyfeet artis.nl/paginas/t/dierentuin/vogels/pinguins.html dierenpark-emmen.nl/nl/dierenpark/favo_pinguins.htm 0.35 Table 3: Static gallery ordering with Levenshtein surfer model. The ODP-based PageRank algorithm clearly has positive effect on the ranking. Links that take the user to a different topic are rated as less important, which can be seen in the corresponding probability matrix (Appendix B). As a result, the rank of the Happy Feet gallery is greatly

29 6.2 real web graph 21 reduced. The galleries on the zoo websites are now the most highly ranked. 2. artis.nl/paginas/t/dierentuin/vogels/pinguins.html dierenpark-emmen.nl/nl/dierenpark/favo_pinguins.htm warnerbros.com/happyfeet 0.17 Table 4: Static gallery ordering with ODP surfer-model. 6.2 real web graph The example shows that the concept works, but the true test is to apply the algorithms to a real-world dataset. The queries have been selected to cover a number of different cases. These are queries that match a famous person and a few queries that are very general, such as car or movie. The related queries are special, they generally use much more keywords in a boolean OR-query, which should result in galleries that concern the same topic. All the other queries are boolean AND-queries. The results can be seen in Table 8 (appendix D). When we look at the overall number of bad results, we see that the uniform scoring method performs best. It receives the highest quality score and the least bad results. The generic PageRank delivered the worst results, it performed better in only one query. The Levenshtein and ODP models performed roughly in between the others. However, if we group the queries into themes, the results show some differences. When searching for people, the Levenshtein scoring method performs slightly better. But in the soccer and travel-related queries, the uniform scoring method is still the best. For the remaining broad queries it is up to the user to determine what they appreciate and what they mark as bad result. The Levenshtein distance scoring method performs slightly better in these cases. The ODP method is slightly better than the generic PageRank method, but worse than the Levenshtein and uniform scoring methods. This is probably the result of an inaccurate classifier. It still shows that the random-surfer model performs sub-optimal.

30

31 C O N C L U S I O N 7 We have seen the results of the experiment and in this chapter we draw some conclusions. We will discuss the tested surfer models, the importance of a good dataset and the differences between image search and classical information retrieval. We conclude this chapter with some notes on the measurability of rankings on large datasets. 7.1 surfer-models Previous tests with the OPIC algorithm have shown that generic Page- Rank has a detrimental effect on the quality of the search results when applied to this dataset. These results have been confirmed by the results of the random-surfer model. We expected that the ODP-surfer model would yield the best results, but it did not perform extremely well, although it was significantly better than the generic PageRank. This probably can be improved by using a better classifier. The Levenshtein-distance surfer was a long shot, and although it does not have a scientific background, it performed surprisingly well. This illustrates that even simple methods can improve the quality of the PageRank algorithm. The research that has been performed in Topic-sensitive PageRank [8], shows that the surfer model can be tuned to a specific topic. They had to calculate the PageRank 16 times, once for each category. This research shows that this is not always required. If the links are weighted according to their relevance to the destination page, the algorithm performs better than the generic approach. 7.2 dataset We have found that a well-managed dataset is required to construct the web graph. The Nutch crawler stores all the results, but it is up to the administrator to keep the data files. Even if all the files are available, it is still important to assure that the dataset forms a web graph of one component, as many small components are not useful in the PageRank algorithm. As the dataset is quite large, it is important to reduce the number of nodes in the web graph, because otherwise the computations would get too complex. However, leaving links out can create a lot of separated components. 7.3 information retrieval We have also found that the adaptations to the algorithm have a positive effect on the quality of the search results, but the uniform sorting method still performs better. This is explained by the fact that PageRank is good for information retrieval, but finding the best gallery for a given keyword requires more than that. 23

32 24 conclusion Another aspect that is not very well-covered within the PageRank algorithm is time. For a number of queries that were used in the experiment, the age of a gallery is important. We would for example prefer images of FC Barcelona that were taken during their last game. But there are only a small number of links to those galleries, because they are very fresh. 7.4 measuring It was difficult to get a good measurement of the quality of the ranking. There are plenty of methods to compare two sortings, like the τ distance measure [11]. It is very hard however to produce a reference ranking by which the quality can be measured, because the dataset is very large and diverse. It gets even more complicated when static and dynamic ranking are combined. It is hard to compare the results of searches with and without static boosting, and it is difficult to balance the dynamic and static influences. With a lot more fine-tuning of the parameters it might be possible to improve the quality of the search results.

33 F U T U R E W O R K 8 In this chapter we discuss topics that require further research. 8.1 level of detail The level of detail of the web graph could be improved by using a more intelligent system. This system should be able to merge vertices that represent the same web page. One could for example remove any reference to sorting and page numbering. It is also possible to create custom filters for domains that do not perform well with the default approach, for example by creating a filter that combines multiple Flickr URLs that lead to the same gallery. 8.2 tooling The tools that have been used/written for this experiment are very rudimentary. They are very scalable, but offer little compatibility with the Nutch framework. With the current tools it is not possible to expand the dataset without recomputing all the ranks. It would be much more efficient to compute the ranks based on the previous results. Using these systems in a production environment would mean that roughly 20% of the resources would go into building, maintaining and updating the graph and ranks. 8.3 link prioritization The behavior of the smart surfer is determined by the link traversal probability distribution. This distribution can be improved; we could for example create a blacklist for pages that do not deserve rank. Another option is to combine multiple probability functions (ODP, Levenshtein, other topic comparison). It might be necessary to obtain more data to make a better estimate, for example the position of the link on a web page. Links that are at the bottom of a long page are less likely to be clicked on by a real user. 8.4 odp classifier The ODP classifier could also be improved. We could increase the number of categories, or find more suitable (less ambiguous) categories. Furthermore, the URL of the page might not be enough to determine the topic. The contents of a page contain much more clues which would make the system more robust. The structure of the application would in that case have to be changed, because the textual content of the destination of an outlink is not readily available. 25

34

35 O D P C L A S S I F I E R A Category True positive False positive False negative True negative Accuracy Precision Recall Adult % 81% 49% Arts % 59% 56% Business % 36% 83% Computers % 65% 37% Games % 67% 44% Health % 65% 35% Home % 64% 36% Kids and Teens % 32% 25% News % 44% 18% Recreation % 50% 28% Science % 63% 51% Shopping % 44% 11% Society % 59% 56% Sports % 75% 39% Averages % 50% 50% Table 5: ODP classifier precision and recall. 27

36 28 odp classifier True class Table 6: ODP classifier confusion matrix. Sports Society Shopping Science Recreation News Kids and Teens Home Health Games Computers Business Arts Adult Adult Arts Business Computers Games Health Home Kids and Teens News Recreation Science Shopping Society Sports Predicted class

37 S A M P L E W E B G R A P H B 29

38 30 sample web graph Table 7: Probability matrix based on ODP topic matches. Probabilities that are marked bold are part of the the sample web-graph warnerbros.com/happyfeet adobe.com/cfusion/showcase adobe.com/go/getflashplayer artis.nl/paginas/t/dierentuin/vogels/pinguins.html dierenpark-emmen.nl/nl/dierenpark/favo_pinguins.htm en.wikipedia.org/wiki/penguin en.wikipedia.org/wiki/penguin dierenpark-emmen.nl/nl/dierenpark/favo_pinguins.htm artis.nl/paginas/t/dierentuin/vogels/pinguins.html adobe.com/go/getflashplayer adobe.com/cfusion/showcase warnerbros.com/happyfeet

39 L U C E N E S C O R I N G C The score of query q for document d correlates to the cosine-distance or dot-product between document and query vectors in a Vector Space Model (VSM) of Information Retrieval. A document whose vector is closer to the query vector in that model is scored higher. The score is computed as follows: where score(q, d) = coord(q, d) querynorm(q) ( ) tf(t d) idf(t) 2 t.getboost() norm(t, d) t q 1. tf(t d) correlates to the term s frequency, defined as the number of times term t appears in the currently scored document d. Documents that have more occurrences of a given term receive a higher score. The default computation for tf(t d) in Default- Similarity is: tf(t d) = frequency idf(t) stands for Inverse Document Frequency. This value correlates to the inverse of docfreq(t) (the number of documents in which the term t appears). This means rarer terms deliver a higher contribution to the total score. The default computation for idf(t) in Default-Similarity is: ( ) numdocs idf(t) = 1 + log docfreq coord(q, d) is a score factor based on how many of the query terms are found in the specified document. Typically, a document that contains more of the query s terms will receive a higher score than another document with fewer query terms. This is a search time factor computed in coord(q, d) by the Similarity in effect at search time. 4. querynorm(q) is a normalizing factor used to make scores between queries comparable. This factor does not affect document ranking (since all ranked documents are multiplied by the same factor), but rather just attempts to make scores from different queries (or even different indexes) comparable. This is a search time factor computed by the Similarity in effect at search time. The default computation in Default-Similarity is: querynorm(q) = querynorm(sumofsquaredweights) 1 querynorm(sumofsquaredweights) = sumofsquaredweights

40 32 lucene scoring The sum of squared weights (of the query terms) is computed by the query Weight object. For example, a boolean query computes this value as: sumofsquaredweights = q.getboost() 2 (idf(t) t.getboost()) 2 t q 5. t.getboost() is a search time boost of term t in the query q as specified in the query text (see query syntax), or as set by application calls to setboost(). Notice that there is really no direct API for accessing a boost of one term in a multi-term query, but rather multi terms are represented in a query as multi TermQuery objects, and so the boost of a term in the query is accessible by calling the sub-query getboost(). 6. norm(t, d) encapsulates a few (indexing time) boost and length factors: Document boost set by calling doc.setboost() before adding the document to the index. Field boost set by calling field.setboost() before adding the field to a document. lengthnorm(field) computed when the document is added to the index in accordance with the number of tokens of this field in the document, so that shorter fields contribute more to the score. LengthNorm is computed by the Similarity class in effect at indexing. When a document is added to the index, all the above factors are multiplied. If the document has multiple fields with the same name, all their boosts are multiplied together: norm(t, d) = doc.getboost() lengthnorm(field) f.getboost() fieldf d named as t However the resulted norm value is encoded as a single byte before being stored. At search time, the norm-byte value is read from the index directory and decoded back to a float norm value. This encoding/decoding, while reducing index size, comes with the price of precision loss - it is not guaranteed that decode(encode(x)) = x. For instance, decode(encode(0.89)) = Also notice that at search time is too late to modify this norm part of scoring, e.g. by using a different Similarity for search.

41 D L A R G E D ATA S E T R E S U LT S Figure 6: This figure shows one of the pages that people have scored. 33

CC PROCESAMIENTO MASIVO DE DATOS OTOÑO Lecture 7: Information Retrieval II. Aidan Hogan

CC PROCESAMIENTO MASIVO DE DATOS OTOÑO Lecture 7: Information Retrieval II. Aidan Hogan CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 7: Information Retrieval II Aidan Hogan aidhog@gmail.com How does Google know about the Web? Inverted Index: Example 1 Fruitvale Station is a 2013

More information

Information Retrieval Spring Web retrieval

Information Retrieval Spring Web retrieval Information Retrieval Spring 2016 Web retrieval The Web Large Changing fast Public - No control over editing or contents Spam and Advertisement How big is the Web? Practically infinite due to the dynamic

More information

Information Retrieval

Information Retrieval Multimedia Computing: Algorithms, Systems, and Applications: Information Retrieval and Search Engine By Dr. Yu Cao Department of Computer Science The University of Massachusetts Lowell Lowell, MA 01854,

More information

Mining Web Data. Lijun Zhang

Mining Web Data. Lijun Zhang Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems

More information

Mining Web Data. Lijun Zhang

Mining Web Data. Lijun Zhang Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems

More information

COMP 4601 Hubs and Authorities

COMP 4601 Hubs and Authorities COMP 4601 Hubs and Authorities 1 Motivation PageRank gives a way to compute the value of a page given its position and connectivity w.r.t. the rest of the Web. Is it the only algorithm: No! It s just one

More information

Information Retrieval May 15. Web retrieval

Information Retrieval May 15. Web retrieval Information Retrieval May 15 Web retrieval What s so special about the Web? The Web Large Changing fast Public - No control over editing or contents Spam and Advertisement How big is the Web? Practically

More information

Link Analysis and Web Search

Link Analysis and Web Search Link Analysis and Web Search Moreno Marzolla Dip. di Informatica Scienza e Ingegneria (DISI) Università di Bologna http://www.moreno.marzolla.name/ based on material by prof. Bing Liu http://www.cs.uic.edu/~liub/webminingbook.html

More information

Social Network Analysis

Social Network Analysis Social Network Analysis Giri Iyengar Cornell University gi43@cornell.edu March 14, 2018 Giri Iyengar (Cornell Tech) Social Network Analysis March 14, 2018 1 / 24 Overview 1 Social Networks 2 HITS 3 Page

More information

Information Retrieval and Web Search

Information Retrieval and Web Search Information Retrieval and Web Search Link analysis Instructor: Rada Mihalcea (Note: This slide set was adapted from an IR course taught by Prof. Chris Manning at Stanford U.) The Web as a Directed Graph

More information

Creating a Classifier for a Focused Web Crawler

Creating a Classifier for a Focused Web Crawler Creating a Classifier for a Focused Web Crawler Nathan Moeller December 16, 2015 1 Abstract With the increasing size of the web, it can be hard to find high quality content with traditional search engines.

More information

The Anatomy of a Large-Scale Hypertextual Web Search Engine

The Anatomy of a Large-Scale Hypertextual Web Search Engine The Anatomy of a Large-Scale Hypertextual Web Search Engine Article by: Larry Page and Sergey Brin Computer Networks 30(1-7):107-117, 1998 1 1. Introduction The authors: Lawrence Page, Sergey Brin started

More information

COMP5331: Knowledge Discovery and Data Mining

COMP5331: Knowledge Discovery and Data Mining COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani and Terry Winograd, Jon M. Kleinberg 1 1 PageRank

More information

Information Retrieval (IR) Introduction to Information Retrieval. Lecture Overview. Why do we need IR? Basics of an IR system.

Information Retrieval (IR) Introduction to Information Retrieval. Lecture Overview. Why do we need IR? Basics of an IR system. Introduction to Information Retrieval Ethan Phelps-Goodman Some slides taken from http://www.cs.utexas.edu/users/mooney/ir-course/ Information Retrieval (IR) The indexing and retrieval of textual documents.

More information

Administrative. Web crawlers. Web Crawlers and Link Analysis!

Administrative. Web crawlers. Web Crawlers and Link Analysis! Web Crawlers and Link Analysis! David Kauchak cs458 Fall 2011 adapted from: http://www.stanford.edu/class/cs276/handouts/lecture15-linkanalysis.ppt http://webcourse.cs.technion.ac.il/236522/spring2007/ho/wcfiles/tutorial05.ppt

More information

Lecture #3: PageRank Algorithm The Mathematics of Google Search

Lecture #3: PageRank Algorithm The Mathematics of Google Search Lecture #3: PageRank Algorithm The Mathematics of Google Search We live in a computer era. Internet is part of our everyday lives and information is only a click away. Just open your favorite search engine,

More information

INTRODUCTION TO DATA SCIENCE. Link Analysis (MMDS5)

INTRODUCTION TO DATA SCIENCE. Link Analysis (MMDS5) INTRODUCTION TO DATA SCIENCE Link Analysis (MMDS5) Introduction Motivation: accurate web search Spammers: want you to land on their pages Google s PageRank and variants TrustRank Hubs and Authorities (HITS)

More information

Chapter 6: Information Retrieval and Web Search. An introduction

Chapter 6: Information Retrieval and Web Search. An introduction Chapter 6: Information Retrieval and Web Search An introduction Introduction n Text mining refers to data mining using text documents as data. n Most text mining tasks use Information Retrieval (IR) methods

More information

Information Retrieval. Lecture 11 - Link analysis

Information Retrieval. Lecture 11 - Link analysis Information Retrieval Lecture 11 - Link analysis Seminar für Sprachwissenschaft International Studies in Computational Linguistics Wintersemester 2007 1/ 35 Introduction Link analysis: using hyperlinks

More information

A Survey on Web Information Retrieval Technologies

A Survey on Web Information Retrieval Technologies A Survey on Web Information Retrieval Technologies Lan Huang Computer Science Department State University of New York, Stony Brook Presented by Kajal Miyan Michigan State University Overview Web Information

More information

SOURCERER: MINING AND SEARCHING INTERNET- SCALE SOFTWARE REPOSITORIES

SOURCERER: MINING AND SEARCHING INTERNET- SCALE SOFTWARE REPOSITORIES SOURCERER: MINING AND SEARCHING INTERNET- SCALE SOFTWARE REPOSITORIES Introduction to Information Retrieval CS 150 Donald J. Patterson This content based on the paper located here: http://dx.doi.org/10.1007/s10618-008-0118-x

More information

5 Choosing keywords Initially choosing keywords Frequent and rare keywords Evaluating the competition rates of search

5 Choosing keywords Initially choosing keywords Frequent and rare keywords Evaluating the competition rates of search Seo tutorial Seo tutorial Introduction to seo... 4 1. General seo information... 5 1.1 History of search engines... 5 1.2 Common search engine principles... 6 2. Internal ranking factors... 8 2.1 Web page

More information

Lecture 8: Linkage algorithms and web search

Lecture 8: Linkage algorithms and web search Lecture 8: Linkage algorithms and web search Information Retrieval Computer Science Tripos Part II Ronan Cummins 1 Natural Language and Information Processing (NLIP) Group ronan.cummins@cl.cam.ac.uk 2017

More information

Part I: Data Mining Foundations

Part I: Data Mining Foundations Table of Contents 1. Introduction 1 1.1. What is the World Wide Web? 1 1.2. A Brief History of the Web and the Internet 2 1.3. Web Data Mining 4 1.3.1. What is Data Mining? 6 1.3.2. What is Web Mining?

More information

1 Starting around 1996, researchers began to work on. 2 In Feb, 1997, Yanhong Li (Scotch Plains, NJ) filed a

1 Starting around 1996, researchers began to work on. 2 In Feb, 1997, Yanhong Li (Scotch Plains, NJ) filed a !"#$ %#& ' Introduction ' Social network analysis ' Co-citation and bibliographic coupling ' PageRank ' HIS ' Summary ()*+,-/*,) Early search engines mainly compare content similarity of the query and

More information

Home Page. Title Page. Page 1 of 14. Go Back. Full Screen. Close. Quit

Home Page. Title Page. Page 1 of 14. Go Back. Full Screen. Close. Quit Page 1 of 14 Retrieving Information from the Web Database and Information Retrieval (IR) Systems both manage data! The data of an IR system is a collection of documents (or pages) User tasks: Browsing

More information

Design and Implementation of Agricultural Information Resources Vertical Search Engine Based on Nutch

Design and Implementation of Agricultural Information Resources Vertical Search Engine Based on Nutch 619 A publication of CHEMICAL ENGINEERING TRANSACTIONS VOL. 51, 2016 Guest Editors: Tichun Wang, Hongyang Zhang, Lei Tian Copyright 2016, AIDIC Servizi S.r.l., ISBN 978-88-95608-43-3; ISSN 2283-9216 The

More information

COMP Page Rank

COMP Page Rank COMP 4601 Page Rank 1 Motivation Remember, we were interested in giving back the most relevant documents to a user. Importance is measured by reference as well as content. Think of this like academic paper

More information

EXTRACTION OF RELEVANT WEB PAGES USING DATA MINING

EXTRACTION OF RELEVANT WEB PAGES USING DATA MINING Chapter 3 EXTRACTION OF RELEVANT WEB PAGES USING DATA MINING 3.1 INTRODUCTION Generally web pages are retrieved with the help of search engines which deploy crawlers for downloading purpose. Given a query,

More information

Weighted Page Rank Algorithm Based on Number of Visits of Links of Web Page

Weighted Page Rank Algorithm Based on Number of Visits of Links of Web Page International Journal of Soft Computing and Engineering (IJSCE) ISSN: 31-307, Volume-, Issue-3, July 01 Weighted Page Rank Algorithm Based on Number of Visits of Links of Web Page Neelam Tyagi, Simple

More information

A STUDY OF RANKING ALGORITHM USED BY VARIOUS SEARCH ENGINE

A STUDY OF RANKING ALGORITHM USED BY VARIOUS SEARCH ENGINE A STUDY OF RANKING ALGORITHM USED BY VARIOUS SEARCH ENGINE Bohar Singh 1, Gursewak Singh 2 1, 2 Computer Science and Application, Govt College Sri Muktsar sahib Abstract The World Wide Web is a popular

More information

Web Structure Mining using Link Analysis Algorithms

Web Structure Mining using Link Analysis Algorithms Web Structure Mining using Link Analysis Algorithms Ronak Jain Aditya Chavan Sindhu Nair Assistant Professor Abstract- The World Wide Web is a huge repository of data which includes audio, text and video.

More information

International Journal of Scientific & Engineering Research Volume 2, Issue 12, December ISSN Web Search Engine

International Journal of Scientific & Engineering Research Volume 2, Issue 12, December ISSN Web Search Engine International Journal of Scientific & Engineering Research Volume 2, Issue 12, December-2011 1 Web Search Engine G.Hanumantha Rao*, G.NarenderΨ, B.Srinivasa Rao+, M.Srilatha* Abstract This paper explains

More information

10/10/13. Traditional database system. Information Retrieval. Information Retrieval. Information retrieval system? Information Retrieval Issues

10/10/13. Traditional database system. Information Retrieval. Information Retrieval. Information retrieval system? Information Retrieval Issues COS 597A: Principles of Database and Information Systems Information Retrieval Traditional database system Large integrated collection of data Uniform access/modifcation mechanisms Model of data organization

More information

Part 1: Link Analysis & Page Rank

Part 1: Link Analysis & Page Rank Chapter 8: Graph Data Part 1: Link Analysis & Page Rank Based on Leskovec, Rajaraman, Ullman 214: Mining of Massive Datasets 1 Graph Data: Social Networks [Source: 4-degrees of separation, Backstrom-Boldi-Rosa-Ugander-Vigna,

More information

SEARCH ENGINE INSIDE OUT

SEARCH ENGINE INSIDE OUT SEARCH ENGINE INSIDE OUT From Technical Views r86526020 r88526016 r88526028 b85506013 b85506010 April 11,2000 Outline Why Search Engine so important Search Engine Architecture Crawling Subsystem Indexing

More information

CRAWLING THE WEB: DISCOVERY AND MAINTENANCE OF LARGE-SCALE WEB DATA

CRAWLING THE WEB: DISCOVERY AND MAINTENANCE OF LARGE-SCALE WEB DATA CRAWLING THE WEB: DISCOVERY AND MAINTENANCE OF LARGE-SCALE WEB DATA An Implementation Amit Chawla 11/M.Tech/01, CSE Department Sat Priya Group of Institutions, Rohtak (Haryana), INDIA anshmahi@gmail.com

More information

The Topic Specific Search Engine

The Topic Specific Search Engine The Topic Specific Search Engine Benjamin Stopford 1 st Jan 2006 Version 0.1 Overview This paper presents a model for creating an accurate topic specific search engine through a focussed (vertical)

More information

Information Retrieval

Information Retrieval Introduction to Information Retrieval CS3245 12 Lecture 12: Crawling and Link Analysis Information Retrieval Last Time Chapter 11 1. Probabilistic Approach to Retrieval / Basic Probability Theory 2. Probability

More information

CHAPTER THREE INFORMATION RETRIEVAL SYSTEM

CHAPTER THREE INFORMATION RETRIEVAL SYSTEM CHAPTER THREE INFORMATION RETRIEVAL SYSTEM 3.1 INTRODUCTION Search engine is one of the most effective and prominent method to find information online. It has become an essential part of life for almost

More information

Brief (non-technical) history

Brief (non-technical) history Web Data Management Part 2 Advanced Topics in Database Management (INFSCI 2711) Textbooks: Database System Concepts - 2010 Introduction to Information Retrieval - 2008 Vladimir Zadorozhny, DINS, SCI, University

More information

Evaluating the Usefulness of Sentiment Information for Focused Crawlers

Evaluating the Usefulness of Sentiment Information for Focused Crawlers Evaluating the Usefulness of Sentiment Information for Focused Crawlers Tianjun Fu 1, Ahmed Abbasi 2, Daniel Zeng 1, Hsinchun Chen 1 University of Arizona 1, University of Wisconsin-Milwaukee 2 futj@email.arizona.edu,

More information

PageRank Algorithm Abstract: Keywords: I. Introduction II. Text Ranking Vs. Page Ranking

PageRank Algorithm Abstract: Keywords: I. Introduction II. Text Ranking Vs. Page Ranking IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661,p-ISSN: 2278-8727, Volume 19, Issue 1, Ver. III (Jan.-Feb. 2017), PP 01-07 www.iosrjournals.org PageRank Algorithm Albi Dode 1, Silvester

More information

UNIT-V WEB MINING. 3/18/2012 Prof. Asha Ambhaikar, RCET Bhilai.

UNIT-V WEB MINING. 3/18/2012 Prof. Asha Ambhaikar, RCET Bhilai. UNIT-V WEB MINING 1 Mining the World-Wide Web 2 What is Web Mining? Discovering useful information from the World-Wide Web and its usage patterns. 3 Web search engines Index-based: search the Web, index

More information

INTRODUCTION. Chapter GENERAL

INTRODUCTION. Chapter GENERAL Chapter 1 INTRODUCTION 1.1 GENERAL The World Wide Web (WWW) [1] is a system of interlinked hypertext documents accessed via the Internet. It is an interactive world of shared information through which

More information

In the recent past, the World Wide Web has been witnessing an. explosive growth. All the leading web search engines, namely, Google,

In the recent past, the World Wide Web has been witnessing an. explosive growth. All the leading web search engines, namely, Google, 1 1.1 Introduction In the recent past, the World Wide Web has been witnessing an explosive growth. All the leading web search engines, namely, Google, Yahoo, Askjeeves, etc. are vying with each other to

More information

Web Search Engines: Solutions to Final Exam, Part I December 13, 2004

Web Search Engines: Solutions to Final Exam, Part I December 13, 2004 Web Search Engines: Solutions to Final Exam, Part I December 13, 2004 Problem 1: A. In using the vector model to compare the similarity of two documents, why is it desirable to normalize the vectors to

More information

Web Search Ranking. (COSC 488) Nazli Goharian Evaluation of Web Search Engines: High Precision Search

Web Search Ranking. (COSC 488) Nazli Goharian Evaluation of Web Search Engines: High Precision Search Web Search Ranking (COSC 488) Nazli Goharian nazli@cs.georgetown.edu 1 Evaluation of Web Search Engines: High Precision Search Traditional IR systems are evaluated based on precision and recall. Web search

More information

Ranking in a Domain Specific Search Engine

Ranking in a Domain Specific Search Engine Ranking in a Domain Specific Search Engine CS6998-03 - NLP for the Web Spring 2008, Final Report Sara Stolbach, ss3067 [at] columbia.edu Abstract A search engine that runs over all domains must give equal

More information

Recent Researches on Web Page Ranking

Recent Researches on Web Page Ranking Recent Researches on Web Page Pradipta Biswas School of Information Technology Indian Institute of Technology Kharagpur, India Importance of Web Page Internet Surfers generally do not bother to go through

More information

Searching the Web What is this Page Known for? Luis De Alba

Searching the Web What is this Page Known for? Luis De Alba Searching the Web What is this Page Known for? Luis De Alba ldealbar@cc.hut.fi Searching the Web Arasu, Cho, Garcia-Molina, Paepcke, Raghavan August, 2001. Stanford University Introduction People browse

More information

An Enhanced Page Ranking Algorithm Based on Weights and Third level Ranking of the Webpages

An Enhanced Page Ranking Algorithm Based on Weights and Third level Ranking of the Webpages An Enhanced Page Ranking Algorithm Based on eights and Third level Ranking of the ebpages Prahlad Kumar Sharma* 1, Sanjay Tiwari #2 M.Tech Scholar, Department of C.S.E, A.I.E.T Jaipur Raj.(India) Asst.

More information

Information Retrieval

Information Retrieval Information Retrieval CSC 375, Fall 2016 An information retrieval system will tend not to be used whenever it is more painful and troublesome for a customer to have information than for him not to have

More information

CLOUD COMPUTING PROJECT. By: - Manish Motwani - Devendra Singh Parmar - Ashish Sharma

CLOUD COMPUTING PROJECT. By: - Manish Motwani - Devendra Singh Parmar - Ashish Sharma CLOUD COMPUTING PROJECT By: - Manish Motwani - Devendra Singh Parmar - Ashish Sharma Instructor: Prof. Reddy Raja Mentor: Ms M.Padmini To Implement PageRank Algorithm using Map-Reduce for Wikipedia and

More information

Big Data Analytics CSCI 4030

Big Data Analytics CSCI 4030 High dim. data Graph data Infinite data Machine learning Apps Locality sensitive hashing PageRank, SimRank Filtering data streams SVM Recommen der systems Clustering Community Detection Web advertising

More information

THE WEB SEARCH ENGINE

THE WEB SEARCH ENGINE International Journal of Computer Science Engineering and Information Technology Research (IJCSEITR) Vol.1, Issue 2 Dec 2011 54-60 TJPRC Pvt. Ltd., THE WEB SEARCH ENGINE Mr.G. HANUMANTHA RAO hanu.abc@gmail.com

More information

NYU CSCI-GA Fall 2016

NYU CSCI-GA Fall 2016 1 / 45 Information Retrieval: Personalization Fernando Diaz Microsoft Research NYC November 7, 2016 2 / 45 Outline Introduction to Personalization Topic-Specific PageRank News Personalization Deciding

More information

The PageRank Citation Ranking

The PageRank Citation Ranking October 17, 2012 Main Idea - Page Rank web page is important if it points to by other important web pages. *Note the recursive definition IR - course web page, Brian home page, Emily home page, Steven

More information

Chapter 27 Introduction to Information Retrieval and Web Search

Chapter 27 Introduction to Information Retrieval and Web Search Chapter 27 Introduction to Information Retrieval and Web Search Copyright 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 27 Outline Information Retrieval (IR) Concepts Retrieval

More information

Topology-Based Spam Avoidance in Large-Scale Web Crawls

Topology-Based Spam Avoidance in Large-Scale Web Crawls Topology-Based Spam Avoidance in Large-Scale Web Crawls Clint Sparkman Joint work with Hsin-Tsang Lee and Dmitri Loguinov Internet Research Lab Department of Computer Science and Engineering Texas A&M

More information

Crawler. Crawler. Crawler. Crawler. Anchors. URL Resolver Indexer. Barrels. Doc Index Sorter. Sorter. URL Server

Crawler. Crawler. Crawler. Crawler. Anchors. URL Resolver Indexer. Barrels. Doc Index Sorter. Sorter. URL Server Authors: Sergey Brin, Lawrence Page Google, word play on googol or 10 100 Centralized system, entire HTML text saved Focused on high precision, even at expense of high recall Relies heavily on document

More information

Introduction to Information Retrieval

Introduction to Information Retrieval Introduction to Information Retrieval http://informationretrieval.org IIR 21: Link Analysis Hinrich Schütze Center for Information and Language Processing, University of Munich 2014-06-18 1/80 Overview

More information

SEO Factors Influencing National Search Results

SEO Factors Influencing National Search Results SEO Factors Influencing National Search Results 1. Domain Age Domain Factors 2. Keyword Appears in Top Level Domain: Doesn t give the boost that it used to, but having your keyword in the domain still

More information

CS47300 Web Information Search and Management

CS47300 Web Information Search and Management CS47300 Web Information Search and Management Search Engine Optimization Prof. Chris Clifton 31 October 2018 What is Search Engine Optimization? 90% of search engine clickthroughs are on the first page

More information

International Journal of Advance Engineering and Research Development. A Review Paper On Various Web Page Ranking Algorithms In Web Mining

International Journal of Advance Engineering and Research Development. A Review Paper On Various Web Page Ranking Algorithms In Web Mining Scientific Journal of Impact Factor (SJIF): 4.14 International Journal of Advance Engineering and Research Development Volume 3, Issue 2, February -2016 e-issn (O): 2348-4470 p-issn (P): 2348-6406 A Review

More information

Pagerank Scoring. Imagine a browser doing a random walk on web pages:

Pagerank Scoring. Imagine a browser doing a random walk on web pages: Ranking Sec. 21.2 Pagerank Scoring Imagine a browser doing a random walk on web pages: Start at a random page At each step, go out of the current page along one of the links on that page, equiprobably

More information

Einführung in Web und Data Science Community Analysis. Prof. Dr. Ralf Möller Universität zu Lübeck Institut für Informationssysteme

Einführung in Web und Data Science Community Analysis. Prof. Dr. Ralf Möller Universität zu Lübeck Institut für Informationssysteme Einführung in Web und Data Science Community Analysis Prof. Dr. Ralf Möller Universität zu Lübeck Institut für Informationssysteme Today s lecture Anchor text Link analysis for ranking Pagerank and variants

More information

Big Data Analytics CSCI 4030

Big Data Analytics CSCI 4030 High dim. data Graph data Infinite data Machine learning Apps Locality sensitive hashing PageRank, SimRank Filtering data streams SVM Recommen der systems Clustering Community Detection Web advertising

More information

SEO ISSUES FOUND ON YOUR SITE (MARCH 29, 2016)

SEO ISSUES FOUND ON YOUR SITE (MARCH 29, 2016) www.advantageserviceco.com SEO ISSUES FOUND ON YOUR SITE (MARCH 29, 2016) This report shows the SEO issues that, when solved, will improve your site rankings and increase traffic to your website. 16 errors

More information

CS6200 Information Retreival. The WebGraph. July 13, 2015

CS6200 Information Retreival. The WebGraph. July 13, 2015 CS6200 Information Retreival The WebGraph The WebGraph July 13, 2015 1 Web Graph: pages and links The WebGraph describes the directed links between pages of the World Wide Web. A directed edge connects

More information

Research and implementation of search engine based on Lucene Wan Pu, Wang Lisha

Research and implementation of search engine based on Lucene Wan Pu, Wang Lisha 2nd International Conference on Advances in Mechanical Engineering and Industrial Informatics (AMEII 2016) Research and implementation of search engine based on Lucene Wan Pu, Wang Lisha Physics Institute,

More information

University of Virginia Department of Computer Science. CS 4501: Information Retrieval Fall 2015

University of Virginia Department of Computer Science. CS 4501: Information Retrieval Fall 2015 University of Virginia Department of Computer Science CS 4501: Information Retrieval Fall 2015 5:00pm-6:15pm, Monday, October 26th Name: ComputingID: This is a closed book and closed notes exam. No electronic

More information

CS 347 Parallel and Distributed Data Processing

CS 347 Parallel and Distributed Data Processing CS 347 Parallel and Distributed Data Processing Spring 2016 Notes 12: Distributed Information Retrieval CS 347 Notes 12 2 CS 347 Notes 12 3 CS 347 Notes 12 4 CS 347 Notes 12 5 Web Search Engine Crawling

More information

CS 347 Parallel and Distributed Data Processing

CS 347 Parallel and Distributed Data Processing CS 347 Parallel and Distributed Data Processing Spring 2016 Notes 12: Distributed Information Retrieval CS 347 Notes 12 2 CS 347 Notes 12 3 CS 347 Notes 12 4 Web Search Engine Crawling Indexing Computing

More information

Experimental study of Web Page Ranking Algorithms

Experimental study of Web Page Ranking Algorithms IOSR IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661, p- ISSN: 2278-8727Volume 16, Issue 2, Ver. II (Mar-pr. 2014), PP 100-106 Experimental study of Web Page Ranking lgorithms Rachna

More information

CS555: Distributed Systems [Fall 2017] Dept. Of Computer Science, Colorado State University

CS555: Distributed Systems [Fall 2017] Dept. Of Computer Science, Colorado State University CS 555: DISTRIBUTED SYSTEMS [MAPREDUCE] Shrideep Pallickara Computer Science Colorado State University Frequently asked questions from the previous class survey Bit Torrent What is the right chunk/piece

More information

Search Engine Optimization (SEO) using HTML Meta-Tags

Search Engine Optimization (SEO) using HTML Meta-Tags 2018 IJSRST Volume 4 Issue 9 Print ISSN : 2395-6011 Online ISSN : 2395-602X Themed Section: Science and Technology Search Engine Optimization (SEO) using HTML Meta-Tags Dr. Birajkumar V. Patel, Dr. Raina

More information

Ranking Techniques in Search Engines

Ranking Techniques in Search Engines Ranking Techniques in Search Engines Rajat Chaudhari M.Tech Scholar Manav Rachna International University, Faridabad Charu Pujara Assistant professor, Dept. of Computer Science Manav Rachna International

More information

Web search before Google. (Taken from Page et al. (1999), The PageRank Citation Ranking: Bringing Order to the Web.)

Web search before Google. (Taken from Page et al. (1999), The PageRank Citation Ranking: Bringing Order to the Web.) ' Sta306b May 11, 2012 $ PageRank: 1 Web search before Google (Taken from Page et al. (1999), The PageRank Citation Ranking: Bringing Order to the Web.) & % Sta306b May 11, 2012 PageRank: 2 Web search

More information

Chapter 3: Google Penguin, Panda, & Hummingbird

Chapter 3: Google Penguin, Panda, & Hummingbird Chapter 3: Google Penguin, Panda, & Hummingbird Search engine algorithms are based on a simple premise: searchers want an answer to their queries. For any search, there are hundreds or thousands of sites

More information

SEARCHMETRICS WHITEPAPER RANKING FACTORS Targeted Analysis for more Success on Google and in your Online Market

SEARCHMETRICS WHITEPAPER RANKING FACTORS Targeted Analysis for more Success on Google and in your Online Market 2018 SEARCHMETRICS WHITEPAPER RANKING FACTORS 2018 Targeted for more Success on Google and in your Online Market Table of Contents Introduction: Why ranking factors for niches?... 3 Methodology: Which

More information

Lecture 9: I: Web Retrieval II: Webology. Johan Bollen Old Dominion University Department of Computer Science

Lecture 9: I: Web Retrieval II: Webology. Johan Bollen Old Dominion University Department of Computer Science Lecture 9: I: Web Retrieval II: Webology Johan Bollen Old Dominion University Department of Computer Science jbollen@cs.odu.edu http://www.cs.odu.edu/ jbollen April 10, 2003 Page 1 WWW retrieval Two approaches

More information

Data-Intensive Computing with MapReduce

Data-Intensive Computing with MapReduce Data-Intensive Computing with MapReduce Session 5: Graph Processing Jimmy Lin University of Maryland Thursday, February 21, 2013 This work is licensed under a Creative Commons Attribution-Noncommercial-Share

More information

Personalizing PageRank Based on Domain Profiles

Personalizing PageRank Based on Domain Profiles Personalizing PageRank Based on Domain Profiles Mehmet S. Aktas, Mehmet A. Nacar, and Filippo Menczer Computer Science Department Indiana University Bloomington, IN 47405 USA {maktas,mnacar,fil}@indiana.edu

More information

Proximity Prestige using Incremental Iteration in Page Rank Algorithm

Proximity Prestige using Incremental Iteration in Page Rank Algorithm Indian Journal of Science and Technology, Vol 9(48), DOI: 10.17485/ijst/2016/v9i48/107962, December 2016 ISSN (Print) : 0974-6846 ISSN (Online) : 0974-5645 Proximity Prestige using Incremental Iteration

More information

Why it Really Matters to RESNET Members

Why it Really Matters to RESNET Members Welcome to SEO 101 Why it Really Matters to RESNET Members Presented by Fourth Dimension at the 2013 RESNET Conference 1. 2. 3. Why you need SEO How search engines work How people use search engines

More information

The Ultimate Digital Marketing Glossary (A-Z) what does it all mean? A-Z of Digital Marketing Translation

The Ultimate Digital Marketing Glossary (A-Z) what does it all mean? A-Z of Digital Marketing Translation The Ultimate Digital Marketing Glossary (A-Z) what does it all mean? In our experience, we find we can get over-excited when talking to clients or family or friends and sometimes we forget that not everyone

More information

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University http://cs224w.stanford.edu How to organize the Web? First try: Human curated Web directories Yahoo, DMOZ, LookSmart Second

More information

A Modified Algorithm to Handle Dangling Pages using Hypothetical Node

A Modified Algorithm to Handle Dangling Pages using Hypothetical Node A Modified Algorithm to Handle Dangling Pages using Hypothetical Node Shipra Srivastava Student Department of Computer Science & Engineering Thapar University, Patiala, 147001 (India) Rinkle Rani Aggrawal

More information

Authoritative K-Means for Clustering of Web Search Results

Authoritative K-Means for Clustering of Web Search Results Authoritative K-Means for Clustering of Web Search Results Gaojie He Master in Information Systems Submission date: June 2010 Supervisor: Kjetil Nørvåg, IDI Co-supervisor: Robert Neumayer, IDI Norwegian

More information

Module 1: Internet Basics for Web Development (II)

Module 1: Internet Basics for Web Development (II) INTERNET & WEB APPLICATION DEVELOPMENT SWE 444 Fall Semester 2008-2009 (081) Module 1: Internet Basics for Web Development (II) Dr. El-Sayed El-Alfy Computer Science Department King Fahd University of

More information

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University CS473: CS-473 Course Review Luo Si Department of Computer Science Purdue University Basic Concepts of IR: Outline Basic Concepts of Information Retrieval: Task definition of Ad-hoc IR Terminologies and

More information

FILTERING OF URLS USING WEBCRAWLER

FILTERING OF URLS USING WEBCRAWLER FILTERING OF URLS USING WEBCRAWLER Arya Babu1, Misha Ravi2 Scholar, Computer Science and engineering, Sree Buddha college of engineering for women, 2 Assistant professor, Computer Science and engineering,

More information

Search Engines. Charles Severance

Search Engines. Charles Severance Search Engines Charles Severance Google Architecture Web Crawling Index Building Searching http://infolab.stanford.edu/~backrub/google.html Google Search Google I/O '08 Keynote by Marissa Mayer Usablity

More information

How to organize the Web?

How to organize the Web? How to organize the Web? First try: Human curated Web directories Yahoo, DMOZ, LookSmart Second try: Web Search Information Retrieval attempts to find relevant docs in a small and trusted set Newspaper

More information

Link Analysis in Web Mining

Link Analysis in Web Mining Problem formulation (998) Link Analysis in Web Mining Hubs and Authorities Spam Detection Suppose we are given a collection of documents on some broad topic e.g., stanford, evolution, iraq perhaps obtained

More information

Information Retrieval Lecture 4: Web Search. Challenges of Web Search 2. Natural Language and Information Processing (NLIP) Group

Information Retrieval Lecture 4: Web Search. Challenges of Web Search 2. Natural Language and Information Processing (NLIP) Group Information Retrieval Lecture 4: Web Search Computer Science Tripos Part II Simone Teufel Natural Language and Information Processing (NLIP) Group sht25@cl.cam.ac.uk (Lecture Notes after Stephen Clark)

More information

Clustering. Informal goal. General types of clustering. Applications: Clustering in information search and analysis. Example applications in search

Clustering. Informal goal. General types of clustering. Applications: Clustering in information search and analysis. Example applications in search Informal goal Clustering Given set of objects and measure of similarity between them, group similar objects together What mean by similar? What is good grouping? Computation time / quality tradeoff 1 2

More information

Automatically Constructing a Directory of Molecular Biology Databases

Automatically Constructing a Directory of Molecular Biology Databases Automatically Constructing a Directory of Molecular Biology Databases Luciano Barbosa Sumit Tandon Juliana Freire School of Computing University of Utah {lbarbosa, sumitt, juliana}@cs.utah.edu Online Databases

More information

Disambiguating Search by Leveraging a Social Context Based on the Stream of User s Activity

Disambiguating Search by Leveraging a Social Context Based on the Stream of User s Activity Disambiguating Search by Leveraging a Social Context Based on the Stream of User s Activity Tomáš Kramár, Michal Barla and Mária Bieliková Faculty of Informatics and Information Technology Slovak University

More information