DHANALAKSHMI COLLEGE OF ENGINEERING, CHENNAI

Size: px
Start display at page:

Download "DHANALAKSHMI COLLEGE OF ENGINEERING, CHENNAI"

Transcription

1 DHANALAKSHMI COLLEGE OF ENGINEERING, CHENNAI Department of Computer Science and Engineering CS6007 INFORMATION RETRIEVAL Anna University 2 &16 Mark Questions & Answers Year / Semester: IV / VII Regulation: 2013 Academic year:

2 UNIT III WEB SEARCH ENGINE INTRODUCTION AND CRAWLING Part A Question Bank 1. Define web server. Web server is a computer connected to the internet that runs a program that takes responsibility for storing, retrieving and distributing some of the web files. 2. What is web Browsers? A web browser is a program. Web browser is used to communicate with web servers on the Internet, Which enables it to download and display the web pages. Netscape Navigator and Microsoft Internet Explorer are the most popular browser software s available in market. 3. Explain paid submission of search service. In paid submission user submit website for review by a search service for a preset fee with the expectation that the site will be accepted and include d in that company s search engine, provided it meets the stated guidelines for submission. Yahoo! is the major search engine that accepts this type of submission. While paid submissions guarantee a timely review of the submitted site and notice of acceptance or rejection, you re not guaranteed inclusion or a particular placement order in the listings. 4. Explain paid inclusion programs of search services. Paid inclusion programs allow you to submit your website for guaranteed inclusion in a search engines database of listings for a set period of time. While paid inclusion guarantees indexing of submitted pages or sites in a search database, you re not guaranteed that the pages will rank well for particular queries. 5. Explain in pay-for-placement of search services. In pay-for-placement, you can guarantee a ranking in a search listing for the terms of your choice. Also known as paid placement, paid listing, or sponsored listings, this program guarantees placement in search results. The leaders in pay-for-placement are Google, Yahoo! and Bing. 6. Define Search Engine Optimization. Search Engine Optimization is the act of modifying a website to increase its ranking in organic, crawler-based listing of search engines. There are several ways to increase the visibility of your website through the major search engines on the internet

3 today. The two most common forms of internet marketing paid placement and natural placement. 7. Describe benefit of SEO. Increase your search engine visibility Generate more traffic from the major search engines. Make sure your website and business get NOTICED and VISITED. Grow your client base and increase business revenue. 8. Explain the difference between SEO and Pay-per-click SEO Pay-Per-click SEO results take 2 weeks to 4 months It results in 1-2 days It is very difficult to control flow of traffic It has ability to turn on and at any moment Requires ongoing learning and experience Easier for a novice to reap results It is more difficult to target local markets Ability to target local markets Better for long-term and lower margin Better for short-term and high-margin campaigns campaigns. Generally more cost-effective, does not Generally more costly per visitor and per penalize for more traffic conversion 9. What is web crawler? A web crawler is a program which browses the world web in a methodical, automated manner. Web crawlers are mainly used to create a copy of all the visited pages for later processing by a search engine that will index the downloaded pages to p[provide fast searches. 10. Define focused crawler. A focused crawler or topical crawler is a web crawler that attempts to download only pages that are relevant to a pre-defined topic or set of topic. 11. What is hard and soft focused crawling? In hard focused crawling the classifier is invoked on a newly crawled document in a standard manner. When it returns the best matching category path, the out-neighbors of the page are checked into the database if and only if some node on the best matching category path is marked as good.

4 In soft focused crawling all out-neighbors of a visited page are checked into DB2, but their crawl priority is based on the relevance of the current page. 12. What is the Near-duplicate detection? Near-duplicate is the task of identifying documents with almost identical content. Near- duplicate web documents are abundant. Two such documents differ from each other in a very small portion that displays advertisements, for example. Such differences are irrelevant and for web search. 13. What are requirements of XML information retrieval systems? Query language that allows users to specify the nature of relevant components, in particular with respect to their structure. Representation strategies providing a description not only of the content of XML documents, but also their structure. Ranking strategies that determine the most relevant elements and rank these appropriately for a given query. PART-B 16Marks 1. Explain the concept of Web search review. Each of which can be addressed by an identifier called Uniform Resource Locator(URL). Web pages ar a set of pages published together. For Example, The ability to search and retrieve information from web efficiently and effectively is enabling technology for realizing its full potential. With powerful workstations and parallel processing, efficiency is not bottleneck. User can search for any information by passing query in form of keywords or phrase. It then searches for relevant information in its database and return to the user.web search engine discover pages by crawling the web, discovering new pages by following hyperlinks. Access to particular web pages may be restricted in various ways. The set of pages which can not be included in search engine indexes is often called as web search The search engine looks for the keyword in the index for predefined database instead of going directly to the web to search for the keyword. It then uses software to search for the information in the database. This software component is known as web crawler. Once web crawler finds the pages, the search engine then shows the relevant web pages as a result. These retrieved web pages generally include title of page, size of text portion, first several

5 sentences etc. 2. Explain the structure of the web. Bow-Tie Structure of the Web: One of the intriguing findings of this crawl was that the Web has a bow-tie structure as shown in Fig. The central core of the Web (the knot of the bow-tie) is the strongly connected component (SCC), which means that for any two pages in the SCC, a user can navigate from one of them to the other and back by clicking on links embedded in the pages encountered. In other words, a user browsing a page in the SCC can always reach any other page in the SCC by traversing some path of links. The left bow, called IN, contains pages that have a directed path of links leading to the SCC. The right bow, called OUT, contains pages that can be reached from the SCC by following a directed path of links. A web page in Tubes has a directed path from IN to OUT bypassing the SCC, and a page in Tendrils can either be reached from IN or leads into OUT. The pages in Disconnected are not even weakly connected to the SCC; that is, even if we ignored the fact that hyperlinks only allow forward navigation, allowing them to be traversed backwards as well as forwards, we still could not reach reach the SCC from them. Fig: Bow Tie Shape of Web

6 3. Explain the User, Paid placement with as neat diagram. In this scheme the search engine separates its query results list into two parts: (i) an organic list, which contains the fre e unbiased results, displayed according to the search and (ii) a sponsored list, which is paid for by advertising managed with the aid of an online auction mechanism. This method of payment is called pay per click (PPC), also known as cost per click (CPC), since payment is made by the advertiser each time a user clicks on the link in the sponsored listing. In most cases the organic and sponsored lists are kept separate but an alternative model is to interleave the organic and sponsored results within a single listing. Example: Pay-per-click is calculated by dividing the advertising cost by the number of clicks generated by an advertisement. The basic formula is: Pay-per-click ($) = Advertising cost ($) Ads clicked (#) There are two primary models for determining pay-per-click: flat-rate and bid-based. In both cases, the advertiser must consider the potential value of a click from a given source. This value is based on the type of individual the advertiser is expecting to receive as a visitor to his or her website, and what the advertiser can gain from that visit, usually revenue, both in the short term as well as in the long term. There are three type of paid search services. a) Paid Submission b) Pay for inclusion c) pay for placement a) Paid Submission: User submits website for review by search service for a preset fee with the expectation that the site. Yahoo is the major search engine that accepts this type of submission. While the paid inclusion does not guarantee a timely

7 review of the submitted site and notice of acceptance or rejection, you are not guaranteed inclusion or a particular placement order in the listings. b) Paid Inclusion: It allows you to submit your web site for database of listings for a set of period of time. While paid inclusion gurantees indexing of submitted pages or sites in a search database, you are not guaranteed that the pages will rank well for particular queries. c) Pay for placement: You guarantee a ranking in search listing for the terms of your choice. Also known as Paid placement, Paid listing, or Sponsored listings, this program guaranteed placement in search result. The leaders in pay for placement are Google, Yahoo, Bing. 4. Explain the search engine concept. OPTIMIZATION/ SPAM: Spam: In a search engine whose scoring was based on term frequencies, a web page with numerous repetitions of terms would rank highly. This led to the first generation of spam, which (in the context of web search) is the manipulation of web page content for the purpose of appearing high up in search results for selected keywords. Many web content creators have commercial motives and therefore stand to gain from manipulating search engine results. Search engines soon became sophisticated enough in their spam detection to screen out a large number of repetitions of particular keywords. Spammers responded with a richer set of spam techniques, the best known of which we now describe. The first of these techniques is cloaking, shown in above figure. pages depending on whether the http request comes from a web search When the user searches for these keywords and elects to view thepage, he receives a web page that has altogether different content than that

8 indexed by the search engine. SEO: Given that spamming is inherently an economically motivated activity, there has sprung around it an industry of Search Engine Optimizers, or SEOs to provide consultancy services for clients who seek to have their web pages rank highly on selected keywords. A page ranking is measured by the position of web pages displayed in the search engine results. If a search engine is putting your web page on the first position, then your web page rank will be number 1 and it will be assumed as the page with the highest rank. SEO is the process of designing and developing a website to attain a high rank in search engine results. Conceptually, there are two ways of optimization: On-Page SEO - It includes providing good content, good keywords selection, putting keywords on correct places, giving appropriate title to every page, etc. Off-Page SEO - It includes link building, increasing link popularity by submitting open directories, search engines, link exchange, etc. On-page SEO: 1. Title Optimization: An HTML TITLE tag is put inside the head tag. The page title (not to be confused with the heading for a page) is what is displayed in the title bar of your browser window. Correct use of keywords in the title of every page of your website is extremely important 2. seo header and bold tags 3. Keyword Usage 4. Link Structure 5. Domain Name Strategy 6. Alt Tags 7. Meta Descrption

9 Off-Page SEO: cloak 1. Anchor Text 2. Link Building 3. Paid Links Benefits: 1. Increase Search engine visibility 2. Generate More traffic from the major search engine 3. Make sure your website and business get noticed and visited 4. Grow your client base and increase business revenues. 5. Explain the Web search architectures with diagram. a) Centralized Architecture Most search engines use centralized crawler-indexer architecture. Crawlers are programs (software agents) that traverse the Web sending new or updated pages to a main server where they are indexed. Crawlers are also called robots, spiders, wanderers, walkers, and know bots. In spite of their name, a crawler does not actually move to and run on remote machines, rather the crawler runs on a local system and sends requests to remote Web servers. The index is used in a centralized fashion to answer queries submitted from different places in the Web. The following figure shows the software architecture of a search engine based on the Alta Vista architecture. It has two parts: one that deals with the users, consisting of the user interface and the query engine and another that consists of the crawler and indexer modules.

10 Problems: 1) The main problem faced by this architecture is the gathering of the data, because of the highly dynamic nature of the Web, the saturated communication links, and the high load at Web servers. 2) Another important problem is the volume of the data b) Distributed Architecture: There are several variants of the crawler-indexer architecture. Among them, the most important is Harvest. Harvest uses a distributed architecture to gather and distribute data, which is more efficient than the crawler architecture. The main drawback is that Harvest requires the coordination of several web servers. The Harvest distributed approach addresses several of the problems of the crawlerindexer architecture, such as: (1) Web servers receive requests from different crawlers, increasing their load; (2) Web traffic increases because crawlers retrieve entire objects, but most of their content is discarded; and (3) information is gathered independently by each crawler, without coordination between all the search engines. To solve these problems, Harvest introduces two main elements: gatherers and brokers. A gatherer collects and extracts indexing information from one or more Web servers. Gathering times are defined by the system and are periodic (i.e. there are harvesting times as the name of the system suggests). A broker provides the indexing mechanism and the query interface to the data gathered. Brokers retrieve information from one or more gatherers or other brokers, updating incrementally their indices. Depending on the configuration of gatherers and brokers, different improvements on server load and network traffic can be achieved. A replicator can be used to replicate servers, enhancing user-base scalability. For example, the registration broker can be replicated in different geographic regions to allow faster access. Replication can also be used to divide the gathering process between many Web

11 servers. Finally, the object cache reduces network and server load, as well as response latency when accessing Web pages. 6. Explain the process of web crawling. Web crawling is the process by which we gather pages from the Web, in order to index them and support a search engine. The objective of crawling is to quickly and efficiently gather as many useful web pages as possible, together with the link structure that interconnects them. Features of Crawler: Robustness: Ability to handle spider traps. TheWeb contains servers that create spider traps, which are generators of web pages that mislead crawlers into getting stuck fetching an infinite number of pages in a particular domain. Crawlers must be designed to be resilient to such traps. Politeness: Web servers have both implicit and explicit policies regulating the rate at which a crawler can visit them. These politeness policies must be respected. Distributed: The crawler should have the ability to execute in a distributed fashion across multiple machines. Scalable: The crawler architecture should permit scaling up the crawl rate by adding extra machines and bandwidth. Performance and efficiency: The crawl system should make efficient use of various system resources including processor, storage and network bandwidth. Quality: the crawler should be biased towards useful pages first.

12 Freshness: In many applications, the crawler should operate in continuous mode: it should obtain fresh copies of previously fetched pages. Extensible: Crawlers should be designed to be extensible in many ways to cope with new data formats, new fetch protocols, and so on. This demands that the crawler architecture be modular. Basic Operation: The crawler begins with one or more URLs that constitute a seed set. It picks a URL from this seed set, and then fetches the web page at that URL. The fetched page is then parsed, to extract both the text and the links from the page (each of which points to another URL). The extracted text is fed to a text indexer. The extracted links (URLs) are then added to a URL frontier, which at all times consists of URLs whose corresponding pages have yet to be fetched by the crawler. Initially, the URL frontier contains the seed set; as pages are fetched, the corresponding URLs are deleted from the URL frontier. The entire process may be viewed as traversing the web graph. In continuous crawling, the URL of a fetched page is added back to the frontier for fetching again in the future. Web Crawler Architecture: The simple scheme outlined above for crawling demands several modules that fit together as shown in Figure. 1) The URL frontier, containing URLs yet to be fetched in the current crawl (in the case of continuous crawling, a URL may have been fetched previously but is back in the frontier for re-fetching). 2) A DNS resolution module that determines the web server from which to fetch the page specified by a URL. 3) A fetch module that uses the http protocol to retrieve the web page at a URL. 4) A parsing module that extracts the text and set of links from a fetched web page. 5) A duplicate elimination module that determines whether an extracted link is already in the URL frontier or has recently been fetched.

13 Crawling is performed by anywhere from one to potentially hundreds of threads, each of which loops through the logical cycle in the above Figure. These threads may be run in a single process, or be partitioned amongst multiple processes running at different nodes of a distributed system. We begin by assuming that the URL frontier is in place and non-empty. We follow the progress of a single URL through the cycle of being fetched, passing through various checks and filters, then finally (for continuous crawling) being returned to the URL frontier. 7. What is meta crawler? Explain the concept with a neat diagram. Meta searchers / Meta Crawlers are Web servers that send a given query to several search engines, Web directories and other databases, collect the answers and unify them. Examples are Metacrawler and SavvySearch. The main advantages of metasearchers are the ability to combine the results of many sources and the fact that the user can pose the same query to various sources through a single common interface. Metasearchers differ from each other in how ranking is performed in the unified result (in some cases no ranking is done), and how well they translate the user query to the specific query language of each search engine or Web directory (the query language common to all of them could be small). Following Table shows the URLs of the main metasearch engines as well as the number of search engines, Web directories and other databases

14 that they search. Metasearchers can also run on the client, for example, Copernic, EchoSearch, WebFerret, WebCompass, and WebSeeker. There are others that search several sources and show the different answers in separate windows, such as All40ne, OneSeek, Proteus, and Search Spaniel. The advantages of metasearchers are that the results can be sorted by different attributes such as host, keyword, date, etc; which can be more informative than the output of a single search engine. Therefore browsing the results should be simpler. On the other hand, the result is not necessarily all the Web pages matching the query, as the number of results per search engine retrieved by the metasearcher is limited (it can be changed by the user, but there is an upper limit). Nevertheless, pages returned by more than one search engine should be more relevant. We expect that new metasearchers will do better ranking. A first step in this direction is the NEC Research Institute metasearch engine, Inquirus. The main difference is that lnquirus actually downloads and analyzes each Web page obtained and then displays each page, highlighting the places where the query terms were found. The results are displayed as soon as they are available in a progressive manner, otherwise the waiting time would be too long. This technique also allows non-existent pages or pages that have changed and do not contain the query any more to be discarded, and, more important, provides for better ranking than normal search engines. On the other hand, this metasearcher is not available to the general public. The use of metasearchers is justified by coverage studies that show that a small percentage of Web pages are in all search engines. In fact, fewer than 1 % of the Web pages indexed by Alta Vista, HotBot, Excite, and Infoseek are in all of those search engines. This fact is quite surprising and has not been explained.

15 8. Explain the Focused crawling concept. Some users would like a search engine that focuses on a specific topic of information. For instance, at a website about movies, users might want access to a search engine that leads to more information about movies. If built correctly, this type of vertical search can provide higher accuracy than general search because of the lack of extraneous information in the document collection. The computational cost of running a vertical search will also be much less than a full web search, simply because the collection will be much smaller. The most accurate way to get web pages for this kind of engine would be to crawl a full copy of the Web and then throw out all unrelated pages. This strategy requires a huge amount of disk space and bandwidth, and most of the web pages will be discarded at the end. A less expensive approach is focused, or topical, crawling. A focused crawler attempts to download only those pages that are about a particular topic. Focused crawlers rely on the fact that pages about a topic tend to have links to other pages on the same topic. If this were perfectly true, it would be possible to start a crawl at one on-topic page, then crawl all pages on that topic just by following links from a single root page. In practice, a number of popular pages for a specific topic are typically used as seeds. Focused crawlers require some automatic means for determining whether a page is about a particular topic. Text classifiers are tools that can make this kind of distinction. Once a page is downloaded, the crawler uses the classifier to decide whether the page is on topic. If it is, the page is kept, and links from the page are used to find other related sites. The anchor text in the outgoing links is an important clue of topicality. Also, some pages have more on-topic links than others. As links from a particular web page are visited, the crawler can keep track of the topicality of the downloaded pages and use this to determine whether to download other similar pages. Anchor text data and page link topicality data can be combined together in order to determine which pages should be crawled next.

16 Web size measurement: To a first approximation, comprehensiveness grows with index size, although it does matter which specific pages a search engine indexes some pages are more informative than others. It is also difficult to reason about the fraction of the Web indexed by a search engine, because there is an infinite number of dynamic web pages; for instance, returns a valid HTML page rather than an error, politely informing the user that there is no such page at Yahoo! Such a "soft 404 error" is only one example of many ways in which web servers can generate an infinite number of valid web pages. Indeed, some of these are malicious spider traps devised to a that site. We could ask the following better-defined question: given two search engines, what are the relative sizes of their indexes? Even this question turns out to be imprecise, because: 1. In response to queries a search engine can return web pages whose contents it has not (fully or even partially) indexed. For one thing, search engines generally index only the first few thousand words in a web page. 2. Search engines generally organize their indexes in various tiers and partitions, not all of which are examined on every search. For instance, a web page deep inside a website may be indexed but not retrieved on general web searches; it is however retrieved as a result on a search that a user has explicitly restricted to that website. Thus, search engine indexes include multiple classes of indexed pages, so that there is no single measure of index size. These issues notwithstanding, a number of techniques have been devised for crude estimates of the ratio of the index sizes of two search engines, E1 and E2. The basic hypothesis underlying these techniques is that each search engine indexes a fraction of the Web chosen independently and uniformly at random. This involves some questionable assumptions: first, that there is a finite size for the Web from which each search engine chooses a subset, and second, that each engine chooses an independent, uniformly chosen subset. CAPTURE-RECAPTURE METHOD: If our assumption about E1 and E2 being independent and uniform random subsets of the Web were true, and our sampling process unbiased, then above Equation should give us an unbiased estimator for E1 / E2. We distinguish between two scenarios here. Either the measurement is performed by someone with access to the index of one of the search engines (say an employee of E1), or the measurement is performed by an independent party with no access to the innards of either search engine. In the former case, we can simply pick a random document from one index. The latter case is more

17 challenging; by picking a random page from one search engine from outside the search engine, then verify whether the random page is present in the other search engine. To implement the sampling phase, we might generate a randompage from the entire (idealized, finite)web and test it for presence in each search engine. Unfortunately, picking a web page uniformly at random is a difficult problem. We briefly outline several attempts to achieve such a sample, pointing out the biases inherent to each; following this we describe in some detail one technique that much research has built on. 1. Random searches: Begin with a search log of web searches; send a random search from this log to E1 and a random page from the results. Since such logs are not widely available outside a search engine, one implementation is to trap all search queries going out of a work group (say scientists in a research center) that agrees to have all its searches logged. This approach has a number of issues, including the bias from the types of searches made by the work group. Further, a random document from the results of such a random search to E1 is not the same as a random document from E1. 3. Random IP addresses: A second approach is to generate random IP addresses and send a request to a web server residing at the random address, collecting all pages at that server. The biases here include the fact that many hosts might share one IP (due to a practice known as virtual hosting) or not accept http requests from the host where the experiment is conducted. Furthermore, this technique is more likely to hit one of the many sites with few pages, skewing the document probabilities; we may be able to correct for this effect if we understand the distribution of the number of pages on websites. 3. Random walks: If the web graph were a strongly connected directed graph, we could run a random walk starting at an arbitrary web page. This walk would converge to a steady state distribution, from which we could in principle pick a web page with a fixed probability. This method, too has a number of biases. First, the Web is not strongly connected so that, even with various corrective rules, it is difficult to argue that we can reach a steady state distribution starting from any page. Second, the time it takes for the random walk to settle into this steady state is unknown and could exceed the length of the experiment.

18 9. Explain Random queries with a equation. The idea is to pick a page (almost) uniformly at random posing a random query to it. It should be clear that picking a set of random terms from (say) is not a good way of implementing this idea. For one thing, not all vocabulary terms occur equally often, so this approach will not result in documents being chosen uniformly at random from the search engine. For another, there are a great many terms in web documents that do not occur in a standard To address the problem of vocabulary terms not in a standard dictionary, we begin by assing a sample web dictionary. This could be done by crawling a limited portion of the Web, or by crawling a manually-assembled representative subset of the Web such as Yahoo!. Consider a conjunctive query with two or more randomly chosen words from this dictionary. Operationally, we proceed as follows: we use a random conjunctive query on E1 and pick from the top 100 returned results a page p at random. We then test p for presence in E2 by choosing 6-8 lowfrequency terms in p and using them in a conjunctive query for E2. We can improve the estimate by repeating the experiment a large number of times. Both the sampling process and the testing process have a number of issues. 1. Our sample is biased towards longer documents 2. Picking from the top 100 results of E1 induces a bias from the ranking algorithm of E1. Picking from all the results of E1 makes the experiment slower. This is particularly so because most web search engines put up defenses against excessive robotic querying. 3. During the checking phase, a number of additional biases are introduced: for instance, E2 may not handle 8-word conjunctive queries properly. 4. Either E1 or E2 may refuse to respond to the test queries, treating them as robotic spam rather than as bona fide queries. 5. There could be operational problems like connection time-outs. A sequence of research has built on this basic paradigm to eliminate some of these issues; there is no perfect solution yet, but the level of sophistication in statistics for understanding the biases is increasing. The main idea is to address biases by estimating, for each document, the magnitude of the bias. From this, standard statistical sampling methods can generate unbiased samples. In the checking phase, the newer work moves away from conjunctive queries to phrase and other queries that appear to be better behaved. Finally, newer experiments use other sampling methods besides random queries. The best known of these is document random walk sampling, in which a document is chosen by a random walk on a virtual graph derived from documents. In this graph, nodes are

19 documents; two documents are connected by an edge if they share two or more words in common. The graph is never instantiated; rather, a random walk on it can be performed by moving from a document to nanother by picking a pair of keywords in d, running a query on a search engine and picking a random document from the results. Near-duplicate detection: The Web contains multiple copies of the same content. By some estimates, as many as 40% of the pages on the Web are duplicates of other pages. Many of these are legitimate copies; for instance, certain information repositories are mirrored simply to provide redundancy and access reliability. Search engines try to avoid indexing multiple copies of the same content, to keep down storage and processing overheads The simplest approach to detecting duplicates is to compute, for each web page, a fingerprint that is a succinct (say 64-bit) digest of the characters on that page. Then, whenever the fingerprints of two web pages are equal, we test whether the pages themselves are equal and if so declare one of them to be a duplicate copy of the other. This simplistic approach fails to capture a crucial and widespread phenomenon on the Web: near duplication. In many cases, the contents of one web page are identical to those of another except for a few characters say, a notation showing the date and time at which the page was last modified. Even in such cases, we want to be able to declare the two pages to be close enough that we only index one copy. Short of exhaustively comparing all pairs of web pages, an infeasible task at the scale of billions of pages, how can we detect and filter out such near duplicates? We now describe a solution to the problem of detecting near-duplicate web pages. The answer lies in a technique known as SHINGLING. Given a positive integer k and a sequence of terms in a document d, define the k-shingles of d to be the set of all consecutive sequences of k terms in d. As an example, consider the following text: a rose is a rose is a rose. The 4-shingles for this text (k = 4 is a typical value used in the detection of nearduplicate web pages) are a rose is a, rose is a rose and is a rose is. The first two of these shingles each occur twice in the text. Intuitively, two documents are near duplicates if the sets of shingles generated from them are nearly the same. We now make this intuition precise, and then develop a method for efficiently computing and comparing the sets of shingles for all web pages. Let S(dj) denote the set of shingles of document dj. The Jaccard coefficient, which measures the degree of overlap between the sets denote this by J(S(d1), S(d2)). Our test for near duplication between d1 and d2 is to compute this Jaccard coefficient; if it exceeds a preset threshold (say, 0.9), we declare them near duplicates and eliminate one from

20 indexing. However, this does not appear to have simplified matters: we still have to compute Jaccard coefficients pairwise. Fig: Illustration of shingle sketches. We see two documents going through four stages of shingle sketch computation. In the first step (top row),we apply a 64-bit hash to each shingle from each document to obtain H(d1) and H(d2) (circles). Next, we apply a random permutation P to permute H(d1) and H(d2), obtaining P(d1) and P(d2) (squares). π π The third row shows only P(d1) and P(d2), while the bottom row shows the minimum values x 1 and x 2 for each document. To avoid this, we use a form of hashing. First, we map every shingle into a hash value over a large space, say 64 bits. For j = 1, 2, let H(dj) be the corresponding set of 64-bit hash values derived from S(dj). We now invoke the following trick to detect document pairs whose sets H() have large Jaccard oǀeƌlaps. Let π ďe a ƌaŷdoŵ peƌŵutatioŷ fƌoŵ the 64-bit integers to the 64-bit integers. Denote by P(dj) the set of permuted hash values in H(dj); thus for each Let x π j be the smallest integer in P(dj). Then J(S(d1), S(d2)) = P(x1 π = x2 π ). We give the proof in a slightly more general setting: consider a family of sets whose elements are drawn from a common universe. View the sets as columns of a matrix A, with one row for each element in the universe. The element aij = 1 if element i is present in the set Sj that the j th column represents. Let P be a random permutation of the rows of A; denote by P(Sj) the column that results from applying P to the j th column. Finally, let xj π be the index of the first row in which the column P(Sj) has a 1. We then prove that for any two columns j1, j2,

21 If we can prove this, the theorem follows. P(xj1 π = xj2 π ) = J(Sj1, Sj2 ). XML retrieval: Fig: Two Sets S j1 and S j2; Their jaccard coefficient 2/5. An XML document is an ordered, labeled tree. Each node of the tree is an XML ELEMENT and is written with an opening and closing tag. An element can have one or more XML ATTRIBUTE. In the XML document in Figure1, the scene element is enclosed by the two tags <scene...> and </scene>. It has an attribute number with value vii and two child elements, title and verse. Figure 2 shows Figure 1 as a tree. The leaf nodes of the tree consist of text, e.g., Shakespeaƌe, MaĐďeth, aŷd MaĐďeth s Đastle. The tƌee s iŷteƌŷal Ŷodes encode either the structure of the document (title, act, and scene) or metadata functions (author). <play> <author>shakespeare</author> <title>macbeth</title> <act number="i"> <scene number="vii"> <title>mađď Đa le</title> <verse>will I with wine and wassail...</verse> </scene> </act> </play> Figure 1 An XML document.

22 Figure 2: The XML document in a simplified DOM object. The standard for accessing and processing XML documents is the XML XML DOM Document Object Model or DOM. The DOM represents elements, attributes and text within elements as nodes in a tree. XPath is a standard for enumerating paths in an XML document collection. We will also refer to paths as XML contexts. The XPath expression node selects all nodes of that name. Successive elements of a path are separated by slashes, so act/scene selects all scene elements whose parent is an act element. Double slashes indicate that an arbitrary number of elements can intervene on a path: play//scene selects all scene elements occurring in a play element. An initial slash starts the path at the root element. /play/title seleđts the plaljs title. For notational convenience, we allow the final element of a path to be a vocabulary term and separate it from the element path by the symbol #, even though this does not conform to the XPath standard. For example, title#"macbeth" selects all titles containing the term Macbeth. Schema puts constraints on the structure of allowable XML documents for a particular application. Two standards for schemas for XML documents are XML DTD (document type definition) and XML Schema. Users can only write structured queries for an XML retrieval system if they have some minimal knowledge about the schema of the collection.

23 Fig : Tree representation of XML documents and queries A common format for XML queries is NEXI (Narrowed Extended XPath). As in XPath double slashes indicate that an arbitrary number of elements can intervene on a path. The dot in a clause in square brackets refers to the element the clause modifies. The clause [.//yr = 2001 or.//yr = 2002] modifies //article. Thus, the dot refers to //article in this case. Similarly, the dot in [about(., summer holidays)] refers to the section that the clause modifies. 10. What are the Challenges in XML retrieval? The first challenge in structured retrieval is that users want us to return parts of documents (i.e., XML elements), not entire documents as IR systems usually do in unstructured, the act or the entire play. In this case, the user is probably looking for the scene. On the other hand, an otherwise unspecified search for Macbeth should return the play of this name, not a subunit. One criterion for selecting the most appropriate part of a document is the structured document retrieval principle: A system should always retrieve the most specific part of a document answering the query. However, it can be hard to implement this principle algorithmically. Consider the query title#"macbeth" applied to Figure 2. The title of the tragedy, good hits because they contain the matching term Macbeth. But in this case, the title of the tragedy, the higher node, is preferred. In unstructured retrieval, it is usually clear what the right document unit is: files on your desktop, messages, web pages on the web etc. In structured retrieval, there are a number of different approaches to defining the indexing unit. One approach is to group nodes into non- overlapping pseudo documents as shown in Figure. In the example, books, chapters and sections have been designated to be indexing units, but without overlap.

24 Fig: Partitioning an XML document into non-overlapping indexing units. The disadvantage of this approach is that pseudodocuments may not make sense to the user because they are not coherent units. For instance, the leftmost indexing unit in Figure merges three disparate elements, the class, author and title elements. The least restrictive approach is to index all elements. This is also problematic. Many XML elements are not meaningful search results, e.g., typographical elements like <b>definitely</b> or an ISBN number which cannot be interpreted without context. Also, indexing all elements means that search results will be highly redundant. We call elements that are contained within each other nested. Returning redundant NESTED ELEMENTS in a list of returned hits is not very user-friendly. Because of the redundancy caused by nested elements it is common to restrict the set of elements that are eligible to be returned. discard all small elements A challenge in XML retrieval related to nesting is that we may need to distinguish different contexts of a term when we compute term statistics for ranking, in particular inverse document frequency (idf) statistics. For example, the term Gates under the node author is unrelated to an occurrence under a content node like section if used to refer to the plural of gate. It makes little sense to compute a single document frequency for Gates in this example. One solution is to compute idf for XML-context/term pairs, e.g., to compute different idf weights for author#"gates" and section#"gates". Unfortunately, this scheme will run into sparse data problems. In many cases, several different XML schemas occur in a collection since the XML documents in an IR application often come from more than one source. This phenomenon is called schema heterogeneity or schema diversity and presents yet another challenge. As illustrated in following Figure comparable elements may have

25 different names: creator in d2 vs. author in d3. Fig: Schema heterogeneity: intervening nodes and mismatched names. If we employ strict matching of trees, then q3 will retrieve neither d2 nor d3 although both documents are relevant. Some form of approximate matching of element names in combination with semi-automatic matching of different document structures can help here. Human editing of correspondences of elements in different schemas will usually do better than automatic methods. Schema heterogeneity is one reason for query-document mismatches like q3/d2 and q3/d3. Another reason is that users often are not familiar with the element names and the structure of the schemas of collections they search as mentioned. We can also support the user by interpreting all parent-child relationships in queries as descendant relationships with any number of intervening nodes allowed. We call such queries extended queries. 11. What is vector space model for XML retrieval? We first take each text node (which in our setup is always a leaf) and break it into multiple nodes, one for each word. So the leaf node Bill Gates is split into two leaves Bill and Gates. Next we define the dimensions of the vector space to be lexicalized subtrees of documents subtrees that contain at least one vocabulary term. A subset of these possible lexicalized subtrees is shown in the figure, but there are others e.g., the subtree corresponding to the whole document with the leaf node Gates removed. We can now represent queries and documents as vectors in this space of lexicalized subtrees and compute matches between them.

26 Fig: A mapping of an XML document (left) to a set of lexicalized subtrees (right). If we create a separate dimension for each lexicalized subtree occurring in the collection, the dimensionality of the space becomes too large. A compromise is to index all paths that end in a single vocabulary term, in other words, all XML-context/term pairs. We call such an XML-context/term pair a structural term and denote it by <c, t>: a pair of XML-context c and vocabulary term t. The document in above Figure has nine structural terms. Seven are shown (e.g., "Bill" and Author#"Bill") and two are not shown: /Book/Author#"Bill" and /Book/Author#"Gates". The tree with the leaves Bill and Gates is a lexicalized sub tree that is not a structural term. We ensure that retrieval results respect this preference by computing a weight for each match. A simple measure of the similarity of a path cq in a query and a path cd in a document is the following context resemblance function CR: where cq and cd are the number of nodes in the query path and document path, respectively, and cq matches cd iff we can transform cq into cd by inserting additional nodes. Two examples from Figure 10.6 are CR(cq4, cd2) = 3/4 = 0.75 and CR(cq4, cd3) = 3/5 = 0.6 where cq4, cd2 and cd3 are the relevant paths from top to leaf node in q4, d2 and d3, respectively. The value of CR(cq, cd) is 1.0 if q and d are identical.

27 The final score for a document is computed as a variant of the cosine measure, which we call SIMNOMERGE for reasons that will become clear shortly. SIMNOMERGE is defined as follows: where V is the vocabulary of non-structural terms; B is the set of all XML contexts; and weight(q, t, c) and weight(d, t, c) are the weights of term t in XML context c in query q and document d, respectively. We compute the weights using one of the weightings, such as idft wft,d. The inverse document frequency idft depends on which elements we use to compute dft. The similarity measure SIMNOMERGE(q, d) is not a true cosine measure since its value can be larger than 1.0. The algorithm for computing SIMNOMERGE for all documents in the collection is shown in Figure. Figure 10.9 The algorithm for scoring documents with SIMNOMERGE We give an example of how SIMNOMERGE computes query-document similarities in Figure. <c1, t> is one of the structural terms in the query. We successively retrieve all postings lists for structural terms <c, t> with the same vocabulary term t. Three example postings lists are shown. For the first one, we have CR(c1, c1) = 1.0 since the two contexts are identical. The next context has no context resemblance with c1: CR(c1, c2) = 0 and the corresponding postings list is ignored. The context match of c1 with c3 is 0.63>0 and it will be processed. In this

28 example, the highest ranking document is d9 with a similarity of = To simplify the figure, the query weight of <c1, t> is assumed to be What are web indexes? Explain the process with suitable diagram. Before an index can be used for query processing, it has to be created from the text collection. Building a small index is not particularly difficult, but as input sizes grow, some index construction tricks can be useful. In this section, we will look at simple inmemory index construction first, and then consider the case where the input data does not fit in memory. Finally, we will consider how to build indexes using more than one computer. Pseudocode for a simple indexer is shown in following Figure. The process involves only a few steps. A list of documents is passed to the BuildIndex function, and the function parses each document into tokens. These tokens are words, perhaps with some additional processing, such as downcasing or stemming. The function removes duplicate tokens, using, for example, a hash table. Then, for each token, the function determines whether a new inverted list needs to be created in I, and creates one if necessary. Finally, the current document number, n, is added to the inverted list. The result is a hash table of tokens and inverted lists. The inverted lists are just lists of integer document numbers and contain no special information. This is enough to do very simple kinds of retrieval. As described, this indexer can be used for many small tasks for example, indexing less than a few thousand documents. However, it is limited in two ways. First, it requires that all of the inverted lists be stored in memory, which may not be practical for larger collections. Second, this algorithm is sequential, with no obvious way to parallelize it. The primary barrier to parallelizing this algorithm is the hash table, which is accessed constantly in the inner loop. Adding locks to the hash table would allow parallelism for parsing, but that improvement alone will not be enough to make use of more than a handful of CPU cores. Handling large collections will require less reliance on memory and improved parallelism.

29 Fig: Pseudocode for a simple indexer 13. What is merging? The classic way to solve the memory problem in the previous example is by merging. We can build the inverted list structure I until memory runs out. When that happens, we write the partial index I to disk, then start making a new one. At the end of this process, the disk is filled with many partial indexes, I1, I2, I3,..., In. The system then merges these files into a single result. By definition, it is not possible to hold even two of the partial index files in memory at one time, so the input files need to be carefully designed so that they can be merged in small pieces. One way to do this is to store the partial indexes in alphabetical order. It is then possible for a merge algorithm to merge the partial indexes using very little memory. Following Figure shows an example of this kind of merging procedure. Even though this figure shows only two indexes, it is possible to merge many at once. The algorithm is essentially the same as the standard merge sort algorithm. Since both I1 and I2 are sorted, at least one of them points to the next piece of data necessary to write to I. The data from the two files is interleaved to produce a sorted result.

30 Fig An example of index merging. The first and second indexes are merged togetherto produce the combined index. Since I1 and I2 may have used the same document numbers, the merge function renumbers documents in I2. This merging process can succeed even if there is only enough memory to store two words (w1 and w2), a single inverted list posting, and a few file pointers. In practice, a real merge function would read large chunks of I1 and I2, and then write large chunks to I in order to use the disk most efficiently. This merging strategy also shows a possible parallel indexing strategy. If many machines build their own partial indexes, a single machine can combine all of those in

XML RETRIEVAL. Introduction to Information Retrieval CS 150 Donald J. Patterson

XML RETRIEVAL. Introduction to Information Retrieval CS 150 Donald J. Patterson Introduction to Information Retrieval CS 150 Donald J. Patterson Content adapted from Manning, Raghavan, and Schütze http://www.informationretrieval.org OVERVIEW Introduction Basic XML Concepts Challenges

More information

Introduction to Information Retrieval

Introduction to Information Retrieval Introduction to Information Retrieval http://informationretrieval.org IIR 10: XML Retrieval Hinrich Schütze, Christina Lioma Center for Information and Language Processing, University of Munich 2010-07-12

More information

Information Retrieval. Lecture 9 - Web search basics

Information Retrieval. Lecture 9 - Web search basics Information Retrieval Lecture 9 - Web search basics Seminar für Sprachwissenschaft International Studies in Computational Linguistics Wintersemester 2007 1/ 30 Introduction Up to now: techniques for general

More information

Information Retrieval Spring Web retrieval

Information Retrieval Spring Web retrieval Information Retrieval Spring 2016 Web retrieval The Web Large Changing fast Public - No control over editing or contents Spam and Advertisement How big is the Web? Practically infinite due to the dynamic

More information

Information Retrieval May 15. Web retrieval

Information Retrieval May 15. Web retrieval Information Retrieval May 15 Web retrieval What s so special about the Web? The Web Large Changing fast Public - No control over editing or contents Spam and Advertisement How big is the Web? Practically

More information

Mining Web Data. Lijun Zhang

Mining Web Data. Lijun Zhang Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems

More information

Today s lecture. Information Retrieval. Basic crawler operation. Crawling picture. What any crawler must do. Simple picture complications

Today s lecture. Information Retrieval. Basic crawler operation. Crawling picture. What any crawler must do. Simple picture complications Today s lecture Introduction to Information Retrieval Web Crawling (Near) duplicate detection CS276 Information Retrieval and Web Search Chris Manning, Pandu Nayak and Prabhakar Raghavan Crawling and Duplicates

More information

Today s lecture. Basic crawler operation. Crawling picture. What any crawler must do. Simple picture complications

Today s lecture. Basic crawler operation. Crawling picture. What any crawler must do. Simple picture complications Today s lecture Introduction to Information Retrieval Web Crawling (Near) duplicate detection CS276 Information Retrieval and Web Search Chris Manning and Pandu Nayak Crawling and Duplicates 2 Sec. 20.2

More information

Information Retrieval

Information Retrieval Introduction to Information Retrieval CS3245 12 Lecture 12: Crawling and Link Analysis Information Retrieval Last Time Chapter 11 1. Probabilistic Approach to Retrieval / Basic Probability Theory 2. Probability

More information

Search Engines. Information Retrieval in Practice

Search Engines. Information Retrieval in Practice Search Engines Information Retrieval in Practice All slides Addison Wesley, 2008 Web Crawler Finds and downloads web pages automatically provides the collection for searching Web is huge and constantly

More information

DATA MINING II - 1DL460. Spring 2014"

DATA MINING II - 1DL460. Spring 2014 DATA MINING II - 1DL460 Spring 2014" A second course in data mining http://www.it.uu.se/edu/course/homepage/infoutv2/vt14 Kjell Orsborn Uppsala Database Laboratory Department of Information Technology,

More information

Advertising Network Affiliate Marketing Algorithm Analytics Auto responder autoresponder Backlinks Blog

Advertising Network Affiliate Marketing Algorithm Analytics Auto responder autoresponder Backlinks Blog Advertising Network A group of websites where one advertiser controls all or a portion of the ads for all sites. A common example is the Google Search Network, which includes AOL, Amazon,Ask.com (formerly

More information

Mining Web Data. Lijun Zhang

Mining Web Data. Lijun Zhang Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems

More information

Information Retrieval

Information Retrieval Information Retrieval CSC 375, Fall 2016 An information retrieval system will tend not to be used whenever it is more painful and troublesome for a customer to have information than for him not to have

More information

Information Retrieval. Lecture 10 - Web crawling

Information Retrieval. Lecture 10 - Web crawling Information Retrieval Lecture 10 - Web crawling Seminar für Sprachwissenschaft International Studies in Computational Linguistics Wintersemester 2007 1/ 30 Introduction Crawling: gathering pages from the

More information

Searching the Web What is this Page Known for? Luis De Alba

Searching the Web What is this Page Known for? Luis De Alba Searching the Web What is this Page Known for? Luis De Alba ldealbar@cc.hut.fi Searching the Web Arasu, Cho, Garcia-Molina, Paepcke, Raghavan August, 2001. Stanford University Introduction People browse

More information

Chapter 27 Introduction to Information Retrieval and Web Search

Chapter 27 Introduction to Information Retrieval and Web Search Chapter 27 Introduction to Information Retrieval and Web Search Copyright 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 27 Outline Information Retrieval (IR) Concepts Retrieval

More information

CS6200 Information Retreival. Crawling. June 10, 2015

CS6200 Information Retreival. Crawling. June 10, 2015 CS6200 Information Retreival Crawling Crawling June 10, 2015 Crawling is one of the most important tasks of a search engine. The breadth, depth, and freshness of the search results depend crucially on

More information

Crawling CE-324: Modern Information Retrieval Sharif University of Technology

Crawling CE-324: Modern Information Retrieval Sharif University of Technology Crawling CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2017 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford) Sec. 20.2 Basic

More information

SEO According to Google

SEO According to Google SEO According to Google An On-Page Optimization Presentation By Rachel Halfhill Lead Copywriter at CDI Agenda Overview Keywords Page Titles URLs Descriptions Heading Tags Anchor Text Alt Text Resources

More information

World Wide Web has specific challenges and opportunities

World Wide Web has specific challenges and opportunities 6. Web Search Motivation Web search, as offered by commercial search engines such as Google, Bing, and DuckDuckGo, is arguably one of the most popular applications of IR methods today World Wide Web has

More information

An Approach To Web Content Mining

An Approach To Web Content Mining An Approach To Web Content Mining Nita Patil, Chhaya Das, Shreya Patanakar, Kshitija Pol Department of Computer Engg. Datta Meghe College of Engineering, Airoli, Navi Mumbai Abstract-With the research

More information

Brief (non-technical) history

Brief (non-technical) history Web Data Management Part 2 Advanced Topics in Database Management (INFSCI 2711) Textbooks: Database System Concepts - 2010 Introduction to Information Retrieval - 2008 Vladimir Zadorozhny, DINS, SCI, University

More information

6 WAYS Google s First Page

6 WAYS Google s First Page 6 WAYS TO Google s First Page FREE EBOOK 2 CONTENTS 03 Intro 06 Search Engine Optimization 08 Search Engine Marketing 10 Start a Business Blog 12 Get Listed on Google Maps 15 Create Online Directory Listing

More information

CHAPTER 4 PROPOSED ARCHITECTURE FOR INCREMENTAL PARALLEL WEBCRAWLER

CHAPTER 4 PROPOSED ARCHITECTURE FOR INCREMENTAL PARALLEL WEBCRAWLER CHAPTER 4 PROPOSED ARCHITECTURE FOR INCREMENTAL PARALLEL WEBCRAWLER 4.1 INTRODUCTION In 1994, the World Wide Web Worm (WWWW), one of the first web search engines had an index of 110,000 web pages [2] but

More information

Crawling. CS6200: Information Retrieval. Slides by: Jesse Anderton

Crawling. CS6200: Information Retrieval. Slides by: Jesse Anderton Crawling CS6200: Information Retrieval Slides by: Jesse Anderton Motivating Problem Internet crawling is discovering web content and downloading it to add to your index. This is a technically complex,

More information

DATA MINING - 1DL105, 1DL111

DATA MINING - 1DL105, 1DL111 1 DATA MINING - 1DL105, 1DL111 Fall 2007 An introductory class in data mining http://user.it.uu.se/~udbl/dut-ht2007/ alt. http://www.it.uu.se/edu/course/homepage/infoutv/ht07 Kjell Orsborn Uppsala Database

More information

Taxonomies and controlled vocabularies best practices for metadata

Taxonomies and controlled vocabularies best practices for metadata Original Article Taxonomies and controlled vocabularies best practices for metadata Heather Hedden is the taxonomy manager at First Wind Energy LLC. Previously, she was a taxonomy consultant with Earley

More information

Administrative. Web crawlers. Web Crawlers and Link Analysis!

Administrative. Web crawlers. Web Crawlers and Link Analysis! Web Crawlers and Link Analysis! David Kauchak cs458 Fall 2011 adapted from: http://www.stanford.edu/class/cs276/handouts/lecture15-linkanalysis.ppt http://webcourse.cs.technion.ac.il/236522/spring2007/ho/wcfiles/tutorial05.ppt

More information

Administrivia. Crawlers: Nutch. Course Overview. Issues. Crawling Issues. Groups Formed Architecture Documents under Review Group Meetings CSE 454

Administrivia. Crawlers: Nutch. Course Overview. Issues. Crawling Issues. Groups Formed Architecture Documents under Review Group Meetings CSE 454 Administrivia Crawlers: Nutch Groups Formed Architecture Documents under Review Group Meetings CSE 454 4/14/2005 12:54 PM 1 4/14/2005 12:54 PM 2 Info Extraction Course Overview Ecommerce Standard Web Search

More information

5 Choosing keywords Initially choosing keywords Frequent and rare keywords Evaluating the competition rates of search

5 Choosing keywords Initially choosing keywords Frequent and rare keywords Evaluating the competition rates of search Seo tutorial Seo tutorial Introduction to seo... 4 1. General seo information... 5 1.1 History of search engines... 5 1.2 Common search engine principles... 6 2. Internal ranking factors... 8 2.1 Web page

More information

CS47300: Web Information Search and Management

CS47300: Web Information Search and Management CS47300: Web Information Search and Management Web Search Prof. Chris Clifton 18 October 2017 Some slides courtesy Croft et al. Web Crawler Finds and downloads web pages automatically provides the collection

More information

In the recent past, the World Wide Web has been witnessing an. explosive growth. All the leading web search engines, namely, Google,

In the recent past, the World Wide Web has been witnessing an. explosive growth. All the leading web search engines, namely, Google, 1 1.1 Introduction In the recent past, the World Wide Web has been witnessing an explosive growth. All the leading web search engines, namely, Google, Yahoo, Askjeeves, etc. are vying with each other to

More information

AN OVERVIEW OF SEARCHING AND DISCOVERING WEB BASED INFORMATION RESOURCES

AN OVERVIEW OF SEARCHING AND DISCOVERING WEB BASED INFORMATION RESOURCES Journal of Defense Resources Management No. 1 (1) / 2010 AN OVERVIEW OF SEARCHING AND DISCOVERING Cezar VASILESCU Regional Department of Defense Resources Management Studies Abstract: The Internet becomes

More information

Executed by Rocky Sir, tech Head Suven Consultants & Technology Pvt Ltd. seo.suven.net 1

Executed by Rocky Sir, tech Head Suven Consultants & Technology Pvt Ltd. seo.suven.net 1 Executed by Rocky Sir, tech Head Suven Consultants & Technology Pvt Ltd. seo.suven.net 1 1. Parts of a Search Engine Every search engine has the 3 basic parts: a crawler an index (or catalog) matching

More information

CS47300: Web Information Search and Management

CS47300: Web Information Search and Management CS47300: Web Information Search and Management Web Search Prof. Chris Clifton 17 September 2018 Some slides courtesy Manning, Raghavan, and Schütze Other characteristics Significant duplication Syntactic

More information

Introduction to Information Retrieval

Introduction to Information Retrieval Introduction to Information Retrieval http://informationretrieval.org IIR 20: Crawling Hinrich Schütze Center for Information and Language Processing, University of Munich 2009.07.14 1/36 Outline 1 Recap

More information

Web Crawling. Introduction to Information Retrieval CS 150 Donald J. Patterson

Web Crawling. Introduction to Information Retrieval CS 150 Donald J. Patterson Web Crawling Introduction to Information Retrieval CS 150 Donald J. Patterson Content adapted from Hinrich Schütze http://www.informationretrieval.org Robust Crawling A Robust Crawl Architecture DNS Doc.

More information

CHAPTER THREE INFORMATION RETRIEVAL SYSTEM

CHAPTER THREE INFORMATION RETRIEVAL SYSTEM CHAPTER THREE INFORMATION RETRIEVAL SYSTEM 3.1 INTRODUCTION Search engine is one of the most effective and prominent method to find information online. It has become an essential part of life for almost

More information

A crawler is a program that visits Web sites and reads their pages and other information in order to create entries for a search engine index.

A crawler is a program that visits Web sites and reads their pages and other information in order to create entries for a search engine index. A crawler is a program that visits Web sites and reads their pages and other information in order to create entries for a search engine index. The major search engines on the Web all have such a program,

More information

Desktop Crawls. Document Feeds. Document Feeds. Information Retrieval

Desktop Crawls. Document Feeds. Document Feeds. Information Retrieval Information Retrieval INFO 4300 / CS 4300! Web crawlers Retrieving web pages Crawling the web» Desktop crawlers» Document feeds File conversion Storing the documents Removing noise Desktop Crawls! Used

More information

Web Search Basics Introduction to Information Retrieval INF 141/ CS 121 Donald J. Patterson

Web Search Basics Introduction to Information Retrieval INF 141/ CS 121 Donald J. Patterson Web Search Basics Introduction to Information Retrieval INF 141/ CS 121 Donald J. Patterson Content adapted from Hinrich Schütze http://www.informationretrieval.org Overview Overview Introduction Classic

More information

CS 347 Parallel and Distributed Data Processing

CS 347 Parallel and Distributed Data Processing CS 347 Parallel and Distributed Data Processing Spring 2016 Notes 12: Distributed Information Retrieval CS 347 Notes 12 2 CS 347 Notes 12 3 CS 347 Notes 12 4 Web Search Engine Crawling Indexing Computing

More information

CS 347 Parallel and Distributed Data Processing

CS 347 Parallel and Distributed Data Processing CS 347 Parallel and Distributed Data Processing Spring 2016 Notes 12: Distributed Information Retrieval CS 347 Notes 12 2 CS 347 Notes 12 3 CS 347 Notes 12 4 CS 347 Notes 12 5 Web Search Engine Crawling

More information

How Does a Search Engine Work? Part 1

How Does a Search Engine Work? Part 1 How Does a Search Engine Work? Part 1 Dr. Frank McCown Intro to Web Science Harding University This work is licensed under Creative Commons Attribution-NonCommercial 3.0 What we ll examine Web crawling

More information

US Patent 6,658,423. William Pugh

US Patent 6,658,423. William Pugh US Patent 6,658,423 William Pugh Detecting duplicate and near - duplicate files Worked on this problem at Google in summer of 2000 I have no information whether this is currently being used I know that

More information

3 Media Web. Understanding SEO WHITEPAPER

3 Media Web. Understanding SEO WHITEPAPER 3 Media Web WHITEPAPER WHITEPAPER In business, it s important to be in the right place at the right time. Online business is no different, but with Google searching more than 30 trillion web pages, 100

More information

How to Get Your Website Listed on Major Search Engines

How to Get Your Website Listed on Major Search Engines Contents Introduction 1 Submitting via Global Forms 1 Preparing to Submit 2 Submitting to the Top 3 Search Engines 3 Paid Listings 4 Understanding META Tags 5 Adding META Tags to Your Web Site 5 Introduction

More information

Peer-to-Peer Systems. Chapter General Characteristics

Peer-to-Peer Systems. Chapter General Characteristics Chapter 2 Peer-to-Peer Systems Abstract In this chapter, a basic overview is given of P2P systems, architectures, and search strategies in P2P systems. More specific concepts that are outlined include

More information

Jargon Buster. Ad Network. Analytics or Web Analytics Tools. Avatar. App (Application) Blog. Banner Ad

Jargon Buster. Ad Network. Analytics or Web Analytics Tools. Avatar. App (Application) Blog. Banner Ad D I G I TA L M A R K E T I N G Jargon Buster Ad Network A platform connecting advertisers with publishers who want to host their ads. The advertiser pays the network every time an agreed event takes place,

More information

Why is Search Engine Optimisation (SEO) important?

Why is Search Engine Optimisation (SEO) important? Why is Search Engine Optimisation (SEO) important? With literally billions of searches conducted every month search engines have essentially become our gateway to the internet. Unfortunately getting yourself

More information

Part I: Data Mining Foundations

Part I: Data Mining Foundations Table of Contents 1. Introduction 1 1.1. What is the World Wide Web? 1 1.2. A Brief History of the Web and the Internet 2 1.3. Web Data Mining 4 1.3.1. What is Data Mining? 6 1.3.2. What is Web Mining?

More information

CS47300 Web Information Search and Management

CS47300 Web Information Search and Management CS47300 Web Information Search and Management Search Engine Optimization Prof. Chris Clifton 31 October 2018 What is Search Engine Optimization? 90% of search engine clickthroughs are on the first page

More information

Indexing and Hashing

Indexing and Hashing C H A P T E R 1 Indexing and Hashing This chapter covers indexing techniques ranging from the most basic one to highly specialized ones. Due to the extensive use of indices in database systems, this chapter

More information

Link Analysis and Web Search

Link Analysis and Web Search Link Analysis and Web Search Moreno Marzolla Dip. di Informatica Scienza e Ingegneria (DISI) Università di Bologna http://www.moreno.marzolla.name/ based on material by prof. Bing Liu http://www.cs.uic.edu/~liub/webminingbook.html

More information

Parallel Crawlers. 1 Introduction. Junghoo Cho, Hector Garcia-Molina Stanford University {cho,

Parallel Crawlers. 1 Introduction. Junghoo Cho, Hector Garcia-Molina Stanford University {cho, Parallel Crawlers Junghoo Cho, Hector Garcia-Molina Stanford University {cho, hector}@cs.stanford.edu Abstract In this paper we study how we can design an effective parallel crawler. As the size of the

More information

Europcar International Franchisee Websites Search Engine Optimisation

Europcar International Franchisee Websites Search Engine Optimisation Introduction Everybody would like their site to be found easily on search engines. There is no magic that can guarantee this, but there are some principles that by following will help in your search engine

More information

Identifying Important Communications

Identifying Important Communications Identifying Important Communications Aaron Jaffey ajaffey@stanford.edu Akifumi Kobashi akobashi@stanford.edu Abstract As we move towards a society increasingly dependent on electronic communication, our

More information

Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p.

Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p. Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p. 6 What is Web Mining? p. 6 Summary of Chapters p. 8 How

More information

XML: Extensible Markup Language

XML: Extensible Markup Language XML: Extensible Markup Language CSC 375, Fall 2015 XML is a classic political compromise: it balances the needs of man and machine by being equally unreadable to both. Matthew Might Slides slightly modified

More information

Unit VIII. Chapter 9. Link Analysis

Unit VIII. Chapter 9. Link Analysis Unit VIII Link Analysis: Page Ranking in web search engines, Efficient Computation of Page Rank using Map-Reduce and other approaches, Topic-Sensitive Page Rank, Link Spam, Hubs and Authorities (Text Book:2

More information

Site Audit Boeing

Site Audit Boeing Site Audit 217 Boeing Site Audit: Issues Total Score Crawled Pages 48 % 13533 Healthy (3181) Broken (231) Have issues (9271) Redirected (812) Errors Warnings Notices 15266 41538 38 2k 5k 4 k 11 Jan k 11

More information

Scalable overlay Networks

Scalable overlay Networks overlay Networks Dr. Samu Varjonen 1 Lectures MO 15.01. C122 Introduction. Exercises. Motivation. TH 18.01. DK117 Unstructured networks I MO 22.01. C122 Unstructured networks II TH 25.01. DK117 Bittorrent

More information

UNIT-V WEB MINING. 3/18/2012 Prof. Asha Ambhaikar, RCET Bhilai.

UNIT-V WEB MINING. 3/18/2012 Prof. Asha Ambhaikar, RCET Bhilai. UNIT-V WEB MINING 1 Mining the World-Wide Web 2 What is Web Mining? Discovering useful information from the World-Wide Web and its usage patterns. 3 Web search engines Index-based: search the Web, index

More information

Information Retrieval and Web Search

Information Retrieval and Web Search Information Retrieval and Web Search Web Crawling Instructor: Rada Mihalcea (some of these slides were adapted from Ray Mooney s IR course at UT Austin) The Web by the Numbers Web servers 634 million Users

More information

A web directory lists web sites by category and subcategory. Web directory entries are usually found and categorized by humans.

A web directory lists web sites by category and subcategory. Web directory entries are usually found and categorized by humans. 1 After WWW protocol was introduced in Internet in the early 1990s and the number of web servers started to grow, the first technology that appeared to be able to locate them were Internet listings, also

More information

Near Neighbor Search in High Dimensional Data (1) Dr. Anwar Alhenshiri

Near Neighbor Search in High Dimensional Data (1) Dr. Anwar Alhenshiri Near Neighbor Search in High Dimensional Data (1) Dr. Anwar Alhenshiri Scene Completion Problem The Bare Data Approach High Dimensional Data Many real-world problems Web Search and Text Mining Billions

More information

SEARCH ENGINE INSIDE OUT

SEARCH ENGINE INSIDE OUT SEARCH ENGINE INSIDE OUT From Technical Views r86526020 r88526016 r88526028 b85506013 b85506010 April 11,2000 Outline Why Search Engine so important Search Engine Architecture Crawling Subsystem Indexing

More information

Representation/Indexing (fig 1.2) IR models - overview (fig 2.1) IR models - vector space. Weighting TF*IDF. U s e r. T a s k s

Representation/Indexing (fig 1.2) IR models - overview (fig 2.1) IR models - vector space. Weighting TF*IDF. U s e r. T a s k s Summary agenda Summary: EITN01 Web Intelligence and Information Retrieval Anders Ardö EIT Electrical and Information Technology, Lund University March 13, 2013 A Ardö, EIT Summary: EITN01 Web Intelligence

More information

Collection Building on the Web. Basic Algorithm

Collection Building on the Web. Basic Algorithm Collection Building on the Web CS 510 Spring 2010 1 Basic Algorithm Initialize URL queue While more If URL is not a duplicate Get document with URL [Add to database] Extract, add to queue CS 510 Spring

More information

NBA 600: Day 15 Online Search 116 March Daniel Huttenlocher

NBA 600: Day 15 Online Search 116 March Daniel Huttenlocher NBA 600: Day 15 Online Search 116 March 2004 Daniel Huttenlocher Today s Class Finish up network effects topic from last week Searching, browsing, navigating Reading Beyond Google No longer available on

More information

Chapter 6: Information Retrieval and Web Search. An introduction

Chapter 6: Information Retrieval and Web Search. An introduction Chapter 6: Information Retrieval and Web Search An introduction Introduction n Text mining refers to data mining using text documents as data. n Most text mining tasks use Information Retrieval (IR) methods

More information

THE WEB SEARCH ENGINE

THE WEB SEARCH ENGINE International Journal of Computer Science Engineering and Information Technology Research (IJCSEITR) Vol.1, Issue 2 Dec 2011 54-60 TJPRC Pvt. Ltd., THE WEB SEARCH ENGINE Mr.G. HANUMANTHA RAO hanu.abc@gmail.com

More information

Advanced Database Systems

Advanced Database Systems Lecture IV Query Processing Kyumars Sheykh Esmaili Basic Steps in Query Processing 2 Query Optimization Many equivalent execution plans Choosing the best one Based on Heuristics, Cost Will be discussed

More information

Chapter 13 XML: Extensible Markup Language

Chapter 13 XML: Extensible Markup Language Chapter 13 XML: Extensible Markup Language - Internet applications provide Web interfaces to databases (data sources) - Three-tier architecture Client V Application Programs Webserver V Database Server

More information

Developing Focused Crawlers for Genre Specific Search Engines

Developing Focused Crawlers for Genre Specific Search Engines Developing Focused Crawlers for Genre Specific Search Engines Nikhil Priyatam Thesis Advisor: Prof. Vasudeva Varma IIIT Hyderabad July 7, 2014 Examples of Genre Specific Search Engines MedlinePlus Naukri.com

More information

A Novel Interface to a Web Crawler using VB.NET Technology

A Novel Interface to a Web Crawler using VB.NET Technology IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661, p- ISSN: 2278-8727Volume 15, Issue 6 (Nov. - Dec. 2013), PP 59-63 A Novel Interface to a Web Crawler using VB.NET Technology Deepak Kumar

More information

Search Engine Optimization. MBA 563 Week 6

Search Engine Optimization. MBA 563 Week 6 Search Engine Optimization MBA 563 Week 6 SEARCH ENGINE OPTIMIZATION (SEO) Search engine marketing 2 major methods TWO MAJOR METHODS - OBJECTIVE IS TO BE IN THE TOP FEW SEARCH RESULTS 1. Search engine

More information

Webinar Series. Sign up at February 15 th. Website Optimization - What Does Google Think of Your Website?

Webinar Series. Sign up at  February 15 th. Website Optimization - What Does Google Think of Your Website? Webinar Series February 15 th Website Optimization - What Does Google Think of Your Website? March 21 st Getting Found on Google using SEO April 18 th Crush Your Competitors with Inbound Marketing May

More information

Information Retrieval (IR) Introduction to Information Retrieval. Lecture Overview. Why do we need IR? Basics of an IR system.

Information Retrieval (IR) Introduction to Information Retrieval. Lecture Overview. Why do we need IR? Basics of an IR system. Introduction to Information Retrieval Ethan Phelps-Goodman Some slides taken from http://www.cs.utexas.edu/users/mooney/ir-course/ Information Retrieval (IR) The indexing and retrieval of textual documents.

More information

Research and implementation of search engine based on Lucene Wan Pu, Wang Lisha

Research and implementation of search engine based on Lucene Wan Pu, Wang Lisha 2nd International Conference on Advances in Mechanical Engineering and Industrial Informatics (AMEII 2016) Research and implementation of search engine based on Lucene Wan Pu, Wang Lisha Physics Institute,

More information

Site Audit SpaceX

Site Audit SpaceX Site Audit 217 SpaceX Site Audit: Issues Total Score Crawled Pages 48 % -13 3868 Healthy (649) Broken (39) Have issues (276) Redirected (474) Blocked () Errors Warnings Notices 4164 +3311 1918 +7312 5k

More information

AN SEO GUIDE FOR SALONS

AN SEO GUIDE FOR SALONS AN SEO GUIDE FOR SALONS AN SEO GUIDE FOR SALONS Set Up Time 2/5 The basics of SEO are quick and easy to implement. Management Time 3/5 You ll need a continued commitment to make SEO work for you. WHAT

More information

Web Development & Design Foundations with HTML5

Web Development & Design Foundations with HTML5 1 Web Development & Design Foundations with HTML5 CHAPTER 13 WEB PROMOTION 2 Learning Outcomes In this chapter, you will learn how to: Identify commonly used search engines and search indexes Describe

More information

Web Search Engines: Solutions to Final Exam, Part I December 13, 2004

Web Search Engines: Solutions to Final Exam, Part I December 13, 2004 Web Search Engines: Solutions to Final Exam, Part I December 13, 2004 Problem 1: A. In using the vector model to compare the similarity of two documents, why is it desirable to normalize the vectors to

More information

Search Quality. Jan Pedersen 10 September 2007

Search Quality. Jan Pedersen 10 September 2007 Search Quality Jan Pedersen 10 September 2007 Outline The Search Landscape A Framework for Quality RCFP Search Engine Architecture Detailed Issues 2 Search Landscape 2007 Source: Search Engine Watch: US

More information

Bing Liu. Web Data Mining. Exploring Hyperlinks, Contents, and Usage Data. With 177 Figures. Springer

Bing Liu. Web Data Mining. Exploring Hyperlinks, Contents, and Usage Data. With 177 Figures. Springer Bing Liu Web Data Mining Exploring Hyperlinks, Contents, and Usage Data With 177 Figures Springer Table of Contents 1. Introduction 1 1.1. What is the World Wide Web? 1 1.2. A Brief History of the Web

More information

Site Audit Virgin Galactic

Site Audit Virgin Galactic Site Audit 27 Virgin Galactic Site Audit: Issues Total Score Crawled Pages 59 % 79 Healthy (34) Broken (3) Have issues (27) Redirected (3) Blocked (2) Errors Warnings Notices 25 236 5 3 25 2 Jan Jan Jan

More information

DATA MINING II - 1DL460. Spring 2017

DATA MINING II - 1DL460. Spring 2017 DATA MINING II - 1DL460 Spring 2017 A second course in data mining http://www.it.uu.se/edu/course/homepage/infoutv2/vt17 Kjell Orsborn Uppsala Database Laboratory Department of Information Technology,

More information

Search Engine Optimization (SEO) using HTML Meta-Tags

Search Engine Optimization (SEO) using HTML Meta-Tags 2018 IJSRST Volume 4 Issue 9 Print ISSN : 2395-6011 Online ISSN : 2395-602X Themed Section: Science and Technology Search Engine Optimization (SEO) using HTML Meta-Tags Dr. Birajkumar V. Patel, Dr. Raina

More information

VALLIAMMAI ENGINEERING COLLEGE SRM Nagar, Kattankulathur DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING QUESTION BANK VII SEMESTER

VALLIAMMAI ENGINEERING COLLEGE SRM Nagar, Kattankulathur DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING QUESTION BANK VII SEMESTER VALLIAMMAI ENGINEERING COLLEGE SRM Nagar, Kattankulathur 603 203 DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING QUESTION BANK VII SEMESTER CS6007-INFORMATION RETRIEVAL Regulation 2013 Academic Year 2018

More information

FILTERING OF URLS USING WEBCRAWLER

FILTERING OF URLS USING WEBCRAWLER FILTERING OF URLS USING WEBCRAWLER Arya Babu1, Misha Ravi2 Scholar, Computer Science and engineering, Sree Buddha college of engineering for women, 2 Assistant professor, Computer Science and engineering,

More information

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University CS473: CS-473 Course Review Luo Si Department of Computer Science Purdue University Basic Concepts of IR: Outline Basic Concepts of Information Retrieval: Task definition of Ad-hoc IR Terminologies and

More information

Search Enginge Optimization (SEO) Proposal

Search Enginge Optimization (SEO) Proposal Search Enginge Optimization (SEO) Proposal Proposal Letter Thank you for the opportunity to provide you with a quotation for the search engine campaign proposed by us for your website as per your request.our

More information

CLOAK OF VISIBILITY : DETECTING WHEN MACHINES BROWSE A DIFFERENT WEB

CLOAK OF VISIBILITY : DETECTING WHEN MACHINES BROWSE A DIFFERENT WEB CLOAK OF VISIBILITY : DETECTING WHEN MACHINES BROWSE A DIFFERENT WEB CIS 601: Graduate Seminar Prof. S. S. Chung Presented By:- Amol Chaudhari CSU ID 2682329 AGENDA About Introduction Contributions Background

More information

Notes on Bloom filters

Notes on Bloom filters Computer Science B63 Winter 2017 Scarborough Campus University of Toronto Notes on Bloom filters Vassos Hadzilacos A Bloom filter is an approximate or probabilistic dictionary. Let S be a dynamic set of

More information

THE HISTORY & EVOLUTION OF SEARCH

THE HISTORY & EVOLUTION OF SEARCH THE HISTORY & EVOLUTION OF SEARCH Duration : 1 Hour 30 Minutes Let s talk about The History Of Search Crawling & Indexing Crawlers / Spiders Datacenters Answer Machine Relevancy (200+ Factors)

More information

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Winter 2009 Lecture 12 Google Bigtable

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Winter 2009 Lecture 12 Google Bigtable CSE 544 Principles of Database Management Systems Magdalena Balazinska Winter 2009 Lecture 12 Google Bigtable References Bigtable: A Distributed Storage System for Structured Data. Fay Chang et. al. OSDI

More information

Information Retrieval Lecture 4: Web Search. Challenges of Web Search 2. Natural Language and Information Processing (NLIP) Group

Information Retrieval Lecture 4: Web Search. Challenges of Web Search 2. Natural Language and Information Processing (NLIP) Group Information Retrieval Lecture 4: Web Search Computer Science Tripos Part II Simone Teufel Natural Language and Information Processing (NLIP) Group sht25@cl.cam.ac.uk (Lecture Notes after Stephen Clark)

More information

Text Technologies for Data Science INFR11145 Web Search Walid Magdy Lecture Objectives

Text Technologies for Data Science INFR11145 Web Search Walid Magdy Lecture Objectives Text Technologies for Data Science INFR11145 Web Search (2) Instructor: Walid Magdy 14-Nov-2017 Lecture Objectives Learn about: Basics of Web search Brief History of web search SEOs Web Crawling (intro)

More information