DHANALAKSHMI COLLEGE OF ENGINEERING, CHENNAI

Size: px

Start display at page:

Download "DHANALAKSHMI COLLEGE OF ENGINEERING, CHENNAI"

Ginger Jordan
5 years ago
Views:

1 DHANALAKSHMI COLLEGE OF ENGINEERING, CHENNAI Department of Computer Science and Engineering CS6007 INFORMATION RETRIEVAL Anna University 2 &16 Mark Questions & Answers Year / Semester: IV / VII Regulation: 2013 Academic year:

2 UNIT III WEB SEARCH ENGINE INTRODUCTION AND CRAWLING Part A Question Bank 1. Define web server. Web server is a computer connected to the internet that runs a program that takes responsibility for storing, retrieving and distributing some of the web files. 2. What is web Browsers? A web browser is a program. Web browser is used to communicate with web servers on the Internet, Which enables it to download and display the web pages. Netscape Navigator and Microsoft Internet Explorer are the most popular browser software s available in market. 3. Explain paid submission of search service. In paid submission user submit website for review by a search service for a preset fee with the expectation that the site will be accepted and include d in that company s search engine, provided it meets the stated guidelines for submission. Yahoo! is the major search engine that accepts this type of submission. While paid submissions guarantee a timely review of the submitted site and notice of acceptance or rejection, you re not guaranteed inclusion or a particular placement order in the listings. 4. Explain paid inclusion programs of search services. Paid inclusion programs allow you to submit your website for guaranteed inclusion in a search engines database of listings for a set period of time. While paid inclusion guarantees indexing of submitted pages or sites in a search database, you re not guaranteed that the pages will rank well for particular queries. 5. Explain in pay-for-placement of search services. In pay-for-placement, you can guarantee a ranking in a search listing for the terms of your choice. Also known as paid placement, paid listing, or sponsored listings, this program guarantees placement in search results. The leaders in pay-for-placement are Google, Yahoo! and Bing. 6. Define Search Engine Optimization. Search Engine Optimization is the act of modifying a website to increase its ranking in organic, crawler-based listing of search engines. There are several ways to increase the visibility of your website through the major search engines on the internet

3 today. The two most common forms of internet marketing paid placement and natural placement. 7. Describe benefit of SEO. Increase your search engine visibility Generate more traffic from the major search engines. Make sure your website and business get NOTICED and VISITED. Grow your client base and increase business revenue. 8. Explain the difference between SEO and Pay-per-click SEO Pay-Per-click SEO results take 2 weeks to 4 months It results in 1-2 days It is very difficult to control flow of traffic It has ability to turn on and at any moment Requires ongoing learning and experience Easier for a novice to reap results It is more difficult to target local markets Ability to target local markets Better for long-term and lower margin Better for short-term and high-margin campaigns campaigns. Generally more cost-effective, does not Generally more costly per visitor and per penalize for more traffic conversion 9. What is web crawler? A web crawler is a program which browses the world web in a methodical, automated manner. Web crawlers are mainly used to create a copy of all the visited pages for later processing by a search engine that will index the downloaded pages to p[provide fast searches. 10. Define focused crawler. A focused crawler or topical crawler is a web crawler that attempts to download only pages that are relevant to a pre-defined topic or set of topic. 11. What is hard and soft focused crawling? In hard focused crawling the classifier is invoked on a newly crawled document in a standard manner. When it returns the best matching category path, the out-neighbors of the page are checked into the database if and only if some node on the best matching category path is marked as good.

4 In soft focused crawling all out-neighbors of a visited page are checked into DB2, but their crawl priority is based on the relevance of the current page. 12. What is the Near-duplicate detection? Near-duplicate is the task of identifying documents with almost identical content. Near- duplicate web documents are abundant. Two such documents differ from each other in a very small portion that displays advertisements, for example. Such differences are irrelevant and for web search. 13. What are requirements of XML information retrieval systems? Query language that allows users to specify the nature of relevant components, in particular with respect to their structure. Representation strategies providing a description not only of the content of XML documents, but also their structure. Ranking strategies that determine the most relevant elements and rank these appropriately for a given query. PART-B 16Marks 1. Explain the concept of Web search review. Each of which can be addressed by an identifier called Uniform Resource Locator(URL). Web pages ar a set of pages published together. For Example, The ability to search and retrieve information from web efficiently and effectively is enabling technology for realizing its full potential. With powerful workstations and parallel processing, efficiency is not bottleneck. User can search for any information by passing query in form of keywords or phrase. It then searches for relevant information in its database and return to the user.web search engine discover pages by crawling the web, discovering new pages by following hyperlinks. Access to particular web pages may be restricted in various ways. The set of pages which can not be included in search engine indexes is often called as web search The search engine looks for the keyword in the index for predefined database instead of going directly to the web to search for the keyword. It then uses software to search for the information in the database. This software component is known as web crawler. Once web crawler finds the pages, the search engine then shows the relevant web pages as a result. These retrieved web pages generally include title of page, size of text portion, first several

5 sentences etc. 2. Explain the structure of the web. Bow-Tie Structure of the Web: One of the intriguing findings of this crawl was that the Web has a bow-tie structure as shown in Fig. The central core of the Web (the knot of the bow-tie) is the strongly connected component (SCC), which means that for any two pages in the SCC, a user can navigate from one of them to the other and back by clicking on links embedded in the pages encountered. In other words, a user browsing a page in the SCC can always reach any other page in the SCC by traversing some path of links. The left bow, called IN, contains pages that have a directed path of links leading to the SCC. The right bow, called OUT, contains pages that can be reached from the SCC by following a directed path of links. A web page in Tubes has a directed path from IN to OUT bypassing the SCC, and a page in Tendrils can either be reached from IN or leads into OUT. The pages in Disconnected are not even weakly connected to the SCC; that is, even if we ignored the fact that hyperlinks only allow forward navigation, allowing them to be traversed backwards as well as forwards, we still could not reach reach the SCC from them. Fig: Bow Tie Shape of Web

6 3. Explain the User, Paid placement with as neat diagram. In this scheme the search engine separates its query results list into two parts: (i) an organic list, which contains the fre e unbiased results, displayed according to the search and (ii) a sponsored list, which is paid for by advertising managed with the aid of an online auction mechanism. This method of payment is called pay per click (PPC), also known as cost per click (CPC), since payment is made by the advertiser each time a user clicks on the link in the sponsored listing. In most cases the organic and sponsored lists are kept separate but an alternative model is to interleave the organic and sponsored results within a single listing. Example: Pay-per-click is calculated by dividing the advertising cost by the number of clicks generated by an advertisement. The basic formula is: Pay-per-click ($) = Advertising cost ($) Ads clicked (#) There are two primary models for determining pay-per-click: flat-rate and bid-based. In both cases, the advertiser must consider the potential value of a click from a given source. This value is based on the type of individual the advertiser is expecting to receive as a visitor to his or her website, and what the advertiser can gain from that visit, usually revenue, both in the short term as well as in the long term. There are three type of paid search services. a) Paid Submission b) Pay for inclusion c) pay for placement a) Paid Submission: User submits website for review by search service for a preset fee with the expectation that the site. Yahoo is the major search engine that accepts this type of submission. While the paid inclusion does not guarantee a timely

7 review of the submitted site and notice of acceptance or rejection, you are not guaranteed inclusion or a particular placement order in the listings. b) Paid Inclusion: It allows you to submit your web site for database of listings for a set of period of time. While paid inclusion gurantees indexing of submitted pages or sites in a search database, you are not guaranteed that the pages will rank well for particular queries. c) Pay for placement: You guarantee a ranking in search listing for the terms of your choice. Also known as Paid placement, Paid listing, or Sponsored listings, this program guaranteed placement in search result. The leaders in pay for placement are Google, Yahoo, Bing. 4. Explain the search engine concept. OPTIMIZATION/ SPAM: Spam: In a search engine whose scoring was based on term frequencies, a web page with numerous repetitions of terms would rank highly. This led to the first generation of spam, which (in the context of web search) is the manipulation of web page content for the purpose of appearing high up in search results for selected keywords. Many web content creators have commercial motives and therefore stand to gain from manipulating search engine results. Search engines soon became sophisticated enough in their spam detection to screen out a large number of repetitions of particular keywords. Spammers responded with a richer set of spam techniques, the best known of which we now describe. The first of these techniques is cloaking, shown in above figure. pages depending on whether the http request comes from a web search When the user searches for these keywords and elects to view thepage, he receives a web page that has altogether different content than that

8 indexed by the search engine. SEO: Given that spamming is inherently an economically motivated activity, there has sprung around it an industry of Search Engine Optimizers, or SEOs to provide consultancy services for clients who seek to have their web pages rank highly on selected keywords. A page ranking is measured by the position of web pages displayed in the search engine results. If a search engine is putting your web page on the first position, then your web page rank will be number 1 and it will be assumed as the page with the highest rank. SEO is the process of designing and developing a website to attain a high rank in search engine results. Conceptually, there are two ways of optimization: On-Page SEO - It includes providing good content, good keywords selection, putting keywords on correct places, giving appropriate title to every page, etc. Off-Page SEO - It includes link building, increasing link popularity by submitting open directories, search engines, link exchange, etc. On-page SEO: 1. Title Optimization: An HTML TITLE tag is put inside the head tag. The page title (not to be confused with the heading for a page) is what is displayed in the title bar of your browser window. Correct use of keywords in the title of every page of your website is extremely important 2. seo header and bold tags 3. Keyword Usage 4. Link Structure 5. Domain Name Strategy 6. Alt Tags 7. Meta Descrption

9 Off-Page SEO: cloak 1. Anchor Text 2. Link Building 3. Paid Links Benefits: 1. Increase Search engine visibility 2. Generate More traffic from the major search engine 3. Make sure your website and business get noticed and visited 4. Grow your client base and increase business revenues. 5. Explain the Web search architectures with diagram. a) Centralized Architecture Most search engines use centralized crawler-indexer architecture. Crawlers are programs (software agents) that traverse the Web sending new or updated pages to a main server where they are indexed. Crawlers are also called robots, spiders, wanderers, walkers, and know bots. In spite of their name, a crawler does not actually move to and run on remote machines, rather the crawler runs on a local system and sends requests to remote Web servers. The index is used in a centralized fashion to answer queries submitted from different places in the Web. The following figure shows the software architecture of a search engine based on the Alta Vista architecture. It has two parts: one that deals with the users, consisting of the user interface and the query engine and another that consists of the crawler and indexer modules.

10 Problems: 1) The main problem faced by this architecture is the gathering of the data, because of the highly dynamic nature of the Web, the saturated communication links, and the high load at Web servers. 2) Another important problem is the volume of the data b) Distributed Architecture: There are several variants of the crawler-indexer architecture. Among them, the most important is Harvest. Harvest uses a distributed architecture to gather and distribute data, which is more efficient than the crawler architecture. The main drawback is that Harvest requires the coordination of several web servers. The Harvest distributed approach addresses several of the problems of the crawlerindexer architecture, such as: (1) Web servers receive requests from different crawlers, increasing their load; (2) Web traffic increases because crawlers retrieve entire objects, but most of their content is discarded; and (3) information is gathered independently by each crawler, without coordination between all the search engines. To solve these problems, Harvest introduces two main elements: gatherers and brokers. A gatherer collects and extracts indexing information from one or more Web servers. Gathering times are defined by the system and are periodic (i.e. there are harvesting times as the name of the system suggests). A broker provides the indexing mechanism and the query interface to the data gathered. Brokers retrieve information from one or more gatherers or other brokers, updating incrementally their indices. Depending on the configuration of gatherers and brokers, different improvements on server load and network traffic can be achieved. A replicator can be used to replicate servers, enhancing user-base scalability. For example, the registration broker can be replicated in different geographic regions to allow faster access. Replication can also be used to divide the gathering process between many Web

11 servers. Finally, the object cache reduces network and server load, as well as response latency when accessing Web pages. 6. Explain the process of web crawling. Web crawling is the process by which we gather pages from the Web, in order to index them and support a search engine. The objective of crawling is to quickly and efficiently gather as many useful web pages as possible, together with the link structure that interconnects them. Features of Crawler: Robustness: Ability to handle spider traps. TheWeb contains servers that create spider traps, which are generators of web pages that mislead crawlers into getting stuck fetching an infinite number of pages in a particular domain. Crawlers must be designed to be resilient to such traps. Politeness: Web servers have both implicit and explicit policies regulating the rate at which a crawler can visit them. These politeness policies must be respected. Distributed: The crawler should have the ability to execute in a distributed fashion across multiple machines. Scalable: The crawler architecture should permit scaling up the crawl rate by adding extra machines and bandwidth. Performance and efficiency: The crawl system should make efficient use of various system resources including processor, storage and network bandwidth. Quality: the crawler should be biased towards useful pages first.

12 Freshness: In many applications, the crawler should operate in continuous mode: it should obtain fresh copies of previously fetched pages. Extensible: Crawlers should be designed to be extensible in many ways to cope with new data formats, new fetch protocols, and so on. This demands that the crawler architecture be modular. Basic Operation: The crawler begins with one or more URLs that constitute a seed set. It picks a URL from this seed set, and then fetches the web page at that URL. The fetched page is then parsed, to extract both the text and the links from the page (each of which points to another URL). The extracted text is fed to a text indexer. The extracted links (URLs) are then added to a URL frontier, which at all times consists of URLs whose corresponding pages have yet to be fetched by the crawler. Initially, the URL frontier contains the seed set; as pages are fetched, the corresponding URLs are deleted from the URL frontier. The entire process may be viewed as traversing the web graph. In continuous crawling, the URL of a fetched page is added back to the frontier for fetching again in the future. Web Crawler Architecture: The simple scheme outlined above for crawling demands several modules that fit together as shown in Figure. 1) The URL frontier, containing URLs yet to be fetched in the current crawl (in the case of continuous crawling, a URL may have been fetched previously but is back in the frontier for re-fetching). 2) A DNS resolution module that determines the web server from which to fetch the page specified by a URL. 3) A fetch module that uses the http protocol to retrieve the web page at a URL. 4) A parsing module that extracts the text and set of links from a fetched web page. 5) A duplicate elimination module that determines whether an extracted link is already in the URL frontier or has recently been fetched.

13 Crawling is performed by anywhere from one to potentially hundreds of threads, each of which loops through the logical cycle in the above Figure. These threads may be run in a single process, or be partitioned amongst multiple processes running at different nodes of a distributed system. We begin by assuming that the URL frontier is in place and non-empty. We follow the progress of a single URL through the cycle of being fetched, passing through various checks and filters, then finally (for continuous crawling) being returned to the URL frontier. 7. What is meta crawler? Explain the concept with a neat diagram. Meta searchers / Meta Crawlers are Web servers that send a given query to several search engines, Web directories and other databases, collect the answers and unify them. Examples are Metacrawler and SavvySearch. The main advantages of metasearchers are the ability to combine the results of many sources and the fact that the user can pose the same query to various sources through a single common interface. Metasearchers differ from each other in how ranking is performed in the unified result (in some cases no ranking is done), and how well they translate the user query to the specific query language of each search engine or Web directory (the query language common to all of them could be small). Following Table shows the URLs of the main metasearch engines as well as the number of search engines, Web directories and other databases

14 that they search. Metasearchers can also run on the client, for example, Copernic, EchoSearch, WebFerret, WebCompass, and WebSeeker. There are others that search several sources and show the different answers in separate windows, such as All40ne, OneSeek, Proteus, and Search Spaniel. The advantages of metasearchers are that the results can be sorted by different attributes such as host, keyword, date, etc; which can be more informative than the output of a single search engine. Therefore browsing the results should be simpler. On the other hand, the result is not necessarily all the Web pages matching the query, as the number of results per search engine retrieved by the metasearcher is limited (it can be changed by the user, but there is an upper limit). Nevertheless, pages returned by more than one search engine should be more relevant. We expect that new metasearchers will do better ranking. A first step in this direction is the NEC Research Institute metasearch engine, Inquirus. The main difference is that lnquirus actually downloads and analyzes each Web page obtained and then displays each page, highlighting the places where the query terms were found. The results are displayed as soon as they are available in a progressive manner, otherwise the waiting time would be too long. This technique also allows non-existent pages or pages that have changed and do not contain the query any more to be discarded, and, more important, provides for better ranking than normal search engines. On the other hand, this metasearcher is not available to the general public. The use of metasearchers is justified by coverage studies that show that a small percentage of Web pages are in all search engines. In fact, fewer than 1 % of the Web pages indexed by Alta Vista, HotBot, Excite, and Infoseek are in all of those search engines. This fact is quite surprising and has not been explained.

15 8. Explain the Focused crawling concept. Some users would like a search engine that focuses on a specific topic of information. For instance, at a website about movies, users might want access to a search engine that leads to more information about movies. If built correctly, this type of vertical search can provide higher accuracy than general search because of the lack of extraneous information in the document collection. The computational cost of running a vertical search will also be much less than a full web search, simply because the collection will be much smaller. The most accurate way to get web pages for this kind of engine would be to crawl a full copy of the Web and then throw out all unrelated pages. This strategy requires a huge amount of disk space and bandwidth, and most of the web pages will be discarded at the end. A less expensive approach is focused, or topical, crawling. A focused crawler attempts to download only those pages that are about a particular topic. Focused crawlers rely on the fact that pages about a topic tend to have links to other pages on the same topic. If this were perfectly true, it would be possible to start a crawl at one on-topic page, then crawl all pages on that topic just by following links from a single root page. In practice, a number of popular pages for a specific topic are typically used as seeds. Focused crawlers require some automatic means for determining whether a page is about a particular topic. Text classifiers are tools that can make this kind of distinction. Once a page is downloaded, the crawler uses the classifier to decide whether the page is on topic. If it is, the page is kept, and links from the page are used to find other related sites. The anchor text in the outgoing links is an important clue of topicality. Also, some pages have more on-topic links than others. As links from a particular web page are visited, the crawler can keep track of the topicality of the downloaded pages and use this to determine whether to download other similar pages. Anchor text data and page link topicality data can be combined together in order to determine which pages should be crawled next.

16 Web size measurement: To a first approximation, comprehensiveness grows with index size, although it does matter which specific pages a search engine indexes some pages are more informative than others. It is also difficult to reason about the fraction of the Web indexed by a search engine, because there is an infinite number of dynamic web pages; for instance, returns a valid HTML page rather than an error, politely informing the user that there is no such page at Yahoo! Such a "soft 404 error" is only one example of many ways in which web servers can generate an infinite number of valid web pages. Indeed, some of these are malicious spider traps devised to a that site. We could ask the following better-defined question: given two search engines, what are the relative sizes of their indexes? Even this question turns out to be imprecise, because: 1. In response to queries a search engine can return web pages whose contents it has not (fully or even partially) indexed. For one thing, search engines generally index only the first few thousand words in a web page. 2. Search engines generally organize their indexes in various tiers and partitions, not all of which are examined on every search. For instance, a web page deep inside a website may be indexed but not retrieved on general web searches; it is however retrieved as a result on a search that a user has explicitly restricted to that website. Thus, search engine indexes include multiple classes of indexed pages, so that there is no single measure of index size. These issues notwithstanding, a number of techniques have been devised for crude estimates of the ratio of the index sizes of two search engines, E1 and E2. The basic hypothesis underlying these techniques is that each search engine indexes a fraction of the Web chosen independently and uniformly at random. This involves some questionable assumptions: first, that there is a finite size for the Web from which each search engine chooses a subset, and second, that each engine chooses an independent, uniformly chosen subset. CAPTURE-RECAPTURE METHOD: If our assumption about E1 and E2 being independent and uniform random subsets of the Web were true, and our sampling process unbiased, then above Equation should give us an unbiased estimator for E1 / E2. We distinguish between two scenarios here. Either the measurement is performed by someone with access to the index of one of the search engines (say an employee of E1), or the measurement is performed by an independent party with no access to the innards of either search engine. In the former case, we can simply pick a random document from one index. The latter case is more

17 challenging; by picking a random page from one search engine from outside the search engine, then verify whether the random page is present in the other search engine. To implement the sampling phase, we might generate a randompage from the entire (idealized, finite)web and test it for presence in each search engine. Unfortunately, picking a web page uniformly at random is a difficult problem. We briefly outline several attempts to achieve such a sample, pointing out the biases inherent to each; following this we describe in some detail one technique that much research has built on. 1. Random searches: Begin with a search log of web searches; send a random search from this log to E1 and a random page from the results. Since such logs are not widely available outside a search engine, one implementation is to trap all search queries going out of a work group (say scientists in a research center) that agrees to have all its searches logged. This approach has a number of issues, including the bias from the types of searches made by the work group. Further, a random document from the results of such a random search to E1 is not the same as a random document from E1. 3. Random IP addresses: A second approach is to generate random IP addresses and send a request to a web server residing at the random address, collecting all pages at that server. The biases here include the fact that many hosts might share one IP (due to a practice known as virtual hosting) or not accept http requests from the host where the experiment is conducted. Furthermore, this technique is more likely to hit one of the many sites with few pages, skewing the document probabilities; we may be able to correct for this effect if we understand the distribution of the number of pages on websites. 3. Random walks: If the web graph were a strongly connected directed graph, we could run a random walk starting at an arbitrary web page. This walk would converge to a steady state distribution, from which we could in principle pick a web page with a fixed probability. This method, too has a number of biases. First, the Web is not strongly connected so that, even with various corrective rules, it is difficult to argue that we can reach a steady state distribution starting from any page. Second, the time it takes for the random walk to settle into this steady state is unknown and could exceed the length of the experiment.

18 9. Explain Random queries with a equation. The idea is to pick a page (almost) uniformly at random posing a random query to it. It should be clear that picking a set of random terms from (say) is not a good way of implementing this idea. For one thing, not all vocabulary terms occur equally often, so this approach will not result in documents being chosen uniformly at random from the search engine. For another, there are a great many terms in web documents that do not occur in a standard To address the problem of vocabulary terms not in a standard dictionary, we begin by assing a sample web dictionary. This could be done by crawling a limited portion of the Web, or by crawling a manually-assembled representative subset of the Web such as Yahoo!. Consider a conjunctive query with two or more randomly chosen words from this dictionary. Operationally, we proceed as follows: we use a random conjunctive query on E1 and pick from the top 100 returned results a page p at random. We then test p for presence in E2 by choosing 6-8 lowfrequency terms in p and using them in a conjunctive query for E2. We can improve the estimate by repeating the experiment a large number of times. Both the sampling process and the testing process have a number of issues. 1. Our sample is biased towards longer documents 2. Picking from the top 100 results of E1 induces a bias from the ranking algorithm of E1. Picking from all the results of E1 makes the experiment slower. This is particularly so because most web search engines put up defenses against excessive robotic querying. 3. During the checking phase, a number of additional biases are introduced: for instance, E2 may not handle 8-word conjunctive queries properly. 4. Either E1 or E2 may refuse to respond to the test queries, treating them as robotic spam rather than as bona fide queries. 5. There could be operational problems like connection time-outs. A sequence of research has built on this basic paradigm to eliminate some of these issues; there is no perfect solution yet, but the level of sophistication in statistics for understanding the biases is increasing. The main idea is to address biases by estimating, for each document, the magnitude of the bias. From this, standard statistical sampling methods can generate unbiased samples. In the checking phase, the newer work moves away from conjunctive queries to phrase and other queries that appear to be better behaved. Finally, newer experiments use other sampling methods besides random queries. The best known of these is document random walk sampling, in which a document is chosen by a random walk on a virtual graph derived from documents. In this graph, nodes are

19 documents; two documents are connected by an edge if they share two or more words in common. The graph is never instantiated; rather, a random walk on it can be performed by moving from a document to nanother by picking a pair of keywords in d, running a query on a search engine and picking a random document from the results. Near-duplicate detection: The Web contains multiple copies of the same content. By some estimates, as many as 40% of the pages on the Web are duplicates of other pages. Many of these are legitimate copies; for instance, certain information repositories are mirrored simply to provide redundancy and access reliability. Search engines try to avoid indexing multiple copies of the same content, to keep down storage and processing overheads The simplest approach to detecting duplicates is to compute, for each web page, a fingerprint that is a succinct (say 64-bit) digest of the characters on that page. Then, whenever the fingerprints of two web pages are equal, we test whether the pages themselves are equal and if so declare one of them to be a duplicate copy of the other. This simplistic approach fails to capture a crucial and widespread phenomenon on the Web: near duplication. In many cases, the contents of one web page are identical to those of another except for a few characters say, a notation showing the date and time at which the page was last modified. Even in such cases, we want to be able to declare the two pages to be close enough that we only index one copy. Short of exhaustively comparing all pairs of web pages, an infeasible task at the scale of billions of pages, how can we detect and filter out such near duplicates? We now describe a solution to the problem of detecting near-duplicate web pages. The answer lies in a technique known as SHINGLING. Given a positive integer k and a sequence of terms in a document d, define the k-shingles of d to be the set of all consecutive sequences of k terms in d. As an example, consider the following text: a rose is a rose is a rose. The 4-shingles for this text (k = 4 is a typical value used in the detection of nearduplicate web pages) are a rose is a, rose is a rose and is a rose is. The first two of these shingles each occur twice in the text. Intuitively, two documents are near duplicates if the sets of shingles generated from them are nearly the same. We now make this intuition precise, and then develop a method for efficiently computing and comparing the sets of shingles for all web pages. Let S(dj) denote the set of shingles of document dj. The Jaccard coefficient, which measures the degree of overlap between the sets denote this by J(S(d1), S(d2)). Our test for near duplication between d1 and d2 is to compute this Jaccard coefficient; if it exceeds a preset threshold (say, 0.9), we declare them near duplicates and eliminate one from

20 indexing. However, this does not appear to have simplified matters: we still have to compute Jaccard coefficients pairwise. Fig: Illustration of shingle sketches. We see two documents going through four stages of shingle sketch computation. In the first step (top row),we apply a 64-bit hash to each shingle from each document to obtain H(d1) and H(d2) (circles). Next, we apply a random permutation P to permute H(d1) and H(d2), obtaining P(d1) and P(d2) (squares). π π The third row shows only P(d1) and P(d2), while the bottom row shows the minimum values x 1 and x 2 for each document. To avoid this, we use a form of hashing. First, we map every shingle into a hash value over a large space, say 64 bits. For j = 1, 2, let H(dj) be the corresponding set of 64-bit hash values derived from S(dj). We now invoke the following trick to detect document pairs whose sets H() have large Jaccard oǀeƌlaps. Let π ďe a ƌaŷdoŵ peƌŵutatioŷ fƌoŵ the 64-bit integers to the 64-bit integers. Denote by P(dj) the set of permuted hash values in H(dj); thus for each Let x π j be the smallest integer in P(dj). Then J(S(d1), S(d2)) = P(x1 π = x2 π ). We give the proof in a slightly more general setting: consider a family of sets whose elements are drawn from a common universe. View the sets as columns of a matrix A, with one row for each element in the universe. The element aij = 1 if element i is present in the set Sj that the j th column represents. Let P be a random permutation of the rows of A; denote by P(Sj) the column that results from applying P to the j th column. Finally, let xj π be the index of the first row in which the column P(Sj) has a 1. We then prove that for any two columns j1, j2,

21 If we can prove this, the theorem follows. P(xj1 π = xj2 π ) = J(Sj1, Sj2 ). XML retrieval: Fig: Two Sets S j1 and S j2; Their jaccard coefficient 2/5. An XML document is an ordered, labeled tree. Each node of the tree is an XML ELEMENT and is written with an opening and closing tag. An element can have one or more XML ATTRIBUTE. In the XML document in Figure1, the scene element is enclosed by the two tags <scene...> and </scene>. It has an attribute number with value vii and two child elements, title and verse. Figure 2 shows Figure 1 as a tree. The leaf nodes of the tree consist of text, e.g., Shakespeaƌe, MaĐďeth, aŷd MaĐďeth s Đastle. The tƌee s iŷteƌŷal Ŷodes encode either the structure of the document (title, act, and scene) or metadata functions (author). <play> <author>shakespeare</author> <title>macbeth</title> <act number="i"> <scene number="vii"> <title>mađď Đa le</title> <verse>will I with wine and wassail...</verse> </scene> </act> </play> Figure 1 An XML document.

22 Figure 2: The XML document in a simplified DOM object. The standard for accessing and processing XML documents is the XML XML DOM Document Object Model or DOM. The DOM represents elements, attributes and text within elements as nodes in a tree. XPath is a standard for enumerating paths in an XML document collection. We will also refer to paths as XML contexts. The XPath expression node selects all nodes of that name. Successive elements of a path are separated by slashes, so act/scene selects all scene elements whose parent is an act element. Double slashes indicate that an arbitrary number of elements can intervene on a path: play//scene selects all scene elements occurring in a play element. An initial slash starts the path at the root element. /play/title seleđts the plaǉs title. For notational convenience, we allow the final element of a path to be a vocabulary term and separate it from the element path by the symbol #, even though this does not conform to the XPath standard. For example, title#"macbeth" selects all titles containing the term Macbeth. Schema puts constraints on the structure of allowable XML documents for a particular application. Two standards for schemas for XML documents are XML DTD (document type definition) and XML Schema. Users can only write structured queries for an XML retrieval system if they have some minimal knowledge about the schema of the collection.

23 Fig : Tree representation of XML documents and queries A common format for XML queries is NEXI (Narrowed Extended XPath). As in XPath double slashes indicate that an arbitrary number of elements can intervene on a path. The dot in a clause in square brackets refers to the element the clause modifies. The clause [.//yr = 2001 or.//yr = 2002] modifies //article. Thus, the dot refers to //article in this case. Similarly, the dot in [about(., summer holidays)] refers to the section that the clause modifies. 10. What are the Challenges in XML retrieval? The first challenge in structured retrieval is that users want us to return parts of documents (i.e., XML elements), not entire documents as IR systems usually do in unstructured, the act or the entire play. In this case, the user is probably looking for the scene. On the other hand, an otherwise unspecified search for Macbeth should return the play of this name, not a subunit. One criterion for selecting the most appropriate part of a document is the structured document retrieval principle: A system should always retrieve the most specific part of a document answering the query. However, it can be hard to implement this principle algorithmically. Consider the query title#"macbeth" applied to Figure 2. The title of the tragedy, good hits because they contain the matching term Macbeth. But in this case, the title of the tragedy, the higher node, is preferred. In unstructured retrieval, it is usually clear what the right document unit is: files on your desktop, messages, web pages on the web etc. In structured retrieval, there are a number of different approaches to defining the indexing unit. One approach is to group nodes into non- overlapping pseudo documents as shown in Figure. In the example, books, chapters and sections have been designated to be indexing units, but without overlap.

24 Fig: Partitioning an XML document into non-overlapping indexing units. The disadvantage of this approach is that pseudodocuments may not make sense to the user because they are not coherent units. For instance, the leftmost indexing unit in Figure merges three disparate elements, the class, author and title elements. The least restrictive approach is to index all elements. This is also problematic. Many XML elements are not meaningful search results, e.g., typographical elements like <b>definitely</b> or an ISBN number which cannot be interpreted without context. Also, indexing all elements means that search results will be highly redundant. We call elements that are contained within each other nested. Returning redundant NESTED ELEMENTS in a list of returned hits is not very user-friendly. Because of the redundancy caused by nested elements it is common to restrict the set of elements that are eligible to be returned. discard all small elements A challenge in XML retrieval related to nesting is that we may need to distinguish different contexts of a term when we compute term statistics for ranking, in particular inverse document frequency (idf) statistics. For example, the term Gates under the node author is unrelated to an occurrence under a content node like section if used to refer to the plural of gate. It makes little sense to compute a single document frequency for Gates in this example. One solution is to compute idf for XML-context/term pairs, e.g., to compute different idf weights for author#"gates" and section#"gates". Unfortunately, this scheme will run into sparse data problems. In many cases, several different XML schemas occur in a collection since the XML documents in an IR application often come from more than one source. This phenomenon is called schema heterogeneity or schema diversity and presents yet another challenge. As illustrated in following Figure comparable elements may have

25 different names: creator in d2 vs. author in d3. Fig: Schema heterogeneity: intervening nodes and mismatched names. If we employ strict matching of trees, then q3 will retrieve neither d2 nor d3 although both documents are relevant. Some form of approximate matching of element names in combination with semi-automatic matching of different document structures can help here. Human editing of correspondences of elements in different schemas will usually do better than automatic methods. Schema heterogeneity is one reason for query-document mismatches like q3/d2 and q3/d3. Another reason is that users often are not familiar with the element names and the structure of the schemas of collections they search as mentioned. We can also support the user by interpreting all parent-child relationships in queries as descendant relationships with any number of intervening nodes allowed. We call such queries extended queries. 11. What is vector space model for XML retrieval? We first take each text node (which in our setup is always a leaf) and break it into multiple nodes, one for each word. So the leaf node Bill Gates is split into two leaves Bill and Gates. Next we define the dimensions of the vector space to be lexicalized subtrees of documents subtrees that contain at least one vocabulary term. A subset of these possible lexicalized subtrees is shown in the figure, but there are others e.g., the subtree corresponding to the whole document with the leaf node Gates removed. We can now represent queries and documents as vectors in this space of lexicalized subtrees and compute matches between them.

26 Fig: A mapping of an XML document (left) to a set of lexicalized subtrees (right). If we create a separate dimension for each lexicalized subtree occurring in the collection, the dimensionality of the space becomes too large. A compromise is to index all paths that end in a single vocabulary term, in other words, all XML-context/term pairs. We call such an XML-context/term pair a structural term and denote it by <c, t>: a pair of XML-context c and vocabulary term t. The document in above Figure has nine structural terms. Seven are shown (e.g., "Bill" and Author#"Bill") and two are not shown: /Book/Author#"Bill" and /Book/Author#"Gates". The tree with the leaves Bill and Gates is a lexicalized sub tree that is not a structural term. We ensure that retrieval results respect this preference by computing a weight for each match. A simple measure of the similarity of a path cq in a query and a path cd in a document is the following context resemblance function CR: where cq and cd are the number of nodes in the query path and document path, respectively, and cq matches cd iff we can transform cq into cd by inserting additional nodes. Two examples from Figure 10.6 are CR(cq4, cd2) = 3/4 = 0.75 and CR(cq4, cd3) = 3/5 = 0.6 where cq4, cd2 and cd3 are the relevant paths from top to leaf node in q4, d2 and d3, respectively. The value of CR(cq, cd) is 1.0 if q and d are identical.

27 The final score for a document is computed as a variant of the cosine measure, which we call SIMNOMERGE for reasons that will become clear shortly. SIMNOMERGE is defined as follows: where V is the vocabulary of non-structural terms; B is the set of all XML contexts; and weight(q, t, c) and weight(d, t, c) are the weights of term t in XML context c in query q and document d, respectively. We compute the weights using one of the weightings, such as idft wft,d. The inverse document frequency idft depends on which elements we use to compute dft. The similarity measure SIMNOMERGE(q, d) is not a true cosine measure since its value can be larger than 1.0. The algorithm for computing SIMNOMERGE for all documents in the collection is shown in Figure. Figure 10.9 The algorithm for scoring documents with SIMNOMERGE We give an example of how SIMNOMERGE computes query-document similarities in Figure. <c1, t> is one of the structural terms in the query. We successively retrieve all postings lists for structural terms <c, t> with the same vocabulary term t. Three example postings lists are shown. For the first one, we have CR(c1, c1) = 1.0 since the two contexts are identical. The next context has no context resemblance with c1: CR(c1, c2) = 0 and the corresponding postings list is ignored. The context match of c1 with c3 is 0.63>0 and it will be processed. In this

28 example, the highest ranking document is d9 with a similarity of = To simplify the figure, the query weight of <c1, t> is assumed to be What are web indexes? Explain the process with suitable diagram. Before an index can be used for query processing, it has to be created from the text collection. Building a small index is not particularly difficult, but as input sizes grow, some index construction tricks can be useful. In this section, we will look at simple inmemory index construction first, and then consider the case where the input data does not fit in memory. Finally, we will consider how to build indexes using more than one computer. Pseudocode for a simple indexer is shown in following Figure. The process involves only a few steps. A list of documents is passed to the BuildIndex function, and the function parses each document into tokens. These tokens are words, perhaps with some additional processing, such as downcasing or stemming. The function removes duplicate tokens, using, for example, a hash table. Then, for each token, the function determines whether a new inverted list needs to be created in I, and creates one if necessary. Finally, the current document number, n, is added to the inverted list. The result is a hash table of tokens and inverted lists. The inverted lists are just lists of integer document numbers and contain no special information. This is enough to do very simple kinds of retrieval. As described, this indexer can be used for many small tasks for example, indexing less than a few thousand documents. However, it is limited in two ways. First, it requires that all of the inverted lists be stored in memory, which may not be practical for larger collections. Second, this algorithm is sequential, with no obvious way to parallelize it. The primary barrier to parallelizing this algorithm is the hash table, which is accessed constantly in the inner loop. Adding locks to the hash table would allow parallelism for parsing, but that improvement alone will not be enough to make use of more than a handful of CPU cores. Handling large collections will require less reliance on memory and improved parallelism.

29 Fig: Pseudocode for a simple indexer 13. What is merging? The classic way to solve the memory problem in the previous example is by merging. We can build the inverted list structure I until memory runs out. When that happens, we write the partial index I to disk, then start making a new one. At the end of this process, the disk is filled with many partial indexes, I1, I2, I3,..., In. The system then merges these files into a single result. By definition, it is not possible to hold even two of the partial index files in memory at one time, so the input files need to be carefully designed so that they can be merged in small pieces. One way to do this is to store the partial indexes in alphabetical order. It is then possible for a merge algorithm to merge the partial indexes using very little memory. Following Figure shows an example of this kind of merging procedure. Even though this figure shows only two indexes, it is possible to merge many at once. The algorithm is essentially the same as the standard merge sort algorithm. Since both I1 and I2 are sorted, at least one of them points to the next piece of data necessary to write to I. The data from the two files is interleaved to produce a sorted result.

Fig. 5.9. An example of index merging. The first and second indexes are merged togetherto produce the combined index.

30 Fig An example of index merging. The first and second indexes are merged togetherto produce the combined index. Since I1 and I2 may have used the same document numbers, the merge function renumbers documents in I2. This merging process can succeed even if there is only enough memory to store two words (w1 and w2), a single inverted list posting, and a few file pointers. In practice, a real merge function would read large chunks of I1 and I2, and then write large chunks to I in order to use the disk most efficiently. This merging strategy also shows a possible parallel indexing strategy. If many machines build their own partial indexes, a single machine can combine all of those in

XML RETRIEVAL. Introduction to Information Retrieval CS 150 Donald J. Patterson

XML RETRIEVAL. Introduction to Information Retrieval CS 150 Donald J. Patterson Introduction to Information Retrieval CS 150 Donald J. Patterson Content adapted from Manning, Raghavan, and Schütze http://www.informationretrieval.org OVERVIEW Introduction Basic XML Concepts Challenges