Multimedia Retrieval. Chapter 3: Web Retrieval. Dr. Roger Weber, Computer Science / CS342 / Fall 2014

Size: px
Start display at page:

Download "Multimedia Retrieval. Chapter 3: Web Retrieval. Dr. Roger Weber, Computer Science / CS342 / Fall 2014"

Transcription

1 Computer Science / CS342 / Fall 2014 Multimedia Retrieval Chapter 3: Web Retrieval Dr. Roger Weber, roger.weber@credit-suisse.com 3.1 Motivation: The Problem of Web Retrieval 3.2 Size of the Internet and Coverage by Search Engines 3.3 Ordering of Documents 3.4 Context based Retrieval 3.5 Architecture of a Search Engine

2 3.1 Motivation: The Problem of Web Retrieval Classical Retrieval Web Retrieval Collection controlled set uncontrolled, incomplete Size Documents, Multimedia Structure of documents Links btw. documents Quality of documents small to large (1 MB - 20 GB [TREC]) homogenous homogenous seldom (citations of other documents) good to excellent extremely large (only text documents comprise more than 200 GB) heterogeneous (HTML, PDF, ASCII) heterogeneous lots of links in documents broad range of quality: poor grammar, wrong contents, incorrect, spamming, misspellings Queries precise and more terms short and imprecise, similarity search Results small number of hits (<100); results have good quality large numbers of hits (>100,000) Page 3-2

3 The Internet grows rapidly The Internet is growing at a rate of 14% a year. Every 5.32 years, the number of domains doubles (see figures above). Google s index contains more than 35 billion pages. In 2008, Google software engineers announced that they discovered one trillion unique URLs ( But the web is even larger! Many web sites address their content with infinite numbers of URIs. For example Google: if we assume a dictionary of 1 million terms, all combinations of two term queries for Google yield a trillion unique URIs. Adding a third, a forth, a fifth, and so on, term multiplies this number by 1 million each time. Page 3-3

4 The Problem of Ordering Lots of documents fulfill the typically small queries (2-5 terms). Mostly, result sets contain more than 100,000 documents with an rsv>0 But not all result documents are relevant Example: query ford returns 911,000,000 hits with Google 1st rank: Homepage of the car manufacturer Ford How is that possible? Google is based on Boolean retrieval! Search engines not only sort based on the rsv-value Depending on the rsv-function, only pages would appear which contain terms with the same frequencies as the query, contain the query terms most often, or contain all the query terms Classical retrieval lack a defense mechanism against spamming! Page 3-4

5 3.2 Size of the Internet and Coverage by Search Engines How to compute the number of web servers connected to the Internet [Giles99]: Assumption: IP-addresses of web servers are evenly distributed in the 32-bit address space. Approach: Choose N random IP-addresses. Access the root web page of that server. Let M the number of found pages (i.e., found web servers) Let M/N be the density of the coverage in the IP-address space Giles [1999]: M/N 1/269 This finally leads to 2 32 * M/N 16.0 millions web servers [Date: July 1999] Problem: This estimation also contains devices that are managed via a web front end, e.g., routers. Page 3-5

6 Estimating the number of web pages overlap analysis [Bharat98] : Assumption: Search engines index a sub space of the web independently of each other. The index sub space is a random sample of the web. Approach: A A B Web B p(a B) = p(a)*p(b) A, B are known for some search engines A = N*p(A) estimate ratio B : A B with selected queries B = N*p(B) A B = N*p(A B) N = A * B / A B Note: the assumption does not generally hold because search engines often start crawling at the same starting points (e.g., yahoo.com). Hence, the above estimation leads to a lower bound for the real number of web pages. Page 3-6

7 Estimating the number of web pages (2) Procedure: A and B are known for many search engines Determine the document frequency for terms in a sufficiently large sample of web pages. Let l=0; repeat k times: Query search engine B with a random query and select an arbitrary page in the result Build a query with terms appearing in the selected document that have the smallest document frequencies (select several terms to steer the query result size) Query the other search engine (A) with these terms. Due to the selected terms with small document frequencies, the result set is small. Increase l if the page is also found with search engine A. Estimate the ratio B : A B with k : l Determine N AB = A * B / A B = A * k / l Compute N AB for different combinations of search engines and estimate the total number of wbe pages by the average over all N AB -values Page 3-7

8 Some key figures Dec 1997: > 320M pages Some search engines indexed about 1/3 of the Web; the biggest 6 engines indexed around 60% of the Web Feb 1999: 800M pages, Some engines indexed 16% of the Web; the largest 11 engines indexed around 42% of the Web 2.8M public web servers, 16M web servers in total a page had an average size of 18,7 kb (excluding embedded images); 14 TBytes of data Jan 2000: > 1B pages Coverage of some search engines between 10%-15% 6.4M public web servers (with 2.2M mirrored servers); about 60% Apache, 25% MS-IIS Number of links on End of 2000:3-5B pages Coverage of search engines still between 10%-15%; Google outperforms all others in terms of coverage: 30% (also includes pages that Google crawlers only know from references) TBytes of data (HTML format) BrightPlanet: distinguishes between surface and deep Web; the "surface web" subsumes all public pages; the deep web subsumes also private pages and dynamic pages (phone book, e-banking, etc.) 800B pages in the deep Web, more than 8000 TBytes of data 2005: 8B pages The deep Web has further grown; accurate estimates are no longer available. 2010: Google has >35B pages and reports to have seen more than 1 trillion unique URIs Page 3-8

9 Last reported index sizes of Search Engines Search Engine Reported Size Page Depth Google 8.1 billion 101K MSN 5.0 billion 150K Yahoo 4.2 billion 500K Ask Jeeves 2.5 billion 101K+ [Source: SearchEngineWatch, 2005] PageDepth: maximum length of the indexed part of a found document. Google, for instance, only indexes the first 101 Kilo-Bytes of a page More recent figures: Google s Index had 8B entries at the end of 2005 (according to its homepage); today, no index sizes are published any more (-> end of search index size war with Yahoo) Basic idea for more current figures: use a keyword like the with a known frequency to appear on any web page (67%) 2010: 23 billion hits (2006: 14.4 billion) size is 34 billion pages (2006: 22 billion) Page 3-9

10 3.3 Ordering of Documents Ranking of Google (as far as documented): even though its based on Boolean retrieval, it has good precision today: most search engines use similar methods but the details are kept secret Ranking already starts with extracting the right information from documents: Google extracts Positions of terms in a document, the relative font size, the visual attributes (bold, italic), and the context of the page (terms in URL, title, meta-tags, text of references) text of references (e.g., between <A>...</A>) are also assigned to the referenced document. The ranking consists of the following factors:: Proximity of terms, i.e., distance between occurrences of distinct query terms Positions in the document (URL, text of references, title, meta-tag, body), Size of font, and visual attributes PageRank Further criteria (advertisements, pushed content) Page 3-10

11 3.3.1 Proximity of Terms Query: White House Document 1: the white car stands in front of the house (-> not relevant) Document 2: the president entered the White House (-> relevant) the closer the query terms are, the more relevant the text is Implementation in Google Prototype: for each position pair, a proximity values was assigned (10 values) the frequencies of these proximity values result in the Proximity -Vector multiplying this vector with a weighting vector (inner vector product) leads to the overall proximity value for the document Page 3-11

12 Example: hit list [ white ] = { 1, 81, 156 }, hit list [ house ] = { 2, 82, 115, 157 } Position pairs {(1,2), (1,82), (1,115), (1,157), (81,2), (81,82), (81,115),...} Pos Term 1 white 2 house white 82 house house white 157 house... (1,2) (81,82) (1,157) are mapped to the proximity vector: p = [3, 0, 0, 1, 1, 0, 0, 1, 2, 3] Proximity Frequency 1 (adjacent) 3 2 (close) 0 3 (...) 0 4 (...) 1 5 (nearby) 1 6 (...) 0 7 (...) 0 8 (distant) 1 9 (...) 2 10 (far away) 3 Overall proximity of the document: with w = [1.0, 0.9, 0.8, 0.7,..., 0.1] p T w = 5.6 Page 3-12

13 3.3.2 Positions in the Document, Size of Font, and Visual Attributes Queries typically aim at the title (heading) of a document E.g.: White House instead of Central Executive Place Often users look for brands, persons, or firms Text references in external documents describe the contents very accurately: E.g.: query eth lausanne is answer by Google with the home page of EPFL, although that page does not contain the term "ETH". Conclusion: Documents containing the query terms in the title, with special visual attributes (large size, heading, bold), or in reference texts linking to that document appear more relevant than documents that contain the terms just in the body ( I work for ETH Lausanne ) Google: counts the occurrences of terms along the dimensions described above multiplies the frequencies with well-chosen weights and sums these values to a second relevance value for the document Further: it contains mechanisms to cut off spamming Page 3-13

14 Implementation in Google Prototype Google: Pos Freq lim. Freq Weight <TITLE> <META> <P> <B> <I> <H1> <H2> reference Impact: Google is able to find pages given a brand, name, or firm that are highly relevant to the user. Spamming: if a page contains a term too often, the page gets ignored (e.g., if a term contributes to more than 10% of text then it s spam) Page 3-14

15 3.3.3 PageRank Which page better suits the query "Uni Basel" and why? Page 3-15

16 A Preliminary Model (not yet PageRank) Idea count the number of incoming links; the more incoming links the more important that page is because it is more likely that a surfer lands on that page: A 1 C 6 C is the most important page before D, A and B B 1 D 3 Problems: Not every page is equally important and thus are links Spamming Page 3-16

17 Computing the PageRank of a Page Improved Idea: A random surfer clicks with a probability p an outgoing link on the current page. With a probability of (1-p), the surfer selects an arbitrary page (bookmark, url) The PageRank of a page is given by the probability that a random surfer lands on the page (after a number of steps) How to compute this Probability: Notation: A an arbitrary page L(A) set of pages which have a reference to A N(A) number of outgoing links of page A PR(A) PageRank of page A p Probability that a surfer is following an outgoing link ( [0,1] ) Definition of PageRank: PR( A) = (1 p) + p PR( B) B L ( A) N( B) Page 3-17

18 Intuitive Explanation of the Formula The value of a link is given by the PageRank of the source document divided by the number of outgoing links on that page. This simulates the freedom of the random surfer to follow any link on that page (1-p) + p*... denotes the freedom of the surfer to follow a link with probability p or to jump to an arbitrary page with probability 1-p. Example: A B A and C have the same PageRank although A has only one incoming link instead of two as C C 0.4 PR(C) Page 3-18

19 Computation The formula is recursive! The PR() values can be computed by a fix point iteration; experiments showed that the computation is minimal compared to the crawling effort (only a few iterations required) Approach: 1. Assign arbitrary initial values PR(A) for all documents A 2. Compute PR (A) (left hand side of equation) according to the formula above for all documents A 3. If PR (A)-PR(A) becomes sufficiently small then it holds that PR(A)=PR (A) is the solution; otherwise let PR(A)=PR (A) and repeat from step 2 Solving the fix point iteration takes only a few iterations (<100) and the computational effort is minimal (several hours) Page 3-19

20 Application PageRank derives a total ordering for web pages independent of the current query and its terms. Google uses PageRank in combination with other criteria PageRank is robust against spamming, i.e., against manipulations to push the PageRank of a page. Even if many links point to a page, this not necessarily implies its importance and a high PageRank. Ordering documents only based on PageRank would be fatal Let A be the document with the highest PageRank All queries with terms contained by A would be answered with A as the best document even though more relevant documents exist Page 3-20

21 3.3.4 Other Criteria to Order Documents Bought ranking positions A number of search engines get money for placing pages at top positions (advertisements) for certain query terms Advertisments RealName Page 3-21

22 Length of URL A query such as ford may be answered with the following documents ces&level2=rentalsfromdealers Search engine rank documents with short URLs at higher positions as they are more likely to be homepages/entry pages for the information need User Feedback direct hit used feedback of users to rank documents Internally, relevance based on feedback is stored similar to the PageRank. If a document in the result click is visited, the relevance value is increased, otherwise decreased Google experimented with feedback Page 3-22

23 3.3.5 Overall Ranking All search engine use and combine different criteria. In Google, the most prominent ones are: Proximity of terms Relevance values based on position in document, font size, and visual attributes PageRank The total relevance for a document results from summing up the relevance values of the individual criteria (with appropriate weighting). How to obtain those weights and which criteria to apply, however, remains a secret of search engine providers. Page 3-23

24 3.4 Context based Retrieval Observations: The web contains many pages addressing a specialized topic (e.g., Star Wars) E.g.: lists many web sites devoted to the "Star Wars" movies (all sites cover the same topic) E.g.: lists web sites of different car brands and car manufacturers (all sites cover similar/related topics) What s Related Hubs and Authorities Consequently: Improve search results by explicitly taking context information into account (as described in the examples above) Similarly: determine the context of the query (probably by asking the user for more information) (Teoma, AskJeeves, Gigablast) Page 3-24

25 3.4.1 Hubs and Authorities A page denotes a so-called Hub for a query Q if it contains many links to pages that are relevant to the query. A page is a so-called Authority for a query Q if it is relevant for Q, i.e. provides the necessary information to answer the information need. Typically, one can identify and distinguish hubs an authorities based on their link structure. relevant to query Q relevant to query Q Hub Authority Page 3-25

26 Additionally, we observe: a good hub points to many good authorities, and a good authority is referenced by many good hubs. Based on hub-authority relationships, a search engine becomes able to identify relevant document which do not contain the query terms. Example: a query such as "looking for car manufacturers" will not lead a user to homepages of Honda, VW, or Ford. With a hubs/authorities analysis, it becomes possible to answer even such queries directly. Idea of Kleinberg [1997]: HITS Algorithm The Web is a graph G = (V,E) with the vertices V denoting the set of documents and the edges E denoting the links (from source to destination). Let (p,q) E, then document p references document q. Step 1: for a query Q, determine the first t (e.g. t=200) documents with the help of a search engine. The set obtained in this step is called the root set. For this set, we observe that it contains many relevant documents, but that it does not contain all the good hubs and authorities Page 3-26

27 Step 2: Extend the root set with documents referenced by a document in the root set and with documents pointing to documents in the root set. The resulting set of document denotes the so-called base set. To limit the size of the base set, one can restrict the number of documents added to d (e.g. 50) documents per element of the root set. Links within the same domain are removed as they frequently only serve as navigation aides. root base Page 3-27

28 Step 3: Compute the hub values h(p) and authority values a(p) for each document p. Thereby, the number of incoming links and outgoing links play a central role in the computation a simple solution could be: a ( p) = 1 h( p) = ( q, p) E ( p, q) 1 E an better idea: a good hub references many good authorities and a good authority is linked by many good hubs. a(p) and h(p) are always normalized p V a 2 2 ( p) 1 h( p) = 1 = p V Initialization: all pages start with the same values for a(p) and h(p) Iteration: the new values are computed based on the old ones a ( p) = h( q) h( p) = ( q, p) E ( p, q) a( q) Repeat the iteration (including normalization) until convergence is reached E Note: normalization of the new values must be established afterwards Page 3-28

29 Step 4: Compute the result if the user asks for overview pages, return the k documents having the largest hub values h(p) if the user asks for content pages, return the k documents having the largest authority values a(p) Notes: User is empowered to choose between Hubs and Authorities The iterative algorithm takes only few steps (10-20) to determine the values a(p) and h(p) Implementation: simple procedure is based on the availability of a search engine evaluation of the query with the search engine determines the root set determine referenced documents from members of the root set by downloading and parsing all documents in the root set determine incoming links and their source with the help of queries of type link: u with u denoting the URL of a document in the root set Page 3-29

30 Extensions of HITS (Henzinger, 1998) The HITS algorithm suffers from three fundamental problems: 1. If all pages in domain reference the same external page, that page becomes too strong an authority. Similarly, if a page links to many different pages in the same domain, that page becomes too strong a hub. 2. Automatically established links, e.g., advertisements, banners, links to the provider/host/designer of a web site become wrong authorities 3. Queries such a "jaguar car" tend to lead to pages about cars in general and hubs containing links to different manufacturer. More precisely, the more frequent term "car" dominates the more infrequent term jaguar. Page 3-30

31 Improvements: Problem 1: the same author (=same domain) has only one "vote" for an external page. Similarly, a document has only one "vote" when referencing documents int the same domain. If k pages p i of the same domain reference a document q, we weight the links with a value of aw(p i, q)=1/k for each edge (p i, q). If a page p references l pages q i in the same domain, we weight the links with a value hw(p, q i )=1/l for each edge (p, q i ). Adjust the iteration step with these weights a ( p) = aw( q, p) ( q, p) E h( q) h( p) = ( p, q) hw E ( p, q) a( q) Page 3-31

32 Problem 2 and 3: to deal with these problems, we eliminate nodes from the graph that obviously do not match the query or its relevant documents. To this end, we perform an artificial query against the base set: The query is described with the first few terms of all documents in the root set (e.g., the first 1000 terms of all documents) Query and documents are mapped to vectors according to the tf-idf weighting scheme of vector space retrieval Similarity s(p) between query and document p is given by the cosine measure For a given threshold t, eliminate all nodes/documents in the graph with s(p)<t. We obtain a good threshold by applying one of the methods to follow: t = median of all s(p) values t = median of s(p) values of documents in the root set t = 1/10 max s(p) a ( p) = aw( q, p) s( q) ( q, p) E h( q) h( p) = ( p, q) hw E ( p, q) s( q) a( q) Page 3-32

33 Discussion: The HITS algorithm outperforms conventional web based search results. The extensions of the HITS algorithm further improve precision by more than 45% The main issues with the HITS algorithm are large evaluation costs and long retrieval times (30 seconds up to several minutes) In contrast to PageRank, ranking with HITS depends on the query. PageRank gives a total order of the documents for all queries. Page 3-33

34 3.4.2 What s Related The basic idea of Alexas What s Related was to identify related documents for existing documents. The definition of "related", however, was not based on similarity between documents but on similarity between the topics addressed in these documents potentially from different angles Related page for would be Analogously to What s Related, Google provides Similar Pages for its result entries. The approaches to compute the relationships differ significantly between the two systems: Alexa used crawlers and data mining tools to determine related pages. Moreover, it spied on the surf patterns of the user (which pages has the user visited; which search result did the user investigate in more details; ) Google relies totally on the analysis of the link structure of web pages to derive the related documents. Two approaches were published by Google experts. Page 3-34

35 Companion Algorithm (Dean, Henzinger, 1999) The more complex approach is based on the extended HITS algorithm: given a URL u, the algorithm must find related documents to that page u Notation: If page w references page v, we call w the parent page of v, and we call v the child page of w. Step 1: build a directed graph in the "neighborhood" of u. That graph contains the following nodes: u at most b parent pages of u and for each parent page at most bf child pages at most f child pages of u and for each child page at most fb parent pages Step 2: Converge duplicates or "near-duplicates" Two documents are "near-duplicates", if they contain more than 10 links, and 95% of the contained links appear in both documents Page 3-35

36 Step 3: Assign weights for edges in the graph Similar to the extension of HITS algorithm: if k edges of documents within the same domain point to the same external page, these edges obtain a weight of 1/k. And: a document contains l edges to pages within the same domain, each of these edges obtains a weight of 1/l. Step 4: Determine hubs and authorities for the nodes of the graph according to the extension of the HITS algorithm (but without similarity weighting), i.e. a ( p) = aw( q, p) ( q, p) E h( q) h( p) = ( p, q) hw E ( p, q) a( q) Step 5: Determine the result The pages with highest authority weights (except for u) are the so-called related pages to u. Page 3-36

37 Co-citation Algorithm (Dean, Henzinger, 1999) This simpler approach counts how often a page u is referenced together with a page q. The page with the most frequent co-citations is the so-called most related page to u. Step 1: Determine at most b parent pages of u. Step 2: Determine for each parent page at most bf child pages with the links of these child pages being close to the link to u. All these pages are siblings of u. Step 3: Determine pages q i that are referenced most frequently together with u. Step 4: If Step 1-3 result in less than 15 co-citations with a frequency of 2 or higher, repeat the search with prefixes of the URL of u, i.e.: u= u = u = u = Page 3-37

38 Discussion: The approaches of Dean and Henzinger work much better than Alexa on average. Due to the non-existent information about how Alexa works, no qualitative and quantitative assessments are possible. Although Henzinger worked for Google, it is not clear which algorithms Google uses to find related pages. Page 3-38

39 3.5 Architecture of a Search Engine A search engine consist of the following main components Crawler/Repository Feature Extractor Indexer Sorter Feedback Component User Interface Page 3-39

40 Google (Brin Page) URL Server Crawler Store Server Anchors Indexer Repository URL Resolver Links Doc Index Barrels Sorter Lexicon PageRank Searcher Page 3-40

41 3.5.1 Main Problem: Scalability The data problem: search engines must deal with enormous data sizes: Assumption: size of web page is 10 KB; extraction returns 1 KB data to index Google s search index contains at least 35B pages (2010): cache size for documents: 350 TB Size of inverted lists: 35 TB Google uses > (commodity) PCs with a hard disk with GB Total disk space: TB But: space per PC is only 100 GB Problem: how to search through 35 TB if a single machine only stores 100 GB? How to organize such a cluster given the frequent updates and the enormous search frequencies? How do you assign identifiers for 35B pages? Page 3-41

42 The retrieval problem: there is no service window to update the software or the data; users search all the time Google: 250M queries per day = 10M queries per hour = 3000 queries per second! Daily peaks may be much higher 35 TB of index data => a single entry (=term) in the inverted lists consumes more than 5 GB (e.g. term house returns 3B hits with Google). Typical IDE Disk performs with 50 MB/s How long does it take to search through 5 GB of data? How do you reduce search times given the high IO load (not meaningful to cache data in memory)? Query house returns 3B hits Hits have to be sorted according to their relevance; with one term this can be pre-computed, but with several terms? an average PC needs quite some time to sort 3B hits even if the scores are already computed; computation of scores takes yet another "eternity" How can we decrease search times? by parallelization? you don't need to have the entire list, only the top hundreds: does that help? Page 3-42

43 The crawling problem: how do you fetch 35B pages in a reasonable time DNS lookups are expensive; a central DNS server would not allow you to scale enough (maximum number of downloads per second is limited) Google exchanged/exchanges its index once every month: upper limit on time: download 35B pages in a month = 24,000 pages per second! web page has 10 KB => 240 MB/s you most probably need several gigabit connections to the Internet And: servers on the Internet have different response times Incremental crawling: important pages such a news papers have to be read daily to support queries on "hot topics" news information with an age of 1-2 months is not interesting any more incremental crawler has to update the index of important pages that change frequently. Goolge's incremental crawler is called freshbot how do you select important pages and how can you update the inverted lists while 3000 concurrent queries run? Do you require ACID properties? Page 3-43

44 Google: success due to addressing the scalability problem from the beginning on The following considerations are based on Papers and presentations of Google Speculations in web search forums Page 3-44

45 3.5.2 Crawling tricks and musts DNS lookup Problem: local cache on each crawl server Be nice with web servers! Follow these rules: Only few requests per minute to the same server! Do not follow cgi-links as executing cgi scripts usually is expensive; moreover, you easily may activate/interact/change the page (order goods, games, forums, ) read and obey to robots.txt e.g. disallow: /cgi-bin/ filter "critical" URIs Note: there are much more pages than you ever can crawl, so it does not matter if you missed some of them; but do not miss the important ones! A single machine is not sufficient for crawling! A single connection to the Internet is not sufficient! Page 3-45

46 3.5.3 Distributed Data Management Google File System (GFS): maintains file systems with more than 1TB: very large files (>>size of disk) can be managed fault tolerance against crashes of machines, hard disks, partitioning of files to allow for massive parallelization Google implemented a two-dimensional data management Data are partitioned along groups of documents; the so-called sherds Each sherd is stored on an arbitrary number of machines increases fault-tolerance (commodity PCs are highly available) replication => distribution of load among different machines Page 3-46

47 Execute as parallel as possible Partitioning and replication provide parallelization inside a query evaluation and across all concurrent query evaluations The scalability of Google is available in two dimensions: partitioning enables support for growth of documents replication enables support for growth of concurrent queries Page 3-47

48 Query evaluation with Google query: cats and dogs DNS resolver: Last geografisch verteilen Google Entry System Google Entry System Google Entry System Spell Checker Ad Server 4 Index Servers 4 Index Servers Google Entry System Google Entry System Google Web Server Document Servers 5 Document Servers Index Servers Document Servers Page 3-48

49 3.5.5 Cluster Management Google: >1,000,000 PCs deployed in numerous data centers: PC (or any of it components) fails at least once a year 1,000,000 PCs: 3,000 failures per day Data (index, document cache) has to be refreshed in regular intervals. Google, for instance, refreshed the index once every month; freshbot refreshes the index once a day. >100 TB of data are not instantly distributed data has to be extended step by step Google must still answer queries but may use old or new index data Google s software is constantly improved. Software distribution must be fully automated and not result in any downtimes. Page 3-49

50 Literature and Links Google Inc. Google: S. Brin, L. Page, The Anatomy of a Large-Scale Hypertextual Web Search Engine, WWW7, 1998, L. Page et. al., The PageRank Citation Ranking: Bringing Order to the Web, work in progress, &format=pdf&compression=&name= pdf Luiz Andre Barroso, Jeffrey Dean, Urs Hölzle, Web Search for a Planet: The Google Cluster Architecture, April 2003 (IEEE Computer Society). Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung, The Google File System, In: Proceedings of SOSP 03, October 19 22, 2003, Bolton Landing, New York, USA. What s Related AlexaResearch ( What s Related ) Google: Jeffrey Dean and Monika R. Henzinger. Finding Related Web Pages in the World Wide Web. Proceedings of the 8th International World Wide Web Conference (WWW8), 1999, pp Page 3-50

51 Literature and Links (2) Size of Internet [Bharat98] Krishna Bharat and Andrei Broder, A technique for measuring the relative size and overlap of public Web search engines, WWW7, 1998, [Giles99] - Steve Lawrence, Lee Giles, "Accessibility of information on the web", Nature, Vol. 400, pp , 1999 [SEW] SearchEngineWatch: [BP] BrightPlanet Studie: Internet Domain Survey: Page 3-51

The Anatomy of a Large-Scale Hypertextual Web Search Engine

The Anatomy of a Large-Scale Hypertextual Web Search Engine The Anatomy of a Large-Scale Hypertextual Web Search Engine Article by: Larry Page and Sergey Brin Computer Networks 30(1-7):107-117, 1998 1 1. Introduction The authors: Lawrence Page, Sergey Brin started

More information

Crawler. Crawler. Crawler. Crawler. Anchors. URL Resolver Indexer. Barrels. Doc Index Sorter. Sorter. URL Server

Crawler. Crawler. Crawler. Crawler. Anchors. URL Resolver Indexer. Barrels. Doc Index Sorter. Sorter. URL Server Authors: Sergey Brin, Lawrence Page Google, word play on googol or 10 100 Centralized system, entire HTML text saved Focused on high precision, even at expense of high recall Relies heavily on document

More information

Information Retrieval Spring Web retrieval

Information Retrieval Spring Web retrieval Information Retrieval Spring 2016 Web retrieval The Web Large Changing fast Public - No control over editing or contents Spam and Advertisement How big is the Web? Practically infinite due to the dynamic

More information

Searching the Web for Information

Searching the Web for Information Search Xin Liu Searching the Web for Information How a Search Engine Works Basic parts: 1. Crawler: Visits sites on the Internet, discovering Web pages 2. Indexer: building an index to the Web's content

More information

THE WEB SEARCH ENGINE

THE WEB SEARCH ENGINE International Journal of Computer Science Engineering and Information Technology Research (IJCSEITR) Vol.1, Issue 2 Dec 2011 54-60 TJPRC Pvt. Ltd., THE WEB SEARCH ENGINE Mr.G. HANUMANTHA RAO hanu.abc@gmail.com

More information

A Survey on Web Information Retrieval Technologies

A Survey on Web Information Retrieval Technologies A Survey on Web Information Retrieval Technologies Lan Huang Computer Science Department State University of New York, Stony Brook Presented by Kajal Miyan Michigan State University Overview Web Information

More information

Link Analysis and Web Search

Link Analysis and Web Search Link Analysis and Web Search Moreno Marzolla Dip. di Informatica Scienza e Ingegneria (DISI) Università di Bologna http://www.moreno.marzolla.name/ based on material by prof. Bing Liu http://www.cs.uic.edu/~liub/webminingbook.html

More information

Distributed Systems. 05r. Case study: Google Cluster Architecture. Paul Krzyzanowski. Rutgers University. Fall 2016

Distributed Systems. 05r. Case study: Google Cluster Architecture. Paul Krzyzanowski. Rutgers University. Fall 2016 Distributed Systems 05r. Case study: Google Cluster Architecture Paul Krzyzanowski Rutgers University Fall 2016 1 A note about relevancy This describes the Google search cluster architecture in the mid

More information

Information Retrieval (IR) Introduction to Information Retrieval. Lecture Overview. Why do we need IR? Basics of an IR system.

Information Retrieval (IR) Introduction to Information Retrieval. Lecture Overview. Why do we need IR? Basics of an IR system. Introduction to Information Retrieval Ethan Phelps-Goodman Some slides taken from http://www.cs.utexas.edu/users/mooney/ir-course/ Information Retrieval (IR) The indexing and retrieval of textual documents.

More information

Relevant?!? Algoritmi per IR. Goal of a Search Engine. Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Web Search

Relevant?!? Algoritmi per IR. Goal of a Search Engine. Prof. Paolo Ferragina, Algoritmi per Information Retrieval Web Search Algoritmi per IR Web Search Goal of a Search Engine Retrieve docs that are relevant for the user query Doc: file word or pdf, web page, email, blog, e-book,... Query: paradigm bag of words Relevant?!?

More information

Mining Web Data. Lijun Zhang

Mining Web Data. Lijun Zhang Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems

More information

International Journal of Scientific & Engineering Research Volume 2, Issue 12, December ISSN Web Search Engine

International Journal of Scientific & Engineering Research Volume 2, Issue 12, December ISSN Web Search Engine International Journal of Scientific & Engineering Research Volume 2, Issue 12, December-2011 1 Web Search Engine G.Hanumantha Rao*, G.NarenderΨ, B.Srinivasa Rao+, M.Srilatha* Abstract This paper explains

More information

Anatomy of a search engine. Design criteria of a search engine Architecture Data structures

Anatomy of a search engine. Design criteria of a search engine Architecture Data structures Anatomy of a search engine Design criteria of a search engine Architecture Data structures Step-1: Crawling the web Google has a fast distributed crawling system Each crawler keeps roughly 300 connection

More information

Information Retrieval May 15. Web retrieval

Information Retrieval May 15. Web retrieval Information Retrieval May 15 Web retrieval What s so special about the Web? The Web Large Changing fast Public - No control over editing or contents Spam and Advertisement How big is the Web? Practically

More information

10/10/13. Traditional database system. Information Retrieval. Information Retrieval. Information retrieval system? Information Retrieval Issues

10/10/13. Traditional database system. Information Retrieval. Information Retrieval. Information retrieval system? Information Retrieval Issues COS 597A: Principles of Database and Information Systems Information Retrieval Traditional database system Large integrated collection of data Uniform access/modifcation mechanisms Model of data organization

More information

Lecture Notes: Social Networks: Models, Algorithms, and Applications Lecture 28: Apr 26, 2012 Scribes: Mauricio Monsalve and Yamini Mule

Lecture Notes: Social Networks: Models, Algorithms, and Applications Lecture 28: Apr 26, 2012 Scribes: Mauricio Monsalve and Yamini Mule Lecture Notes: Social Networks: Models, Algorithms, and Applications Lecture 28: Apr 26, 2012 Scribes: Mauricio Monsalve and Yamini Mule 1 How big is the Web How big is the Web? In the past, this question

More information

Weighted Page Rank Algorithm Based on Number of Visits of Links of Web Page

Weighted Page Rank Algorithm Based on Number of Visits of Links of Web Page International Journal of Soft Computing and Engineering (IJSCE) ISSN: 31-307, Volume-, Issue-3, July 01 Weighted Page Rank Algorithm Based on Number of Visits of Links of Web Page Neelam Tyagi, Simple

More information

CS 347 Parallel and Distributed Data Processing

CS 347 Parallel and Distributed Data Processing CS 347 Parallel and Distributed Data Processing Spring 2016 Notes 12: Distributed Information Retrieval CS 347 Notes 12 2 CS 347 Notes 12 3 CS 347 Notes 12 4 Web Search Engine Crawling Indexing Computing

More information

Mining Web Data. Lijun Zhang

Mining Web Data. Lijun Zhang Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems

More information

CS 347 Parallel and Distributed Data Processing

CS 347 Parallel and Distributed Data Processing CS 347 Parallel and Distributed Data Processing Spring 2016 Notes 12: Distributed Information Retrieval CS 347 Notes 12 2 CS 347 Notes 12 3 CS 347 Notes 12 4 CS 347 Notes 12 5 Web Search Engine Crawling

More information

SEARCH ENGINE INSIDE OUT

SEARCH ENGINE INSIDE OUT SEARCH ENGINE INSIDE OUT From Technical Views r86526020 r88526016 r88526028 b85506013 b85506010 April 11,2000 Outline Why Search Engine so important Search Engine Architecture Crawling Subsystem Indexing

More information

CHAPTER THREE INFORMATION RETRIEVAL SYSTEM

CHAPTER THREE INFORMATION RETRIEVAL SYSTEM CHAPTER THREE INFORMATION RETRIEVAL SYSTEM 3.1 INTRODUCTION Search engine is one of the most effective and prominent method to find information online. It has become an essential part of life for almost

More information

Administrivia. Crawlers: Nutch. Course Overview. Issues. Crawling Issues. Groups Formed Architecture Documents under Review Group Meetings CSE 454

Administrivia. Crawlers: Nutch. Course Overview. Issues. Crawling Issues. Groups Formed Architecture Documents under Review Group Meetings CSE 454 Administrivia Crawlers: Nutch Groups Formed Architecture Documents under Review Group Meetings CSE 454 4/14/2005 12:54 PM 1 4/14/2005 12:54 PM 2 Info Extraction Course Overview Ecommerce Standard Web Search

More information

COMP5331: Knowledge Discovery and Data Mining

COMP5331: Knowledge Discovery and Data Mining COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani and Terry Winograd, Jon M. Kleinberg 1 1 PageRank

More information

Today we show how a search engine works

Today we show how a search engine works How Search Engines Work Today we show how a search engine works What happens when a searcher enters keywords What was performed well in advance Also explain (briefly) how paid results are chosen If we

More information

Search Engines. Information Retrieval in Practice

Search Engines. Information Retrieval in Practice Search Engines Information Retrieval in Practice All slides Addison Wesley, 2008 Web Crawler Finds and downloads web pages automatically provides the collection for searching Web is huge and constantly

More information

Logistics. CSE Case Studies. Indexing & Retrieval in Google. Review: AltaVista. BigTable. Index Stream Readers (ISRs) Advanced Search

Logistics. CSE Case Studies. Indexing & Retrieval in Google. Review: AltaVista. BigTable. Index Stream Readers (ISRs) Advanced Search CSE 454 - Case Studies Indexing & Retrieval in Google Some slides from http://www.cs.huji.ac.il/~sdbi/2000/google/index.htm Logistics For next class Read: How to implement PageRank Efficiently Projects

More information

Analytical survey of Web Page Rank Algorithm

Analytical survey of Web Page Rank Algorithm Analytical survey of Web Page Rank Algorithm Mrs.M.Usha 1, Dr.N.Nagadeepa 2 Research Scholar, Bharathiyar University,Coimbatore 1 Associate Professor, Jairams Arts and Science College, Karur 2 ABSTRACT

More information

Searching the Web What is this Page Known for? Luis De Alba

Searching the Web What is this Page Known for? Luis De Alba Searching the Web What is this Page Known for? Luis De Alba ldealbar@cc.hut.fi Searching the Web Arasu, Cho, Garcia-Molina, Paepcke, Raghavan August, 2001. Stanford University Introduction People browse

More information

Multimedia Retrieval. Chapter 3: Web Retrieval. Dr. Roger Weber, Computer Science / / 2018

Multimedia Retrieval. Chapter 3: Web Retrieval. Dr. Roger Weber, Computer Science / / 2018 omputer Science / 15731-01 / 2018 Multimedia Retrieval hapter 3: Web Retrieval r. Roger Weber, roger.weber@ubs.com 3.1 Motivation 3.2 Ranking in Web Retrieval 3.3 Link nalysis 3.4 Literature and Links

More information

Information Retrieval

Information Retrieval Introduction to Information Retrieval CS3245 12 Lecture 12: Crawling and Link Analysis Information Retrieval Last Time Chapter 11 1. Probabilistic Approach to Retrieval / Basic Probability Theory 2. Probability

More information

CS6200 Information Retreival. The WebGraph. July 13, 2015

CS6200 Information Retreival. The WebGraph. July 13, 2015 CS6200 Information Retreival The WebGraph The WebGraph July 13, 2015 1 Web Graph: pages and links The WebGraph describes the directed links between pages of the World Wide Web. A directed edge connects

More information

Home Page. Title Page. Page 1 of 14. Go Back. Full Screen. Close. Quit

Home Page. Title Page. Page 1 of 14. Go Back. Full Screen. Close. Quit Page 1 of 14 Retrieving Information from the Web Database and Information Retrieval (IR) Systems both manage data! The data of an IR system is a collection of documents (or pages) User tasks: Browsing

More information

Running Head: HOW A SEARCH ENGINE WORKS 1. How a Search Engine Works. Sara Davis INFO Spring Erika Gutierrez.

Running Head: HOW A SEARCH ENGINE WORKS 1. How a Search Engine Works. Sara Davis INFO Spring Erika Gutierrez. Running Head: 1 How a Search Engine Works Sara Davis INFO 4206.001 Spring 2016 Erika Gutierrez May 1, 2016 2 Search engines come in many forms and types, but they all follow three basic steps: crawling,

More information

CS 345A Data Mining Lecture 1. Introduction to Web Mining

CS 345A Data Mining Lecture 1. Introduction to Web Mining CS 345A Data Mining Lecture 1 Introduction to Web Mining What is Web Mining? Discovering useful information from the World-Wide Web and its usage patterns Web Mining v. Data Mining Structure (or lack of

More information

Reading Time: A Method for Improving the Ranking Scores of Web Pages

Reading Time: A Method for Improving the Ranking Scores of Web Pages Reading Time: A Method for Improving the Ranking Scores of Web Pages Shweta Agarwal Asst. Prof., CS&IT Deptt. MIT, Moradabad, U.P. India Bharat Bhushan Agarwal Asst. Prof., CS&IT Deptt. IFTM, Moradabad,

More information

Searching the Web [Arasu 01]

Searching the Web [Arasu 01] Searching the Web [Arasu 01] Most user simply browse the web Google, Yahoo, Lycos, Ask Others do more specialized searches web search engines submit queries by specifying lists of keywords receive web

More information

An Adaptive Approach in Web Search Algorithm

An Adaptive Approach in Web Search Algorithm International Journal of Information & Computation Technology. ISSN 0974-2239 Volume 4, Number 15 (2014), pp. 1575-1581 International Research Publications House http://www. irphouse.com An Adaptive Approach

More information

Indexing: Part IV. Announcements (February 17) Keyword search. CPS 216 Advanced Database Systems

Indexing: Part IV. Announcements (February 17) Keyword search. CPS 216 Advanced Database Systems Indexing: Part IV CPS 216 Advanced Database Systems Announcements (February 17) 2 Homework #2 due in two weeks Reading assignments for this and next week The query processing survey by Graefe Due next

More information

CS6200 Information Retreival. Crawling. June 10, 2015

CS6200 Information Retreival. Crawling. June 10, 2015 CS6200 Information Retreival Crawling Crawling June 10, 2015 Crawling is one of the most important tasks of a search engine. The breadth, depth, and freshness of the search results depend crucially on

More information

The Topic Specific Search Engine

The Topic Specific Search Engine The Topic Specific Search Engine Benjamin Stopford 1 st Jan 2006 Version 0.1 Overview This paper presents a model for creating an accurate topic specific search engine through a focussed (vertical)

More information

DATA MINING II - 1DL460. Spring 2014"

DATA MINING II - 1DL460. Spring 2014 DATA MINING II - 1DL460 Spring 2014" A second course in data mining http://www.it.uu.se/edu/course/homepage/infoutv2/vt14 Kjell Orsborn Uppsala Database Laboratory Department of Information Technology,

More information

CS377: Database Systems Text data and information. Li Xiong Department of Mathematics and Computer Science Emory University

CS377: Database Systems Text data and information. Li Xiong Department of Mathematics and Computer Science Emory University CS377: Database Systems Text data and information retrieval Li Xiong Department of Mathematics and Computer Science Emory University Outline Information Retrieval (IR) Concepts Text Preprocessing Inverted

More information

DATA MINING - 1DL105, 1DL111

DATA MINING - 1DL105, 1DL111 1 DATA MINING - 1DL105, 1DL111 Fall 2007 An introductory class in data mining http://user.it.uu.se/~udbl/dut-ht2007/ alt. http://www.it.uu.se/edu/course/homepage/infoutv/ht07 Kjell Orsborn Uppsala Database

More information

UNIT-V WEB MINING. 3/18/2012 Prof. Asha Ambhaikar, RCET Bhilai.

UNIT-V WEB MINING. 3/18/2012 Prof. Asha Ambhaikar, RCET Bhilai. UNIT-V WEB MINING 1 Mining the World-Wide Web 2 What is Web Mining? Discovering useful information from the World-Wide Web and its usage patterns. 3 Web search engines Index-based: search the Web, index

More information

Web Search Ranking. (COSC 488) Nazli Goharian Evaluation of Web Search Engines: High Precision Search

Web Search Ranking. (COSC 488) Nazli Goharian Evaluation of Web Search Engines: High Precision Search Web Search Ranking (COSC 488) Nazli Goharian nazli@cs.georgetown.edu 1 Evaluation of Web Search Engines: High Precision Search Traditional IR systems are evaluated based on precision and recall. Web search

More information

Information Retrieval. Lecture 4: Search engines and linkage algorithms

Information Retrieval. Lecture 4: Search engines and linkage algorithms Information Retrieval Lecture 4: Search engines and linkage algorithms Computer Science Tripos Part II Simone Teufel Natural Language and Information Processing (NLIP) Group sht25@cl.cam.ac.uk Today 2

More information

A Survey of Google's PageRank

A Survey of Google's PageRank http://pr.efactory.de/ A Survey of Google's PageRank Within the past few years, Google has become the far most utilized search engine worldwide. A decisive factor therefore was, besides high performance

More information

Lecture 9: I: Web Retrieval II: Webology. Johan Bollen Old Dominion University Department of Computer Science

Lecture 9: I: Web Retrieval II: Webology. Johan Bollen Old Dominion University Department of Computer Science Lecture 9: I: Web Retrieval II: Webology Johan Bollen Old Dominion University Department of Computer Science jbollen@cs.odu.edu http://www.cs.odu.edu/ jbollen April 10, 2003 Page 1 WWW retrieval Two approaches

More information

Index Construction. Dictionary, postings, scalable indexing, dynamic indexing. Web Search

Index Construction. Dictionary, postings, scalable indexing, dynamic indexing. Web Search Index Construction Dictionary, postings, scalable indexing, dynamic indexing Web Search 1 Overview Indexes Query Indexing Ranking Results Application Documents User Information analysis Query processing

More information

Information Retrieval Lecture 4: Web Search. Challenges of Web Search 2. Natural Language and Information Processing (NLIP) Group

Information Retrieval Lecture 4: Web Search. Challenges of Web Search 2. Natural Language and Information Processing (NLIP) Group Information Retrieval Lecture 4: Web Search Computer Science Tripos Part II Simone Teufel Natural Language and Information Processing (NLIP) Group sht25@cl.cam.ac.uk (Lecture Notes after Stephen Clark)

More information

5 Choosing keywords Initially choosing keywords Frequent and rare keywords Evaluating the competition rates of search

5 Choosing keywords Initially choosing keywords Frequent and rare keywords Evaluating the competition rates of search Seo tutorial Seo tutorial Introduction to seo... 4 1. General seo information... 5 1.1 History of search engines... 5 1.2 Common search engine principles... 6 2. Internal ranking factors... 8 2.1 Web page

More information

Google Scale Data Management

Google Scale Data Management Google Scale Data Management The slides are based on the slides made by Prof. K. Selcuk Candan, which is partially based on slides by Qing Li Google (..a course on that??) 2 1 Google (..a course on that??)

More information

Welcome to the class of Web Information Retrieval!

Welcome to the class of Web Information Retrieval! Welcome to the class of Web Information Retrieval! Tee Time Topic Augmented Reality and Google Glass By Ali Abbasi Challenges in Web Search Engines Min ZHANG z-m@tsinghua.edu.cn April 13, 2012 Challenges

More information

A STUDY OF RANKING ALGORITHM USED BY VARIOUS SEARCH ENGINE

A STUDY OF RANKING ALGORITHM USED BY VARIOUS SEARCH ENGINE A STUDY OF RANKING ALGORITHM USED BY VARIOUS SEARCH ENGINE Bohar Singh 1, Gursewak Singh 2 1, 2 Computer Science and Application, Govt College Sri Muktsar sahib Abstract The World Wide Web is a popular

More information

Administrative. Web crawlers. Web Crawlers and Link Analysis!

Administrative. Web crawlers. Web Crawlers and Link Analysis! Web Crawlers and Link Analysis! David Kauchak cs458 Fall 2011 adapted from: http://www.stanford.edu/class/cs276/handouts/lecture15-linkanalysis.ppt http://webcourse.cs.technion.ac.il/236522/spring2007/ho/wcfiles/tutorial05.ppt

More information

Distributed computing: index building and use

Distributed computing: index building and use Distributed computing: index building and use Distributed computing Goals Distributing computation across several machines to Do one computation faster - latency Do more computations in given time - throughput

More information

Efficient query processing

Efficient query processing Efficient query processing Efficient scoring, distributed query processing Web Search 1 Ranking functions In general, document scoring functions are of the form The BM25 function, is one of the best performing:

More information

AN OVERVIEW OF SEARCHING AND DISCOVERING WEB BASED INFORMATION RESOURCES

AN OVERVIEW OF SEARCHING AND DISCOVERING WEB BASED INFORMATION RESOURCES Journal of Defense Resources Management No. 1 (1) / 2010 AN OVERVIEW OF SEARCHING AND DISCOVERING Cezar VASILESCU Regional Department of Defense Resources Management Studies Abstract: The Internet becomes

More information

Web Search. Lecture Objectives. Text Technologies for Data Science INFR Learn about: 11/14/2017. Instructor: Walid Magdy

Web Search. Lecture Objectives. Text Technologies for Data Science INFR Learn about: 11/14/2017. Instructor: Walid Magdy Text Technologies for Data Science INFR11145 Web Search Instructor: Walid Magdy 14-Nov-2017 Lecture Objectives Learn about: Working with Massive data Link analysis (PageRank) Anchor text 2 1 The Web Document

More information

EXTRACTION OF RELEVANT WEB PAGES USING DATA MINING

EXTRACTION OF RELEVANT WEB PAGES USING DATA MINING Chapter 3 EXTRACTION OF RELEVANT WEB PAGES USING DATA MINING 3.1 INTRODUCTION Generally web pages are retrieved with the help of search engines which deploy crawlers for downloading purpose. Given a query,

More information

Vannevar Bush. Information Retrieval. Prophetic: Hypertext. Historic Vision 2/8/17

Vannevar Bush. Information Retrieval. Prophetic: Hypertext. Historic Vision 2/8/17 Information Retrieval Vannevar Bush Director of the Office of Scientific Research and Development (1941-1947) Vannevar Bush,1890-1974 End of WW2 - what next big challenge for scientists? 1 Historic Vision

More information

Web Structure Mining using Link Analysis Algorithms

Web Structure Mining using Link Analysis Algorithms Web Structure Mining using Link Analysis Algorithms Ronak Jain Aditya Chavan Sindhu Nair Assistant Professor Abstract- The World Wide Web is a huge repository of data which includes audio, text and video.

More information

Optimizing Search Engines using Click-through Data

Optimizing Search Engines using Click-through Data Optimizing Search Engines using Click-through Data By Sameep - 100050003 Rahee - 100050028 Anil - 100050082 1 Overview Web Search Engines : Creating a good information retrieval system Previous Approaches

More information

Information Retrieval. Lecture 10 - Web crawling

Information Retrieval. Lecture 10 - Web crawling Information Retrieval Lecture 10 - Web crawling Seminar für Sprachwissenschaft International Studies in Computational Linguistics Wintersemester 2007 1/ 30 Introduction Crawling: gathering pages from the

More information

Informa/on Retrieval. Text Search. CISC437/637, Lecture #23 Ben CartereAe. Consider a database consis/ng of long textual informa/on fields

Informa/on Retrieval. Text Search. CISC437/637, Lecture #23 Ben CartereAe. Consider a database consis/ng of long textual informa/on fields Informa/on Retrieval CISC437/637, Lecture #23 Ben CartereAe Copyright Ben CartereAe 1 Text Search Consider a database consis/ng of long textual informa/on fields News ar/cles, patents, web pages, books,

More information

Chapter 6: Information Retrieval and Web Search. An introduction

Chapter 6: Information Retrieval and Web Search. An introduction Chapter 6: Information Retrieval and Web Search An introduction Introduction n Text mining refers to data mining using text documents as data. n Most text mining tasks use Information Retrieval (IR) methods

More information

Recent Researches on Web Page Ranking

Recent Researches on Web Page Ranking Recent Researches on Web Page Pradipta Biswas School of Information Technology Indian Institute of Technology Kharagpur, India Importance of Web Page Internet Surfers generally do not bother to go through

More information

A Framework for adaptive focused web crawling and information retrieval using genetic algorithms

A Framework for adaptive focused web crawling and information retrieval using genetic algorithms A Framework for adaptive focused web crawling and information retrieval using genetic algorithms Kevin Sebastian Dept of Computer Science, BITS Pilani kevseb1993@gmail.com 1 Abstract The web is undeniably

More information

Website Name. Project Code: # SEO Recommendations Report. Version: 1.0

Website Name. Project Code: # SEO Recommendations Report. Version: 1.0 Website Name Project Code: #10001 Version: 1.0 DocID: SEO/site/rec Issue Date: DD-MM-YYYY Prepared By: - Owned By: Rave Infosys Reviewed By: - Approved By: - 3111 N University Dr. #604 Coral Springs FL

More information

Information Retrieval and Web Search

Information Retrieval and Web Search Information Retrieval and Web Search Web Crawling Instructor: Rada Mihalcea (some of these slides were adapted from Ray Mooney s IR course at UT Austin) The Web by the Numbers Web servers 634 million Users

More information

Information Retrieval

Information Retrieval Information Retrieval CSC 375, Fall 2016 An information retrieval system will tend not to be used whenever it is more painful and troublesome for a customer to have information than for him not to have

More information

CC PROCESAMIENTO MASIVO DE DATOS OTOÑO Lecture 7: Information Retrieval II. Aidan Hogan

CC PROCESAMIENTO MASIVO DE DATOS OTOÑO Lecture 7: Information Retrieval II. Aidan Hogan CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 7: Information Retrieval II Aidan Hogan aidhog@gmail.com How does Google know about the Web? Inverted Index: Example 1 Fruitvale Station is a 2013

More information

Centralities (4) By: Ralucca Gera, NPS. Excellence Through Knowledge

Centralities (4) By: Ralucca Gera, NPS. Excellence Through Knowledge Centralities (4) By: Ralucca Gera, NPS Excellence Through Knowledge Some slide from last week that we didn t talk about in class: 2 PageRank algorithm Eigenvector centrality: i s Rank score is the sum

More information

The PageRank Citation Ranking: Bringing Order to the Web

The PageRank Citation Ranking: Bringing Order to the Web The PageRank Citation Ranking: Bringing Order to the Web Marlon Dias msdias@dcc.ufmg.br Information Retrieval DCC/UFMG - 2017 Introduction Paper: The PageRank Citation Ranking: Bringing Order to the Web,

More information

Web Search Engines: Solutions to Final Exam, Part I December 13, 2004

Web Search Engines: Solutions to Final Exam, Part I December 13, 2004 Web Search Engines: Solutions to Final Exam, Part I December 13, 2004 Problem 1: A. In using the vector model to compare the similarity of two documents, why is it desirable to normalize the vectors to

More information

Introduction to Information Retrieval and Anatomy of Google. Information Retrieval Introduction

Introduction to Information Retrieval and Anatomy of Google. Information Retrieval Introduction Introduction to Information Retrieval and Anatomy of Google Information Retrieval Introduction Earlier we discussed methods for string matching Appropriate for small documents that fit in memory available

More information

Distributed computing: index building and use

Distributed computing: index building and use Distributed computing: index building and use Distributed computing Goals Distributing computation across several machines to Do one computation faster - latency Do more computations in given time - throughput

More information

Map Reduce. Yerevan.

Map Reduce. Yerevan. Map Reduce Erasmus+ @ Yerevan dacosta@irit.fr Divide and conquer at PaaS 100 % // Typical problem Iterate over a large number of records Extract something of interest from each Shuffle and sort intermediate

More information

Proximity Prestige using Incremental Iteration in Page Rank Algorithm

Proximity Prestige using Incremental Iteration in Page Rank Algorithm Indian Journal of Science and Technology, Vol 9(48), DOI: 10.17485/ijst/2016/v9i48/107962, December 2016 ISSN (Print) : 0974-6846 ISSN (Online) : 0974-5645 Proximity Prestige using Incremental Iteration

More information

Searching the Deep Web

Searching the Deep Web Searching the Deep Web 1 What is Deep Web? Information accessed only through HTML form pages database queries results embedded in HTML pages Also can included other information on Web can t directly index

More information

CHAPTER 4 PROPOSED ARCHITECTURE FOR INCREMENTAL PARALLEL WEBCRAWLER

CHAPTER 4 PROPOSED ARCHITECTURE FOR INCREMENTAL PARALLEL WEBCRAWLER CHAPTER 4 PROPOSED ARCHITECTURE FOR INCREMENTAL PARALLEL WEBCRAWLER 4.1 INTRODUCTION In 1994, the World Wide Web Worm (WWWW), one of the first web search engines had an index of 110,000 web pages [2] but

More information

Basic techniques. Text processing; term weighting; vector space model; inverted index; Web Search

Basic techniques. Text processing; term weighting; vector space model; inverted index; Web Search Basic techniques Text processing; term weighting; vector space model; inverted index; Web Search Overview Indexes Query Indexing Ranking Results Application Documents User Information analysis Query processing

More information

1 Starting around 1996, researchers began to work on. 2 In Feb, 1997, Yanhong Li (Scotch Plains, NJ) filed a

1 Starting around 1996, researchers began to work on. 2 In Feb, 1997, Yanhong Li (Scotch Plains, NJ) filed a !"#$ %#& ' Introduction ' Social network analysis ' Co-citation and bibliographic coupling ' PageRank ' HIS ' Summary ()*+,-/*,) Early search engines mainly compare content similarity of the query and

More information

Information Retrieval II

Information Retrieval II Information Retrieval II David Hawking 30 Sep 2010 Machine Learning Summer School, ANU Session Outline Ranking documents in response to a query Measuring the quality of such rankings Case Study: Tuning

More information

The Google File System

The Google File System The Google File System Sanjay Ghemawat, Howard Gobioff and Shun Tak Leung Google* Shivesh Kumar Sharma fl4164@wayne.edu Fall 2015 004395771 Overview Google file system is a scalable distributed file system

More information

CSCI 5417 Information Retrieval Systems Jim Martin!

CSCI 5417 Information Retrieval Systems Jim Martin! CSCI 5417 Information Retrieval Systems Jim Martin! Lecture 4 9/1/2011 Today Finish up spelling correction Realistic indexing Block merge Single-pass in memory Distributed indexing Next HW details 1 Query

More information

Searching in All the Right Places. How Is Information Organized? Chapter 5: Searching for Truth: Locating Information on the WWW

Searching in All the Right Places. How Is Information Organized? Chapter 5: Searching for Truth: Locating Information on the WWW Chapter 5: Searching for Truth: Locating Information on the WWW Fluency with Information Technology Third Edition by Lawrence Snyder Searching in All the Right Places The Obvious and Familiar To find tax

More information

Searching the Deep Web

Searching the Deep Web Searching the Deep Web 1 What is Deep Web? Information accessed only through HTML form pages database queries results embedded in HTML pages Also can included other information on Web can t directly index

More information

Analysis of Link Algorithms for Web Mining

Analysis of Link Algorithms for Web Mining International Journal of Scientific and Research Publications, Volume 4, Issue 5, May 2014 1 Analysis of Link Algorithms for Web Monica Sehgal Abstract- As the use of Web is

More information

Lecture 8: Linkage algorithms and web search

Lecture 8: Linkage algorithms and web search Lecture 8: Linkage algorithms and web search Information Retrieval Computer Science Tripos Part II Ronan Cummins 1 Natural Language and Information Processing (NLIP) Group ronan.cummins@cl.cam.ac.uk 2017

More information

Chapter 27 Introduction to Information Retrieval and Web Search

Chapter 27 Introduction to Information Retrieval and Web Search Chapter 27 Introduction to Information Retrieval and Web Search Copyright 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 27 Outline Information Retrieval (IR) Concepts Retrieval

More information

Einführung in Web und Data Science Community Analysis. Prof. Dr. Ralf Möller Universität zu Lübeck Institut für Informationssysteme

Einführung in Web und Data Science Community Analysis. Prof. Dr. Ralf Möller Universität zu Lübeck Institut für Informationssysteme Einführung in Web und Data Science Community Analysis Prof. Dr. Ralf Möller Universität zu Lübeck Institut für Informationssysteme Today s lecture Anchor text Link analysis for ranking Pagerank and variants

More information

Effective Page Refresh Policies for Web Crawlers

Effective Page Refresh Policies for Web Crawlers For CS561 Web Data Management Spring 2013 University of Crete Effective Page Refresh Policies for Web Crawlers and a Semantic Web Document Ranking Model Roger-Alekos Berkley IMSE 2012/2014 Paper 1: Main

More information

Salford Systems Predictive Modeler Unsupervised Learning. Salford Systems

Salford Systems Predictive Modeler Unsupervised Learning. Salford Systems Salford Systems Predictive Modeler Unsupervised Learning Salford Systems http://www.salford-systems.com Unsupervised Learning In mainstream statistics this is typically known as cluster analysis The term

More information

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University http://cs224w.stanford.edu How to organize the Web? First try: Human curated Web directories Yahoo, DMOZ, LookSmart Second

More information

Text Analytics. Index-Structures for Information Retrieval. Ulf Leser

Text Analytics. Index-Structures for Information Retrieval. Ulf Leser Text Analytics Index-Structures for Information Retrieval Ulf Leser Content of this Lecture Inverted files Storage structures Phrase and proximity search Building and updating the index Using a RDBMS Ulf

More information

Lecture #3: PageRank Algorithm The Mathematics of Google Search

Lecture #3: PageRank Algorithm The Mathematics of Google Search Lecture #3: PageRank Algorithm The Mathematics of Google Search We live in a computer era. Internet is part of our everyday lives and information is only a click away. Just open your favorite search engine,

More information

MAE 298, Lecture 9 April 30, Web search and decentralized search on small-worlds

MAE 298, Lecture 9 April 30, Web search and decentralized search on small-worlds MAE 298, Lecture 9 April 30, 2007 Web search and decentralized search on small-worlds Search for information Assume some resource of interest is stored at the vertices of a network: Web pages Files in

More information

F. Aiolli - Sistemi Informativi 2007/2008. Web Search before Google

F. Aiolli - Sistemi Informativi 2007/2008. Web Search before Google Web Search Engines 1 Web Search before Google Web Search Engines (WSEs) of the first generation (up to 1998) Identified relevance with topic-relateness Based on keywords inserted by web page creators (META

More information