Multimedia Retrieval. Chapter 3: Web Retrieval. Dr. Roger Weber, Computer Science / CS342 / Fall 2014
|
|
- Irene Hodges
- 6 years ago
- Views:
Transcription
1 Computer Science / CS342 / Fall 2014 Multimedia Retrieval Chapter 3: Web Retrieval Dr. Roger Weber, roger.weber@credit-suisse.com 3.1 Motivation: The Problem of Web Retrieval 3.2 Size of the Internet and Coverage by Search Engines 3.3 Ordering of Documents 3.4 Context based Retrieval 3.5 Architecture of a Search Engine
2 3.1 Motivation: The Problem of Web Retrieval Classical Retrieval Web Retrieval Collection controlled set uncontrolled, incomplete Size Documents, Multimedia Structure of documents Links btw. documents Quality of documents small to large (1 MB - 20 GB [TREC]) homogenous homogenous seldom (citations of other documents) good to excellent extremely large (only text documents comprise more than 200 GB) heterogeneous (HTML, PDF, ASCII) heterogeneous lots of links in documents broad range of quality: poor grammar, wrong contents, incorrect, spamming, misspellings Queries precise and more terms short and imprecise, similarity search Results small number of hits (<100); results have good quality large numbers of hits (>100,000) Page 3-2
3 The Internet grows rapidly The Internet is growing at a rate of 14% a year. Every 5.32 years, the number of domains doubles (see figures above). Google s index contains more than 35 billion pages. In 2008, Google software engineers announced that they discovered one trillion unique URLs ( But the web is even larger! Many web sites address their content with infinite numbers of URIs. For example Google: if we assume a dictionary of 1 million terms, all combinations of two term queries for Google yield a trillion unique URIs. Adding a third, a forth, a fifth, and so on, term multiplies this number by 1 million each time. Page 3-3
4 The Problem of Ordering Lots of documents fulfill the typically small queries (2-5 terms). Mostly, result sets contain more than 100,000 documents with an rsv>0 But not all result documents are relevant Example: query ford returns 911,000,000 hits with Google 1st rank: Homepage of the car manufacturer Ford How is that possible? Google is based on Boolean retrieval! Search engines not only sort based on the rsv-value Depending on the rsv-function, only pages would appear which contain terms with the same frequencies as the query, contain the query terms most often, or contain all the query terms Classical retrieval lack a defense mechanism against spamming! Page 3-4
5 3.2 Size of the Internet and Coverage by Search Engines How to compute the number of web servers connected to the Internet [Giles99]: Assumption: IP-addresses of web servers are evenly distributed in the 32-bit address space. Approach: Choose N random IP-addresses. Access the root web page of that server. Let M the number of found pages (i.e., found web servers) Let M/N be the density of the coverage in the IP-address space Giles [1999]: M/N 1/269 This finally leads to 2 32 * M/N 16.0 millions web servers [Date: July 1999] Problem: This estimation also contains devices that are managed via a web front end, e.g., routers. Page 3-5
6 Estimating the number of web pages overlap analysis [Bharat98] : Assumption: Search engines index a sub space of the web independently of each other. The index sub space is a random sample of the web. Approach: A A B Web B p(a B) = p(a)*p(b) A, B are known for some search engines A = N*p(A) estimate ratio B : A B with selected queries B = N*p(B) A B = N*p(A B) N = A * B / A B Note: the assumption does not generally hold because search engines often start crawling at the same starting points (e.g., yahoo.com). Hence, the above estimation leads to a lower bound for the real number of web pages. Page 3-6
7 Estimating the number of web pages (2) Procedure: A and B are known for many search engines Determine the document frequency for terms in a sufficiently large sample of web pages. Let l=0; repeat k times: Query search engine B with a random query and select an arbitrary page in the result Build a query with terms appearing in the selected document that have the smallest document frequencies (select several terms to steer the query result size) Query the other search engine (A) with these terms. Due to the selected terms with small document frequencies, the result set is small. Increase l if the page is also found with search engine A. Estimate the ratio B : A B with k : l Determine N AB = A * B / A B = A * k / l Compute N AB for different combinations of search engines and estimate the total number of wbe pages by the average over all N AB -values Page 3-7
8 Some key figures Dec 1997: > 320M pages Some search engines indexed about 1/3 of the Web; the biggest 6 engines indexed around 60% of the Web Feb 1999: 800M pages, Some engines indexed 16% of the Web; the largest 11 engines indexed around 42% of the Web 2.8M public web servers, 16M web servers in total a page had an average size of 18,7 kb (excluding embedded images); 14 TBytes of data Jan 2000: > 1B pages Coverage of some search engines between 10%-15% 6.4M public web servers (with 2.2M mirrored servers); about 60% Apache, 25% MS-IIS Number of links on End of 2000:3-5B pages Coverage of search engines still between 10%-15%; Google outperforms all others in terms of coverage: 30% (also includes pages that Google crawlers only know from references) TBytes of data (HTML format) BrightPlanet: distinguishes between surface and deep Web; the "surface web" subsumes all public pages; the deep web subsumes also private pages and dynamic pages (phone book, e-banking, etc.) 800B pages in the deep Web, more than 8000 TBytes of data 2005: 8B pages The deep Web has further grown; accurate estimates are no longer available. 2010: Google has >35B pages and reports to have seen more than 1 trillion unique URIs Page 3-8
9 Last reported index sizes of Search Engines Search Engine Reported Size Page Depth Google 8.1 billion 101K MSN 5.0 billion 150K Yahoo 4.2 billion 500K Ask Jeeves 2.5 billion 101K+ [Source: SearchEngineWatch, 2005] PageDepth: maximum length of the indexed part of a found document. Google, for instance, only indexes the first 101 Kilo-Bytes of a page More recent figures: Google s Index had 8B entries at the end of 2005 (according to its homepage); today, no index sizes are published any more (-> end of search index size war with Yahoo) Basic idea for more current figures: use a keyword like the with a known frequency to appear on any web page (67%) 2010: 23 billion hits (2006: 14.4 billion) size is 34 billion pages (2006: 22 billion) Page 3-9
10 3.3 Ordering of Documents Ranking of Google (as far as documented): even though its based on Boolean retrieval, it has good precision today: most search engines use similar methods but the details are kept secret Ranking already starts with extracting the right information from documents: Google extracts Positions of terms in a document, the relative font size, the visual attributes (bold, italic), and the context of the page (terms in URL, title, meta-tags, text of references) text of references (e.g., between <A>...</A>) are also assigned to the referenced document. The ranking consists of the following factors:: Proximity of terms, i.e., distance between occurrences of distinct query terms Positions in the document (URL, text of references, title, meta-tag, body), Size of font, and visual attributes PageRank Further criteria (advertisements, pushed content) Page 3-10
11 3.3.1 Proximity of Terms Query: White House Document 1: the white car stands in front of the house (-> not relevant) Document 2: the president entered the White House (-> relevant) the closer the query terms are, the more relevant the text is Implementation in Google Prototype: for each position pair, a proximity values was assigned (10 values) the frequencies of these proximity values result in the Proximity -Vector multiplying this vector with a weighting vector (inner vector product) leads to the overall proximity value for the document Page 3-11
12 Example: hit list [ white ] = { 1, 81, 156 }, hit list [ house ] = { 2, 82, 115, 157 } Position pairs {(1,2), (1,82), (1,115), (1,157), (81,2), (81,82), (81,115),...} Pos Term 1 white 2 house white 82 house house white 157 house... (1,2) (81,82) (1,157) are mapped to the proximity vector: p = [3, 0, 0, 1, 1, 0, 0, 1, 2, 3] Proximity Frequency 1 (adjacent) 3 2 (close) 0 3 (...) 0 4 (...) 1 5 (nearby) 1 6 (...) 0 7 (...) 0 8 (distant) 1 9 (...) 2 10 (far away) 3 Overall proximity of the document: with w = [1.0, 0.9, 0.8, 0.7,..., 0.1] p T w = 5.6 Page 3-12
13 3.3.2 Positions in the Document, Size of Font, and Visual Attributes Queries typically aim at the title (heading) of a document E.g.: White House instead of Central Executive Place Often users look for brands, persons, or firms Text references in external documents describe the contents very accurately: E.g.: query eth lausanne is answer by Google with the home page of EPFL, although that page does not contain the term "ETH". Conclusion: Documents containing the query terms in the title, with special visual attributes (large size, heading, bold), or in reference texts linking to that document appear more relevant than documents that contain the terms just in the body ( I work for ETH Lausanne ) Google: counts the occurrences of terms along the dimensions described above multiplies the frequencies with well-chosen weights and sums these values to a second relevance value for the document Further: it contains mechanisms to cut off spamming Page 3-13
14 Implementation in Google Prototype Google: Pos Freq lim. Freq Weight <TITLE> <META> <P> <B> <I> <H1> <H2> reference Impact: Google is able to find pages given a brand, name, or firm that are highly relevant to the user. Spamming: if a page contains a term too often, the page gets ignored (e.g., if a term contributes to more than 10% of text then it s spam) Page 3-14
15 3.3.3 PageRank Which page better suits the query "Uni Basel" and why? Page 3-15
16 A Preliminary Model (not yet PageRank) Idea count the number of incoming links; the more incoming links the more important that page is because it is more likely that a surfer lands on that page: A 1 C 6 C is the most important page before D, A and B B 1 D 3 Problems: Not every page is equally important and thus are links Spamming Page 3-16
17 Computing the PageRank of a Page Improved Idea: A random surfer clicks with a probability p an outgoing link on the current page. With a probability of (1-p), the surfer selects an arbitrary page (bookmark, url) The PageRank of a page is given by the probability that a random surfer lands on the page (after a number of steps) How to compute this Probability: Notation: A an arbitrary page L(A) set of pages which have a reference to A N(A) number of outgoing links of page A PR(A) PageRank of page A p Probability that a surfer is following an outgoing link ( [0,1] ) Definition of PageRank: PR( A) = (1 p) + p PR( B) B L ( A) N( B) Page 3-17
18 Intuitive Explanation of the Formula The value of a link is given by the PageRank of the source document divided by the number of outgoing links on that page. This simulates the freedom of the random surfer to follow any link on that page (1-p) + p*... denotes the freedom of the surfer to follow a link with probability p or to jump to an arbitrary page with probability 1-p. Example: A B A and C have the same PageRank although A has only one incoming link instead of two as C C 0.4 PR(C) Page 3-18
19 Computation The formula is recursive! The PR() values can be computed by a fix point iteration; experiments showed that the computation is minimal compared to the crawling effort (only a few iterations required) Approach: 1. Assign arbitrary initial values PR(A) for all documents A 2. Compute PR (A) (left hand side of equation) according to the formula above for all documents A 3. If PR (A)-PR(A) becomes sufficiently small then it holds that PR(A)=PR (A) is the solution; otherwise let PR(A)=PR (A) and repeat from step 2 Solving the fix point iteration takes only a few iterations (<100) and the computational effort is minimal (several hours) Page 3-19
20 Application PageRank derives a total ordering for web pages independent of the current query and its terms. Google uses PageRank in combination with other criteria PageRank is robust against spamming, i.e., against manipulations to push the PageRank of a page. Even if many links point to a page, this not necessarily implies its importance and a high PageRank. Ordering documents only based on PageRank would be fatal Let A be the document with the highest PageRank All queries with terms contained by A would be answered with A as the best document even though more relevant documents exist Page 3-20
21 3.3.4 Other Criteria to Order Documents Bought ranking positions A number of search engines get money for placing pages at top positions (advertisements) for certain query terms Advertisments RealName Page 3-21
22 Length of URL A query such as ford may be answered with the following documents ces&level2=rentalsfromdealers Search engine rank documents with short URLs at higher positions as they are more likely to be homepages/entry pages for the information need User Feedback direct hit used feedback of users to rank documents Internally, relevance based on feedback is stored similar to the PageRank. If a document in the result click is visited, the relevance value is increased, otherwise decreased Google experimented with feedback Page 3-22
23 3.3.5 Overall Ranking All search engine use and combine different criteria. In Google, the most prominent ones are: Proximity of terms Relevance values based on position in document, font size, and visual attributes PageRank The total relevance for a document results from summing up the relevance values of the individual criteria (with appropriate weighting). How to obtain those weights and which criteria to apply, however, remains a secret of search engine providers. Page 3-23
24 3.4 Context based Retrieval Observations: The web contains many pages addressing a specialized topic (e.g., Star Wars) E.g.: lists many web sites devoted to the "Star Wars" movies (all sites cover the same topic) E.g.: lists web sites of different car brands and car manufacturers (all sites cover similar/related topics) What s Related Hubs and Authorities Consequently: Improve search results by explicitly taking context information into account (as described in the examples above) Similarly: determine the context of the query (probably by asking the user for more information) (Teoma, AskJeeves, Gigablast) Page 3-24
25 3.4.1 Hubs and Authorities A page denotes a so-called Hub for a query Q if it contains many links to pages that are relevant to the query. A page is a so-called Authority for a query Q if it is relevant for Q, i.e. provides the necessary information to answer the information need. Typically, one can identify and distinguish hubs an authorities based on their link structure. relevant to query Q relevant to query Q Hub Authority Page 3-25
26 Additionally, we observe: a good hub points to many good authorities, and a good authority is referenced by many good hubs. Based on hub-authority relationships, a search engine becomes able to identify relevant document which do not contain the query terms. Example: a query such as "looking for car manufacturers" will not lead a user to homepages of Honda, VW, or Ford. With a hubs/authorities analysis, it becomes possible to answer even such queries directly. Idea of Kleinberg [1997]: HITS Algorithm The Web is a graph G = (V,E) with the vertices V denoting the set of documents and the edges E denoting the links (from source to destination). Let (p,q) E, then document p references document q. Step 1: for a query Q, determine the first t (e.g. t=200) documents with the help of a search engine. The set obtained in this step is called the root set. For this set, we observe that it contains many relevant documents, but that it does not contain all the good hubs and authorities Page 3-26
27 Step 2: Extend the root set with documents referenced by a document in the root set and with documents pointing to documents in the root set. The resulting set of document denotes the so-called base set. To limit the size of the base set, one can restrict the number of documents added to d (e.g. 50) documents per element of the root set. Links within the same domain are removed as they frequently only serve as navigation aides. root base Page 3-27
28 Step 3: Compute the hub values h(p) and authority values a(p) for each document p. Thereby, the number of incoming links and outgoing links play a central role in the computation a simple solution could be: a ( p) = 1 h( p) = ( q, p) E ( p, q) 1 E an better idea: a good hub references many good authorities and a good authority is linked by many good hubs. a(p) and h(p) are always normalized p V a 2 2 ( p) 1 h( p) = 1 = p V Initialization: all pages start with the same values for a(p) and h(p) Iteration: the new values are computed based on the old ones a ( p) = h( q) h( p) = ( q, p) E ( p, q) a( q) Repeat the iteration (including normalization) until convergence is reached E Note: normalization of the new values must be established afterwards Page 3-28
29 Step 4: Compute the result if the user asks for overview pages, return the k documents having the largest hub values h(p) if the user asks for content pages, return the k documents having the largest authority values a(p) Notes: User is empowered to choose between Hubs and Authorities The iterative algorithm takes only few steps (10-20) to determine the values a(p) and h(p) Implementation: simple procedure is based on the availability of a search engine evaluation of the query with the search engine determines the root set determine referenced documents from members of the root set by downloading and parsing all documents in the root set determine incoming links and their source with the help of queries of type link: u with u denoting the URL of a document in the root set Page 3-29
30 Extensions of HITS (Henzinger, 1998) The HITS algorithm suffers from three fundamental problems: 1. If all pages in domain reference the same external page, that page becomes too strong an authority. Similarly, if a page links to many different pages in the same domain, that page becomes too strong a hub. 2. Automatically established links, e.g., advertisements, banners, links to the provider/host/designer of a web site become wrong authorities 3. Queries such a "jaguar car" tend to lead to pages about cars in general and hubs containing links to different manufacturer. More precisely, the more frequent term "car" dominates the more infrequent term jaguar. Page 3-30
31 Improvements: Problem 1: the same author (=same domain) has only one "vote" for an external page. Similarly, a document has only one "vote" when referencing documents int the same domain. If k pages p i of the same domain reference a document q, we weight the links with a value of aw(p i, q)=1/k for each edge (p i, q). If a page p references l pages q i in the same domain, we weight the links with a value hw(p, q i )=1/l for each edge (p, q i ). Adjust the iteration step with these weights a ( p) = aw( q, p) ( q, p) E h( q) h( p) = ( p, q) hw E ( p, q) a( q) Page 3-31
32 Problem 2 and 3: to deal with these problems, we eliminate nodes from the graph that obviously do not match the query or its relevant documents. To this end, we perform an artificial query against the base set: The query is described with the first few terms of all documents in the root set (e.g., the first 1000 terms of all documents) Query and documents are mapped to vectors according to the tf-idf weighting scheme of vector space retrieval Similarity s(p) between query and document p is given by the cosine measure For a given threshold t, eliminate all nodes/documents in the graph with s(p)<t. We obtain a good threshold by applying one of the methods to follow: t = median of all s(p) values t = median of s(p) values of documents in the root set t = 1/10 max s(p) a ( p) = aw( q, p) s( q) ( q, p) E h( q) h( p) = ( p, q) hw E ( p, q) s( q) a( q) Page 3-32
33 Discussion: The HITS algorithm outperforms conventional web based search results. The extensions of the HITS algorithm further improve precision by more than 45% The main issues with the HITS algorithm are large evaluation costs and long retrieval times (30 seconds up to several minutes) In contrast to PageRank, ranking with HITS depends on the query. PageRank gives a total order of the documents for all queries. Page 3-33
34 3.4.2 What s Related The basic idea of Alexas What s Related was to identify related documents for existing documents. The definition of "related", however, was not based on similarity between documents but on similarity between the topics addressed in these documents potentially from different angles Related page for would be Analogously to What s Related, Google provides Similar Pages for its result entries. The approaches to compute the relationships differ significantly between the two systems: Alexa used crawlers and data mining tools to determine related pages. Moreover, it spied on the surf patterns of the user (which pages has the user visited; which search result did the user investigate in more details; ) Google relies totally on the analysis of the link structure of web pages to derive the related documents. Two approaches were published by Google experts. Page 3-34
35 Companion Algorithm (Dean, Henzinger, 1999) The more complex approach is based on the extended HITS algorithm: given a URL u, the algorithm must find related documents to that page u Notation: If page w references page v, we call w the parent page of v, and we call v the child page of w. Step 1: build a directed graph in the "neighborhood" of u. That graph contains the following nodes: u at most b parent pages of u and for each parent page at most bf child pages at most f child pages of u and for each child page at most fb parent pages Step 2: Converge duplicates or "near-duplicates" Two documents are "near-duplicates", if they contain more than 10 links, and 95% of the contained links appear in both documents Page 3-35
36 Step 3: Assign weights for edges in the graph Similar to the extension of HITS algorithm: if k edges of documents within the same domain point to the same external page, these edges obtain a weight of 1/k. And: a document contains l edges to pages within the same domain, each of these edges obtains a weight of 1/l. Step 4: Determine hubs and authorities for the nodes of the graph according to the extension of the HITS algorithm (but without similarity weighting), i.e. a ( p) = aw( q, p) ( q, p) E h( q) h( p) = ( p, q) hw E ( p, q) a( q) Step 5: Determine the result The pages with highest authority weights (except for u) are the so-called related pages to u. Page 3-36
37 Co-citation Algorithm (Dean, Henzinger, 1999) This simpler approach counts how often a page u is referenced together with a page q. The page with the most frequent co-citations is the so-called most related page to u. Step 1: Determine at most b parent pages of u. Step 2: Determine for each parent page at most bf child pages with the links of these child pages being close to the link to u. All these pages are siblings of u. Step 3: Determine pages q i that are referenced most frequently together with u. Step 4: If Step 1-3 result in less than 15 co-citations with a frequency of 2 or higher, repeat the search with prefixes of the URL of u, i.e.: u= u = u = u = Page 3-37
38 Discussion: The approaches of Dean and Henzinger work much better than Alexa on average. Due to the non-existent information about how Alexa works, no qualitative and quantitative assessments are possible. Although Henzinger worked for Google, it is not clear which algorithms Google uses to find related pages. Page 3-38
39 3.5 Architecture of a Search Engine A search engine consist of the following main components Crawler/Repository Feature Extractor Indexer Sorter Feedback Component User Interface Page 3-39
40 Google (Brin Page) URL Server Crawler Store Server Anchors Indexer Repository URL Resolver Links Doc Index Barrels Sorter Lexicon PageRank Searcher Page 3-40
41 3.5.1 Main Problem: Scalability The data problem: search engines must deal with enormous data sizes: Assumption: size of web page is 10 KB; extraction returns 1 KB data to index Google s search index contains at least 35B pages (2010): cache size for documents: 350 TB Size of inverted lists: 35 TB Google uses > (commodity) PCs with a hard disk with GB Total disk space: TB But: space per PC is only 100 GB Problem: how to search through 35 TB if a single machine only stores 100 GB? How to organize such a cluster given the frequent updates and the enormous search frequencies? How do you assign identifiers for 35B pages? Page 3-41
42 The retrieval problem: there is no service window to update the software or the data; users search all the time Google: 250M queries per day = 10M queries per hour = 3000 queries per second! Daily peaks may be much higher 35 TB of index data => a single entry (=term) in the inverted lists consumes more than 5 GB (e.g. term house returns 3B hits with Google). Typical IDE Disk performs with 50 MB/s How long does it take to search through 5 GB of data? How do you reduce search times given the high IO load (not meaningful to cache data in memory)? Query house returns 3B hits Hits have to be sorted according to their relevance; with one term this can be pre-computed, but with several terms? an average PC needs quite some time to sort 3B hits even if the scores are already computed; computation of scores takes yet another "eternity" How can we decrease search times? by parallelization? you don't need to have the entire list, only the top hundreds: does that help? Page 3-42
43 The crawling problem: how do you fetch 35B pages in a reasonable time DNS lookups are expensive; a central DNS server would not allow you to scale enough (maximum number of downloads per second is limited) Google exchanged/exchanges its index once every month: upper limit on time: download 35B pages in a month = 24,000 pages per second! web page has 10 KB => 240 MB/s you most probably need several gigabit connections to the Internet And: servers on the Internet have different response times Incremental crawling: important pages such a news papers have to be read daily to support queries on "hot topics" news information with an age of 1-2 months is not interesting any more incremental crawler has to update the index of important pages that change frequently. Goolge's incremental crawler is called freshbot how do you select important pages and how can you update the inverted lists while 3000 concurrent queries run? Do you require ACID properties? Page 3-43
44 Google: success due to addressing the scalability problem from the beginning on The following considerations are based on Papers and presentations of Google Speculations in web search forums Page 3-44
45 3.5.2 Crawling tricks and musts DNS lookup Problem: local cache on each crawl server Be nice with web servers! Follow these rules: Only few requests per minute to the same server! Do not follow cgi-links as executing cgi scripts usually is expensive; moreover, you easily may activate/interact/change the page (order goods, games, forums, ) read and obey to robots.txt e.g. disallow: /cgi-bin/ filter "critical" URIs Note: there are much more pages than you ever can crawl, so it does not matter if you missed some of them; but do not miss the important ones! A single machine is not sufficient for crawling! A single connection to the Internet is not sufficient! Page 3-45
46 3.5.3 Distributed Data Management Google File System (GFS): maintains file systems with more than 1TB: very large files (>>size of disk) can be managed fault tolerance against crashes of machines, hard disks, partitioning of files to allow for massive parallelization Google implemented a two-dimensional data management Data are partitioned along groups of documents; the so-called sherds Each sherd is stored on an arbitrary number of machines increases fault-tolerance (commodity PCs are highly available) replication => distribution of load among different machines Page 3-46
47 Execute as parallel as possible Partitioning and replication provide parallelization inside a query evaluation and across all concurrent query evaluations The scalability of Google is available in two dimensions: partitioning enables support for growth of documents replication enables support for growth of concurrent queries Page 3-47
48 Query evaluation with Google query: cats and dogs DNS resolver: Last geografisch verteilen Google Entry System Google Entry System Google Entry System Spell Checker Ad Server 4 Index Servers 4 Index Servers Google Entry System Google Entry System Google Web Server Document Servers 5 Document Servers Index Servers Document Servers Page 3-48
49 3.5.5 Cluster Management Google: >1,000,000 PCs deployed in numerous data centers: PC (or any of it components) fails at least once a year 1,000,000 PCs: 3,000 failures per day Data (index, document cache) has to be refreshed in regular intervals. Google, for instance, refreshed the index once every month; freshbot refreshes the index once a day. >100 TB of data are not instantly distributed data has to be extended step by step Google must still answer queries but may use old or new index data Google s software is constantly improved. Software distribution must be fully automated and not result in any downtimes. Page 3-49
50 Literature and Links Google Inc. Google: S. Brin, L. Page, The Anatomy of a Large-Scale Hypertextual Web Search Engine, WWW7, 1998, L. Page et. al., The PageRank Citation Ranking: Bringing Order to the Web, work in progress, &format=pdf&compression=&name= pdf Luiz Andre Barroso, Jeffrey Dean, Urs Hölzle, Web Search for a Planet: The Google Cluster Architecture, April 2003 (IEEE Computer Society). Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung, The Google File System, In: Proceedings of SOSP 03, October 19 22, 2003, Bolton Landing, New York, USA. What s Related AlexaResearch ( What s Related ) Google: Jeffrey Dean and Monika R. Henzinger. Finding Related Web Pages in the World Wide Web. Proceedings of the 8th International World Wide Web Conference (WWW8), 1999, pp Page 3-50
51 Literature and Links (2) Size of Internet [Bharat98] Krishna Bharat and Andrei Broder, A technique for measuring the relative size and overlap of public Web search engines, WWW7, 1998, [Giles99] - Steve Lawrence, Lee Giles, "Accessibility of information on the web", Nature, Vol. 400, pp , 1999 [SEW] SearchEngineWatch: [BP] BrightPlanet Studie: Internet Domain Survey: Page 3-51
The Anatomy of a Large-Scale Hypertextual Web Search Engine
The Anatomy of a Large-Scale Hypertextual Web Search Engine Article by: Larry Page and Sergey Brin Computer Networks 30(1-7):107-117, 1998 1 1. Introduction The authors: Lawrence Page, Sergey Brin started
More informationCrawler. Crawler. Crawler. Crawler. Anchors. URL Resolver Indexer. Barrels. Doc Index Sorter. Sorter. URL Server
Authors: Sergey Brin, Lawrence Page Google, word play on googol or 10 100 Centralized system, entire HTML text saved Focused on high precision, even at expense of high recall Relies heavily on document
More informationInformation Retrieval Spring Web retrieval
Information Retrieval Spring 2016 Web retrieval The Web Large Changing fast Public - No control over editing or contents Spam and Advertisement How big is the Web? Practically infinite due to the dynamic
More informationSearching the Web for Information
Search Xin Liu Searching the Web for Information How a Search Engine Works Basic parts: 1. Crawler: Visits sites on the Internet, discovering Web pages 2. Indexer: building an index to the Web's content
More informationTHE WEB SEARCH ENGINE
International Journal of Computer Science Engineering and Information Technology Research (IJCSEITR) Vol.1, Issue 2 Dec 2011 54-60 TJPRC Pvt. Ltd., THE WEB SEARCH ENGINE Mr.G. HANUMANTHA RAO hanu.abc@gmail.com
More informationA Survey on Web Information Retrieval Technologies
A Survey on Web Information Retrieval Technologies Lan Huang Computer Science Department State University of New York, Stony Brook Presented by Kajal Miyan Michigan State University Overview Web Information
More informationLink Analysis and Web Search
Link Analysis and Web Search Moreno Marzolla Dip. di Informatica Scienza e Ingegneria (DISI) Università di Bologna http://www.moreno.marzolla.name/ based on material by prof. Bing Liu http://www.cs.uic.edu/~liub/webminingbook.html
More informationDistributed Systems. 05r. Case study: Google Cluster Architecture. Paul Krzyzanowski. Rutgers University. Fall 2016
Distributed Systems 05r. Case study: Google Cluster Architecture Paul Krzyzanowski Rutgers University Fall 2016 1 A note about relevancy This describes the Google search cluster architecture in the mid
More informationInformation Retrieval (IR) Introduction to Information Retrieval. Lecture Overview. Why do we need IR? Basics of an IR system.
Introduction to Information Retrieval Ethan Phelps-Goodman Some slides taken from http://www.cs.utexas.edu/users/mooney/ir-course/ Information Retrieval (IR) The indexing and retrieval of textual documents.
More informationRelevant?!? Algoritmi per IR. Goal of a Search Engine. Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Web Search
Algoritmi per IR Web Search Goal of a Search Engine Retrieve docs that are relevant for the user query Doc: file word or pdf, web page, email, blog, e-book,... Query: paradigm bag of words Relevant?!?
More informationMining Web Data. Lijun Zhang
Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems
More informationInternational Journal of Scientific & Engineering Research Volume 2, Issue 12, December ISSN Web Search Engine
International Journal of Scientific & Engineering Research Volume 2, Issue 12, December-2011 1 Web Search Engine G.Hanumantha Rao*, G.NarenderΨ, B.Srinivasa Rao+, M.Srilatha* Abstract This paper explains
More informationAnatomy of a search engine. Design criteria of a search engine Architecture Data structures
Anatomy of a search engine Design criteria of a search engine Architecture Data structures Step-1: Crawling the web Google has a fast distributed crawling system Each crawler keeps roughly 300 connection
More informationInformation Retrieval May 15. Web retrieval
Information Retrieval May 15 Web retrieval What s so special about the Web? The Web Large Changing fast Public - No control over editing or contents Spam and Advertisement How big is the Web? Practically
More information10/10/13. Traditional database system. Information Retrieval. Information Retrieval. Information retrieval system? Information Retrieval Issues
COS 597A: Principles of Database and Information Systems Information Retrieval Traditional database system Large integrated collection of data Uniform access/modifcation mechanisms Model of data organization
More informationLecture Notes: Social Networks: Models, Algorithms, and Applications Lecture 28: Apr 26, 2012 Scribes: Mauricio Monsalve and Yamini Mule
Lecture Notes: Social Networks: Models, Algorithms, and Applications Lecture 28: Apr 26, 2012 Scribes: Mauricio Monsalve and Yamini Mule 1 How big is the Web How big is the Web? In the past, this question
More informationWeighted Page Rank Algorithm Based on Number of Visits of Links of Web Page
International Journal of Soft Computing and Engineering (IJSCE) ISSN: 31-307, Volume-, Issue-3, July 01 Weighted Page Rank Algorithm Based on Number of Visits of Links of Web Page Neelam Tyagi, Simple
More informationCS 347 Parallel and Distributed Data Processing
CS 347 Parallel and Distributed Data Processing Spring 2016 Notes 12: Distributed Information Retrieval CS 347 Notes 12 2 CS 347 Notes 12 3 CS 347 Notes 12 4 Web Search Engine Crawling Indexing Computing
More informationMining Web Data. Lijun Zhang
Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems
More informationCS 347 Parallel and Distributed Data Processing
CS 347 Parallel and Distributed Data Processing Spring 2016 Notes 12: Distributed Information Retrieval CS 347 Notes 12 2 CS 347 Notes 12 3 CS 347 Notes 12 4 CS 347 Notes 12 5 Web Search Engine Crawling
More informationSEARCH ENGINE INSIDE OUT
SEARCH ENGINE INSIDE OUT From Technical Views r86526020 r88526016 r88526028 b85506013 b85506010 April 11,2000 Outline Why Search Engine so important Search Engine Architecture Crawling Subsystem Indexing
More informationCHAPTER THREE INFORMATION RETRIEVAL SYSTEM
CHAPTER THREE INFORMATION RETRIEVAL SYSTEM 3.1 INTRODUCTION Search engine is one of the most effective and prominent method to find information online. It has become an essential part of life for almost
More informationAdministrivia. Crawlers: Nutch. Course Overview. Issues. Crawling Issues. Groups Formed Architecture Documents under Review Group Meetings CSE 454
Administrivia Crawlers: Nutch Groups Formed Architecture Documents under Review Group Meetings CSE 454 4/14/2005 12:54 PM 1 4/14/2005 12:54 PM 2 Info Extraction Course Overview Ecommerce Standard Web Search
More informationCOMP5331: Knowledge Discovery and Data Mining
COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani and Terry Winograd, Jon M. Kleinberg 1 1 PageRank
More informationToday we show how a search engine works
How Search Engines Work Today we show how a search engine works What happens when a searcher enters keywords What was performed well in advance Also explain (briefly) how paid results are chosen If we
More informationSearch Engines. Information Retrieval in Practice
Search Engines Information Retrieval in Practice All slides Addison Wesley, 2008 Web Crawler Finds and downloads web pages automatically provides the collection for searching Web is huge and constantly
More informationLogistics. CSE Case Studies. Indexing & Retrieval in Google. Review: AltaVista. BigTable. Index Stream Readers (ISRs) Advanced Search
CSE 454 - Case Studies Indexing & Retrieval in Google Some slides from http://www.cs.huji.ac.il/~sdbi/2000/google/index.htm Logistics For next class Read: How to implement PageRank Efficiently Projects
More informationAnalytical survey of Web Page Rank Algorithm
Analytical survey of Web Page Rank Algorithm Mrs.M.Usha 1, Dr.N.Nagadeepa 2 Research Scholar, Bharathiyar University,Coimbatore 1 Associate Professor, Jairams Arts and Science College, Karur 2 ABSTRACT
More informationSearching the Web What is this Page Known for? Luis De Alba
Searching the Web What is this Page Known for? Luis De Alba ldealbar@cc.hut.fi Searching the Web Arasu, Cho, Garcia-Molina, Paepcke, Raghavan August, 2001. Stanford University Introduction People browse
More informationMultimedia Retrieval. Chapter 3: Web Retrieval. Dr. Roger Weber, Computer Science / / 2018
omputer Science / 15731-01 / 2018 Multimedia Retrieval hapter 3: Web Retrieval r. Roger Weber, roger.weber@ubs.com 3.1 Motivation 3.2 Ranking in Web Retrieval 3.3 Link nalysis 3.4 Literature and Links
More informationInformation Retrieval
Introduction to Information Retrieval CS3245 12 Lecture 12: Crawling and Link Analysis Information Retrieval Last Time Chapter 11 1. Probabilistic Approach to Retrieval / Basic Probability Theory 2. Probability
More informationCS6200 Information Retreival. The WebGraph. July 13, 2015
CS6200 Information Retreival The WebGraph The WebGraph July 13, 2015 1 Web Graph: pages and links The WebGraph describes the directed links between pages of the World Wide Web. A directed edge connects
More informationHome Page. Title Page. Page 1 of 14. Go Back. Full Screen. Close. Quit
Page 1 of 14 Retrieving Information from the Web Database and Information Retrieval (IR) Systems both manage data! The data of an IR system is a collection of documents (or pages) User tasks: Browsing
More informationRunning Head: HOW A SEARCH ENGINE WORKS 1. How a Search Engine Works. Sara Davis INFO Spring Erika Gutierrez.
Running Head: 1 How a Search Engine Works Sara Davis INFO 4206.001 Spring 2016 Erika Gutierrez May 1, 2016 2 Search engines come in many forms and types, but they all follow three basic steps: crawling,
More informationCS 345A Data Mining Lecture 1. Introduction to Web Mining
CS 345A Data Mining Lecture 1 Introduction to Web Mining What is Web Mining? Discovering useful information from the World-Wide Web and its usage patterns Web Mining v. Data Mining Structure (or lack of
More informationReading Time: A Method for Improving the Ranking Scores of Web Pages
Reading Time: A Method for Improving the Ranking Scores of Web Pages Shweta Agarwal Asst. Prof., CS&IT Deptt. MIT, Moradabad, U.P. India Bharat Bhushan Agarwal Asst. Prof., CS&IT Deptt. IFTM, Moradabad,
More informationSearching the Web [Arasu 01]
Searching the Web [Arasu 01] Most user simply browse the web Google, Yahoo, Lycos, Ask Others do more specialized searches web search engines submit queries by specifying lists of keywords receive web
More informationAn Adaptive Approach in Web Search Algorithm
International Journal of Information & Computation Technology. ISSN 0974-2239 Volume 4, Number 15 (2014), pp. 1575-1581 International Research Publications House http://www. irphouse.com An Adaptive Approach
More informationIndexing: Part IV. Announcements (February 17) Keyword search. CPS 216 Advanced Database Systems
Indexing: Part IV CPS 216 Advanced Database Systems Announcements (February 17) 2 Homework #2 due in two weeks Reading assignments for this and next week The query processing survey by Graefe Due next
More informationCS6200 Information Retreival. Crawling. June 10, 2015
CS6200 Information Retreival Crawling Crawling June 10, 2015 Crawling is one of the most important tasks of a search engine. The breadth, depth, and freshness of the search results depend crucially on
More informationThe Topic Specific Search Engine
The Topic Specific Search Engine Benjamin Stopford 1 st Jan 2006 Version 0.1 Overview This paper presents a model for creating an accurate topic specific search engine through a focussed (vertical)
More informationDATA MINING II - 1DL460. Spring 2014"
DATA MINING II - 1DL460 Spring 2014" A second course in data mining http://www.it.uu.se/edu/course/homepage/infoutv2/vt14 Kjell Orsborn Uppsala Database Laboratory Department of Information Technology,
More informationCS377: Database Systems Text data and information. Li Xiong Department of Mathematics and Computer Science Emory University
CS377: Database Systems Text data and information retrieval Li Xiong Department of Mathematics and Computer Science Emory University Outline Information Retrieval (IR) Concepts Text Preprocessing Inverted
More informationDATA MINING - 1DL105, 1DL111
1 DATA MINING - 1DL105, 1DL111 Fall 2007 An introductory class in data mining http://user.it.uu.se/~udbl/dut-ht2007/ alt. http://www.it.uu.se/edu/course/homepage/infoutv/ht07 Kjell Orsborn Uppsala Database
More informationUNIT-V WEB MINING. 3/18/2012 Prof. Asha Ambhaikar, RCET Bhilai.
UNIT-V WEB MINING 1 Mining the World-Wide Web 2 What is Web Mining? Discovering useful information from the World-Wide Web and its usage patterns. 3 Web search engines Index-based: search the Web, index
More informationWeb Search Ranking. (COSC 488) Nazli Goharian Evaluation of Web Search Engines: High Precision Search
Web Search Ranking (COSC 488) Nazli Goharian nazli@cs.georgetown.edu 1 Evaluation of Web Search Engines: High Precision Search Traditional IR systems are evaluated based on precision and recall. Web search
More informationInformation Retrieval. Lecture 4: Search engines and linkage algorithms
Information Retrieval Lecture 4: Search engines and linkage algorithms Computer Science Tripos Part II Simone Teufel Natural Language and Information Processing (NLIP) Group sht25@cl.cam.ac.uk Today 2
More informationA Survey of Google's PageRank
http://pr.efactory.de/ A Survey of Google's PageRank Within the past few years, Google has become the far most utilized search engine worldwide. A decisive factor therefore was, besides high performance
More informationLecture 9: I: Web Retrieval II: Webology. Johan Bollen Old Dominion University Department of Computer Science
Lecture 9: I: Web Retrieval II: Webology Johan Bollen Old Dominion University Department of Computer Science jbollen@cs.odu.edu http://www.cs.odu.edu/ jbollen April 10, 2003 Page 1 WWW retrieval Two approaches
More informationIndex Construction. Dictionary, postings, scalable indexing, dynamic indexing. Web Search
Index Construction Dictionary, postings, scalable indexing, dynamic indexing Web Search 1 Overview Indexes Query Indexing Ranking Results Application Documents User Information analysis Query processing
More informationInformation Retrieval Lecture 4: Web Search. Challenges of Web Search 2. Natural Language and Information Processing (NLIP) Group
Information Retrieval Lecture 4: Web Search Computer Science Tripos Part II Simone Teufel Natural Language and Information Processing (NLIP) Group sht25@cl.cam.ac.uk (Lecture Notes after Stephen Clark)
More information5 Choosing keywords Initially choosing keywords Frequent and rare keywords Evaluating the competition rates of search
Seo tutorial Seo tutorial Introduction to seo... 4 1. General seo information... 5 1.1 History of search engines... 5 1.2 Common search engine principles... 6 2. Internal ranking factors... 8 2.1 Web page
More informationGoogle Scale Data Management
Google Scale Data Management The slides are based on the slides made by Prof. K. Selcuk Candan, which is partially based on slides by Qing Li Google (..a course on that??) 2 1 Google (..a course on that??)
More informationWelcome to the class of Web Information Retrieval!
Welcome to the class of Web Information Retrieval! Tee Time Topic Augmented Reality and Google Glass By Ali Abbasi Challenges in Web Search Engines Min ZHANG z-m@tsinghua.edu.cn April 13, 2012 Challenges
More informationA STUDY OF RANKING ALGORITHM USED BY VARIOUS SEARCH ENGINE
A STUDY OF RANKING ALGORITHM USED BY VARIOUS SEARCH ENGINE Bohar Singh 1, Gursewak Singh 2 1, 2 Computer Science and Application, Govt College Sri Muktsar sahib Abstract The World Wide Web is a popular
More informationAdministrative. Web crawlers. Web Crawlers and Link Analysis!
Web Crawlers and Link Analysis! David Kauchak cs458 Fall 2011 adapted from: http://www.stanford.edu/class/cs276/handouts/lecture15-linkanalysis.ppt http://webcourse.cs.technion.ac.il/236522/spring2007/ho/wcfiles/tutorial05.ppt
More informationDistributed computing: index building and use
Distributed computing: index building and use Distributed computing Goals Distributing computation across several machines to Do one computation faster - latency Do more computations in given time - throughput
More informationEfficient query processing
Efficient query processing Efficient scoring, distributed query processing Web Search 1 Ranking functions In general, document scoring functions are of the form The BM25 function, is one of the best performing:
More informationAN OVERVIEW OF SEARCHING AND DISCOVERING WEB BASED INFORMATION RESOURCES
Journal of Defense Resources Management No. 1 (1) / 2010 AN OVERVIEW OF SEARCHING AND DISCOVERING Cezar VASILESCU Regional Department of Defense Resources Management Studies Abstract: The Internet becomes
More informationWeb Search. Lecture Objectives. Text Technologies for Data Science INFR Learn about: 11/14/2017. Instructor: Walid Magdy
Text Technologies for Data Science INFR11145 Web Search Instructor: Walid Magdy 14-Nov-2017 Lecture Objectives Learn about: Working with Massive data Link analysis (PageRank) Anchor text 2 1 The Web Document
More informationEXTRACTION OF RELEVANT WEB PAGES USING DATA MINING
Chapter 3 EXTRACTION OF RELEVANT WEB PAGES USING DATA MINING 3.1 INTRODUCTION Generally web pages are retrieved with the help of search engines which deploy crawlers for downloading purpose. Given a query,
More informationVannevar Bush. Information Retrieval. Prophetic: Hypertext. Historic Vision 2/8/17
Information Retrieval Vannevar Bush Director of the Office of Scientific Research and Development (1941-1947) Vannevar Bush,1890-1974 End of WW2 - what next big challenge for scientists? 1 Historic Vision
More informationWeb Structure Mining using Link Analysis Algorithms
Web Structure Mining using Link Analysis Algorithms Ronak Jain Aditya Chavan Sindhu Nair Assistant Professor Abstract- The World Wide Web is a huge repository of data which includes audio, text and video.
More informationOptimizing Search Engines using Click-through Data
Optimizing Search Engines using Click-through Data By Sameep - 100050003 Rahee - 100050028 Anil - 100050082 1 Overview Web Search Engines : Creating a good information retrieval system Previous Approaches
More informationInformation Retrieval. Lecture 10 - Web crawling
Information Retrieval Lecture 10 - Web crawling Seminar für Sprachwissenschaft International Studies in Computational Linguistics Wintersemester 2007 1/ 30 Introduction Crawling: gathering pages from the
More informationInforma/on Retrieval. Text Search. CISC437/637, Lecture #23 Ben CartereAe. Consider a database consis/ng of long textual informa/on fields
Informa/on Retrieval CISC437/637, Lecture #23 Ben CartereAe Copyright Ben CartereAe 1 Text Search Consider a database consis/ng of long textual informa/on fields News ar/cles, patents, web pages, books,
More informationChapter 6: Information Retrieval and Web Search. An introduction
Chapter 6: Information Retrieval and Web Search An introduction Introduction n Text mining refers to data mining using text documents as data. n Most text mining tasks use Information Retrieval (IR) methods
More informationRecent Researches on Web Page Ranking
Recent Researches on Web Page Pradipta Biswas School of Information Technology Indian Institute of Technology Kharagpur, India Importance of Web Page Internet Surfers generally do not bother to go through
More informationA Framework for adaptive focused web crawling and information retrieval using genetic algorithms
A Framework for adaptive focused web crawling and information retrieval using genetic algorithms Kevin Sebastian Dept of Computer Science, BITS Pilani kevseb1993@gmail.com 1 Abstract The web is undeniably
More informationWebsite Name. Project Code: # SEO Recommendations Report. Version: 1.0
Website Name Project Code: #10001 Version: 1.0 DocID: SEO/site/rec Issue Date: DD-MM-YYYY Prepared By: - Owned By: Rave Infosys Reviewed By: - Approved By: - 3111 N University Dr. #604 Coral Springs FL
More informationInformation Retrieval and Web Search
Information Retrieval and Web Search Web Crawling Instructor: Rada Mihalcea (some of these slides were adapted from Ray Mooney s IR course at UT Austin) The Web by the Numbers Web servers 634 million Users
More informationInformation Retrieval
Information Retrieval CSC 375, Fall 2016 An information retrieval system will tend not to be used whenever it is more painful and troublesome for a customer to have information than for him not to have
More informationCC PROCESAMIENTO MASIVO DE DATOS OTOÑO Lecture 7: Information Retrieval II. Aidan Hogan
CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 7: Information Retrieval II Aidan Hogan aidhog@gmail.com How does Google know about the Web? Inverted Index: Example 1 Fruitvale Station is a 2013
More informationCentralities (4) By: Ralucca Gera, NPS. Excellence Through Knowledge
Centralities (4) By: Ralucca Gera, NPS Excellence Through Knowledge Some slide from last week that we didn t talk about in class: 2 PageRank algorithm Eigenvector centrality: i s Rank score is the sum
More informationThe PageRank Citation Ranking: Bringing Order to the Web
The PageRank Citation Ranking: Bringing Order to the Web Marlon Dias msdias@dcc.ufmg.br Information Retrieval DCC/UFMG - 2017 Introduction Paper: The PageRank Citation Ranking: Bringing Order to the Web,
More informationWeb Search Engines: Solutions to Final Exam, Part I December 13, 2004
Web Search Engines: Solutions to Final Exam, Part I December 13, 2004 Problem 1: A. In using the vector model to compare the similarity of two documents, why is it desirable to normalize the vectors to
More informationIntroduction to Information Retrieval and Anatomy of Google. Information Retrieval Introduction
Introduction to Information Retrieval and Anatomy of Google Information Retrieval Introduction Earlier we discussed methods for string matching Appropriate for small documents that fit in memory available
More informationDistributed computing: index building and use
Distributed computing: index building and use Distributed computing Goals Distributing computation across several machines to Do one computation faster - latency Do more computations in given time - throughput
More informationMap Reduce. Yerevan.
Map Reduce Erasmus+ @ Yerevan dacosta@irit.fr Divide and conquer at PaaS 100 % // Typical problem Iterate over a large number of records Extract something of interest from each Shuffle and sort intermediate
More informationProximity Prestige using Incremental Iteration in Page Rank Algorithm
Indian Journal of Science and Technology, Vol 9(48), DOI: 10.17485/ijst/2016/v9i48/107962, December 2016 ISSN (Print) : 0974-6846 ISSN (Online) : 0974-5645 Proximity Prestige using Incremental Iteration
More informationSearching the Deep Web
Searching the Deep Web 1 What is Deep Web? Information accessed only through HTML form pages database queries results embedded in HTML pages Also can included other information on Web can t directly index
More informationCHAPTER 4 PROPOSED ARCHITECTURE FOR INCREMENTAL PARALLEL WEBCRAWLER
CHAPTER 4 PROPOSED ARCHITECTURE FOR INCREMENTAL PARALLEL WEBCRAWLER 4.1 INTRODUCTION In 1994, the World Wide Web Worm (WWWW), one of the first web search engines had an index of 110,000 web pages [2] but
More informationBasic techniques. Text processing; term weighting; vector space model; inverted index; Web Search
Basic techniques Text processing; term weighting; vector space model; inverted index; Web Search Overview Indexes Query Indexing Ranking Results Application Documents User Information analysis Query processing
More information1 Starting around 1996, researchers began to work on. 2 In Feb, 1997, Yanhong Li (Scotch Plains, NJ) filed a
!"#$ %#& ' Introduction ' Social network analysis ' Co-citation and bibliographic coupling ' PageRank ' HIS ' Summary ()*+,-/*,) Early search engines mainly compare content similarity of the query and
More informationInformation Retrieval II
Information Retrieval II David Hawking 30 Sep 2010 Machine Learning Summer School, ANU Session Outline Ranking documents in response to a query Measuring the quality of such rankings Case Study: Tuning
More informationThe Google File System
The Google File System Sanjay Ghemawat, Howard Gobioff and Shun Tak Leung Google* Shivesh Kumar Sharma fl4164@wayne.edu Fall 2015 004395771 Overview Google file system is a scalable distributed file system
More informationCSCI 5417 Information Retrieval Systems Jim Martin!
CSCI 5417 Information Retrieval Systems Jim Martin! Lecture 4 9/1/2011 Today Finish up spelling correction Realistic indexing Block merge Single-pass in memory Distributed indexing Next HW details 1 Query
More informationSearching in All the Right Places. How Is Information Organized? Chapter 5: Searching for Truth: Locating Information on the WWW
Chapter 5: Searching for Truth: Locating Information on the WWW Fluency with Information Technology Third Edition by Lawrence Snyder Searching in All the Right Places The Obvious and Familiar To find tax
More informationSearching the Deep Web
Searching the Deep Web 1 What is Deep Web? Information accessed only through HTML form pages database queries results embedded in HTML pages Also can included other information on Web can t directly index
More informationAnalysis of Link Algorithms for Web Mining
International Journal of Scientific and Research Publications, Volume 4, Issue 5, May 2014 1 Analysis of Link Algorithms for Web Monica Sehgal Abstract- As the use of Web is
More informationLecture 8: Linkage algorithms and web search
Lecture 8: Linkage algorithms and web search Information Retrieval Computer Science Tripos Part II Ronan Cummins 1 Natural Language and Information Processing (NLIP) Group ronan.cummins@cl.cam.ac.uk 2017
More informationChapter 27 Introduction to Information Retrieval and Web Search
Chapter 27 Introduction to Information Retrieval and Web Search Copyright 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 27 Outline Information Retrieval (IR) Concepts Retrieval
More informationEinführung in Web und Data Science Community Analysis. Prof. Dr. Ralf Möller Universität zu Lübeck Institut für Informationssysteme
Einführung in Web und Data Science Community Analysis Prof. Dr. Ralf Möller Universität zu Lübeck Institut für Informationssysteme Today s lecture Anchor text Link analysis for ranking Pagerank and variants
More informationEffective Page Refresh Policies for Web Crawlers
For CS561 Web Data Management Spring 2013 University of Crete Effective Page Refresh Policies for Web Crawlers and a Semantic Web Document Ranking Model Roger-Alekos Berkley IMSE 2012/2014 Paper 1: Main
More informationSalford Systems Predictive Modeler Unsupervised Learning. Salford Systems
Salford Systems Predictive Modeler Unsupervised Learning Salford Systems http://www.salford-systems.com Unsupervised Learning In mainstream statistics this is typically known as cluster analysis The term
More informationCS224W: Social and Information Network Analysis Jure Leskovec, Stanford University
CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University http://cs224w.stanford.edu How to organize the Web? First try: Human curated Web directories Yahoo, DMOZ, LookSmart Second
More informationText Analytics. Index-Structures for Information Retrieval. Ulf Leser
Text Analytics Index-Structures for Information Retrieval Ulf Leser Content of this Lecture Inverted files Storage structures Phrase and proximity search Building and updating the index Using a RDBMS Ulf
More informationLecture #3: PageRank Algorithm The Mathematics of Google Search
Lecture #3: PageRank Algorithm The Mathematics of Google Search We live in a computer era. Internet is part of our everyday lives and information is only a click away. Just open your favorite search engine,
More informationMAE 298, Lecture 9 April 30, Web search and decentralized search on small-worlds
MAE 298, Lecture 9 April 30, 2007 Web search and decentralized search on small-worlds Search for information Assume some resource of interest is stored at the vertices of a network: Web pages Files in
More informationF. Aiolli - Sistemi Informativi 2007/2008. Web Search before Google
Web Search Engines 1 Web Search before Google Web Search Engines (WSEs) of the first generation (up to 1998) Identified relevance with topic-relateness Based on keywords inserted by web page creators (META
More information