~ Ian Hunneybell: WWWT Revision Notes (15/06/2006) ~

Similar documents
Mining Web Data. Lijun Zhang

Mining Web Data. Lijun Zhang

Information Retrieval Lecture 4: Web Search. Challenges of Web Search 2. Natural Language and Information Processing (NLIP) Group

Web Search Engines: Solutions to Final Exam, Part I December 13, 2004

World Wide Web has specific challenges and opportunities

COMP Page Rank

An Introduction to Search Engines and Web Navigation

Lec 8: Adaptive Information Retrieval 2

Information Retrieval (IR) Introduction to Information Retrieval. Lecture Overview. Why do we need IR? Basics of an IR system.

Advanced Computer Architecture: A Google Search Engine

DATA MINING - 1DL105, 1DL111

DATA MINING II - 1DL460. Spring 2014"

A Survey on Web Information Retrieval Technologies

Part 1: Link Analysis & Page Rank

ROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015

Brief (non-technical) history

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

Lecture 8: Linkage algorithms and web search

Lecture 11: Graph algorithms! Claudia Hauff (Web Information Systems)!

Proximity Prestige using Incremental Iteration in Page Rank Algorithm

Lecture #3: PageRank Algorithm The Mathematics of Google Search

Introduction to Information Retrieval

Information Retrieval May 15. Web retrieval

Link Analysis and Web Search

Link Analysis in Web Mining

Web Structure Mining using Link Analysis Algorithms

How to organize the Web?

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

Information Retrieval. Lecture 11 - Link analysis

Big Data Analytics CSCI 4030

UNIT-V WEB MINING. 3/18/2012 Prof. Asha Ambhaikar, RCET Bhilai.

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

Information Retrieval Spring Web retrieval

Today we show how a search engine works

CC PROCESAMIENTO MASIVO DE DATOS OTOÑO Lecture 7: Information Retrieval II. Aidan Hogan

Approaches to Mining the Web

Agenda. Math Google PageRank algorithm. 2 Developing a formula for ranking web pages. 3 Interpretation. 4 Computing the score of each page

Home Page. Title Page. Page 1 of 14. Go Back. Full Screen. Close. Quit

Web search before Google. (Taken from Page et al. (1999), The PageRank Citation Ranking: Bringing Order to the Web.)

Information Retrieval

Social and Technological Network Data Analytics. Lecture 5: Structure of the Web, Search and Power Laws. Prof Cecilia Mascolo

Recent Researches on Web Page Ranking

5 Choosing keywords Initially choosing keywords Frequent and rare keywords Evaluating the competition rates of search

Searching the Web [Arasu 01]

A Modified Algorithm to Handle Dangling Pages using Hypothetical Node

Big Data Analytics CSCI 4030

Searching the Web What is this Page Known for? Luis De Alba

Search Engines. Information Retrieval in Practice

Part I: Data Mining Foundations

Link Structure Analysis

Unit VIII. Chapter 9. Link Analysis

Lecture Notes to Big Data Management and Analytics Winter Term 2017/2018 Node Importance and Neighborhoods

Link Analysis from Bing Liu. Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data, Springer and other material.

CS249: SPECIAL TOPICS MINING INFORMATION/SOCIAL NETWORKS

MAE 298, Lecture 9 April 30, Web search and decentralized search on small-worlds

6 TOOLS FOR A COMPLETE MARKETING WORKFLOW

Module 1: Internet Basics for Web Development (II)

The application of Randomized HITS algorithm in the fund trading network

INTRODUCTION TO DATA SCIENCE. Link Analysis (MMDS5)

Introduction to Data Mining

Information Retrieval. Lecture 4: Search engines and linkage algorithms

COMP 4601 Hubs and Authorities

Outline. Lecture 2: EITN01 Web Intelligence and Information Retrieval. Previous lecture. Representation/Indexing (fig 1.

Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p.

Einführung in Web und Data Science Community Analysis. Prof. Dr. Ralf Möller Universität zu Lübeck Institut für Informationssysteme

CS47300 Web Information Search and Management

CS6200 Information Retreival. The WebGraph. July 13, 2015

Graph and Web Mining - Motivation, Applications and Algorithms PROF. EHUD GUDES DEPARTMENT OF COMPUTER SCIENCE BEN-GURION UNIVERSITY, ISRAEL

Tag-based Social Interest Discovery

Collaborative filtering based on a random walk model on a graph

by Jimmy's Value World Ashish H Thakkar

PageRank and related algorithms

Recommender System for Personalization in. Daniel Mican Nicolae Tomai

Single link clustering: 11/7: Lecture 18. Clustering Heuristics 1

SEARCH ENGINE INSIDE OUT

Information Retrieval and Web Search Engines

1 Starting around 1996, researchers began to work on. 2 In Feb, 1997, Yanhong Li (Scotch Plains, NJ) filed a

A New Technique for Ranking Web Pages and Adwords

NBA 600: Day 15 Online Search 116 March Daniel Huttenlocher

Searching the Deep Web

Executed by Rocky Sir, tech Head Suven Consultants & Technology Pvt Ltd. seo.suven.net 1

Web Mining. Data Mining and Text Mining (UIC Politecnico di Milano) Daniele Loiacono

Lecture 17 November 7

Building Your Blog Audience. Elise Bauer & Vanessa Fox BlogHer Conference Chicago July 27, 2007

Web Mining. Data Mining and Text Mining (UIC Politecnico di Milano) Daniele Loiacono

Information Retrieval

Weighted Page Rank Algorithm Based on Number of Visits of Links of Web Page

The PageRank Computation in Google, Randomized Algorithms and Consensus of Multi-Agent Systems

Web consists of web pages and hyperlinks between pages. A page receiving many links from other pages may be a hint of the authority of the page

Automatic Summarization

Link Analysis. CSE 454 Advanced Internet Systems University of Washington. 1/26/12 16:36 1 Copyright D.S.Weld

Roadmap. Roadmap. Ranking Web Pages. PageRank. Roadmap. Random Walks in Ranking Query Results in Semistructured Databases

Path Analysis References: Ch.10, Data Mining Techniques By M.Berry, andg.linoff Dr Ahmed Rafea

CS224W: Social and Information Network Analysis Project Report: Edge Detection in Review Networks

Characterizing Web Usage Regularities with Information Foraging Agents

10/10/13. Traditional database system. Information Retrieval. Information Retrieval. Information retrieval system? Information Retrieval Issues

Bing Liu. Web Data Mining. Exploring Hyperlinks, Contents, and Usage Data. With 177 Figures. Springer

Searching the Deep Web

SEO. Definitions/Acronyms. Definitions/Acronyms

Desktop Crawls. Document Feeds. Document Feeds. Information Retrieval

Transcription:

. Search Engines, history and different types In the beginning there was Archie (990, indexed computer files) and Gopher (99, indexed plain text documents). Lycos (994) and AltaVista (995) were amongst the early popular Search Engines. Google (998) appeared a little later. MSN Search débuted in 2005. Google and AltaVista are examples of indexing search engines, which use varying criteria to rank relevant search engine results pages (SERPs). Clusty, a meta-search engine, clusters results based on textual and linguistic similarity. Clusty attempts to provide relevant results more quickly than with a general-purpose search engine, discover relationships between items and view deeper results that would be buried in a purely ranked list. Dogpile, also a meta-search engine, combines the results of several search engines, thus giving potentially greater coverage of the Web, given the assumption that no search engine covers the entire web. Ask Jeeves (now simply Ask), specialises in natural language queries (NLQs). Yahoo! is an example of a directory rather than a search engine, populated by human editors. In the early days of the web this proved an ideal way to navigate the web, but does not scale so well as the web assumes huge proportions. Some search engines have worked on a pay-per-view basis, with companies paying for clickthroughs, or visits to their websites. Overture (originally GoTo.com) is an example of this, and is now part of Yahoo! Search Marketing. A potential extension to this concept is for companies to pay search engines only for completed sales. There would need to be a way to verify sales, and a far higher payment rate could be applied. This is the holy grail of targeted advertising. 2. Challenges faced by Search Engines The web is growing faster than search engines can index (especially directory-based engines such as Yahoo!). Many pages are updated frequently (e.g. News providers) which forces search engines to revisit periodically. Queries are mostly limited to key word searches. Proximity searches offer an improvement, where search terms are limited to same-sentences or paragraphs. Dynamically-generated pages are difficult to index, since they do not persist. The deep web cannot be indexed. Link farms, Link selling (from sites with a high PageRank) and other manipulations attempt to subvert SERPs. A google bomb is a specific attempt to influence the ranking of a web page for a given phrase by adding links to the page with the phrase as the anchor text (e.g. failure ranks the biography of G.W. Bush highest). ~ Page of 7 ~

3. How does PageRank work, and how can it be personalised? PR(W) = T/N + (-T)( PR(W )/O(W )+ PR(W 2 )/O(W 2 )+ + PR(W n )/O(W n ) ) where T is the teleportation probability (if T = most of the equation disappears). The PageRank for a given page W is related to the sum of the PageRanks of all the pages linking to W each divided by the number of outlinks for those pages (only a portion of each page s PageRank contributes to W s PageRank). PageRank assigns a user-independent importance to pages. It can be personalised by biasing the algorithm to prefer certain kinds of pages those preferred by the user (e.g. by using features of the URL such as the domain). This will raise the PageRank of pages preferred by the user, thus personalising it. 4. What is the idea behind the random surfer model? PageRank is based on the random surfer model. The model has a parameter 0 a. More precisely, in each time step a surfer, currently visiting a document with probability a follows a link or, with the remaining probability a, randomly jumps to another page. The traditional Random Surfer model does not reflect some very common aspects of Web surfing i.e. reverting to a page previously visited. Reverting to previous pages constitutes a substantial part of Web surfing behaviour and should be reflected in ranking schemes. 5. What is the PageRank rank sink problem and how is it overcome? A rank sink is a loop between web pages, e.g. A B C A which could cause problems when generating PageRank. This problem can be overcome by using a rank source, or the concept of random jumps with a probability. This circumvents the infinite loop scenario. This is described in Page, Brin & Winograd s paper The PageRank Citation Ranking: Bringing Order to the Web (998). 6. Calculating PageRank PageRank is an iterative algorithm which is repeated until the PageRanks stabilise. The teleportation (or random surfer) element has an effect on PageRanks the higher the teleportation probability T, the more the PageRanks tend towards the same value. When T =, all PageRanks are equal. However, when T=0, there is the possibility of a rank sink, which is to be avoided. Page & Brin suggested a value of 0.5 (i.e. 85% of the time a user will traverse by following links). ~ Page 2 of 7 ~

(i) calculate adjacency matrix (ii) normalise by number of links (iii) initially all pages have weight = w = (iv) recalculate weights.75 w = 2 Bw = 0.25 0.5 2.25 0.5 0.75 (v) iterate until w k = Bwk 7. What is TF-IDF? Compare PageRank and TF-IDF. Term frequency (TF): A term that appears many times within a document is likely to be more important than a term that appears only once. Inverse Document Frequency (IDF): A term that occurs in a few documents is likely to be a better discriminator that a term that appears in most or all documents. The TF-IDF weight is a statistical measure used to evaluate how important a word is to a document. A high weight in tf idf is reached by a high term frequency (in the given document) and a low document frequency of the term in the whole collection of documents; the weights tend to filter out common terms. e.g. if the word Web appears 8 times in a 200 word document, its TF is 8/200 (0.04). If only 2 documents out of 00 contain the word Web, its IDF is 2/00 (0.02). TD-IDF is therefore 0.04/0.02 = 2. PageRank: The rank of a web page is higher if many pages link to it. Links from highly ranked pages are given greater weight than links from less highly ranked pages. With TF-IDF documents are ranked depending on how well they match a specific query. With PageRank, the pages are ranked in order of importance, with no reference to a specific query. ~ Page 3 of 7 ~

8. What is a Markov chain? Why are they useful? A Markov Chain (named after Russian mathematician Andrey Markov), is a network with a probability function imposed on it. It is a statistical model concerning states and transitions with the property that it is memoryless, i.e. it does not remember its previous states (however, real life surfing is not as idealised as this). A Markov Chain has two components: (i) a network structure where each node is called a state; (ii) a transition probability of traversing a link from one state to another. The sum of outgoing probabilities from a state is always, i.e. p = e.g. i i A random walk is a path through the network. The probability of the path is the product of all probabilities along the path (<). We can write a Markov Chain as a sparse matrix, with probabilities in each cell where there is a link. The sum of outgoing probabilities is always. Support s in a chain accepts only paths whose initial probability is above s. Confidence c accepts only paths whose probability is above c. 9. What are Power Law Distributions? Give Examples. Power Laws are those which demonstrate exponential growth and can be seen as straight lines on a log-log graph. k A power law relationship is of the form y = ax which can be expressed as log( y ) = k log( x) + log( a) which itself is of the straight line form y = mx + c. Examples include scale-free networks such as the web (a few highly-connected nodes, many low-connected nodes), Newtonian gravity (inverse-square law), the Pareto principle (wealth distribution aka 80-20 rule ), earthquake magnitudes (the Richter Scale), storm strengths (the Beaufort scale) and city sizes. Power Laws often have Long Tails the low-frequency, low-amplitude portion of the population. The term was coined in a 2004 article by Chris Anderson in Wired. Hubs & Authorities some nodes are popular (many inlinks, e.g. hubs), others are focal points (many outlinks, e.g. authorities). Many are both. Preferential Attachment aka the rich get richer. Popular sites tend to become more popular, thus accentuating the power law distribution. ~ Page 4 of 7 ~

0. The Bow-Tie theory This model suggests that only a portion of the Web (the core) is well-connected. There are many pages which link into the core, but are not linked to (e.g. new pages) and many which are linked to but do not link back to the core (e.g. corporate websites). There are also tubes and tendrils which avoid the core, and disconnected pages.. What are recall/precision in the context of Search Engines? Recall is the ability of a search engine to obtain all of the relevant documents. Precision is the ratio or fraction of a search output that is relevant for a particular query. There is a certain amount of subjectivity with these terms, and it is generally not possible to know the total number of relevant documents in the case of calculating recall. 2. Recommender Systems and Collaborative Filtering. What is collaborative filtering? What is the difference between user-based and item-based collaborative filtering? Recommender systems are programs which make predictions, often implemented using collaborative filtering algorithms. Data can be gathered implicitly (e.g. recording purchases or page impressions) or explicitly (asking for ratings e.g. IMDB or relative rankings). Collaborative filtering (CF) is the method of making automatic predictions (filtering) about the interests of a user by collecting taste information from many users (collaborating). The underlying assumption of CF approach is that those who agreed in the past tend to agree again in the future. User-based collaborative filtering predicts a user s interests based on information from similar user profiles. A user-item matrix is constructed, and correlation between nearneighbours (users) can be calculated (treating the user information as vectors). Item-based collaborative filtering (e.g. Amazon.com, users who bought x also bought y ) works by building an item-item matrix which determines relationships between pairs of items, and using the matrix and the data on the current user, makes inferences. Difficulties with CF include sparsity (many users, but few ratings), scalability (huge numbers of users and items, e.g. customers and books) and the first-rater problem (i.e cold-starting). ~ Page 5 of 7 ~

3. Web Data Mining contrast content mining, link mining and usage mining. Content mining concerns mining useful information or knowledge from web page contents. Usage mining refers to the discovery of user access patterns from Web usage logs. Web Link mining tries to discover useful knowledge from the structure of hyperlinks. Content mining retrieves and analyses web content, classifying web objects (pages, websites, etc). Objects can be clustered and web crawlers can use a focused search to reveal patterns. DMOZ (the Open Directory Project) is an example of clustering by classification. Applications of usage mining include website reorganisation, caching oft-accessed pages, recommendation and personalisation. Users can be identified by IP address (unreliable), cookies (security/deleted cookie issues), logins. Sessions can be identified by time-stamping or by navigation connections. Sessions illustrate user trails through a website. Markov chains can be constructed from user trails. By adding start and end states, the chains are termed absorbing (i.e. selfcontained). Pattern discovery tools can identify complex patterns in usage. PageRank and Hubs and Authorities are the two most well-known applications of link mining. Other forms of mining include association rule mining, market basket analysis and cluster analysis. 4. What is capture/recapture? This is a simple mechanism for estimating the size of a population. We begin by capturing a sample of size c, tagging them, and then returning them to the pond. After allowing some time for mixing, a second recaptured sample of size r is taken and the number t of them which are tagged is recorded. An estimate of the number of fish in the pond is c r N = t Put another way, the ratio of recaptured to captured (t/c) is estimated to be the same as the ratio of recaptured sample size to total population (r/n). In the context of the size of the web, we can use two search engines results to simulate the capture/recapture, with the overlap providing the recapture value. We need to know the size of the web as reported by one of the search engines. The equation thus becomes size of SE # pages returned by SE2 Size of web = overlap between SE& SE2 This is becoming a less precise metric as more of the web consists of dynamic pages. It is no longer possible to determine the size of the web in terms of pages. ~ Page 6 of 7 ~

5. RSS (Really Simple Syndication) RSS is a way to publish and receive information, without using e-mail. Advantages are that recipients opt-in to receiving an RSS feed, it is not filtered (like e-mail) and can be updated as often as required. RSS feeds can be read using an RSS Feed Aggregator. RSS is pulled by a recipient (or its feed aggregator), not pushed by the publisher. RSS is based on XML. The RSS specification describes the structure of the XML necessary to deliver RSS information. A minimal structure is <rss version="2.0"> <channel> <title>rss feed title</title> <link>http://www.myfeed.com/</link> <description>short description of feed</description> <item> <title>story headline</title> <description>summary of story</description> <link>http://www.myfeed.com/story</link> </item> </channel> </rss> Channels can also include an image, published date, editor, etc. There can be many items within a channel. Items can also include author, category, published date, etc. ~ Page 7 of 7 ~