Lec 8: Adaptive Information Retrieval 2

Similar documents
Information Retrieval

Web consists of web pages and hyperlinks between pages. A page receiving many links from other pages may be a hint of the authority of the page

Information Retrieval. Lecture 11 - Link analysis

Information Retrieval Lecture 4: Web Search. Challenges of Web Search 2. Natural Language and Information Processing (NLIP) Group

Brief (non-technical) history

Lecture 8: Linkage algorithms and web search

Lecture #3: PageRank Algorithm The Mathematics of Google Search

COMP Page Rank

Einführung in Web und Data Science Community Analysis. Prof. Dr. Ralf Möller Universität zu Lübeck Institut für Informationssysteme

F. Aiolli - Sistemi Informativi 2007/2008. Web Search before Google

Introduction to Information Retrieval

Agenda. Math Google PageRank algorithm. 2 Developing a formula for ranking web pages. 3 Interpretation. 4 Computing the score of each page

Information Retrieval (IR) Introduction to Information Retrieval. Lecture Overview. Why do we need IR? Basics of an IR system.

Big Data Analytics CSCI 4030

TODAY S LECTURE HYPERTEXT AND

Web search before Google. (Taken from Page et al. (1999), The PageRank Citation Ranking: Bringing Order to the Web.)

COMP5331: Knowledge Discovery and Data Mining

Big Data Analytics CSCI 4030

Part 1: Link Analysis & Page Rank

Link Analysis SEEM5680. Taken from Introduction to Information Retrieval by C. Manning, P. Raghavan, and H. Schutze, Cambridge University Press.

Lecture Notes to Big Data Management and Analytics Winter Term 2017/2018 Node Importance and Neighborhoods

Link Analysis. Hongning Wang

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

COMP 4601 Hubs and Authorities

Information Retrieval

Proximity Prestige using Incremental Iteration in Page Rank Algorithm

How to organize the Web?

Mining Web Data. Lijun Zhang

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

Link Analysis and Web Search

Mining Web Data. Lijun Zhang

Generalized Social Networks. Social Networks and Ranking. How use links to improve information search? Hypertext

Motivation. Motivation

World Wide Web has specific challenges and opportunities

INTRODUCTION TO DATA SCIENCE. Link Analysis (MMDS5)

Information Retrieval. Lecture 4: Search engines and linkage algorithms

Web Search. Lecture Objectives. Text Technologies for Data Science INFR Learn about: 11/14/2017. Instructor: Walid Magdy

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

Information Retrieval. Lecture 9 - Web search basics

CS60092: Informa0on Retrieval

10/10/13. Traditional database system. Information Retrieval. Information Retrieval. Information retrieval system? Information Retrieval Issues

Pagerank Scoring. Imagine a browser doing a random walk on web pages:

Link analysis in web IR CE-324: Modern Information Retrieval Sharif University of Technology

Searching the Web [Arasu 01]

ROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015

Link Structure Analysis

Lecture Notes: Social Networks: Models, Algorithms, and Applications Lecture 28: Apr 26, 2012 Scribes: Mauricio Monsalve and Yamini Mule

Searching the Web for Information

Recent Researches on Web Page Ranking

PageRank. CS16: Introduction to Data Structures & Algorithms Spring 2018

MAE 298, Lecture 9 April 30, Web search and decentralized search on small-worlds

Information Retrieval Spring Web retrieval

Unit VIII. Chapter 9. Link Analysis

Centralities (4) By: Ralucca Gera, NPS. Excellence Through Knowledge

CS6200 Information Retreival. The WebGraph. July 13, 2015

A Web Metrics of the Universities Mutual Impact: G-Factor revisited

5/30/2014. Acknowledgement. In this segment: Search Engine Architecture. Collecting Text. System Architecture. Web Information Retrieval

Link analysis in web IR CE-324: Modern Information Retrieval Sharif University of Technology

Information Retrieval and Web Search

Introduction to Data Mining

Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 21 Link analysis

Information Retrieval and Web Search Engines

Information Retrieval May 15. Web retrieval

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

Web Search Basics. Berlin Chen Department of Computer Science & Information Engineering National Taiwan Normal University

~ Ian Hunneybell: WWWT Revision Notes (15/06/2006) ~

Jeffrey D. Ullman Stanford University

UNIT-V WEB MINING. 3/18/2012 Prof. Asha Ambhaikar, RCET Bhilai.

PageRank and related algorithms

Lecture 8: Linkage algorithms and web search

Information Networks: PageRank

Learning to Rank Networked Entities

CC PROCESAMIENTO MASIVO DE DATOS OTOÑO Lecture 7: Information Retrieval II. Aidan Hogan

A brief history of Google

Introduction to Data Mining

Web Search Basics. Berlin Chen Department t of Computer Science & Information Engineering National Taiwan Normal University

Authoritative Sources in a Hyperlinked Environment

Lecture 27: Learning from relational data

Advanced Computer Architecture: A Google Search Engine

High Quality Inbound Links For Your Website Success

CS425: Algorithms for Web Scale Data

Administrative. Web crawlers. Web Crawlers and Link Analysis!

Link analysis. Query-independent ordering. Query processing. Spamming simple popularity

Home Page. Title Page. Page 1 of 14. Go Back. Full Screen. Close. Quit

A Survey on Web Information Retrieval Technologies

Slides based on those in:

.. Spring 2009 CSC 466: Knowledge Discovery from Data Alexander Dekhtyar..

This lecture. CS276B Text Retrieval and Mining Winter Pagerank: Issues and Variants. Influencing PageRank ( Personalization )

Network Centrality. Saptarshi Ghosh Department of CSE, IIT Kharagpur Social Computing course, CS60017

CS6322: Information Retrieval Sanda Harabagiu. Lecture 10: Link analysis

PV211: Introduction to Information Retrieval

Multimedia Content Management: Link Analysis. Ralf Moeller Hamburg Univ. of Technology

Collaborative filtering based on a random walk model on a graph

The application of Randomized HITS algorithm in the fund trading network

A STUDY OF RANKING ALGORITHM USED BY VARIOUS SEARCH ENGINE

1 Starting around 1996, researchers began to work on. 2 In Feb, 1997, Yanhong Li (Scotch Plains, NJ) filed a

DSCI 575: Advanced Machine Learning. PageRank Winter 2018

Web Search Engines: Solutions to Final Exam, Part I December 13, 2004

DEC Computer Technology LESSON 6: DATABASES AND WEB SEARCH ENGINES

Calculating Web Page Authority Using the PageRank Algorithm. Math 45, Fall 2005 Levi Gill and Jacob Miles Prystowsky

Transcription:

Lec 8: Adaptive Information Retrieval 2 Advaith Siddharthan Introduction to Information Retrieval by Manning, Raghavan & Schütze. Website: http://nlp.stanford.edu/ir-book/

Linear Algebra Revision Vectors: One-Dimensional Matrices X= [x 0 x 1 x 2... x n ] 2 2 X = length of X = sqrt(x 0 + x 1 +x 2 2 +... + x 2 0 ) 2 = sqrt( Σ i x i ) Often used to represent coordinates in space (x,y,z), but vectors can have any dimension

Scalar Multiplication Two vectors can be multiplied using dot product (also called scalar product) to give a scalar number. X= [x 0 x 1 x 2... x n ] Y= [y 0 y 1 y 2... y n ] X. Y = x 0 y 0 + x 1 y 1 + x 2 y 2 + + x n y n = Σ i x i y i

Scalar Multiplication Two vectors can be multiplied using dot product (also called scalar product) to give a scalar number. X= [1 2 3 4] Y= [5 6 7 8] X.Y = 1 5+ 2 6+ 3 7+ 4 8 = 70 X Y X.Y = Length of Projection of X on Y Length of Y

Geometric Interpretation A.A = A 2 B.B = B 2 A.B = A B cos(θ) A cos(0)=1 A cos(90)=0 θ B B Cosine function is a similarity metric

Vector Product Vector Product is also called cross product A m n Β n p = C m p 2 [1 5] = 2 10 3 3 15 3 4 4 20 C ij = Row i. Column j Rows of C are Rows of B multiplied by scalar value from A Columns of C are columns of A multiplied by scalar value from B

Matrix Multiplication = Row i. Column j

Overview 3 Lectures: Information Retrieval History and Evolution; Vector Models Link Analysis Using anchor text for indexing Using hyperlinks as recommendations PageRank Personalised PageRank Adaptive and Interactive IR

Properties of the internet Google indexes are big 1998: 26 Million pages 2000: 1 Billion pages 2004: 8 Billion pages 2008: 1 Trillion unique URLS These numbers are now meaningless Auto generated content Duplicates, etc, etc Probably around 20 Billion are indexed

Properties of the internet Dynamic Page content changes around twice a month on average Over a million pages added every day Indexing is a continuous process News sites etc have to be indexed constantly Popular sites indexed more often

Vector Space Model Documents and queries are vectors - Normalised term counts (tf*idf) Comparison of query Q and Document D - Cosine (Q, D)= Q.D / Q D Returns ranked documents for query - Based entirely on the textual content of the documents and query

Problems Not all documents on web are reliable Websites can cheat to improve rank on queries Indexing done by algorithm based on content provided on web page How do we know which websites are reliable?.

Problems with Term Counts For the term IBM, how do you distinguish IBM's home page (mostly graphical; IBM occurs only a few times in the html) IBM's copyright page (IBM occurs over 100 times) A Rival's spam page (Arbitrarily large term count for IBM)

Hyperlinks for search Web as a graph Anchor text pointing to page B provides a description of B A Hyperlink from page A to B is a recommendation or endorsement of B Ignore Internal links? IBM computers IBM Corporation International Business Machines IBM.com

Anchor Text <a href= http://www.ibm.com > IBM computers </a> Anchor text: IBM computers computer occurs only once on ibm.com html page yahoo.com doesn't contain the word portal Apple.com doesn't contain the word apple! Gaps exist between terms present on a website and useful terms for indexing These can usually be filled by anchor text

Anchor Text for indexing Need tf*idf again Most common words in anchor text are:

Anchor Text for indexing Need tf*idf again Most common words in anchor text are: Click Here Search Engines give substantial weight to index terms obtained from anchor text satchmo -> louisarmstronghouse.org

Extended Anchor Text Area around anchor text is useful too Click here for information about mutual funds Search engines make use of extended anchor text as well

Links as recommendations PageRank (Brin and Page, 1998) A link from A to B is a recommendation of B Think of science Highly cited papers are considered of higher quality Backlinks are like citations But webpages aren't reviewed, so how do we know the citer A is reliable? By counting links to A of course!

PageRank Consider a random surfer - Clicks on links at random A 1/3 1/3 B 1/1 E 1/3 C 1/2 1/2 D F

PageRank If you continue this random walk You will visit some pages more frequently than others These are pages with lots of links from other pages with lots of links PageRank: Pages visited more often in a random walk are more important (reliable)

Teleporting What if the random surfer reaches a page with no hyperlinks? Teleport: the surfer jumps from a page to any other page in the web graph at random If there are N pages in the web graph, teleporting takes the surfer to each node with probability 1/N Use teleport operation if No outgoing links: with probabilty α = 1 Otherwise with probability 0< α < 1

Need for Teleporting To avoid loops where you are forced to keep visiting the same sites in the random walk

Steady State Given this model of a random surfer The surfer spends a fixed fraction of the time at each page that depends on The hyperlink structure of the web Page Rank of page ν: The value of α (usually 0.1) Π (ν) = fraction of the time spent at page ν

Page Rank Computation Represent Web as Adjacency matrix Adj(i,j) = 1 iff there is a link from i to j Adj(i,j) = 0 iff there is no link from i to j C A B Adj = A B A B 0 1 1 1 0 0 C C 1 0 0

Transition Probabilities If row has no 1, replace each element by 1/N (teleport if no outgoing links) Divide each 1 in A by number of 1s in Row (probability of clicking on link to that page) C A B 0 1/2 1/2 1/1 0 0 1/1 0 0

Transition Probabilities Lets consider α = 1/2, Ν=3 3) Multiply everything by 1/2=(1-α) (probability of not teleporting) 4) Add 1/6 = (α/n) to every entry C A B P = 1/6 1/4+1/6=5/12 1/4+1/6=5/12 1/2+1/6=2/3 1/6 1/6 1/2+1/6=2/3 1/6 1/6

Starting State Imaging, surfer starts at page B At beginning, x_0 = [0, 1, 0] Vectors x_n show proportion of time spent on pages A, B, C at time n At step one, x_1=x_0 P =[0,1,0] 1/6 5/12 5/12 2/3 1/6 1/6 2/3 1/6 1/6 X_1 = [2/3, 1/6, 1/6] 0*5/12 + 1*1/6 + 0*1/6 = 1/6

Iteration 2 At step one, x_1 = [2/3, 1/6, 1/6] At step 2, x_2 = x_1 P 1/6 5/12 5/12 = [ 2/3, 1/6, 1/6 ] 2/3 1/6 1/6 2/3 1/6 1/6 X_2 = [2/18+2/18+2/18, 10/36+1/36+1/36, 10/36+1/36+1/36 ] = [1/3, 1/3, 1/3 ] 2/3*5/12 + 1/6*1/6 + 1/6 *1/6 =1/3

Iterating... A B C x_0 0 1 0 x_1 2/3 1/6 1/6 x_2 1/3 1/3 1/3 x_3 1/2 1/4 1/4 x_4 5/12 7/24 7/24............ X = 4/9 5/18 5/18

Solving by hand C A B B and C are symmetric 1/6 5/12 5/12 2/3 1/6 1/6 2/3 1/6 1/6 =(1-2p, p, p) (have to add up to 1) Solve P = to get p=5/18 1/6*(1-2p)+2/3*p+2/3*p=1-2p 1/6 1/3p +4/3p = 1-2p 3p=5/6; p=5/18 = [ 4/9, 5/18, 5/18 ]

Example Which sites have low / high pagerank? D0 D1 D2 D5 D6 D3 D4

Example (α = 0.14) =[ 0.05, 0.04, 0.11, 0.25, 0.21, 0.04, 0.31] D0=0.05 D1=0.04 D2=0.11 D5=0.04 D6=0.31 D3=0.25 D4=0.21

Web Search Ranking Documents for a Query Vector similarity: Cosine (Q, D) Terms from document and anchor text Terms normalised using tf*idf PageRank Independent of query: Property of Graph Measure of reliability: Collaborative trust Has nothing to do with how often real users click on links. The random user was only used to calculate a property of the graph

Properties of Page Rank New pages have to acquire Page Rank Either convince lots of sites to link to you Or convince a few high-pagerank sites Page Rank can change very fast One link on Yahoo or the BBC is enough Spamming PageRank costs money Need to create huge number of sites Google never sells PageRank

Top PageRank sites google.com adobe.com w3.org jigsaw.w3.org/css-validator cnn.com usa.gov get.adobe.com/flashplayer get.adobe.com/reader india.gov.in

Personalised PageRank Why Personalise? Tech sites tend to have many back links and high PageRank Problem if you are not interested in IT Try searching for Apache Snow Leopard Java PageRank reflects the interests of the webcreating majority What if you are in the minority?

Personalised PageRank Keep track of a user's favorite websites Increase the PageRank of these sites During the iterative process, this PageRank will spread to sites that are linked PageRank will now reflect the user's interests If you give wwf.org a large PageRank, this will spread to other wildlife sites You might then see real snow leopards when you search? BUT?

Personalised PageRank PageRank vectors are very big and time consuming to compute, even once. You don't want to compute it for each user, or continuously update it as their browsing behaviour changes Too computationally intensive

Personalised PageRank Compromise Personalise by subject, not user Create a PageRank vector for each subject (Sports, Politics, etc) How?

Topic-specific Pagerank Random surfer Follow Link OR Teleport Teleport only to site relevant to Topic? Use directory of sports pages from yahoo or dmoz We can then build _sports, _politics, etc

User Modelling We can then model a user as a linear combination of Topics For example if we say a user's interests are 60% Sports 40% Politics Can we compute a PageRank for this?

User Modelling

User Modelling We don't need to recompute PageRank If each webpage has a Politics PageRank and a Sports PageRank precomputed, We can just use a linear combination of PageRanks for user with mixed interests.6 sports+.4 politics =.6 sports +.4 politics Topic PageRanks calculated offline by server User Profile maintained at clientside (.4,.6,...) Efficient method that can be used at runtime