Lec 8: Adaptive Information Retrieval 2 Advaith Siddharthan Introduction to Information Retrieval by Manning, Raghavan & Schütze. Website: http://nlp.stanford.edu/ir-book/
Linear Algebra Revision Vectors: One-Dimensional Matrices X= [x 0 x 1 x 2... x n ] 2 2 X = length of X = sqrt(x 0 + x 1 +x 2 2 +... + x 2 0 ) 2 = sqrt( Σ i x i ) Often used to represent coordinates in space (x,y,z), but vectors can have any dimension
Scalar Multiplication Two vectors can be multiplied using dot product (also called scalar product) to give a scalar number. X= [x 0 x 1 x 2... x n ] Y= [y 0 y 1 y 2... y n ] X. Y = x 0 y 0 + x 1 y 1 + x 2 y 2 + + x n y n = Σ i x i y i
Scalar Multiplication Two vectors can be multiplied using dot product (also called scalar product) to give a scalar number. X= [1 2 3 4] Y= [5 6 7 8] X.Y = 1 5+ 2 6+ 3 7+ 4 8 = 70 X Y X.Y = Length of Projection of X on Y Length of Y
Geometric Interpretation A.A = A 2 B.B = B 2 A.B = A B cos(θ) A cos(0)=1 A cos(90)=0 θ B B Cosine function is a similarity metric
Vector Product Vector Product is also called cross product A m n Β n p = C m p 2 [1 5] = 2 10 3 3 15 3 4 4 20 C ij = Row i. Column j Rows of C are Rows of B multiplied by scalar value from A Columns of C are columns of A multiplied by scalar value from B
Matrix Multiplication = Row i. Column j
Overview 3 Lectures: Information Retrieval History and Evolution; Vector Models Link Analysis Using anchor text for indexing Using hyperlinks as recommendations PageRank Personalised PageRank Adaptive and Interactive IR
Properties of the internet Google indexes are big 1998: 26 Million pages 2000: 1 Billion pages 2004: 8 Billion pages 2008: 1 Trillion unique URLS These numbers are now meaningless Auto generated content Duplicates, etc, etc Probably around 20 Billion are indexed
Properties of the internet Dynamic Page content changes around twice a month on average Over a million pages added every day Indexing is a continuous process News sites etc have to be indexed constantly Popular sites indexed more often
Vector Space Model Documents and queries are vectors - Normalised term counts (tf*idf) Comparison of query Q and Document D - Cosine (Q, D)= Q.D / Q D Returns ranked documents for query - Based entirely on the textual content of the documents and query
Problems Not all documents on web are reliable Websites can cheat to improve rank on queries Indexing done by algorithm based on content provided on web page How do we know which websites are reliable?.
Problems with Term Counts For the term IBM, how do you distinguish IBM's home page (mostly graphical; IBM occurs only a few times in the html) IBM's copyright page (IBM occurs over 100 times) A Rival's spam page (Arbitrarily large term count for IBM)
Hyperlinks for search Web as a graph Anchor text pointing to page B provides a description of B A Hyperlink from page A to B is a recommendation or endorsement of B Ignore Internal links? IBM computers IBM Corporation International Business Machines IBM.com
Anchor Text <a href= http://www.ibm.com > IBM computers </a> Anchor text: IBM computers computer occurs only once on ibm.com html page yahoo.com doesn't contain the word portal Apple.com doesn't contain the word apple! Gaps exist between terms present on a website and useful terms for indexing These can usually be filled by anchor text
Anchor Text for indexing Need tf*idf again Most common words in anchor text are:
Anchor Text for indexing Need tf*idf again Most common words in anchor text are: Click Here Search Engines give substantial weight to index terms obtained from anchor text satchmo -> louisarmstronghouse.org
Extended Anchor Text Area around anchor text is useful too Click here for information about mutual funds Search engines make use of extended anchor text as well
Links as recommendations PageRank (Brin and Page, 1998) A link from A to B is a recommendation of B Think of science Highly cited papers are considered of higher quality Backlinks are like citations But webpages aren't reviewed, so how do we know the citer A is reliable? By counting links to A of course!
PageRank Consider a random surfer - Clicks on links at random A 1/3 1/3 B 1/1 E 1/3 C 1/2 1/2 D F
PageRank If you continue this random walk You will visit some pages more frequently than others These are pages with lots of links from other pages with lots of links PageRank: Pages visited more often in a random walk are more important (reliable)
Teleporting What if the random surfer reaches a page with no hyperlinks? Teleport: the surfer jumps from a page to any other page in the web graph at random If there are N pages in the web graph, teleporting takes the surfer to each node with probability 1/N Use teleport operation if No outgoing links: with probabilty α = 1 Otherwise with probability 0< α < 1
Need for Teleporting To avoid loops where you are forced to keep visiting the same sites in the random walk
Steady State Given this model of a random surfer The surfer spends a fixed fraction of the time at each page that depends on The hyperlink structure of the web Page Rank of page ν: The value of α (usually 0.1) Π (ν) = fraction of the time spent at page ν
Page Rank Computation Represent Web as Adjacency matrix Adj(i,j) = 1 iff there is a link from i to j Adj(i,j) = 0 iff there is no link from i to j C A B Adj = A B A B 0 1 1 1 0 0 C C 1 0 0
Transition Probabilities If row has no 1, replace each element by 1/N (teleport if no outgoing links) Divide each 1 in A by number of 1s in Row (probability of clicking on link to that page) C A B 0 1/2 1/2 1/1 0 0 1/1 0 0
Transition Probabilities Lets consider α = 1/2, Ν=3 3) Multiply everything by 1/2=(1-α) (probability of not teleporting) 4) Add 1/6 = (α/n) to every entry C A B P = 1/6 1/4+1/6=5/12 1/4+1/6=5/12 1/2+1/6=2/3 1/6 1/6 1/2+1/6=2/3 1/6 1/6
Starting State Imaging, surfer starts at page B At beginning, x_0 = [0, 1, 0] Vectors x_n show proportion of time spent on pages A, B, C at time n At step one, x_1=x_0 P =[0,1,0] 1/6 5/12 5/12 2/3 1/6 1/6 2/3 1/6 1/6 X_1 = [2/3, 1/6, 1/6] 0*5/12 + 1*1/6 + 0*1/6 = 1/6
Iteration 2 At step one, x_1 = [2/3, 1/6, 1/6] At step 2, x_2 = x_1 P 1/6 5/12 5/12 = [ 2/3, 1/6, 1/6 ] 2/3 1/6 1/6 2/3 1/6 1/6 X_2 = [2/18+2/18+2/18, 10/36+1/36+1/36, 10/36+1/36+1/36 ] = [1/3, 1/3, 1/3 ] 2/3*5/12 + 1/6*1/6 + 1/6 *1/6 =1/3
Iterating... A B C x_0 0 1 0 x_1 2/3 1/6 1/6 x_2 1/3 1/3 1/3 x_3 1/2 1/4 1/4 x_4 5/12 7/24 7/24............ X = 4/9 5/18 5/18
Solving by hand C A B B and C are symmetric 1/6 5/12 5/12 2/3 1/6 1/6 2/3 1/6 1/6 =(1-2p, p, p) (have to add up to 1) Solve P = to get p=5/18 1/6*(1-2p)+2/3*p+2/3*p=1-2p 1/6 1/3p +4/3p = 1-2p 3p=5/6; p=5/18 = [ 4/9, 5/18, 5/18 ]
Example Which sites have low / high pagerank? D0 D1 D2 D5 D6 D3 D4
Example (α = 0.14) =[ 0.05, 0.04, 0.11, 0.25, 0.21, 0.04, 0.31] D0=0.05 D1=0.04 D2=0.11 D5=0.04 D6=0.31 D3=0.25 D4=0.21
Web Search Ranking Documents for a Query Vector similarity: Cosine (Q, D) Terms from document and anchor text Terms normalised using tf*idf PageRank Independent of query: Property of Graph Measure of reliability: Collaborative trust Has nothing to do with how often real users click on links. The random user was only used to calculate a property of the graph
Properties of Page Rank New pages have to acquire Page Rank Either convince lots of sites to link to you Or convince a few high-pagerank sites Page Rank can change very fast One link on Yahoo or the BBC is enough Spamming PageRank costs money Need to create huge number of sites Google never sells PageRank
Top PageRank sites google.com adobe.com w3.org jigsaw.w3.org/css-validator cnn.com usa.gov get.adobe.com/flashplayer get.adobe.com/reader india.gov.in
Personalised PageRank Why Personalise? Tech sites tend to have many back links and high PageRank Problem if you are not interested in IT Try searching for Apache Snow Leopard Java PageRank reflects the interests of the webcreating majority What if you are in the minority?
Personalised PageRank Keep track of a user's favorite websites Increase the PageRank of these sites During the iterative process, this PageRank will spread to sites that are linked PageRank will now reflect the user's interests If you give wwf.org a large PageRank, this will spread to other wildlife sites You might then see real snow leopards when you search? BUT?
Personalised PageRank PageRank vectors are very big and time consuming to compute, even once. You don't want to compute it for each user, or continuously update it as their browsing behaviour changes Too computationally intensive
Personalised PageRank Compromise Personalise by subject, not user Create a PageRank vector for each subject (Sports, Politics, etc) How?
Topic-specific Pagerank Random surfer Follow Link OR Teleport Teleport only to site relevant to Topic? Use directory of sports pages from yahoo or dmoz We can then build _sports, _politics, etc
User Modelling We can then model a user as a linear combination of Topics For example if we say a user's interests are 60% Sports 40% Politics Can we compute a PageRank for this?
User Modelling
User Modelling We don't need to recompute PageRank If each webpage has a Politics PageRank and a Sports PageRank precomputed, We can just use a linear combination of PageRanks for user with mixed interests.6 sports+.4 politics =.6 sports +.4 politics Topic PageRanks calculated offline by server User Profile maintained at clientside (.4,.6,...) Efficient method that can be used at runtime