Information Retrieval Additional Reference Introduction to Information Retrieval, Manning, Raghavan and Schütze, online book: http://nlp.stanford.edu/ir-book/
Why Study Information Retrieval? Google Searches billions of pages Gives back personalised results in < 1 second Worth $200,000,000,000 Siri, etc IR is increasingly blending into IE IR uses very fast technology, so is a filter for performing IE at web scale First, retrieve Relevant Documents Second, Analyse these to find relevant information
Library Index Card
Library Index Card
Organising documents Fields associated with a document Author, Title, Year, Publisher, Number of pages, etc. Subject Areas Curated by Librarians Creating a classification scheme for all the books in a library is a lot of work How do you search?
Search Can you search on more than one field? You could use different card collections ordered by each search field Field1: Author Field2: Title Field3: Subject
Edge-notched Cards (1896)
Key Notions Terms Values assigned to fields for each document E.g. Fields, Author, Title, Subject Index Terms Terms that have been indexed on Query Index Terms that can be combined by boolean logic operators: AND, OR, NOT Retrieval Finding documents that match query
Edge-notched Cards (ALASKA or GREENLAND) and NATURE Put pin through NATURE Put pin through ALASKA Collect the cards that fall out Remove ALASKA Pin Put pin through GREENLAND Collect cards that fall out
Boolean Search Very little has changed for Information Retrieval over closed document collections Documents labeled with terms from a domain-specific ontology Search with boolean operators permitted over these terms
MESH: Medical Subject Headings C11 Eye Diseases C11.93 Asthenopia C11.187 Conjunctival Diseases C11.187.169 Conjunctival Neoplasms C11.187.183 Conjunctivitis C11.187.183.220 Conjunctivitis, Allergic» C11.187.183.220.889 Trachoma C11.187.781 Pterygium C11.187.810 Xerophthalmia... www.nlm.nih.gov/mesh
ACM Classification for CS B Hardware B.3 Memory structures B.3.1 Semiconductor Memories Dynamic memory (DRAM) Read-only memory (ROM) Static memory (SRAM) B.3.2 Design Styles B.3.3 Performance Analysis Simulation Worst-case analysis www.acm.org/class/
Limitations Manual effort by trained catalogers: required to create classification scheme and for annotation of documents with subject classes (concepts) Users need to be aware of subject classes BUT high precision searches works well for closed collections of documents (libraries, etc.)
The Internet NOT a closed collection Billions of webpages Documents change on daily basis Not possible to index or search by manually constructed subject classes How does Indexing work? How does Search work?
Simple Indexing Model Bag-of-Words Documents and queries are represented as a bag of words Ignore order of words Ignore morphology/syntax (cat vs cats etc) Just count the number of matches between words in document and query This already works rather well!
Vector Space Model Ranks Documents for relevance to query Documents and queries are vectors What do vectors look like? How do you compute relevance?
Term Frequency D1) Athletes face dope raids: UK dope body. D2) Athletes urged to snitch on dopers at Olympics. Q) Athletes dope Olympics Athlete s face dope raids UK body Olympi cs urged snitch dopers at to on D1 1 1 2 1 1 1 0 0 0 0 0 0 0 D2 1 0 0 0 0 0 1 1 1 1 1 1 1 Q 1 0 1 0 0 1 0 0 0 0 0 0 Q. D1 = 3 (Athletes + 2*dope) Q. D2 = 2 (Athletes + Olympics)
Similarity Metrics Each Cell is the number of times the word occurs in the document or query(simplification, more later...) Doc1 Doc2 Doc3 DocN Query Term1 ct 1 1 ct 1 2 ct 1 3 ct 1 N q 1 Term2 ct 2 1 ct 2 2 ct 2 3 ct 2 N q 2... TermM ct M 1 ct M 2 ct M 3 ct M N q M
Similarity Metrics Dot Product Sim DOC_N,QUERY = DOC_N. QUERY ct = 1n q 1 +ct 2n q 2 +...+ct mn q m ct = jn q j j But, there can be a large dot product just because documents are very long, so normalise by lengths Cosine of vectors
Comparison Metrics Cosine (Q,D)= Q.D / Q D Number between 0 and 1 Cosine of angles between Document and Query vectors (diagram for M=3)
Problems? D1) Athletes face dope raids: UK dope body. D2) Athletes urged to snitch on dopers at Olympics. Q) Athletes dope Olympics But both documents are about the London Olympics and about doping UK, Olympics London Olympics, dopers dope, etc. Indexing on words, not subject classes (concepts)
Problems? Dimensions are not independent Drug and Dope are closer together than Dope and London Apache could mean the server, the helicopter or the tribe. These should be different dimensions Therefore, the cosine is not necessarily an accurate reflection of similarity
Index terms What makes a good index term? The term should describe some aspect of the document The term should not be generic enough that it also describes all the other documents in the collection A good index term distinguishes a document from the rest of the collection
Text Coverage Coverage with N most frequent words 1 5% (the) 10 42% (the, and, a, he, but...) 100 65% 1000 90% 10000 99% Most frequent words are not informative! Least frequent words are typos or too specialised
Inverse Document Frequncy In a vector model, different words should have different weights Search for Query: Tom and Jerry Match on documents with Tom or Jerry should count for more than and The more documents a word appears in, the less is its use as an index term Documents are characterised by words which are relatively rare in other docs
Inverted Document Frequency Numerator = number of Documents in collection Denominator = number of documents containing term t i idf i =log ( D d :t i d )
tf*idf Normalise term frequency by length of document: (term i and document j) tf i,j = n i,j / k n k,j idf i = log ( D / {d:t i d} ) tf*idf i,j = tf i,j * idf i tf*idf is high for a term in a document if: its frequency in the document is high and its frequency in rest of collection is low
Cheating Hidden text Keyword Stuffing
Cheating the system Indexing done by algorithm, not humans No control over documents in collection Websites try to show up at the top of a search How to identify reliable websites?
Linear Algebra Revision Vectors are One-Dimensional Matrices X= [x 0 x 1 x 2... x n ] X = length of X = sqrt(x 2 0 + x 2 1 +x 2 2 +... + x n 2) = sqrt( Σ i x 2 i ) Vectors are used to represent coordinates in n-dimensional space
Scalar Multiplication Two vectors can be multiplied using dot product (also called scalar product) to give a scalar number. X= [x 0 x 1 x 2... x n ] Y= [y 0 y 1 y 2... y n ] X. Y = x 0 y 0 + x 1 y 1 + x 2 y 2 + + x n y n = Σ i x i y i
Scalar Multiplication Two vectors can be multiplied using dot product (also called scalar product) to give a scalar number. X= [1 2 3 4] Y= [5 6 7 8] X Y X.Y = 1 + 2 + 3 + 4 70 X.Y = Length of Projection of X on Y Length of Y
Geometric Interpretation A.A = A 2 B.B = B 2 A.B = A B cos(θ) cos(0)=1 cos(90)=0 A A Cosine function is a similarity metric θ B B
Vector Product Vector Product is also called cross product A m n n p = C m p 2 [1 5] = 2 10 3 3 15 4 4 20 C ij = Row i. Column j Rows of C are Rows of B multiplied by scalar value from A Columns of C are columns of A multiplied by scalar value from B
Problems with Term Counts For the term IBM, how do you distinguish IBM's home page (mostly graphical; IBM occurs only a few times in the html) IBM's copyright page (IBM occurs over 100 times) A Rival's spam page (Arbitrarily large term count for IBM)
Hyperlinks for search Web as a graph Anchor text pointing to page B provides a description of B A Hyperlink from page A to B is a recommendation or endorsement of B Ignore Internal links? IBM computers IBM Corporation International Business Machines IBM.com
Links as recommendations PageRank (Brin and Page, 1998) A link from A to B is a recommendation of B Think of science Highly cited papers are considered of higher quality Backlinks are like citations But webpages aren't reviewed, so how do we know the citer A is reliable? By counting links to A of course!
PageRank Consider a random surfer - Clicks on links at random A 1/3 1/3 B 1/1 E 1/3 D C 1/2 1/2 F Probability of following a link
PageRank If you continue this random walk You will visit some pages more frequently than others These are pages with lots of links from other pages with lots of links PageRank: Pages visited more often in a random walk are more important (reliable)
Teleporting What if the random surfer reaches a page with no hyperlinks? Teleport: the surfer jumps from a page to any other page in the web graph at random If there are N pages in the web graph, teleporting takes the surfer to each node with probability 1/N Use teleport operation if No outgoing links from node?» With probabilty α = 1 Otherwise with some probability 0 < α < 1
Need for Teleporting To avoid loops where you are forced to keep visiting the same sites in the random walk
Steady State Given this model of a random surfer The surfer spends a fixed fraction of the time at each page that depends on The hyperlink structure of the web The value of (usually 0.1) PageRank of page : fraction of the time spent at page
PageRank Computation Represent Web as Adjacency matrix Adj(i,j) = 1 iff there is a link from i to j Adj(i,j) = 0 iff there is no link from i to j C A B Adj = A B C A B 0 1 1 1 0 0 1 0 0 C
Transition Probabilities Divide each 1 in A by number of 1s in Row (probability of clicking on link to that page) Probability of following a link C A B 0 1/2 1/2 1/1 0 0 1/1 0 0
Transition Probabilities Lets consider teleport probability α = ½, N=3 3) Multiply cells by ½ (1-α, probability of not teleporting) 4) Add 1/6 = (α/n, probability of teleporting to that page ) to every cell Transition Probabilities C A B P = 1/6 1/4+1/6=5/12 1/4+1/6=5/12 1/2+1/6=2/3 1/6 1/6 1/2+1/6=2/3 1/6 1/6
Starting State Imaging, surfer starts at page B At beginning, x_0 = [0, 1, 0] Vectors x_n show proportion of time spent on pages A, B, C at time n At step one, x_1=x_0 P =[0,1,0] 1/6 5/12 5/12 2/3 1/6 1/6 2/3 1/6 1/6 X_1 = [2/3, 1/6, 1/6] 0*5/12 + 1*1/6 + 0*1/6 = 1/6
Iteration 2 At step one, x_1 = [2/3, 1/6, 1/6] At step 2, x_2 = x_1 P = [ 2/3, 1/6, 1/6 ] 1/6 5/12 5/12 2/3 1/6 1/6 2/3 1/6 1/6 X_2 = [2/18+2/18+2/18, 10/36+1/36+1/36, 10/36+1/36+1/36 ] = [1/3, 1/3, 1/3 ] 2/3*5/12 + 1/6*1/6 + 1/6 *1/6 =1/3
Iterating... A B C x_0 0 1 0 x_1 2/3 1/6 1/6 x_2 1/3 1/3 1/3 x_3 1/2 1/4 1/4 x_4 5/12 7/24 7/24............ X = 4/9 5/18 5/18
Example Which sites have low / high pagerank? D0 D1 D2 D5 D6 D3 D4
Example ( = 0.14) =[ 0.05, 0.04, 0.11, 0.25, 0.21, 0.04, 0.31] D0=0.05 D1=0.04 D2=0.11 D5=0.04 D6=0.31 D3=0.25 D4=0.21
Properties of Page Rank New pages have to acquire Page Rank Either convince lots of sites to link to you Or convince a few high-pagerank sites Page Rank can change very fast One link on Yahoo or the BBC is enough Spamming PageRank costs money Need to create huge number of sites Google never sells PageRank
Web Search in a nutshell Ranking Documents for a Query Vector similarity: Cosine (Q, D) Terms from document and anchor text Terms normalised using tf*idf PageRank Independent of query: Property of Graph Measure of reliability: Collaborative trust Has nothing to do with how often real users click on links. The random user was only used to calculate a property of the graph
Topics not covered... Personalisation of search Increasingly IR takes into account your search history to personalise IR to your needs and interests. IR also takes into account usage data to identify: What links others clicked on for similar search queries, etc.
Social IR Performing IR on Social Networks searching twitter, etc Using Social Networks for IR adding collaborative aspects to web search
Social Model of IR (Diagram by Sebastian Marius Kirsch)
Features of Social IR Individuals appear in two roles: information producers and information consumers Queries and documents are essentially interchangeable Queries and/or documents may be used to model an information need or an area of expertise. Most systems will use only some of the relations in the model For a social IR systems, modelling relations between individuals is mandatory
Information Spaces Graph of Users Graph of Documents
Information Spaces Users follows/is followed by others on twitter, friends on facebook etc. User writes or views Documents
Social Graph Algorithms PageRank Can be used to judge reliability in same manner Tweeter is reliable if retweeted by other reliable tweeters
How Google ranks tweets Tweets: 140-character microblog posts sent out by Twitter members The key is to identify "reputed followers," Twitterers "follow" the comments of other Twitterers they've selected, and are themselves "followed." If lots of people follow you, and then you follow someone-- then even though this [new person] does not have lots of followers, his tweet is deemed valuable One user following another in social media is analogous to one page linking to another on the Web. Both are a form of recommendation...
Social Graph Algorithms PageRank for identifying authorities Not a new idea Has been used to identify most influential scientists based on citation networks.
Pagerank for Social IR Calculate PageRank for each User i = Π i Based on who is following whom PageRank for each Document j = Π j Based on which document links to which If User i wrote Document j, then: reliability of j is some combination of Π i and Π j
Collaborative Filtering for IR In addition to reliability, you can filter search results using friend networks 5.7 degrees of separation on Facebook Rerank search results to recommend documents viewed by friends, or people you follow, etc.
Facebook Graph Search Restaurants liked by my Italian friends in Aberdeen Filter friends by country (Italy) and Location (Aberdeen) Only use ratings by these friends Which restaurants are liked by the locals? If in Sofia, Find Restaurants in Sofia Only use ratings by Facebook users living in Sofia Pictures of Jane Look for photos of people called Jane, starting with my friends, Janes who went to my school, my university, are friends of my friends, etc. Filter out photos I don't have permission to view