Information Retrieval

Similar documents
Lec 8: Adaptive Information Retrieval 2

Web consists of web pages and hyperlinks between pages. A page receiving many links from other pages may be a hint of the authority of the page

Information Retrieval. Lecture 11 - Link analysis

Lecture 8: Linkage algorithms and web search

10/10/13. Traditional database system. Information Retrieval. Information Retrieval. Information retrieval system? Information Retrieval Issues

Information Retrieval (IR) Introduction to Information Retrieval. Lecture Overview. Why do we need IR? Basics of an IR system.

Brief (non-technical) history

Introduction to Information Retrieval

Information Retrieval Lecture 4: Web Search. Challenges of Web Search 2. Natural Language and Information Processing (NLIP) Group

Proximity Prestige using Incremental Iteration in Page Rank Algorithm

Mining Web Data. Lijun Zhang

COMP Page Rank

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

Lecture #3: PageRank Algorithm The Mathematics of Google Search

How to organize the Web?

Link Analysis and Web Search

Searching the Web [Arasu 01]

Big Data Analytics CSCI 4030

F. Aiolli - Sistemi Informativi 2007/2008. Web Search before Google

ROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015

Web search before Google. (Taken from Page et al. (1999), The PageRank Citation Ranking: Bringing Order to the Web.)

Mining Web Data. Lijun Zhang

Text Analytics (Text Mining)

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

Link Analysis. Hongning Wang

CC PROCESAMIENTO MASIVO DE DATOS OTOÑO Lecture 7: Information Retrieval II. Aidan Hogan

Text Analytics (Text Mining)

Web Search. Lecture Objectives. Text Technologies for Data Science INFR Learn about: 11/14/2017. Instructor: Walid Magdy

Information Retrieval. CS630 Representing and Accessing Digital Information. What is a Retrieval Model? Basic IR Processes

Network Centrality. Saptarshi Ghosh Department of CSE, IIT Kharagpur Social Computing course, CS60017

Vannevar Bush. Information Retrieval. Prophetic: Hypertext. Historic Vision 2/8/17

Motivation. Motivation

COMP 4601 Hubs and Authorities

Big Data Analytics CSCI 4030

Social Information Retrieval

Information Retrieval

Information Retrieval. Lecture 9 - Web search basics

Einführung in Web und Data Science Community Analysis. Prof. Dr. Ralf Möller Universität zu Lübeck Institut für Informationssysteme

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

TODAY S LECTURE HYPERTEXT AND

Advanced Computer Architecture: A Google Search Engine

Unit VIII. Chapter 9. Link Analysis

Representation/Indexing (fig 1.2) IR models - overview (fig 2.1) IR models - vector space. Weighting TF*IDF. U s e r. T a s k s

Information Retrieval

Agenda. Math Google PageRank algorithm. 2 Developing a formula for ranking web pages. 3 Interpretation. 4 Computing the score of each page

Centralities (4) By: Ralucca Gera, NPS. Excellence Through Knowledge

1 Starting around 1996, researchers began to work on. 2 In Feb, 1997, Yanhong Li (Scotch Plains, NJ) filed a

MAE 298, Lecture 9 April 30, Web search and decentralized search on small-worlds

Information Retrieval: Retrieval Models

Knowledge Discovery and Data Mining 1 (VO) ( )

CS60092: Informa0on Retrieval

2.3 Algorithms Using Map-Reduce

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University

Graph Algorithms. Revised based on the slides by Ruoming Kent State

World Wide Web has specific challenges and opportunities

Chapter 6: Information Retrieval and Web Search. An introduction

Home Page. Title Page. Page 1 of 14. Go Back. Full Screen. Close. Quit

Natural Language Processing

CSE 494: Information Retrieval, Mining and Integration on the Internet

Lecture Notes: Social Networks: Models, Algorithms, and Applications Lecture 28: Apr 26, 2012 Scribes: Mauricio Monsalve and Yamini Mule

Part 1: Link Analysis & Page Rank

Lecture 27: Learning from relational data

COMP5331: Knowledge Discovery and Data Mining

modern database systems lecture 4 : information retrieval

Generalized Social Networks. Social Networks and Ranking. How use links to improve information search? Hypertext

Information Retrieval. (M&S Ch 15)

Link Analysis SEEM5680. Taken from Introduction to Information Retrieval by C. Manning, P. Raghavan, and H. Schutze, Cambridge University Press.

Information Retrieval. hussein suleman uct cs

Web Search Ranking. (COSC 488) Nazli Goharian Evaluation of Web Search Engines: High Precision Search

CS 6320 Natural Language Processing

Bruno Martins. 1 st Semester 2012/2013

Information Retrieval and Web Search

Introduction to Information Retrieval

Indexing: Part IV. Announcements (February 17) Keyword search. CPS 216 Advanced Database Systems

Information Networks: PageRank

Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 21 Link analysis

Lecture Notes to Big Data Management and Analytics Winter Term 2017/2018 Node Importance and Neighborhoods

CSCI 5417 Information Retrieval Systems! What is Information Retrieval?

DSCI 575: Advanced Machine Learning. PageRank Winter 2018

Calculating Web Page Authority Using the PageRank Algorithm. Math 45, Fall 2005 Levi Gill and Jacob Miles Prystowsky

Recent Researches on Web Page Ranking

SEO and Monetizing The Content. Digital 2011 March 30 th Thinking on a different level

Link Analysis from Bing Liu. Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data, Springer and other material.

The Ultimate Digital Marketing Glossary (A-Z) what does it all mean? A-Z of Digital Marketing Translation

Lecture 8: Linkage algorithms and web search

Boolean Model. Hongning Wang

CPSC 340: Machine Learning and Data Mining. Ranking Fall 2016

Introduction To Graphs and Networks. Fall 2013 Carola Wenk

Administrative. Web crawlers. Web Crawlers and Link Analysis!

Link analysis in web IR CE-324: Modern Information Retrieval Sharif University of Technology

Authoritative K-Means for Clustering of Web Search Results

Social media to promote

Mathematical Analysis of Google PageRank

PageRank. CS16: Introduction to Data Structures & Algorithms Spring 2018

CS 6604: Data Mining Large Networks and Time-Series

Link analysis in web IR CE-324: Modern Information Retrieval Sharif University of Technology

University of Virginia Department of Computer Science. CS 4501: Information Retrieval Fall 2015

Information Retrieval

Link Structure Analysis

Transcription:

Information Retrieval Additional Reference Introduction to Information Retrieval, Manning, Raghavan and Schütze, online book: http://nlp.stanford.edu/ir-book/

Why Study Information Retrieval? Google Searches billions of pages Gives back personalised results in < 1 second Worth $200,000,000,000 Siri, etc IR is increasingly blending into IE IR uses very fast technology, so is a filter for performing IE at web scale First, retrieve Relevant Documents Second, Analyse these to find relevant information

Library Index Card

Library Index Card

Organising documents Fields associated with a document Author, Title, Year, Publisher, Number of pages, etc. Subject Areas Curated by Librarians Creating a classification scheme for all the books in a library is a lot of work How do you search?

Search Can you search on more than one field? You could use different card collections ordered by each search field Field1: Author Field2: Title Field3: Subject

Edge-notched Cards (1896)

Key Notions Terms Values assigned to fields for each document E.g. Fields, Author, Title, Subject Index Terms Terms that have been indexed on Query Index Terms that can be combined by boolean logic operators: AND, OR, NOT Retrieval Finding documents that match query

Edge-notched Cards (ALASKA or GREENLAND) and NATURE Put pin through NATURE Put pin through ALASKA Collect the cards that fall out Remove ALASKA Pin Put pin through GREENLAND Collect cards that fall out

Boolean Search Very little has changed for Information Retrieval over closed document collections Documents labeled with terms from a domain-specific ontology Search with boolean operators permitted over these terms

MESH: Medical Subject Headings C11 Eye Diseases C11.93 Asthenopia C11.187 Conjunctival Diseases C11.187.169 Conjunctival Neoplasms C11.187.183 Conjunctivitis C11.187.183.220 Conjunctivitis, Allergic» C11.187.183.220.889 Trachoma C11.187.781 Pterygium C11.187.810 Xerophthalmia... www.nlm.nih.gov/mesh

ACM Classification for CS B Hardware B.3 Memory structures B.3.1 Semiconductor Memories Dynamic memory (DRAM) Read-only memory (ROM) Static memory (SRAM) B.3.2 Design Styles B.3.3 Performance Analysis Simulation Worst-case analysis www.acm.org/class/

Limitations Manual effort by trained catalogers: required to create classification scheme and for annotation of documents with subject classes (concepts) Users need to be aware of subject classes BUT high precision searches works well for closed collections of documents (libraries, etc.)

The Internet NOT a closed collection Billions of webpages Documents change on daily basis Not possible to index or search by manually constructed subject classes How does Indexing work? How does Search work?

Simple Indexing Model Bag-of-Words Documents and queries are represented as a bag of words Ignore order of words Ignore morphology/syntax (cat vs cats etc) Just count the number of matches between words in document and query This already works rather well!

Vector Space Model Ranks Documents for relevance to query Documents and queries are vectors What do vectors look like? How do you compute relevance?

Term Frequency D1) Athletes face dope raids: UK dope body. D2) Athletes urged to snitch on dopers at Olympics. Q) Athletes dope Olympics Athlete s face dope raids UK body Olympi cs urged snitch dopers at to on D1 1 1 2 1 1 1 0 0 0 0 0 0 0 D2 1 0 0 0 0 0 1 1 1 1 1 1 1 Q 1 0 1 0 0 1 0 0 0 0 0 0 Q. D1 = 3 (Athletes + 2*dope) Q. D2 = 2 (Athletes + Olympics)

Similarity Metrics Each Cell is the number of times the word occurs in the document or query(simplification, more later...) Doc1 Doc2 Doc3 DocN Query Term1 ct 1 1 ct 1 2 ct 1 3 ct 1 N q 1 Term2 ct 2 1 ct 2 2 ct 2 3 ct 2 N q 2... TermM ct M 1 ct M 2 ct M 3 ct M N q M

Similarity Metrics Dot Product Sim DOC_N,QUERY = DOC_N. QUERY ct = 1n q 1 +ct 2n q 2 +...+ct mn q m ct = jn q j j But, there can be a large dot product just because documents are very long, so normalise by lengths Cosine of vectors

Comparison Metrics Cosine (Q,D)= Q.D / Q D Number between 0 and 1 Cosine of angles between Document and Query vectors (diagram for M=3)

Problems? D1) Athletes face dope raids: UK dope body. D2) Athletes urged to snitch on dopers at Olympics. Q) Athletes dope Olympics But both documents are about the London Olympics and about doping UK, Olympics London Olympics, dopers dope, etc. Indexing on words, not subject classes (concepts)

Problems? Dimensions are not independent Drug and Dope are closer together than Dope and London Apache could mean the server, the helicopter or the tribe. These should be different dimensions Therefore, the cosine is not necessarily an accurate reflection of similarity

Index terms What makes a good index term? The term should describe some aspect of the document The term should not be generic enough that it also describes all the other documents in the collection A good index term distinguishes a document from the rest of the collection

Text Coverage Coverage with N most frequent words 1 5% (the) 10 42% (the, and, a, he, but...) 100 65% 1000 90% 10000 99% Most frequent words are not informative! Least frequent words are typos or too specialised

Inverse Document Frequncy In a vector model, different words should have different weights Search for Query: Tom and Jerry Match on documents with Tom or Jerry should count for more than and The more documents a word appears in, the less is its use as an index term Documents are characterised by words which are relatively rare in other docs

Inverted Document Frequency Numerator = number of Documents in collection Denominator = number of documents containing term t i idf i =log ( D d :t i d )

tf*idf Normalise term frequency by length of document: (term i and document j) tf i,j = n i,j / k n k,j idf i = log ( D / {d:t i d} ) tf*idf i,j = tf i,j * idf i tf*idf is high for a term in a document if: its frequency in the document is high and its frequency in rest of collection is low

Cheating Hidden text Keyword Stuffing

Cheating the system Indexing done by algorithm, not humans No control over documents in collection Websites try to show up at the top of a search How to identify reliable websites?

Linear Algebra Revision Vectors are One-Dimensional Matrices X= [x 0 x 1 x 2... x n ] X = length of X = sqrt(x 2 0 + x 2 1 +x 2 2 +... + x n 2) = sqrt( Σ i x 2 i ) Vectors are used to represent coordinates in n-dimensional space

Scalar Multiplication Two vectors can be multiplied using dot product (also called scalar product) to give a scalar number. X= [x 0 x 1 x 2... x n ] Y= [y 0 y 1 y 2... y n ] X. Y = x 0 y 0 + x 1 y 1 + x 2 y 2 + + x n y n = Σ i x i y i

Scalar Multiplication Two vectors can be multiplied using dot product (also called scalar product) to give a scalar number. X= [1 2 3 4] Y= [5 6 7 8] X Y X.Y = 1 + 2 + 3 + 4 70 X.Y = Length of Projection of X on Y Length of Y

Geometric Interpretation A.A = A 2 B.B = B 2 A.B = A B cos(θ) cos(0)=1 cos(90)=0 A A Cosine function is a similarity metric θ B B

Vector Product Vector Product is also called cross product A m n n p = C m p 2 [1 5] = 2 10 3 3 15 4 4 20 C ij = Row i. Column j Rows of C are Rows of B multiplied by scalar value from A Columns of C are columns of A multiplied by scalar value from B

Problems with Term Counts For the term IBM, how do you distinguish IBM's home page (mostly graphical; IBM occurs only a few times in the html) IBM's copyright page (IBM occurs over 100 times) A Rival's spam page (Arbitrarily large term count for IBM)

Hyperlinks for search Web as a graph Anchor text pointing to page B provides a description of B A Hyperlink from page A to B is a recommendation or endorsement of B Ignore Internal links? IBM computers IBM Corporation International Business Machines IBM.com

Links as recommendations PageRank (Brin and Page, 1998) A link from A to B is a recommendation of B Think of science Highly cited papers are considered of higher quality Backlinks are like citations But webpages aren't reviewed, so how do we know the citer A is reliable? By counting links to A of course!

PageRank Consider a random surfer - Clicks on links at random A 1/3 1/3 B 1/1 E 1/3 D C 1/2 1/2 F Probability of following a link

PageRank If you continue this random walk You will visit some pages more frequently than others These are pages with lots of links from other pages with lots of links PageRank: Pages visited more often in a random walk are more important (reliable)

Teleporting What if the random surfer reaches a page with no hyperlinks? Teleport: the surfer jumps from a page to any other page in the web graph at random If there are N pages in the web graph, teleporting takes the surfer to each node with probability 1/N Use teleport operation if No outgoing links from node?» With probabilty α = 1 Otherwise with some probability 0 < α < 1

Need for Teleporting To avoid loops where you are forced to keep visiting the same sites in the random walk

Steady State Given this model of a random surfer The surfer spends a fixed fraction of the time at each page that depends on The hyperlink structure of the web The value of (usually 0.1) PageRank of page : fraction of the time spent at page

PageRank Computation Represent Web as Adjacency matrix Adj(i,j) = 1 iff there is a link from i to j Adj(i,j) = 0 iff there is no link from i to j C A B Adj = A B C A B 0 1 1 1 0 0 1 0 0 C

Transition Probabilities Divide each 1 in A by number of 1s in Row (probability of clicking on link to that page) Probability of following a link C A B 0 1/2 1/2 1/1 0 0 1/1 0 0

Transition Probabilities Lets consider teleport probability α = ½, N=3 3) Multiply cells by ½ (1-α, probability of not teleporting) 4) Add 1/6 = (α/n, probability of teleporting to that page ) to every cell Transition Probabilities C A B P = 1/6 1/4+1/6=5/12 1/4+1/6=5/12 1/2+1/6=2/3 1/6 1/6 1/2+1/6=2/3 1/6 1/6

Starting State Imaging, surfer starts at page B At beginning, x_0 = [0, 1, 0] Vectors x_n show proportion of time spent on pages A, B, C at time n At step one, x_1=x_0 P =[0,1,0] 1/6 5/12 5/12 2/3 1/6 1/6 2/3 1/6 1/6 X_1 = [2/3, 1/6, 1/6] 0*5/12 + 1*1/6 + 0*1/6 = 1/6

Iteration 2 At step one, x_1 = [2/3, 1/6, 1/6] At step 2, x_2 = x_1 P = [ 2/3, 1/6, 1/6 ] 1/6 5/12 5/12 2/3 1/6 1/6 2/3 1/6 1/6 X_2 = [2/18+2/18+2/18, 10/36+1/36+1/36, 10/36+1/36+1/36 ] = [1/3, 1/3, 1/3 ] 2/3*5/12 + 1/6*1/6 + 1/6 *1/6 =1/3

Iterating... A B C x_0 0 1 0 x_1 2/3 1/6 1/6 x_2 1/3 1/3 1/3 x_3 1/2 1/4 1/4 x_4 5/12 7/24 7/24............ X = 4/9 5/18 5/18

Example Which sites have low / high pagerank? D0 D1 D2 D5 D6 D3 D4

Example ( = 0.14) =[ 0.05, 0.04, 0.11, 0.25, 0.21, 0.04, 0.31] D0=0.05 D1=0.04 D2=0.11 D5=0.04 D6=0.31 D3=0.25 D4=0.21

Properties of Page Rank New pages have to acquire Page Rank Either convince lots of sites to link to you Or convince a few high-pagerank sites Page Rank can change very fast One link on Yahoo or the BBC is enough Spamming PageRank costs money Need to create huge number of sites Google never sells PageRank

Web Search in a nutshell Ranking Documents for a Query Vector similarity: Cosine (Q, D) Terms from document and anchor text Terms normalised using tf*idf PageRank Independent of query: Property of Graph Measure of reliability: Collaborative trust Has nothing to do with how often real users click on links. The random user was only used to calculate a property of the graph

Topics not covered... Personalisation of search Increasingly IR takes into account your search history to personalise IR to your needs and interests. IR also takes into account usage data to identify: What links others clicked on for similar search queries, etc.

Social IR Performing IR on Social Networks searching twitter, etc Using Social Networks for IR adding collaborative aspects to web search

Social Model of IR (Diagram by Sebastian Marius Kirsch)

Features of Social IR Individuals appear in two roles: information producers and information consumers Queries and documents are essentially interchangeable Queries and/or documents may be used to model an information need or an area of expertise. Most systems will use only some of the relations in the model For a social IR systems, modelling relations between individuals is mandatory

Information Spaces Graph of Users Graph of Documents

Information Spaces Users follows/is followed by others on twitter, friends on facebook etc. User writes or views Documents

Social Graph Algorithms PageRank Can be used to judge reliability in same manner Tweeter is reliable if retweeted by other reliable tweeters

How Google ranks tweets Tweets: 140-character microblog posts sent out by Twitter members The key is to identify "reputed followers," Twitterers "follow" the comments of other Twitterers they've selected, and are themselves "followed." If lots of people follow you, and then you follow someone-- then even though this [new person] does not have lots of followers, his tweet is deemed valuable One user following another in social media is analogous to one page linking to another on the Web. Both are a form of recommendation...

Social Graph Algorithms PageRank for identifying authorities Not a new idea Has been used to identify most influential scientists based on citation networks.

Pagerank for Social IR Calculate PageRank for each User i = Π i Based on who is following whom PageRank for each Document j = Π j Based on which document links to which If User i wrote Document j, then: reliability of j is some combination of Π i and Π j

Collaborative Filtering for IR In addition to reliability, you can filter search results using friend networks 5.7 degrees of separation on Facebook Rerank search results to recommend documents viewed by friends, or people you follow, etc.

Facebook Graph Search Restaurants liked by my Italian friends in Aberdeen Filter friends by country (Italy) and Location (Aberdeen) Only use ratings by these friends Which restaurants are liked by the locals? If in Sofia, Find Restaurants in Sofia Only use ratings by Facebook users living in Sofia Pictures of Jane Look for photos of people called Jane, starting with my friends, Janes who went to my school, my university, are friends of my friends, etc. Filter out photos I don't have permission to view