Information Retrieval

Save this PDF as:
 WORD  PNG  TXT  JPG

Size: px
Start display at page:

Download "Information Retrieval"

Transcription

1 Information Retrieval Additional Reference Introduction to Information Retrieval, Manning, Raghavan and Schütze, online book:

2 Why Study Information Retrieval? Google Searches billions of pages Gives back personalised results in < 1 second Worth $200,000,000,000 Siri, etc IR is increasingly blending into IE IR uses very fast technology, so is a filter for performing IE at web scale First, retrieve Relevant Documents Second, Analyse these to find relevant information

3 Library Index Card

4 Library Index Card

5 Organising documents Fields associated with a document Author, Title, Year, Publisher, Number of pages, etc. Subject Areas Curated by Librarians Creating a classification scheme for all the books in a library is a lot of work How do you search?

6

7 Search Can you search on more than one field? You could use different card collections ordered by each search field Field1: Author Field2: Title Field3: Subject

8 Edge-notched Cards (1896)

9 Key Notions Terms Values assigned to fields for each document E.g. Fields, Author, Title, Subject Index Terms Terms that have been indexed on Query Index Terms that can be combined by boolean logic operators: AND, OR, NOT Retrieval Finding documents that match query

10 Edge-notched Cards (ALASKA or GREENLAND) and NATURE Put pin through NATURE Put pin through ALASKA Collect the cards that fall out Remove ALASKA Pin Put pin through GREENLAND Collect cards that fall out

11 Boolean Search Very little has changed for Information Retrieval over closed document collections Documents labeled with terms from a domain-specific ontology Search with boolean operators permitted over these terms

12 MESH: Medical Subject Headings C11 Eye Diseases C11.93 Asthenopia C Conjunctival Diseases C Conjunctival Neoplasms C Conjunctivitis C Conjunctivitis, Allergic» C Trachoma C Pterygium C Xerophthalmia...

13 ACM Classification for CS B Hardware B.3 Memory structures B.3.1 Semiconductor Memories Dynamic memory (DRAM) Read-only memory (ROM) Static memory (SRAM) B.3.2 Design Styles B.3.3 Performance Analysis Simulation Worst-case analysis

14 Limitations Manual effort by trained catalogers: required to create classification scheme and for annotation of documents with subject classes (concepts) Users need to be aware of subject classes BUT high precision searches works well for closed collections of documents (libraries, etc.)

15 The Internet NOT a closed collection Billions of webpages Documents change on daily basis Not possible to index or search by manually constructed subject classes How does Indexing work? How does Search work?

16 Simple Indexing Model Bag-of-Words Documents and queries are represented as a bag of words Ignore order of words Ignore morphology/syntax (cat vs cats etc) Just count the number of matches between words in document and query This already works rather well!

17 Vector Space Model Ranks Documents for relevance to query Documents and queries are vectors What do vectors look like? How do you compute relevance?

18 Term Frequency D1) Athletes face dope raids: UK dope body. D2) Athletes urged to snitch on dopers at Olympics. Q) Athletes dope Olympics Athlete s face dope raids UK body Olympi cs urged snitch dopers at to on D D Q Q. D1 = 3 (Athletes + 2*dope) Q. D2 = 2 (Athletes + Olympics)

19 Similarity Metrics Each Cell is the number of times the word occurs in the document or query(simplification, more later...) Doc1 Doc2 Doc3 DocN Query Term1 ct 1 1 ct 1 2 ct 1 3 ct 1 N q 1 Term2 ct 2 1 ct 2 2 ct 2 3 ct 2 N q 2... TermM ct M 1 ct M 2 ct M 3 ct M N q M

20 Similarity Metrics Dot Product Sim DOC_N,QUERY = DOC_N. QUERY ct = 1n q 1 +ct 2n q ct mn q m ct = jn q j j But, there can be a large dot product just because documents are very long, so normalise by lengths Cosine of vectors

21 Comparison Metrics Cosine (Q,D)= Q.D / Q D Number between 0 and 1 Cosine of angles between Document and Query vectors (diagram for M=3)

22 Problems? D1) Athletes face dope raids: UK dope body. D2) Athletes urged to snitch on dopers at Olympics. Q) Athletes dope Olympics But both documents are about the London Olympics and about doping UK, Olympics London Olympics, dopers dope, etc. Indexing on words, not subject classes (concepts)

23 Problems? Dimensions are not independent Drug and Dope are closer together than Dope and London Apache could mean the server, the helicopter or the tribe. These should be different dimensions Therefore, the cosine is not necessarily an accurate reflection of similarity

24 Index terms What makes a good index term? The term should describe some aspect of the document The term should not be generic enough that it also describes all the other documents in the collection A good index term distinguishes a document from the rest of the collection

25 Text Coverage Coverage with N most frequent words 1 5% (the) 10 42% (the, and, a, he, but...) % % % Most frequent words are not informative! Least frequent words are typos or too specialised

26 Inverse Document Frequncy In a vector model, different words should have different weights Search for Query: Tom and Jerry Match on documents with Tom or Jerry should count for more than and The more documents a word appears in, the less is its use as an index term Documents are characterised by words which are relatively rare in other docs

27 Inverted Document Frequency Numerator = number of Documents in collection Denominator = number of documents containing term t i idf i =log ( D d :t i d )

28 tf*idf Normalise term frequency by length of document: (term i and document j) tf i,j = n i,j / k n k,j idf i = log ( D / {d:t i d} ) tf*idf i,j = tf i,j * idf i tf*idf is high for a term in a document if: its frequency in the document is high and its frequency in rest of collection is low

29 Cheating Hidden text Keyword Stuffing

30 Cheating the system Indexing done by algorithm, not humans No control over documents in collection Websites try to show up at the top of a search How to identify reliable websites?

31 Linear Algebra Revision Vectors are One-Dimensional Matrices X= [x 0 x 1 x 2... x n ] X = length of X = sqrt(x x 2 1 +x x n 2) = sqrt( Σ i x 2 i ) Vectors are used to represent coordinates in n-dimensional space

32 Scalar Multiplication Two vectors can be multiplied using dot product (also called scalar product) to give a scalar number. X= [x 0 x 1 x 2... x n ] Y= [y 0 y 1 y 2... y n ] X. Y = x 0 y 0 + x 1 y 1 + x 2 y x n y n = Σ i x i y i

33 Scalar Multiplication Two vectors can be multiplied using dot product (also called scalar product) to give a scalar number. X= [ ] Y= [ ] X Y X.Y = X.Y = Length of Projection of X on Y Length of Y

34 Geometric Interpretation A.A = A 2 B.B = B 2 A.B = A B cos(θ) cos(0)=1 cos(90)=0 A A Cosine function is a similarity metric θ B B

35 Vector Product Vector Product is also called cross product A m n n p = C m p 2 [1 5] = C ij = Row i. Column j Rows of C are Rows of B multiplied by scalar value from A Columns of C are columns of A multiplied by scalar value from B

36

37 Problems with Term Counts For the term IBM, how do you distinguish IBM's home page (mostly graphical; IBM occurs only a few times in the html) IBM's copyright page (IBM occurs over 100 times) A Rival's spam page (Arbitrarily large term count for IBM)

38 Hyperlinks for search Web as a graph Anchor text pointing to page B provides a description of B A Hyperlink from page A to B is a recommendation or endorsement of B Ignore Internal links? IBM computers IBM Corporation International Business Machines IBM.com

39 Links as recommendations PageRank (Brin and Page, 1998) A link from A to B is a recommendation of B Think of science Highly cited papers are considered of higher quality Backlinks are like citations But webpages aren't reviewed, so how do we know the citer A is reliable? By counting links to A of course!

40 PageRank Consider a random surfer - Clicks on links at random A 1/3 1/3 B 1/1 E 1/3 D C 1/2 1/2 F Probability of following a link

41 PageRank If you continue this random walk You will visit some pages more frequently than others These are pages with lots of links from other pages with lots of links PageRank: Pages visited more often in a random walk are more important (reliable)

42 Teleporting What if the random surfer reaches a page with no hyperlinks? Teleport: the surfer jumps from a page to any other page in the web graph at random If there are N pages in the web graph, teleporting takes the surfer to each node with probability 1/N Use teleport operation if No outgoing links from node?» With probabilty α = 1 Otherwise with some probability 0 < α < 1

43 Need for Teleporting To avoid loops where you are forced to keep visiting the same sites in the random walk

44 Steady State Given this model of a random surfer The surfer spends a fixed fraction of the time at each page that depends on The hyperlink structure of the web The value of (usually 0.1) PageRank of page : fraction of the time spent at page

45 PageRank Computation Represent Web as Adjacency matrix Adj(i,j) = 1 iff there is a link from i to j Adj(i,j) = 0 iff there is no link from i to j C A B Adj = A B C A B C

46 Transition Probabilities Divide each 1 in A by number of 1s in Row (probability of clicking on link to that page) Probability of following a link C A B 0 1/2 1/2 1/ /1 0 0

47 Transition Probabilities Lets consider teleport probability α = ½, N=3 3) Multiply cells by ½ (1-α, probability of not teleporting) 4) Add 1/6 = (α/n, probability of teleporting to that page ) to every cell Transition Probabilities C A B P = 1/6 1/4+1/6=5/12 1/4+1/6=5/12 1/2+1/6=2/3 1/6 1/6 1/2+1/6=2/3 1/6 1/6

48 Starting State Imaging, surfer starts at page B At beginning, x_0 = [0, 1, 0] Vectors x_n show proportion of time spent on pages A, B, C at time n At step one, x_1=x_0 P =[0,1,0] 1/6 5/12 5/12 2/3 1/6 1/6 2/3 1/6 1/6 X_1 = [2/3, 1/6, 1/6] 0*5/12 + 1*1/6 + 0*1/6 = 1/6

49 Iteration 2 At step one, x_1 = [2/3, 1/6, 1/6] At step 2, x_2 = x_1 P = [ 2/3, 1/6, 1/6 ] 1/6 5/12 5/12 2/3 1/6 1/6 2/3 1/6 1/6 X_2 = [2/18+2/18+2/18, 10/36+1/36+1/36, 10/36+1/36+1/36 ] = [1/3, 1/3, 1/3 ] 2/3*5/12 + 1/6*1/6 + 1/6 *1/6 =1/3

50 Iterating... A B C x_ x_1 2/3 1/6 1/6 x_2 1/3 1/3 1/3 x_3 1/2 1/4 1/4 x_4 5/12 7/24 7/ X = 4/9 5/18 5/18

51 Example Which sites have low / high pagerank? D0 D1 D2 D5 D6 D3 D4

52 Example ( = 0.14) =[ 0.05, 0.04, 0.11, 0.25, 0.21, 0.04, 0.31] D0=0.05 D1=0.04 D2=0.11 D5=0.04 D6=0.31 D3=0.25 D4=0.21

53 Properties of Page Rank New pages have to acquire Page Rank Either convince lots of sites to link to you Or convince a few high-pagerank sites Page Rank can change very fast One link on Yahoo or the BBC is enough Spamming PageRank costs money Need to create huge number of sites Google never sells PageRank

54 Web Search in a nutshell Ranking Documents for a Query Vector similarity: Cosine (Q, D) Terms from document and anchor text Terms normalised using tf*idf PageRank Independent of query: Property of Graph Measure of reliability: Collaborative trust Has nothing to do with how often real users click on links. The random user was only used to calculate a property of the graph

55 Topics not covered... Personalisation of search Increasingly IR takes into account your search history to personalise IR to your needs and interests. IR also takes into account usage data to identify: What links others clicked on for similar search queries, etc.

56 Social IR Performing IR on Social Networks searching twitter, etc Using Social Networks for IR adding collaborative aspects to web search

57 Social Model of IR (Diagram by Sebastian Marius Kirsch)

58 Features of Social IR Individuals appear in two roles: information producers and information consumers Queries and documents are essentially interchangeable Queries and/or documents may be used to model an information need or an area of expertise. Most systems will use only some of the relations in the model For a social IR systems, modelling relations between individuals is mandatory

59 Information Spaces Graph of Users Graph of Documents

60 Information Spaces Users follows/is followed by others on twitter, friends on facebook etc. User writes or views Documents

61 Social Graph Algorithms PageRank Can be used to judge reliability in same manner Tweeter is reliable if retweeted by other reliable tweeters

62 How Google ranks tweets Tweets: 140-character microblog posts sent out by Twitter members The key is to identify "reputed followers," Twitterers "follow" the comments of other Twitterers they've selected, and are themselves "followed." If lots of people follow you, and then you follow someone-- then even though this [new person] does not have lots of followers, his tweet is deemed valuable One user following another in social media is analogous to one page linking to another on the Web. Both are a form of recommendation...

63 Social Graph Algorithms PageRank for identifying authorities Not a new idea Has been used to identify most influential scientists based on citation networks.

64 Pagerank for Social IR Calculate PageRank for each User i = Π i Based on who is following whom PageRank for each Document j = Π j Based on which document links to which If User i wrote Document j, then: reliability of j is some combination of Π i and Π j

65 Collaborative Filtering for IR In addition to reliability, you can filter search results using friend networks 5.7 degrees of separation on Facebook Rerank search results to recommend documents viewed by friends, or people you follow, etc.

66 Facebook Graph Search Restaurants liked by my Italian friends in Aberdeen Filter friends by country (Italy) and Location (Aberdeen) Only use ratings by these friends Which restaurants are liked by the locals? If in Sofia, Find Restaurants in Sofia Only use ratings by Facebook users living in Sofia Pictures of Jane Look for photos of people called Jane, starting with my friends, Janes who went to my school, my university, are friends of my friends, etc. Filter out photos I don't have permission to view

Lec 8: Adaptive Information Retrieval 2

Lec 8: Adaptive Information Retrieval 2 Lec 8: Adaptive Information Retrieval 2 Advaith Siddharthan Introduction to Information Retrieval by Manning, Raghavan & Schütze. Website: http://nlp.stanford.edu/ir-book/ Linear Algebra Revision Vectors:

More information

Web consists of web pages and hyperlinks between pages. A page receiving many links from other pages may be a hint of the authority of the page

Web consists of web pages and hyperlinks between pages. A page receiving many links from other pages may be a hint of the authority of the page Link Analysis Links Web consists of web pages and hyperlinks between pages A page receiving many links from other pages may be a hint of the authority of the page Links are also popular in some other information

More information

Introduction to Information Retrieval

Introduction to Information Retrieval Introduction to Information Retrieval http://informationretrieval.org IIR 21: Link Analysis Hinrich Schütze Center for Information and Language Processing, University of Munich 2014-06-18 1/80 Overview

More information

Information Retrieval (IR) Introduction to Information Retrieval. Lecture Overview. Why do we need IR? Basics of an IR system.

Information Retrieval (IR) Introduction to Information Retrieval. Lecture Overview. Why do we need IR? Basics of an IR system. Introduction to Information Retrieval Ethan Phelps-Goodman Some slides taken from http://www.cs.utexas.edu/users/mooney/ir-course/ Information Retrieval (IR) The indexing and retrieval of textual documents.

More information

COMP Page Rank

COMP Page Rank COMP 4601 Page Rank 1 Motivation Remember, we were interested in giving back the most relevant documents to a user. Importance is measured by reference as well as content. Think of this like academic paper

More information

Information Retrieval Lecture 4: Web Search. Challenges of Web Search 2. Natural Language and Information Processing (NLIP) Group

Information Retrieval Lecture 4: Web Search. Challenges of Web Search 2. Natural Language and Information Processing (NLIP) Group Information Retrieval Lecture 4: Web Search Computer Science Tripos Part II Simone Teufel Natural Language and Information Processing (NLIP) Group sht25@cl.cam.ac.uk (Lecture Notes after Stephen Clark)

More information

Mining Web Data. Lijun Zhang

Mining Web Data. Lijun Zhang Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems

More information

Searching the Web [Arasu 01]

Searching the Web [Arasu 01] Searching the Web [Arasu 01] Most user simply browse the web Google, Yahoo, Lycos, Ask Others do more specialized searches web search engines submit queries by specifying lists of keywords receive web

More information

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University http://cs224w.stanford.edu How to organize the Web? First try: Human curated Web directories Yahoo, DMOZ, LookSmart Second

More information

Big Data Analytics CSCI 4030

Big Data Analytics CSCI 4030 High dim. data Graph data Infinite data Machine learning Apps Locality sensitive hashing PageRank, SimRank Filtering data streams SVM Recommen der systems Clustering Community Detection Web advertising

More information

How to organize the Web?

How to organize the Web? How to organize the Web? First try: Human curated Web directories Yahoo, DMOZ, LookSmart Second try: Web Search Information Retrieval attempts to find relevant docs in a small and trusted set Newspaper

More information

Mining Web Data. Lijun Zhang

Mining Web Data. Lijun Zhang Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems

More information

Information Retrieval. CS630 Representing and Accessing Digital Information. What is a Retrieval Model? Basic IR Processes

Information Retrieval. CS630 Representing and Accessing Digital Information. What is a Retrieval Model? Basic IR Processes CS630 Representing and Accessing Digital Information Information Retrieval: Retrieval Models Information Retrieval Basics Data Structures and Access Indexing and Preprocessing Retrieval Models Thorsten

More information

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University http://cs224w.stanford.edu How to organize the Web? First try: Human curated Web directories Yahoo, DMOZ, LookSmart Second

More information

Big Data Analytics CSCI 4030

Big Data Analytics CSCI 4030 High dim. data Graph data Infinite data Machine learning Apps Locality sensitive hashing PageRank, SimRank Filtering data streams SVM Recommen der systems Clustering Community Detection Web advertising

More information

Information Retrieval

Information Retrieval Natural Language Processing SoSe 2015 Information Retrieval Dr. Mariana Neves June 22nd, 2015 (based on the slides of Dr. Saeedeh Momtazi) Outline Introduction Indexing Block 2 Document Crawling Text Processing

More information

Representation/Indexing (fig 1.2) IR models - overview (fig 2.1) IR models - vector space. Weighting TF*IDF. U s e r. T a s k s

Representation/Indexing (fig 1.2) IR models - overview (fig 2.1) IR models - vector space. Weighting TF*IDF. U s e r. T a s k s Summary agenda Summary: EITN01 Web Intelligence and Information Retrieval Anders Ardö EIT Electrical and Information Technology, Lund University March 13, 2013 A Ardö, EIT Summary: EITN01 Web Intelligence

More information

COMP 4601 Hubs and Authorities

COMP 4601 Hubs and Authorities COMP 4601 Hubs and Authorities 1 Motivation PageRank gives a way to compute the value of a page given its position and connectivity w.r.t. the rest of the Web. Is it the only algorithm: No! It s just one

More information

TODAY S LECTURE HYPERTEXT AND

TODAY S LECTURE HYPERTEXT AND LINK ANALYSIS TODAY S LECTURE HYPERTEXT AND LINKS We look beyond the content of documents We begin to look at the hyperlinks between them Address questions like Do the links represent a conferral of authority

More information

Information Retrieval. Lecture 9 - Web search basics

Information Retrieval. Lecture 9 - Web search basics Information Retrieval Lecture 9 - Web search basics Seminar für Sprachwissenschaft International Studies in Computational Linguistics Wintersemester 2007 1/ 30 Introduction Up to now: techniques for general

More information

Unit VIII. Chapter 9. Link Analysis

Unit VIII. Chapter 9. Link Analysis Unit VIII Link Analysis: Page Ranking in web search engines, Efficient Computation of Page Rank using Map-Reduce and other approaches, Topic-Sensitive Page Rank, Link Spam, Hubs and Authorities (Text Book:2

More information

Knowledge Discovery and Data Mining 1 (VO) ( )

Knowledge Discovery and Data Mining 1 (VO) ( ) Knowledge Discovery and Data Mining 1 (VO) (707.003) Data Matrices and Vector Space Model Denis Helic KTI, TU Graz Nov 6, 2014 Denis Helic (KTI, TU Graz) KDDM1 Nov 6, 2014 1 / 55 Big picture: KDDM Probability

More information

1 Starting around 1996, researchers began to work on. 2 In Feb, 1997, Yanhong Li (Scotch Plains, NJ) filed a

1 Starting around 1996, researchers began to work on. 2 In Feb, 1997, Yanhong Li (Scotch Plains, NJ) filed a !"#$ %#& ' Introduction ' Social network analysis ' Co-citation and bibliographic coupling ' PageRank ' HIS ' Summary ()*+,-/*,) Early search engines mainly compare content similarity of the query and

More information

CSE 494: Information Retrieval, Mining and Integration on the Internet

CSE 494: Information Retrieval, Mining and Integration on the Internet CSE 494: Information Retrieval, Mining and Integration on the Internet Midterm. 18 th Oct 2011 (Instructor: Subbarao Kambhampati) In-class Duration: Duration of the class 1hr 15min (75min) Total points:

More information

Agenda. Math Google PageRank algorithm. 2 Developing a formula for ranking web pages. 3 Interpretation. 4 Computing the score of each page

Agenda. Math Google PageRank algorithm. 2 Developing a formula for ranking web pages. 3 Interpretation. 4 Computing the score of each page Agenda Math 104 1 Google PageRank algorithm 2 Developing a formula for ranking web pages 3 Interpretation 4 Computing the score of each page Google: background Mid nineties: many search engines often times

More information

modern database systems lecture 4 : information retrieval

modern database systems lecture 4 : information retrieval modern database systems lecture 4 : information retrieval Aristides Gionis Michael Mathioudakis spring 2016 in perspective structured data relational data RDBMS MySQL semi-structured data data-graph representation

More information

Information Retrieval: Retrieval Models

Information Retrieval: Retrieval Models CS473: Web Information Retrieval & Management CS-473 Web Information Retrieval & Management Information Retrieval: Retrieval Models Luo Si Department of Computer Science Purdue University Retrieval Models

More information

Chapter 6: Information Retrieval and Web Search. An introduction

Chapter 6: Information Retrieval and Web Search. An introduction Chapter 6: Information Retrieval and Web Search An introduction Introduction n Text mining refers to data mining using text documents as data. n Most text mining tasks use Information Retrieval (IR) methods

More information

COMP5331: Knowledge Discovery and Data Mining

COMP5331: Knowledge Discovery and Data Mining COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani and Terry Winograd, Jon M. Kleinberg 1 1 PageRank

More information

Information Retrieval. (M&S Ch 15)

Information Retrieval. (M&S Ch 15) Information Retrieval (M&S Ch 15) 1 Retrieval Models A retrieval model specifies the details of: Document representation Query representation Retrieval function Determines a notion of relevance. Notion

More information

Calculating Web Page Authority Using the PageRank Algorithm. Math 45, Fall 2005 Levi Gill and Jacob Miles Prystowsky

Calculating Web Page Authority Using the PageRank Algorithm. Math 45, Fall 2005 Levi Gill and Jacob Miles Prystowsky Calculating Web Page Authority Using the PageRank Algorithm Math 45, Fall 2005 Levi Gill and Jacob Miles Prystowsky Introduction In 1998 a phenomenon hit the World Wide Web: Google opened its doors. Larry

More information

Part 1: Link Analysis & Page Rank

Part 1: Link Analysis & Page Rank Chapter 8: Graph Data Part 1: Link Analysis & Page Rank Based on Leskovec, Rajaraman, Ullman 214: Mining of Massive Datasets 1 Graph Data: Social Networks [Source: 4-degrees of separation, Backstrom-Boldi-Rosa-Ugander-Vigna,

More information

Bruno Martins. 1 st Semester 2012/2013

Bruno Martins. 1 st Semester 2012/2013 Link Analysis Departamento de Engenharia Informática Instituto Superior Técnico 1 st Semester 2012/2013 Slides baseados nos slides oficiais do livro Mining the Web c Soumen Chakrabarti. Outline 1 2 3 4

More information

CSCI 5417 Information Retrieval Systems! What is Information Retrieval?

CSCI 5417 Information Retrieval Systems! What is Information Retrieval? CSCI 5417 Information Retrieval Systems! Lecture 1 8/23/2011 Introduction 1 What is Information Retrieval? Information retrieval is the science of searching for information in documents, searching for

More information

Web Search Ranking. (COSC 488) Nazli Goharian Evaluation of Web Search Engines: High Precision Search

Web Search Ranking. (COSC 488) Nazli Goharian Evaluation of Web Search Engines: High Precision Search Web Search Ranking (COSC 488) Nazli Goharian nazli@cs.georgetown.edu 1 Evaluation of Web Search Engines: High Precision Search Traditional IR systems are evaluated based on precision and recall. Web search

More information

~ Ian Hunneybell: WWWT Revision Notes (15/06/2006) ~

~ Ian Hunneybell: WWWT Revision Notes (15/06/2006) ~ . Search Engines, history and different types In the beginning there was Archie (990, indexed computer files) and Gopher (99, indexed plain text documents). Lycos (994) and AltaVista (995) were amongst

More information

Lecture 8: Linkage algorithms and web search

Lecture 8: Linkage algorithms and web search Lecture 8: Linkage algorithms and web search Information Retrieval Computer Science Tripos Part II Simone Teufel Natural Language and Information Processing (NLIP) Group Simone.Teufel@cl.cam.ac.uk Lent

More information

Boolean Model. Hongning Wang

Boolean Model. Hongning Wang Boolean Model Hongning Wang CS@UVa Abstraction of search engine architecture Indexed corpus Crawler Ranking procedure Doc Analyzer Doc Representation Query Rep Feedback (Query) Evaluation User Indexer

More information

Lecture Map-Reduce. Algorithms. By Marina Barsky Winter 2017, University of Toronto

Lecture Map-Reduce. Algorithms. By Marina Barsky Winter 2017, University of Toronto Lecture 04.02 Map-Reduce Algorithms By Marina Barsky Winter 2017, University of Toronto Example 1: Language Model Statistical machine translation: Need to count number of times every 5-word sequence occurs

More information

Lecture Notes to Big Data Management and Analytics Winter Term 2017/2018 Node Importance and Neighborhoods

Lecture Notes to Big Data Management and Analytics Winter Term 2017/2018 Node Importance and Neighborhoods Lecture Notes to Big Data Management and Analytics Winter Term 2017/2018 Node Importance and Neighborhoods Matthias Schubert, Matthias Renz, Felix Borutta, Evgeniy Faerman, Christian Frey, Klaus Arthur

More information

Administrative. Web crawlers. Web Crawlers and Link Analysis!

Administrative. Web crawlers. Web Crawlers and Link Analysis! Web Crawlers and Link Analysis! David Kauchak cs458 Fall 2011 adapted from: http://www.stanford.edu/class/cs276/handouts/lecture15-linkanalysis.ppt http://webcourse.cs.technion.ac.il/236522/spring2007/ho/wcfiles/tutorial05.ppt

More information

This lecture: IIR Sections Ranked retrieval Scoring documents Term frequency Collection statistics Weighting schemes Vector space scoring

This lecture: IIR Sections Ranked retrieval Scoring documents Term frequency Collection statistics Weighting schemes Vector space scoring This lecture: IIR Sections 6.2 6.4.3 Ranked retrieval Scoring documents Term frequency Collection statistics Weighting schemes Vector space scoring 1 Ch. 6 Ranked retrieval Thus far, our queries have all

More information

Link Analysis in the Cloud

Link Analysis in the Cloud Cloud Computing Link Analysis in the Cloud Dell Zhang Birkbeck, University of London 2017/18 Graph Problems & Representations What is a Graph? G = (V,E), where V represents the set of vertices (nodes)

More information

Information Retrieval and Web Search Engines

Information Retrieval and Web Search Engines Information Retrieval and Web Search Engines Lecture 12: Link Analysis January 28 th, 2016 Wolf-Tilo Balke and Younes Ghammad Institut für Informationssysteme Technische Universität Braunschweig An Overview

More information

University of Virginia Department of Computer Science. CS 4501: Information Retrieval Fall 2015

University of Virginia Department of Computer Science. CS 4501: Information Retrieval Fall 2015 University of Virginia Department of Computer Science CS 4501: Information Retrieval Fall 2015 2:00pm-3:30pm, Tuesday, December 15th Name: ComputingID: This is a closed book and closed notes exam. No electronic

More information

Chapter 27 Introduction to Information Retrieval and Web Search

Chapter 27 Introduction to Information Retrieval and Web Search Chapter 27 Introduction to Information Retrieval and Web Search Copyright 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 27 Outline Information Retrieval (IR) Concepts Retrieval

More information

A brief history of Google

A brief history of Google the math behind Sat 25 March 2006 A brief history of Google 1995-7 The Stanford days (aka Backrub(!?)) 1998 Yahoo! wouldn't buy (but they might invest...) 1999 Finally out of beta! Sergey Brin Larry Page

More information

Web Search Basics. Berlin Chen Department t of Computer Science & Information Engineering National Taiwan Normal University

Web Search Basics. Berlin Chen Department t of Computer Science & Information Engineering National Taiwan Normal University Web Search Basics Berlin Chen Department t of Computer Science & Information Engineering i National Taiwan Normal University References: 1. Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze,

More information

Information Retrieval CS Lecture 06. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Information Retrieval CS Lecture 06. Razvan C. Bunescu School of Electrical Engineering and Computer Science Information Retrieval CS 6900 Lecture 06 Razvan C. Bunescu School of Electrical Engineering and Computer Science bunescu@ohio.edu Boolean Retrieval vs. Ranked Retrieval Many users (professionals) prefer

More information

Fall CS646: Information Retrieval. Lecture 2 - Introduction to Search Result Ranking. Jiepu Jiang University of Massachusetts Amherst 2016/09/12

Fall CS646: Information Retrieval. Lecture 2 - Introduction to Search Result Ranking. Jiepu Jiang University of Massachusetts Amherst 2016/09/12 Fall 2016 CS646: Information Retrieval Lecture 2 - Introduction to Search Result Ranking Jiepu Jiang University of Massachusetts Amherst 2016/09/12 More course information Programming Prerequisites Proficiency

More information

Information Retrieval CS Lecture 01. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Information Retrieval CS Lecture 01. Razvan C. Bunescu School of Electrical Engineering and Computer Science Information Retrieval CS 6900 Razvan C. Bunescu School of Electrical Engineering and Computer Science bunescu@ohio.edu Information Retrieval Information Retrieval (IR) is finding material of an unstructured

More information

Below is another example, taken from a REAL profile on one of the sites in my packet of someone abusing the sites.

Below is another example, taken from a REAL profile on one of the sites in my packet of someone abusing the sites. Before I show you this month's sites, I need to go over a couple of things, so that we are all on the same page. You will be shown how to leave your link on each of the sites, but abusing the sites can

More information

Website Optimizer. Before we start building a website, it s good practice to think about the purpose, your target

Website Optimizer. Before we start building a website, it s good practice to think about the purpose, your target Website Optimizer Before we start building a website, it s good practice to think about the purpose, your target audience, what you want to have on the website, and your expectations. For this purpose

More information

INTRODUCTION TO DATA SCIENCE. Link Analysis (MMDS5)

INTRODUCTION TO DATA SCIENCE. Link Analysis (MMDS5) INTRODUCTION TO DATA SCIENCE Link Analysis (MMDS5) Introduction Motivation: accurate web search Spammers: want you to land on their pages Google s PageRank and variants TrustRank Hubs and Authorities (HITS)

More information

Web Search Basics. Berlin Chen Department of Computer Science & Information Engineering National Taiwan Normal University

Web Search Basics. Berlin Chen Department of Computer Science & Information Engineering National Taiwan Normal University Web Search Basics Berlin Chen Department of Computer Science & Information Engineering National Taiwan Normal University References: 1. Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze, Introduction

More information

XML RETRIEVAL. Introduction to Information Retrieval CS 150 Donald J. Patterson

XML RETRIEVAL. Introduction to Information Retrieval CS 150 Donald J. Patterson Introduction to Information Retrieval CS 150 Donald J. Patterson Content adapted from Manning, Raghavan, and Schütze http://www.informationretrieval.org OVERVIEW Introduction Basic XML Concepts Challenges

More information

Information Retrieval

Information Retrieval Information Retrieval WS 2016 / 2017 Lecture 2, Tuesday October 25 th, 2016 (Ranking, Evaluation) Prof. Dr. Hannah Bast Chair of Algorithms and Data Structures Department of Computer Science University

More information

Building a website. Should you build your own website?

Building a website. Should you build your own website? Building a website As discussed in the previous module, your website is the online shop window for your business and you will only get one chance to make a good first impression. It is worthwhile investing

More information

PERSONALIZED TAG RECOMMENDATION

PERSONALIZED TAG RECOMMENDATION PERSONALIZED TAG RECOMMENDATION Ziyu Guan, Xiaofei He, Jiajun Bu, Qiaozhu Mei, Chun Chen, Can Wang Zhejiang University, China Univ. of Illinois/Univ. of Michigan 1 Booming of Social Tagging Applications

More information

THE QUICK AND EASY GUIDE

THE QUICK AND EASY GUIDE THE QUICK AND EASY GUIDE TO BOOSTING YOUR ORGANIC SEO A FEROCIOUS DIGITAL MARKETING AGENCY About Designzillas IS YOUR BUSINESS FEROCIOUS? Our Digital Marketing Agency specializes in custom website design

More information

Five SEO Strategies Every Company Needs to Master

Five SEO Strategies Every Company Needs to Master Five SEO Strategies Every Company Needs to Master Martin Laetsch Agenda SEO Overview Link Building Google Authorship Set a Canonical URL Microdata/Schema Responsive Design What is Search Engine Marketing?

More information

2013/2/12 EVOLVING GRAPH. Bahman Bahmani(Stanford) Ravi Kumar(Google) Mohammad Mahdian(Google) Eli Upfal(Brown) Yanzhao Yang

2013/2/12 EVOLVING GRAPH. Bahman Bahmani(Stanford) Ravi Kumar(Google) Mohammad Mahdian(Google) Eli Upfal(Brown) Yanzhao Yang 1 PAGERANK ON AN EVOLVING GRAPH Bahman Bahmani(Stanford) Ravi Kumar(Google) Mohammad Mahdian(Google) Eli Upfal(Brown) Present by Yanzhao Yang 1 Evolving Graph(Web Graph) 2 The directed links between web

More information

Chapter 2. Architecture of a Search Engine

Chapter 2. Architecture of a Search Engine Chapter 2 Architecture of a Search Engine Search Engine Architecture A software architecture consists of software components, the interfaces provided by those components and the relationships between them

More information

The PageRank Citation Ranking: Bringing Order to the Web

The PageRank Citation Ranking: Bringing Order to the Web The PageRank Citation Ranking: Bringing Order to the Web Marlon Dias msdias@dcc.ufmg.br Information Retrieval DCC/UFMG - 2017 Introduction Paper: The PageRank Citation Ranking: Bringing Order to the Web,

More information

Jordan Boyd-Graber University of Maryland. Thursday, March 3, 2011

Jordan Boyd-Graber University of Maryland. Thursday, March 3, 2011 Data-Intensive Information Processing Applications! Session #5 Graph Algorithms Jordan Boyd-Graber University of Maryland Thursday, March 3, 2011 This work is licensed under a Creative Commons Attribution-Noncommercial-Share

More information

Corso di Biblioteche Digitali

Corso di Biblioteche Digitali Corso di Biblioteche Digitali Vittore Casarosa casarosa@isti.cnr.it tel. 050-315 3115 cell. 348-397 2168 Ricevimento dopo la lezione o per appuntamento Valutazione finale 70-75% esame orale 25-30% progetto

More information

Graph and Web Mining - Motivation, Applications and Algorithms PROF. EHUD GUDES DEPARTMENT OF COMPUTER SCIENCE BEN-GURION UNIVERSITY, ISRAEL

Graph and Web Mining - Motivation, Applications and Algorithms PROF. EHUD GUDES DEPARTMENT OF COMPUTER SCIENCE BEN-GURION UNIVERSITY, ISRAEL Graph and Web Mining - Motivation, Applications and Algorithms PROF. EHUD GUDES DEPARTMENT OF COMPUTER SCIENCE BEN-GURION UNIVERSITY, ISRAEL Web mining - Outline Introduction Web Content Mining Web usage

More information

Chapter 4. Distributed Algorithms based on MapReduce. - Applications

Chapter 4. Distributed Algorithms based on MapReduce. - Applications Chapter 4 Distributed Algorithms based on MapReduce - Applications 1 Acknowledgements MapReduce Algorithms - Understanding Data Joins: http://codingjunkie.net/mapreduce-reduce-joins/ Joins with Map Reduce:

More information

The PageRank Citation Ranking

The PageRank Citation Ranking October 17, 2012 Main Idea - Page Rank web page is important if it points to by other important web pages. *Note the recursive definition IR - course web page, Brian home page, Emily home page, Steven

More information

Developing Focused Crawlers for Genre Specific Search Engines

Developing Focused Crawlers for Genre Specific Search Engines Developing Focused Crawlers for Genre Specific Search Engines Nikhil Priyatam Thesis Advisor: Prof. Vasudeva Varma IIIT Hyderabad July 7, 2014 Examples of Genre Specific Search Engines MedlinePlus Naukri.com

More information

Information Retrieval Spring Web retrieval

Information Retrieval Spring Web retrieval Information Retrieval Spring 2016 Web retrieval The Web Large Changing fast Public - No control over editing or contents Spam and Advertisement How big is the Web? Practically infinite due to the dynamic

More information

Website Validity DOING QUALITY RESEARCH MR. ERFURTH, 2015

Website Validity DOING QUALITY RESEARCH MR. ERFURTH, 2015 Website Validity DOING QUALITY RESEARCH MR. ERFURTH, 2015 Today s Goal Students can determine the validity and value of information they find on the internet while researching. Open Web vs. Paid Resources

More information

Refining searches. Refine initially: query. Refining after search. Explicit user feedback. Explicit user feedback

Refining searches. Refine initially: query. Refining after search. Explicit user feedback. Explicit user feedback Refine initially: query Refining searches Commonly, query epansion add synonyms Improve recall Hurt precision? Sometimes done automatically Modify based on pri searches Not automatic All pri searches vs

More information

Definitions. Lecture Objectives. Text Technologies for Data Science INFR Learn about main concepts in IR 9/19/2017. Instructor: Walid Magdy

Definitions. Lecture Objectives. Text Technologies for Data Science INFR Learn about main concepts in IR 9/19/2017. Instructor: Walid Magdy Text Technologies for Data Science INFR11145 Definitions Instructor: Walid Magdy 19-Sep-2017 Lecture Objectives Learn about main concepts in IR Document Information need Query Index BOW 2 1 IR in a nutshell

More information

Information Retrieval May 15. Web retrieval

Information Retrieval May 15. Web retrieval Information Retrieval May 15 Web retrieval What s so special about the Web? The Web Large Changing fast Public - No control over editing or contents Spam and Advertisement How big is the Web? Practically

More information

.. Spring 2009 CSC 466: Knowledge Discovery from Data Alexander Dekhtyar..

.. Spring 2009 CSC 466: Knowledge Discovery from Data Alexander Dekhtyar.. .. Spring 2009 CSC 466: Knowledge Discovery from Data Alexander Dekhtyar.. Link Analysis in Graphs: PageRank Link Analysis Graphs Recall definitions from Discrete math and graph theory. Graph. A graph

More information

Introduction to

Introduction to Introduction to Email gcflearnfree.org/print/email101/introduction-to-email Introduction Do you ever feel like the only person who doesn't use email? You don't have to feel left out. If you're just getting

More information

Jeffrey D. Ullman Stanford University/Infolab

Jeffrey D. Ullman Stanford University/Infolab Jeffrey D. Ullman Stanford University/Infolab Spamming = any deliberate action intended solely to boost a Web page s position in searchengine results. Web Spam = Web pages that are the result of spamming.

More information

How To Guide. ADENION GmbH Merkatorstraße Grevenbroich Germany Fon: Fax:

How To Guide. ADENION GmbH Merkatorstraße Grevenbroich Germany Fon: Fax: How To Guide ADENION GmbH Merkatorstraße 2 41515 Grevenbroich Germany Fon: +49 2181 7569-140 Fax: +49 2181 7569-199 The! Complete Guide to Social Media Sharing The following social media sharing guide

More information

Information Retrieval. Information Retrieval and Web Search

Information Retrieval. Information Retrieval and Web Search Information Retrieval and Web Search Introduction to IR models and methods Information Retrieval The indexing and retrieval of textual documents. Searching for pages on the World Wide Web is the most recent

More information

Page Rank Algorithm. May 12, Abstract

Page Rank Algorithm. May 12, Abstract Page Rank Algorithm Catherine Benincasa, Adena Calden, Emily Hanlon, Matthew Kindzerske, Kody Law, Eddery Lam, John Rhoades, Ishani Roy, Michael Satz, Eric Valentine and Nathaniel Whitaker Department of

More information

Before I show you this month's sites, I need to go over a couple of things, so that we are all on the same page.

Before I show you this month's sites, I need to go over a couple of things, so that we are all on the same page. Before I show you this month's sites, I need to go over a couple of things, so that we are all on the same page. You will be shown how to leave your link on each of the sites, but abusing the sites can

More information

Link Analysis in Web Mining

Link Analysis in Web Mining Problem formulation (998) Link Analysis in Web Mining Hubs and Authorities Spam Detection Suppose we are given a collection of documents on some broad topic e.g., stanford, evolution, iraq perhaps obtained

More information

CS345a: Data Mining Jure Leskovec and Anand Rajaraman Stanford University

CS345a: Data Mining Jure Leskovec and Anand Rajaraman Stanford University CS345a: Data Mining Jure Leskovec and Anand Rajaraman Stanford University Instead of generic popularity, can we measure popularity within a topic? E.g., computer science, health Bias the random walk When

More information

My Best Current Friend in a Social Network

My Best Current Friend in a Social Network Procedia Computer Science Volume 51, 2015, Pages 2903 2907 ICCS 2015 International Conference On Computational Science My Best Current Friend in a Social Network Francisco Moreno 1, Santiago Hernández

More information

Matrix-Vector Multiplication by MapReduce. From Rajaraman / Ullman- Ch.2 Part 1

Matrix-Vector Multiplication by MapReduce. From Rajaraman / Ullman- Ch.2 Part 1 Matrix-Vector Multiplication by MapReduce From Rajaraman / Ullman- Ch.2 Part 1 Google implementation of MapReduce created to execute very large matrix-vector multiplications When ranking of Web pages that

More information

Lecture 5: Information Retrieval using the Vector Space Model

Lecture 5: Information Retrieval using the Vector Space Model Lecture 5: Information Retrieval using the Vector Space Model Trevor Cohn (tcohn@unimelb.edu.au) Slide credits: William Webber COMP90042, 2015, Semester 1 What we ll learn today How to take a user query

More information

Today we show how a search engine works

Today we show how a search engine works How Search Engines Work Today we show how a search engine works What happens when a searcher enters keywords What was performed well in advance Also explain (briefly) how paid results are chosen If we

More information

x = 12 x = 12 1x = 16

x = 12 x = 12 1x = 16 2.2 - The Inverse of a Matrix We've seen how to add matrices, multiply them by scalars, subtract them, and multiply one matrix by another. The question naturally arises: Can we divide one matrix by another?

More information

AH Matrices.notebook November 28, 2016

AH Matrices.notebook November 28, 2016 Matrices Numbers are put into arrays to help with multiplication, division etc. A Matrix (matrices pl.) is a rectangular array of numbers arranged in rows and columns. Matrices If there are m rows and

More information

Web Search Basics Introduction to Information Retrieval INF 141/ CS 121 Donald J. Patterson

Web Search Basics Introduction to Information Retrieval INF 141/ CS 121 Donald J. Patterson Web Search Basics Introduction to Information Retrieval INF 141/ CS 121 Donald J. Patterson Content adapted from Hinrich Schütze http://www.informationretrieval.org Overview Overview Introduction Classic

More information

A Modified Algorithm to Handle Dangling Pages using Hypothetical Node

A Modified Algorithm to Handle Dangling Pages using Hypothetical Node A Modified Algorithm to Handle Dangling Pages using Hypothetical Node Shipra Srivastava Student Department of Computer Science & Engineering Thapar University, Patiala, 147001 (India) Rinkle Rani Aggrawal

More information

Why I Use Python for Academic Research

Why I Use Python for Academic Research Why I Use Python for Academic Research Academics and other researchers have to choose from a variety of research skills. Most social scientists do not add computer programming into their skill set. As

More information

Information Retrieval. Techniques for Relevance Feedback

Information Retrieval. Techniques for Relevance Feedback Information Retrieval Techniques for Relevance Feedback Introduction An information need may be epressed using different keywords (synonymy) impact on recall eamples: ship vs boat, aircraft vs airplane

More information

Lecture 9: I: Web Retrieval II: Webology. Johan Bollen Old Dominion University Department of Computer Science

Lecture 9: I: Web Retrieval II: Webology. Johan Bollen Old Dominion University Department of Computer Science Lecture 9: I: Web Retrieval II: Webology Johan Bollen Old Dominion University Department of Computer Science jbollen@cs.odu.edu http://www.cs.odu.edu/ jbollen April 10, 2003 Page 1 WWW retrieval Two approaches

More information

Lecture 11: Graph algorithms! Claudia Hauff (Web Information Systems)!

Lecture 11: Graph algorithms! Claudia Hauff (Web Information Systems)! Lecture 11: Graph algorithms!! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl 1 Course content Introduction Data streams 1 & 2 The MapReduce paradigm Looking behind the scenes of MapReduce:

More information

I Shopping on mobile / KSA

I Shopping on mobile / KSA I Shopping on mobile / KSA Exploring how people use their smartphones for shopping activities Q3 2016 I About this study Background: Objective: Mobile apps and sites are a vital channel for advertisers

More information

Automatic Summarization

Automatic Summarization Automatic Summarization CS 769 Guest Lecture Andrew B. Goldberg goldberg@cs.wisc.edu Department of Computer Sciences University of Wisconsin, Madison February 22, 2008 Andrew B. Goldberg (CS Dept) Summarization

More information

Ranking on Data Manifolds

Ranking on Data Manifolds Ranking on Data Manifolds Dengyong Zhou, Jason Weston, Arthur Gretton, Olivier Bousquet, and Bernhard Schölkopf Max Planck Institute for Biological Cybernetics, 72076 Tuebingen, Germany {firstname.secondname

More information

Complimentary SEO Analysis & Proposal. ageinplaceofne.com. Rashima Marjara

Complimentary SEO Analysis & Proposal. ageinplaceofne.com. Rashima Marjara Complimentary SEO Analysis & Proposal ageinplaceofne.com Rashima Marjara Wednesday, March 8, 2017 CONTENTS Contents... 1 Account Information... 3 Introduction... 3 Website Performance Analysis... 4 organic

More information