Outline. Lecture 2: EITN01 Web Intelligence and Information Retrieval. Previous lecture. Representation/Indexing (fig 1.

Size: px

Start display at page:

Download "Outline. Lecture 2: EITN01 Web Intelligence and Information Retrieval. Previous lecture. Representation/Indexing (fig 1."

Kevin Jacobs
5 years ago
Views:

1 Outline Lecture 2: EITN01 Web Intelligence and Information Retrieval Anders Ardö EIT Electrical and Information Technology, Lund University January 23, 2013 A. Ardö, EIT Lecture 2: EITN01 Web Intelligence and Information Retrieval January 23, / 44 A. Ardö, EIT Lecture 2: EITN01 Web Intelligence and Information Retrieval January 23, / 44 Previous lecture Representation/Indexing (fig 1.2) Indexing IR models Weighting Evaluation Information Retrieval - IR Documents assign document IDs text break into words document numbers words stoplist and *field numbers non-stoplist stemming* Lexical analysis words stemmed term weighting* words * Indicates optional operation terms with weights Index / database 43 A. Ardö, EIT Lecture 2: EITN01 Web Intelligence and Information Retrieval January 23, / 44 A. Ardö, EIT Lecture 2: EITN01 Web Intelligence and Information Retrieval January 23, / 44

2 IR models - vector space Probabilistic vs. vector space model Bag-Of-Words: syntax irrelevant document structure irrelevant position within document irrelevant meta-information irrelevant document and query are n-dimensional vectors similarity = cosine between vectors Query Document Vector space model: rank documents according to similarity to query. The notion of similarity does not translate directly into an assessment of is the document a good document to give to the user or not? The most similar document can be highly relevant or completely non-relevant. Probability theory is arguably a cleaner formalization of what we really want an IR system to do: give relevant documents to the user. Does not improve results much, experimental, not really used A. Ardö, EIT Lecture 2: EITN01 Web Intelligence and Information Retrieval January 23, / 44 A. Ardö, EIT Lecture 2: EITN01 Web Intelligence and Information Retrieval January 23, / 44 LSI Weighting TF*IDF LSI takes documents that are semantically similar (= talk about the same topics), but are not similar in the vector space (because they use different words) and re-represents them in a reduced vector space (SVD) in which they have higher similarity. Addresses the problems of synonymy. The dimensionality reduction forces us to omit a lot of detail. Fewer dimensions, more collapsing of axes, better recall, worse precision More dimensions, less collapsing, worse recall, better precision LSI has been tested and found to be modestly effective with traditional test collections. Terms should be important in the document Terms present in many documents are not important (importance in document) * (how rare is the term) tf * idf tf term frequency idf inverse document frequency Various normalizations A. Ardö, EIT Lecture 2: EITN01 Web Intelligence and Information Retrieval January 23, / 44 A. Ardö, EIT Lecture 2: EITN01 Web Intelligence and Information Retrieval January 23, / 44

3 Recall/Precision Lecture 2 agenda Chapters 3, 5, 7, 11.5 in Modern Information Retrieval The PageRank Citation Ranking: Bringing Order to the Web L. Page, S. Brin, R. Motwani, and T. Winograd (1999) PageRank : Authoritative Sources in a Hyperlinked Environment J. Kleinberg (1999) A. Ardö, EIT Lecture 2: EITN01 Web Intelligence and Information Retrieval January 23, / 44 A. Ardö, EIT Lecture 2: EITN01 Web Intelligence and Information Retrieval January 23, / 44 Outline Query languages - aspects keyword context phrase proximity (position) Boolean natural language pattern matching truncation wild cards regular expressions fuzzy search structure fields A. Ardö, EIT Lecture 2: EITN01 Web Intelligence and Information Retrieval January 23, / 44 A. Ardö, EIT Lecture 2: EITN01 Web Intelligence and Information Retrieval January 23, / 44

4 Databases Query languages Relational Free text Free text - fields structured Graph-based Expressiveness Standardization Examples Structured Query Language - SQL Common Query Language - CQL XQuery Proprietary/home grown Google Bing Cypher (Graph Query Language) A. Ardö, EIT Lecture 2: EITN01 Web Intelligence and Information Retrieval January 23, / 44 A. Ardö, EIT Lecture 2: EITN01 Web Intelligence and Information Retrieval January 23, / 44 SQL example SQL JOIN example SQL table country: SQL table cities: SQL table country: name region area population Afghanistan SouthAsia Albania Europe Algeria MiddleEast Andorra Europe SELECT name,population,area FROM country; SELECT name,region FROM country WHERE area > 20000; SELECT * FROM country WHERE name LIKE Al% ORDER BY region; cid name region area 1 Afghanistan SouthAsia Albania Europe Algeria MiddleEast Andorra Europe 468 cid name 1 Kabul 1 Herat 2 Tirana 2 Fier 4 Canilo 5 Rome SELECT country.name,cities.name,country.region FROM country INNER JOIN cities ON country.cid=cities.cid WHERE country.region= Europe ; Albania, Tirana, Europe Albania, Fier, Europe Andorra, Canilo, Europe A. Ardö, EIT Lecture 2: EITN01 Web Intelligence and Information Retrieval January 23, / 44 A. Ardö, EIT Lecture 2: EITN01 Web Intelligence and Information Retrieval January 23, / 44

5 XQuery example Common Query Language - CQL XQuery is to XML what SQL is to relational database tables. <bookstore> <book category="web"> <title lang="en">learning XML</title> <author>erik T. Ray</author> <year>2003</year> <price>39.95</price> </book>... </bookstore> used in the Web standardized simple intuitive powerful for $x in /bookstore/book where $x/price>30 order by $x/title return $x/title intended to do what you mean A. Ardö, EIT Lecture 2: EITN01 Web Intelligence and Information Retrieval January 23, / 44 A. Ardö, EIT Lecture 2: EITN01 Web Intelligence and Information Retrieval January 23, / 44 Simple CQL - example 1 CQL - example 2 fish "complete dinosaur" dinosaur or bird "complete dinosaur" and fish (bird or dinosaur) and (feathers or scales) dinosaur not reptile foo and bar or baz = (foo and bar) or baz foo or bar and baz = (foo or bar) and baz foo and (bar or baz) title = dinosaur title = ((dinosaur and bird) or dinobird) bath.title="the complete dinosaur" title all "complete dinosaur" title any "dinosaur bird reptile" title exact "the complete dinosaur" publicationyear < 1980 title any "dinosaur bird reptile" and publicationyear > 1980 and author = Leakey A. Ardö, EIT Lecture 2: EITN01 Web Intelligence and Information Retrieval January 23, / 44 A. Ardö, EIT Lecture 2: EITN01 Web Intelligence and Information Retrieval January 23, / 44

6 Query operations Outline relevance feedback Give me more like this New Relevance Feedback vector = Original query vector + mean of relevant documents (vectors) in hit set - mean of non-relevant documents (vectors) in hit set query expansion Add or remove search terms Change Boolean operators Change wild cards A. Ardö, EIT Lecture 2: EITN01 Web Intelligence and Information Retrieval January 23, / 44 A. Ardö, EIT Lecture 2: EITN01 Web Intelligence and Information Retrieval January 23, / 44 Network protocols SRU Dedicated protocol Z39.50 MySQL network protocol CGI-script - Web Common Gateway Interface Web services - REST/SOAP Search/Retrieval via URL - SRU (OpenSearch) Open Archives Initiative - OAI A SRU request is a HTTP URL consisting of a SRU base URL and a search-part base URL: search-part: version=1.1&operation=searchretrieve &query=dinosaur Query expressed in CQL URL-encode CQL query query=... in search-part title= kirkegård = title%3d%20kirkeg%c3%a5rd optional parameters: startrecord, maximumrecords, recordpacking, recordschema response is a XML record (SRW is a variation of SRU using SOAP with XML over HTTP) A. Ardö, EIT Lecture 2: EITN01 Web Intelligence and Information Retrieval January 23, / 44 A. Ardö, EIT Lecture 2: EITN01 Web Intelligence and Information Retrieval January 23, / 44

7 Outline Vector space relevance Bag-Of-Words: only words syntax irrelevant position irrelevant document structure irrelevant meta-information irrelevant document and query are n-dimensional vectors relevance = similarity = cosine between query vector and document vector Query Document A. Ardö, EIT Lecture 2: EITN01 Web Intelligence and Information Retrieval January 23, / 44 A. Ardö, EIT Lecture 2: EITN01 Web Intelligence and Information Retrieval January 23, / 44 Relevance ranking in the Web Relevance ranking Which Web page/report/article/document is most relevant? No unambiguous answer! Which Web page/report/article/document is most relevant? Is it important to know? In relation to Popularity Quality Information need Query Expectation Collection Types of Relevance Objective - System based Algorithmic Subjective - User based Topicality Pertinence Situational Find relevant information fast We are lazy LOTS of information on Internet Advertisement SEO - Search Engine Optimization Evaluation of Universities/researchers - bibliometrics YES A. Ardö, EIT Lecture 2: EITN01 Web Intelligence and Information Retrieval January 23, / 44 A. Ardö, EIT Lecture 2: EITN01 Web Intelligence and Information Retrieval January 23, / 44

8 Outline Why can t I find X in Google? X isn t relevant enough! Results are sorted by relevance (ranking) We never look at more than 2-3 pages There is a LOT of data in Google (Make your search more specific) A. Ardö, EIT Lecture 2: EITN01 Web Intelligence and Information Retrieval January 23, / 44 A. Ardö, EIT Lecture 2: EITN01 Web Intelligence and Information Retrieval January 23, / 44 Relevance ranking in the Web Vector space - cosine similarity How do you calculate relevance (importance)? In relation to what? What criteria/information should be used? Text similarity Link structure Meta-data Need numeric value (for sorting) Text only The similarity between two vectors is calculated by use of the cosine formula sim(d j, q) = d j q d = wi,j w i,q j q w 2 i,j wi,q 2 A. Ardö, EIT Lecture 2: EITN01 Web Intelligence and Information Retrieval January 23, / 44 A. Ardö, EIT Lecture 2: EITN01 Web Intelligence and Information Retrieval January 23, / 44

9 Link based relevance - PageRank Simple PageRank Web: Use linking structure Why? Motivation Creating a link is a deliberate act that requires some effort. The more links to a page - the more important (popular) it is related to citation-analysis from bibliometrics PageRank (Google): a page has high rank if the sum of the ranks of the pages linking to it is high PR i = PR i = v inlinks v inlinks Iterate PR v PR v #outlinks A. Ardö, EIT Lecture 2: EITN01 Web Intelligence and Information Retrieval January 23, / 44 A. Ardö, EIT Lecture 2: EITN01 Web Intelligence and Information Retrieval January 23, / 44 PageRank PageRank How to calculate? PR(u) t = d PR(v) t 1 N v v B u + (1 d)e(u) PR(u) t = d PR(v) t 1 N v v B u + (1 d)e(u) PR(u) t B u N v PageRank for u at time t pages that link to u number of links from v d constant, usually 0.85 E(u) initial PageRank, for example 1 #pages P1 t: t 1: 100 t: t 1: 9 P P2 t: 53 t 1: t: 50 t 1: P4 3 A. Ardö, EIT Lecture 2: EITN01 Web Intelligence and Information Retrieval January 23, / 44 Random surfer model Click on a random link in the page Eventually gets bored and jumps to a random page Converges to a stable solution Problems size of the Web pages without links - dangling pages (rank sinks) converging link-spamming A. Ardö, EIT Lecture 2: EITN01 Web Intelligence and Information Retrieval January 23, / 44

10 Interpretation of PageRank PageRank variants Random surfer... gets bored after several clicks... jumps to a random page PageRank reflects the probability that a random surfer will land on the page Model: Markov chain with pages as states, transitions are links between pages PR(u) t = d PR(v) t 1 + (1 d)e(u) N v v B u Personalized Concept specific Decentralized calculation (P2P), page u from server s PR(u) = PR global (s) PR local (u) A. Ardö, EIT Lecture 2: EITN01 Web Intelligence and Information Retrieval January 23, / 44 A. Ardö, EIT Lecture 2: EITN01 Web Intelligence and Information Retrieval January 23, / 44 PageRank - Pros and Cons PageRank + TF*IDF Relevance ranking Pros Easily understandable Based on a theoretical model determined by users Works well in practice Cons Favors old pages Time consuming to compute Vulnerable to manipulation - link farms Google owns PageRank Combine PageRank with vector space model In practice PR(D) sim(q, D) or f (PR(D)) sim(q, D) proximity structure: title, link-anchor text meta-data: keywords, description and A. Ardö, EIT Lecture 2: EITN01 Web Intelligence and Information Retrieval January 23, / 44 A. Ardö, EIT Lecture 2: EITN01 Web Intelligence and Information Retrieval January 23, / 44

11 Intro HITS HITS What - How to calculate? Another link based algorithm to find good documents a good hub is a document that points to many documents a good authority is a document that many documents point to a mutually reinforcing relationship A. Ardö, EIT Lecture 2: EITN01 Web Intelligence and Information Retrieval January 23, / 44 Authorities Pages that many links to [A(u)] Hubs Pages with many out-links [H(u)] Hub = Authorities H(u) t = A(v) t 1 v L(u) L(u) pages that u links to Authority = Hubs A(u) t = H(v) t 1 v B(u) B(u) pages that links to u HUBS AUTHORITIES A. Ardö, EIT Lecture 2: EITN01 Web Intelligence and Information Retrieval January 23, / 44 HITS Other methods Not all the Web! Use a subset S S = Result set of a query S = S + L(S ) + B(S ) S S S Use behavior Clicks Co-occurrence - book orders (Amazon) A. Ardö, EIT Lecture 2: EITN01 Web Intelligence and Information Retrieval January 23, / 44 A. Ardö, EIT Lecture 2: EITN01 Web Intelligence and Information Retrieval January 23, / 44

Outline. Lecture 3: EITN01 Web Intelligence and Information Retrieval. Query languages - aspects. Previous lecture. Anders Ardö.

Outline. Lecture 3: EITN01 Web Intelligence and Information Retrieval. Query languages - aspects. Previous lecture. Anders Ardö. Outline Lecture 3: EITN01 Web Intelligence and Information Retrieval Anders Ardö EIT Electrical and Information Technology, Lund University February 5, 2013 A. Ardö, EIT Lecture 3: EITN01 Web Intelligence