Text Algorithms (4AP) Information Retrieval. Jaak Vilo 2008 fall. MTAT Text Algorithms. Materials

Size: px

Start display at page:

Download "Text Algorithms (4AP) Information Retrieval. Jaak Vilo 2008 fall. MTAT Text Algorithms. Materials"

Reynold McLaughlin
5 years ago
Views:

1 Text Algorithms (4AP) Information Retrieval Jaak Vilo 2008 fall Jaak Vilo MTAT Text Algorithms 1 Materials Modern Information Retrieval by Ricardo Baeza Yates and Berthier Ribeiro Neto. Information Retrieval ACM Press/dp/ X/ref=sr_1_1?ie=UTF8&s=books&qid= &sr=8 1 New edition in May 2009 Google Books: Information Retrieval /b k ti t i l ESSCaSS 08 : Ricardo Baeza Yates and Filippo Menczer 1

100? Given an interesting document (one?), how to find similar ones?

2 Given a set of documents, find those relevant to topic X User formulates a query, documents are returned and retrieved by user Looking at first 100, result how many are relevant to topic, how many of all fit in the first 100? Given an interesting document (one?), how to find similar ones? Which keywords characterise documents similar to other documents? How to present the answer to user? Topic hierarchies Self organising maps (see WebSom)... 2

3 3

4 4

5 Mida otsiti? auto buss rong tramm troll bensiin diisel puit vesinik maagaas elekter Milline dokument peaks kõige sarnasem olema? Dokumendi ja päringu sarnasus Dokumentide järjestamine Käänded/pöörded Ontoloogiad (mõistete struktuur) Dokumendi enda olulisus (e.g. PageRank) Information retrieval (IR) Finding relevant information From unstructured document database(s) Relevance, measures Presenting information (UI, relevance) Free text queries (Natural Language Processing) User feedback Information Retrieval is the "science of search The study of systems for indexing, searching, and recalling data, particularly text or other unstructured forms 5

History of IR 1960 70 s: Small text retrieval systems; basic boolean and vector space retrieval models 1980 s: Large

6 History of IR s: Small text retrieval systems; basic boolean and vector space retrieval models 1980 s: Large document database systems, many run by companies: (e.g. Lexis Nexis, Dialog, MEDLINE) 1990 s: Searching FTPable documents on the Internet (e.g. Archie, WAIS); Searching the World Wide Web (e.g. Lycos, Yahoo, Altavista) History cont s: Link analysis for Web Search (e.g. Google) Automated Information Extraction (e.g. Whizbang, Fetch, Burning Glass) Question Answering (e.g. TREC Q/A track, Ask Jeeves) Multimedia IR (Image, Video, Audio and music) Cross Language IR (e.g. DARPA Tides) Document Summarization 6

similar blogs Dlii Del.icio.us most popular bookmarks k Flickr.

7 Cont 2000 s: Recommender Systems (e.g. MovieLens, Pandora, LastFM) Automated Text Categorization & Clustering itunes Top Songs Amazon people who bought this also bought Bloglines similar blogs Dlii Del.icio.us most popular bookmarks k Flickr.com most viewed pictures NYTimes most ed articles IR discipline that deals with: retrieval ti representation storage organization access of structured, semi structured and unstructured data (information objects) in response to query (topic statement) structured (e.g. boolean expression) unstructured (e.g. sentence, document) 7

8 Concepts Information Retrieval the study of systems for representing, indexing (organising), searching (retrieving), and recalling (delivering) data. Information Filtering given a large amount of data, return the data that the user wants to see Information Need what the user really wants to know; a query is an approximation to the information need. Query astring of words that characterizes the information that the user seeks Browsing a sequence of user interaction tasks that characterizes the information that the user seeks The process of applying algorithms over unstructured, semi structured or structured data in order to satisfy a given information (explicit) query Efficiency with respect to: algorithms query building data organization/structure 8

9 Data vs. Information Retrieval Information Retrieval: Set of keywords (loose semantics) Semantics of the information need Errors are tolerable Data Retrieval: Regular expression (well defined query) Constraints for the objects in the answer set Single error results in a falure retrieval task Informa tion Need Compare the information need with the information Summary generate a ranking which reflects relevance User Query IR System Ranked list of documents Lecture 2: Query Languages & Operations feedback 2ID10: Information Retrieval ( ), Lora Aroyo 9

10 IR introduction IR research issues Applications of IR 1. IR Models 2. IR Query Languages & Operations 3. Searcher Feedback 4. Language Modeling for IR 8. Multimedia IR 6. Semantic in IR 5. Search Engines 9. Structured Content classification and categorization (catalogues) systems and languages (NL based systems) user interfaces and visualization The Web fenomena universal repository of knowledge free (low cost) universal access no central editorial board IR the key to finding the solutions 10

Logical View of Documents text + structure Documents text accents, spacing Stop-words Noun groups Stemming Automatic or Manual Indexing Structure structure Full text Index terms Document

11 Logical View of Documents text + structure Documents text accents, spacing Stop-words Noun groups Stemming Automatic or Manual Indexing Structure structure Full text Index terms Document representation continuum Intermediate representations (transformations) Text operations to reduce complexity of documents Lecture 1: Introduction 2ID10: Information Retrieval ( ) 21 The Retrieval Process user feedback change the query 4 specifies user need User Interface text 1 text defines logical view Text Operations 5 logical view logical view ranking docs Query Operations query generated Searching retrieved docs Ranking Indexing inverted file Index builds 2 DB Manager Module Text Database Lecture 1: Introduction 2ID10: Information Retrieval ( )

12 12

$org/wiki/inverted_index T 0 ="it is what it is", T 1 = "what tis it" T 2 ="it is a banana Q: "what", "is" "it" {0, 1} {0, 1, 2} {0, 1, 2} = {0,1} "a": {2} "banana": {2} "is": {0, 1, 2} "it": {0, 1,$

13 Inverted index: document level T 0 ="it is what it is", T 1 = "what tis it" T 2 ="it is a banana Q: "what", "is" "it" {0, 1} {0, 1, 2} {0, 1, 2} = {0,1} "a": {2} "banana": {2} "is": {0, 1, 2} "it": {0, 1, 2} "what": {0, 1} Inverted index: word level T 0 ="it is what it is", T 1 = "what tis it" T 2 ="it is a banana "a": {(2, 2)} "banana": {(2, 3)} "is": {(0, 1), (0, 4), (1, 1), (2, 1)} "it": {(0, 0), (0, 3), (1, 2), (2, 0)} "what": {(0, 2), (1, 0)} Q: "what is it {(0, 2), (1, 0)} {(0, 1), (0, 4), (1, 1), (2, 1)} {(0, 0), (0, 3), (1, 2), (2, 0)} 13

Once a forward index is developed, which stores lists of words per document, it is next inverted to develop an inverted index.

14 The inverted index data structure is a central component of a typical search engine indexing algorithm. A goal of a search engine implementation is to optimize the speed of the query: find the documents where word X occurs. Once a forward index is developed, which stores lists of words per document, it is next inverted to develop an inverted index. Querying the forward index would require sequential iteration through each document and to each word to verify a matching document. The time, memory, and processing resources to perform such a query are not always technically realistic. Instead of listing the words per document in the forward index, the inverted index data structure is developed which lists the documents per word. 14

15 Measures Precision is the fraction of the documents retrieved dthat t are relevant tto the user's information need. Recall is the fraction of the documents that are relevant to the query that are successfully retrieved. Measures: Precision & Recall Retrieved Not retrieved User need TP FN Relevant Not needed FP TN Irrelevant TP+FP TN+FN TP True Positive TN True Negative FP False Positive FN False Negative TP Relevant Retrieved Precision = = TP+FP Retrieved TP Relevant Retrieved Recall = = TP+FN Relevant 15

16 Measures: Precision & Recall Retrieved Not retrieved User need TP FN Relevant Not needed FP TN Irrelevant TP+FP TN+FN TP Precision = TP+FP Specificity TP Recall = TP+FN Sensitivity Measure: F Measure The weighted harmonic mean of precision and recall, the traditional F measure or balanced F score is: 2 X Precision x Recall F-Measure = ( Precison + Recall) F 2 measure, weights recall twice as much as precision, and the F 0.5 measure, which weights precision twice as much as recall. 16

17 ROC Receiver Operator Characteristic AUC Area Under Curve 3 systems compared TP Relevant FP Irrelevant Vector space model Document: a vector of words A sparse vector over all possible words Similarity between query and document: Scalar product An angle between the two vectors 17

18 Scalar product Query Q is a document with perhaps just a single word. Similarity of query and document M(Q, D i ) = Q D i X Y = i x i y i Weighted version The more the word occurs, the more relevant, Same word vectors, count occurrences M(X, Y ) = i w q,i w d,i w is different for word in each document Extend: add weight for a word in a "more important" context Can you add term weight on query words? 18

19 Limitations of vector space Long documents are poorly represented because they have poor similarity values (a small scalar product and a large dimensionality) Search keywords must precisely match document terms; word substrings might result in a "false positive match" Semantic sensitivity; documents with similar context but different term vocabulary won't be associated, resulting in a "false negative match". The order in which the terms appear in the document is lost in the vector space representation. Term Weigh Calculation quantification of intra-document (-cluster) contents (similarity) = tf factor the term frequency within a document how well a term describes a document quantification of inter-documents (-cluster) separation (dissimilarity) = idf factor the inverse document frequency frequency of the term in docs of the collection w ij = tf(i,j) * idf(i) Lecture 1: Introduction 2ID10: Information Retrieval ( ) 38 19

20 TF and IDF Factors Let, N be the total number of docs in the collection n i be the number of docs which contain k i freq(i,j) raw frequency of k i within d j A normalized frequency (tf factor) is given by: f(i,j) = freq(i,j) / max(freq(l,j)) where max is computed over all terms occuring in doc d j The idf factor is computed as: idf(i) = log (N/ n i ) log makes values tf and idf comparable or the amount of information associated with term k i Lecture 1: Introduction 2ID10: Information Retrieval ( ) 39 TF and IDF Factors Let, N be the total number of docs in the collection vector n i be model the number with of tf-idf docs which weights contain k i freq(i,j) raw frequency of k i within d j A normalized frequency (tf factor) is given by: a good ranking strategy in general collections f(i,j) = freq(i,j) / max(freq(l,j)) where max is computed over all terms occuring in doc d j The simple idf factor and fast is computed to compute as: idf(i) = log (N/ n i ) log makes values tf and idf comparable or the amount of information associated with term k i Lecture 1: Introduction 2ID10: Information Retrieval ( ) 40 20

21 Pros & Cons Advantages: term-weighting improves quality of the answer set partial matching allows retrieval of docs that approximate the query conditions cosine ranking formula sorts documents according to degree of similarity to the query Disadvantages: assumes independence of index terms not clear whether this is bad though Lecture 1: Introduction 2ID10: Information Retrieval ( ) 41 Ontology Ontology: a conceptualisation of things An explicit formal specification of how to represent the objects, concepts and other entities that are assumed to exist in some area of interest and the relationships that hold among them. Sõidukid veesõidukid autod lennukid Vesilennuk 21

22 Ontology driven search Query => map to an ontology Use ontology to guide what you really want Map documents to the same ontology Fth Fetch most relevant tto term, ontology, etc gopubmed 22

23 Importance of a document Can we say that some documents are a priori more important than others? Type of a document /law, news, chat, / Good source Relevant (often cited, popular) What is a Markov Chain? A Markov chain has two components: 1) A network structure much like a web site, where each node is called a state. So the complete web is the set of all possible states. 2) A transition probability of traversing a link given that the chain is in a state. For each state the sum of outgoing probabilities is one. A sequence of steps through h the chain is called a random walk. 23

24 The Random Surfer Assume the web is a Markov chain. Surfers randomly click on links, where the probability of an outlink from page A is 1/m, where m is the number of outlinks from A. The surfer occasionally gets bored and is teleported to another web page, say B, where B is equally likely to be any page. Using the theory of Markov chains it can be shown that if the surfer follows links for long enough, the PageRank of a web page is the probability that the surfer will visit that page. Dangling Pages A C B Problem: A and B have no outlinks. Solution: Assume A and B have links to all web pages with equal probability. 24

Rank Sink Problem: Pages in a loop accumulate lt rank but tdo not distribute it. Solution: Teleportation, i.e. with a certain probability the surfer can jump to any other web page to get out of the loop.

25 Rank Sink Problem: Pages in a loop accumulate lt rank but tdo not distribute it. Solution: Teleportation, i.e. with a certain probability the surfer can jump to any other web page to get out of the loop. PageRank (PR) Definition PR( P) d N (1 d ) ( PR( P1 ) O( P ) 1 PR( P2 ) PR( Pn )... O( P ) O( P ) 2 n ) P is a web page Pi are the web pages that have a link to P O(Pi) is the number of outlinks from Pi d is the teleportation probability N is the size of the web 25

26 Example Web Graph Iteratively Computing PageRank Replace d/n in the def. of PR(P) by d, so PR will take values between 1 and N. d is normally set to 0.15, but for simplicity lets set it to 0.5 Set initial PR values to 1 Solve the following equations iteratively: PR( A) PR( C) PR( B) ( PR( A) PR( C) ( PR( A) / 2) / 2 PR( B)) 26

27 Example Computation of PR Iteration PR(A) PR(B) PR(C) Large Matrix Computation Computing PageRank can be done via matrix multiplication, wherethe matrix has 30 million rows and columns. The matrix is sparse as average number of outlinks is between 7 and 8. Setting d = 0.15 or above requires at most 100 iterations to convergence. Researchers still trying to speed up the computation. 27

The number incoming links to a page is a measure of importance and authority of the

28 PageRank - Motivation A link from page A to page B is a vote of the author of A for B, or a recommendation of fthe page. The number incoming links to a page is a measure of importance and authority of the page. Also take into account the quality of recommendation, so a page is more important if the sources of its incomoing llinks are important. 28

29 29

The Anatomy of a Large Scale Hypertextual Web Search Engine http://infolab.stanford.

html In this paper, we present Google, a prototype of a large scale search engine

Google is designed to crawl and index the Web efficiently and produce

30 The Anatomy of a Large Scale Hypertextual Web Search Engine In this paper, we present Google, a prototype of a large scale search engine which makes heavy use of the structure present in hypertext. Google is designed to crawl and index the Web efficiently and produce muchmoresatisfyingmore searchresultsresults than existing systems. The prototype with a full text and hyperlink database of at least 24 million pages is available at 30

Personalized PageRank Teleportation to a set of pages

sensitive PageRank [Haveliwala 02] Teleportation to a set

04] Teleportation to trustworthy pages Many papers on

31 Personalized PageRank Teleportation to a set of pages defining the preferences of a particular user Topic sensitive PageRank [Haveliwala 02] Teleportation to a set of pages defining a particular topic TrustRank [Gyöngyi 04] Teleportation to trustworthy pages Many papers on analyzing PageRank and numerical methods for efficient computation 31

32 32

33 Future? Or current? Recommendations (Tagging) Common behaviour (news/epidemics spread) Social networks Focus Generalisation Rich get richer; Googlearchy?; Your contribution? 33

Materials Text Algorithms (4AP) Information Retrieval. Jaak Vilo 2008 fall

Materials Text Algorithms (4AP) Information Retrieval Jaak Vilo 2008 fall Modern Information Retrieval by Ricardo Baeza Yates and Berthier Ribeiro Neto. http://people.ischool.berkeley.edu/~hearst/irbook/