Information retrieval systems

Size: px

Start display at page:

Download "Information retrieval systems"

Suzanna Parks
6 years ago
Views:

2 Information retrieval systems Information retrieval (IR): n. searching for documents or information in documents. Question-answering: respond with a specific answer to a question (e.g., Wolfram Alpha). Document retrieval: find documents relevant to a query, ranked by relevance (e.g., or Google). Text analytics/data mining: General organization of large textual databases (e.g., Lexis-Nexis, OpenText, MedSearch,.) CSC401/2511 Spring

3 Terminology Information retrieval has slightly different terminology than the tasks we ve seen previously: Document: a book, article, web page, or paragraph (depending on the task and data). Collection: a corpus of documents Term: a word type Stop word: a functional (non-content) word (e.g., the) CSC401/2511 Spring

4 Query types Different kinds of questions can be asked. Factoid questions, e.g., How often were the peace talks in Ireland delayed or disrupted as a result of acts of violence? Narrative (open-ended) questions, e.g., Can you tell me about contemporary interest in the Greek philosophy of stoicism? Complex/hybrid questions, e.g., Who was involved in the Schengen agreement to eliminate border controls in Western Europe and what did they hope to accomplish? CSC401/2511 Spring

5 Question answering (QA) Which woman has won more than 1 Nobel prize? (Marie Curie) Question Answering (QA) usually involves a specific answer to a question. CSC401/2511 Spring

6 Document retrieval vs IR One strategy is to turn question answering into information retrieval (IR) and let the human complete the task. CSC401/2511 Spring

7 Question answering (QA) CSC401/2511 Spring

8 Knowledge-based QA 1. Build a structured semantic representation of the query. Extract times, dates, locations, entities using regular expressions. Fit to well-known templates. CSC401/2511 Spring Query databases with these semantics. Ontologies (Wikipedia infoboxes). Restaurant review databases. Calendars. Movie schedules.

9 IR-based QA CSC401/2511 Spring

10 IR-based QA CSC401/2511 Spring

11 IR-based QA Information retrieval Question answering CSC401/2511 Spring

12 IBM s Watson Human 1 Game Control System Clue Grid Decisions to Buzz and Bet Strategy Watson s Game Controller Text-to-Speech Clue & Category Answers & Confidences Watson s QA Engine 2,880 IBM Power750 Compute Cores 15 TB of Memory Human 2 Clues, Scores & Other Game Data Content equivalent to ~ 1,000,000 books source: A Brief Overview and Thoughts for Healthcare Education and Performance Improvement by the IBM Watson team CSC401/2511 Spring

13 IBM s Watson: search This man became the 44 th President of the United States in 2008 CSC401/2511 Spring

14 IBM s Watson: search Title-oriented search: In some cases, the solution is in the title of highly-ranked documents. E.g., This pizza delivery boy celebrated New Year s at Applied Cryogenics. CSC401/2511 Spring

15 BASICS

16 Queries A query is a textual key which orders a specific subset of documents (or answers) in a collection. Historically, these were highly structured in a logical language, but in modern search engines queries are more often streams of syntactically disconnected keywords. A boolean query is a logical combination of boolean membership predicates. Brutus AND Caesar AND NOT Calpurnia CSC401/2511 Spring

17 Term-document incidence Anthony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth ANTHONY BRUTUS CAESAR CALPURNIA CLEOPATRA MERCY WORSER For the query Brutus AND Caesar AND NOT Calpurnia, (Brutus) (Caesar) (Not Calpurnia) (Bitwise AND) CSC401/2511 Spring

18 Boolean Queries and big collections If we have 1 million documents, each with 1000 tokens 1 billion tokens at most 1 billion 1 s in the matrix. If we have 500,000 distinct terms, the term-document incidence matrix will have 500,000,000,000 elements. There will be << 1 billion 1s in this matrix. Very sparse and a waste of space. Can there be a better way? CSC401/2511 Spring

Inverted index Given a query word, the inverted

contain that word in either the title, the abstract

More sophisticated versions also include the

19 Inverted index Given a query word, the inverted index for that word gives us all documents that contain that word in either the title, the abstract (summary), some hidden metadata, or the entire text. More sophisticated versions also include the frequency and positions of the query word in each document. Python query Inverted index! " documents How does one construct such indices? CSC401/2511 Spring

20 Inverted index construction 1. Collect the documents to be indexed. Friends, Romans, countrymen So let it be with Caesar 2. Tokenize the text. Friends Romans countrymen So 3. Do preprocessing and normalization, resulting in the indexing terms. friend roman countryman so 4. Create a dictionary (hash) of documents given terms. CSC401/2511 Spring

21 Simple conjunctive query Given the query Brutus AND Calpurnia, 1. Locate Brutus in the dictionary. Retrieve documents list. 2. Locate Calpurnia in the dictionary. Retrieve documents list. 3. Intersect the two document lists. Return the result to the user. Linear in the lengths of document lists. (if lists are sorted) CSC401/2511 Spring

Despite the exponential growth in memory across depth, breadth-first search is quite popular.

22 Constructing indices Spiders (i.e., bots, crawlers) start with root (seed) URLs. Follow all links on these pages recursively. Novel pages are processed and indexed. Despite the exponential growth in memory across depth, breadth-first search is quite popular. Depth-first search is linear in depth, but can get lost. Trivia: If you click on the first contentful link in any Wikipedia page, you will eventually be led to the Philosophy article. CSC401/2511 Spring

1996 2016 150 100 Unstructured Structured 150 100 50 0 50 0 Data

23 Increasing entropy? Boolean retrieval is precise and was very popular for decades (it still is used for structured data, e.g., ehealth records). The amount and value of unstructured data (i.e., text) has grown faster than structured data on the web Unstructured Structured Data volume Market cap (data from Chris Manning) Data volume Market cap CSC401/2511 Spring

24 Zipf s law on the web These variables have Zipfian distributions: Number of links to and links from a page. Length of web pages. Number of web page hits. (graph from Ray Mooney) CSC401/2511 Spring

25 New challenges for IR on the web Distributed data: Documents spread over millions of web servers. Volatile data: Document change or disappear frequently and rapidly. Large volume: Petabytes of data. Poor quality: No editorial control, false information, poor writing, typographic errours. Heterogeneity: Various media, languages, encodings. Unstructured: No uniform structure, HTML errors, CSC401/2511 Spring duplicate documents.

26 Detecting duplicates duplicates The user will become annoyed when many top-ranking hits are identical/similar. Nearly-identical pages can be determined by hashing E.g., don t index en.m.wikipedia.org/wiki/ if you ve indexed en.wikipedia.org/wiki/. Zero marginal relevance occurs when a highly relevant document becomes irrelevant by being ranked below a (near-)duplicate. CSC401/2511 Spring

27 Detecting duplicates duplicates Compute similarity with some edit-distance measure. Syntactic similarity (e.g., overlap of bigrams) easier to measure than semantic similarity. If this measure is above some threshold! for some pair of documents, we consider them duplicates. Jaccard coefficient: " #, % = ' ) ' ) Is a measure of similarity on [0.. 1] " #, # = 1 " #, % = 0 iff # % = CSC401/2511 Spring

28 Jaccard coefficient on 2-grams Documents:! " : Jack London went to Toronto! # : Jack London went to the city of Toronto! $ : Jack went from Toronto to London %! ",! # = $ ( = %! ",! $ = 0 CSC401/2511 Spring

29 LINKS

30 Link analysis When we re crawling the web and indexing, we want to retain some record of similarity between (non-duplicate) documents in terms of their link structure. This will help in searching. CSC401/2511 Spring

31 Bibliometrics: citation analysis Impact factor: Developed in 1972 to measure the quality and influence of scientific journals. Measures how often articles are cited. Bibliographic coupling: Measure of similarity between documents according to the intersection of their citations (Kessler, 1963). A B CSC401/2511 Spring

32 Bibliometrics: citation analysis Co-citation: Measure of similarity between documents according to the intersection of the documents that cite them (Small, 1973). A B CSC401/2511 Spring

33 Links are not citations Many links are navigational within a website. Many pages with high in-degree are portals without much content. Some links are not necessarily endorsements. Relevance of citations in scientific settings is (theoretically) enforced by peer review. Can we mimic the enforcement of relevance usually done by human experts in scientific articles? CSC401/2511 Spring

Authorities and hubs Authorities are pages recognized as significant, trustworthy, and useful for a topic. In-degree (number of incoming links) is an estimate of authority.

34 Authorities and hubs Authorities are pages recognized as significant, trustworthy, and useful for a topic. In-degree (number of incoming links) is an estimate of authority. Should incoming links from authoritative pages count more than others? Hubs are index pages that provide lots of links to relevant content pages. e.g., reddit.com is a hub page for memes. CSC401/2511 Spring

35 HITS (hits hits hits hits) The HITS algorithm (Kleinberg, 1998) attempts to learn hubs and authorities on a given topic given relevant web subgraphs. Hubs and authorities tend to form bipartite graphs. Hubs Authorities CSC401/2511 Spring

36 HITS First, find (top!) most relevant pages for a query this is the root set, ". (we ll see how to do this next lecture) Next, look at the link structure relative to ". The base set, # is " and all pages that link to and are linked from pages in " # " CSC401/2511 Spring

37 HITS: Authorities and In-degree Even for!, nodes with high in-degree may not be authorities they may just be generically popular pages. Authority should be determined by strong hubs. Iteratively (slowly) converge on a mutually reinforcing set of hubs and authorities. For every page "!, maintain Authority score: $ % (initialized to 1/! ) Hub score: h % (initialized to 1/! ) CSC401/2511 Spring

HITS update rules Authorities! are pointed to ( ) by lots of good hubs #: $ % = ' (:( % h ( 1 2 4 3 $ + = h, + h.

38 HITS update rules Authorities! are pointed to ( ) by lots of good hubs #: $ % = ' (:( % h ( $ + = h, + h. + h / Hubs point to lots of good authorities: h ( = ' %:( % $ % h + = $, + $. + $ / CSC401/2511 Spring

39 PageRank PageRank (Brin & Page, 1998) is an alternative to HITS that does not distinguish between hub and authority. CSC401/2511 Spring

40 PageRank initial idea Assume that in-degree does not account for the authority of the source of a link. For page!, the page rank is: where "! = $ % CSC401/2511 Spring &:& ) "(+) - & - & is the total number of out-links over all +. $ is a normalizing constant. A page s rank flows out equally among outgoing links.

41 PageRank flow of authority PageRank would iteratively adjust all! " until overall page ranking converged Steady state CSC401/2511 Spring

42 PageRank problem Groups of purely self-referential pages (linked from the outside) are sinks that absorb all the rank in the system during the iterative rank assignment process. CSC401/2511 Spring

43 PageRank rank source An ethereal rank source! continually replenishes the rank of each page " by a fixed amount! " # " = % & ':' * #(,). ' +!(") CSC401/2511 Spring

44 Complete ranking A complete ranking involves combining: PageRank. Preferences using HTML tags (e.g., title or abstract are often highly informative). Similarity of query words and documents. How do we relate query words and documents in the first place? CSC401/2511 Spring

45 VECTOR SPACES

46 The vector space model In the vector space model, queries and documents are both represented by unit-length vectors in word space. Each dimension can be a word in vocabulary. The domain of each dimension can be, e.g., 0/1: absent/present N: term frequency of word " in document # ($% &' ). R ) * : damped weight, e.g., =, 1 + log $% &' if $% &' > 0 0 if $% &' = 0 Note: vectors in the above domains can easily be normalized to unit vectors, if desired. Note 2: you don t always want these vectors to be unit length sometimes you explicitly don t. CSC401/2511 Spring

The vector space model If the query and the available documents can be represented by vectors, we can determine similarity according to the cosine method.

47 The vector space model If the query and the available documents can be represented by vectors, we can determine similarity according to the cosine method. Vectors that are near each other (within a certain angular radius) are considered relevant. Document! " is closest to query #. CSC401/2511 Spring

48 The cosine measure The cosine measure (a.k.a., normalized correlation coefficient ) is cos %, ' = - *+, % * ' * -. *+, % * -. *+, ' * where % and ' are /-dimensional vectors for the query and document, respectively. CSC401/2511 Spring

49 The cosine measure The cosine measure (a.k.a., normalized correlation coefficient ) is cos %, ' - *+, % * ' * = -. *+, % * -. *+, ' * where % and ' are /-dimensional vectors for the query and document, respectively. Larger values of cos %, ' means stronger correlation, so % is closer to ', than '. iff cos %, ', > cos %, '.. CSC401/2511 Spring

50 Term weighting What if we want to weight words in the vector space model? Term frequency,!" #$ : Document frequency, )" # : Collection frequency, *" # : number of occurrences of word % & in document ' (. number of documents in which % & appears. total occurrences of % & in the collection. CSC401/2511 Spring

51 Term frequency Higher values of!" #$ (for contentful words) suggest that word % & is a good indicator of the content of document ' (. When comparing the relevance of a document ' ( to a keyword % &, )* &( should be maximized. We often dampen )* &( to temper these comparisons. E.g., even if )* &( = 3)* &-, empirically we don t want to say that document ' ( is thrice as relevant as document ' -. )*./0123 = 1 + log()*), if )* > 0. CSC401/2511 Spring

52 Document frequency The document frequency,!" #, is the number of documents in which $ # appears. Meaningful words may occur repeatedly in a related document, but functional (or less meaningful) words may be distributed evenly over all documents. Word Collection frequency Document frequency kernel 10, try 10, E.g., kernel occurs about as often as try in total, but it occurs in fewer documents it represents a more specific concept. CSC401/2511 Spring

53 Inverse document frequency Very specific words,! ", would give smaller values of #$ ". To maximize specificity, the inverse document frequency is %#$ " = log * #$ " where * is the total number of documents and we scale with log, as before. This measure gives full weight to words that occur in 1 document, and zero weight to words that occur in all documents. CSC401/2511 Spring

54 tf.idf We combine the term frequency and the inverse document frequency to give us a joint measure of relatedness between words and documents:!". $%" & ', % ) = log(!" ')) log 3 if!" %" ') 1 ' 0 if!" ') = 0 CSC401/2511 Spring

55 Aspects of tf.idf The!". $%" score has been criticised for being ad hoc, but it has been shown to be robust and effective in a wide range of applications where a rough estimate of relatedness is needed. However, The effectiveness of!". $%" can vary depending on the number of words in a query. Ambiguous words can cause spurious matches. Vectors of terms collected independently in this way can t capture semantic connections between words. We somehow need vectors of ideas/concepts. CSC401/2511 Spring

56 Latent semantic indexing Co-occurrence: n. when two or more terms occur in the same documents more often than by chance. Note: this is not quite the same thing as collocations Consider the following: Term 1 Term 2 Term 3 Term 4 Query user interface Document 1 user interface HCI interaction Document 2 HCI interaction Document 2 appears to be related to the query although it contains none of the query terms. The query and document 2 are semantically related. CSC401/2511 Spring

57 Latent semantic indexing Latent semantic indexing projects queries and documents into a space with latent (i.e., hidden) semantic dimensions. Co-occurring terms are projected onto the same latent dimensions. Non-co-occurring terms are not. In latent space, a query and a document can have high cosine similarity even if they do not share any terms. The latent semantic space has fewer dimensions than the original space (which had 1 dimension per term). CSC401/2511 Spring

58 Latent semantic indexing There are many different mappings from high-dimensional spaces to low-dimensional spaces. The low dimensions are not usually selected from among the existing dimensions (in this case). Recall, however, feature selection from Lecture 3-1. We must learn some function that projects our! dimensions onto some new dimensions CSC401/2511 Spring

59 Latent semantic indexing example This original space has 5 dimensions (one per term). The reduced space has two dimensions, perhaps vaguely referring to outer space and to vehicles, respectively.! "! #! $! %! &! ' cosmonaut astronaut moon car truck What are some ways of projecting onto fewer dimensions? CSC401/2511 Spring

60 1. Principal components analysis (PCA) PCA is an eigendecomposition of the variance in the data. It gives us a sequence of orthogonal vectors that represent the maximum amount of variance in the remaining data. Below, most of the variance is along! it captures the most amount of change in the data. We can rotate the data so that it s expressed in the new dimensions. # " CSC401/2511 Spring

61 1. Principal components analysis (PCA) If we express the data only in terms of!, it will be easier to learn a model. (we have fewer parameters) we will lose some information. (what if the classes we want to distinguish are all differentiated in "? (note: this should be rare)) $ # CSC401/2511 Spring

62 2. Singular value decomposition Singular value decomposition (SVD) can be used both as: a method of co-occurrence analysis between words that aids in similarity judgements, and a method for dimensionality reduction. Just as linear regression projects 2-dimensional data onto a 1- dimensional line, SVD projects a!-dimensional matrix, ", onto a #-dimensional matrix, $", where! #. The matrix &' is produced such that some maximal amount of information in " is retained. CSC401/2511 Spring

63 2. Singular value decomposition The SVD projection is computed by decomposing the term-bydocument matrix! " $ into the product of three matrices: % " &, ' & &, and ( $ & where ) is the number of words (terms), * is the number of documents, and + = min(), *). Specifically,! " $ = % " & ' & & ( $ & CSC401/2511 Spring

64 2. Singular value decomposition A A CSC401/2511 Spring

65 SVD example (. 0 = * , 0 1 ( =! "! #! $! %! &! ' cosmonaut astronaut moon car truck * = cosm astro moon car truck = , =! "! #! $! %! &! ' What do these matrices mean? CSC401/2511 Spring

66 SVD example! "! #! $! %! &! ' cosmonaut ( = astronaut moon car truck ( is the matrix of term frequencies, *+,-. E.g., moon occurs once in. / and once in. 0. CSC401/2511 Spring

67 SVD example Matrices! and # represent terms and documents, respectively in this new space. E.g., the first row of! corresponds to the first row of $, and so on.! and # are orthonormal, so all columns are orthogonal to each other and!! = # # = &.! = # = cosm astro moon car truck ' ( ' ) ' * ' + ', ' CSC401/2511 Spring

68 SVD example The matrix! contains the singular values of # in descending order. The $ %& singular value hints at the amount of variation on the $ %& axis.! = CSC401/2511 Spring

65-0.41 0.58-0.09 # = 2.16 0 0 0 0 0 1.59 0 0 0 0 0 1.28 0 0 0 0 0 1 0 0 0 0 0 0.39 + = $ % $ & $ ' $ ( $ ) $ * -0.75-0.28-0.20-0.45-0.

69 SVD example By restricting!, #, and + to their first - < / columns, their product gives us 01, a best least squares approximation of 1.! = cosm astro moon car truck # = = $ % $ & $ ' $ ( $ ) $ * CSC401/2511 Spring

70 SVD in practice Body parts Place names Animals Rohde et al. (2006) An Improved Model of Semantic Similarity Based on Lexical Co-Occurrence. Communications of the ACM 8: CSC401/2511 Spring

71 SVD in practice Trained on the Google news corpus with over 300 billion words. CSC401/2511 Spring

72 SVD projection Recall that the purpose of doing SVD was to find co-occurrence patterns among terms and documents. When SVD finds the optimal projection to a lowerdimensional space, the result is that words that have similar co-occurrence patterns are projected onto the same dimensions. e.g., car and truck could be projected on the same dimensions. Therefore, we can identify similarities between queries and documents even if they share no terms in common. Note: PCA typically uses SVD (e.g., it does in Matlab). CSC401/2511 Spring

73 Similarities in lower dimensions Consider! = # $ $ & $ ' from the previous example:! = ) * ) + ), ) - ). ) / If we normalize the columns of! and multiply its transpose with it, we get: ) * 1 ) ) * ) + ), ) - ). ) / ), ) ) ) / CSC401/2511 Spring

74 Similarities in lower dimensions This matrix gives us the correlation coefficients between documents in their latent (hidden) semantic space, regardless of their constituent terms. % & % ' % ( % ) % * % + % & 1 % ' % ( % ) % * % E.g.,! " and! # are very similar, whereas! " and! $ are not. CSC401/2511 Spring

75 Latent semantic indexing in IR The process of projecting different (but related) terms to common underlying dimensions in the semantic space can be thought of as soft clustering. This is desirable if a user poses a particular query, we want to identify all documents that might be of use, regardless of the specific terms used. Occasionally this process results in some spurious matches, especially with polysemous (many meaning) words, but it remains a useful tool in practice. CSC401/2511 Spring

76 Aside neural networks cut for time We ve seen word embeddings before, from word2vec. Everything is the same, but we swap out SVD for NNs. Sometimes, small amounts of labeled data can be used to retrain the network. Mitra B, Craswell N. (2017) Neural Models for Information Retrieval. CSC401/2511 Spring

77 Aside neural networks cut for time Often word2vec or GloVe vectors are trained on some massive global corpus. Global word embeddings risk capturing only coarse representations of topics dominant in the corpus. Diaz F, Mitra B, Craswell N. (2016) Query Expansion with Locally-Trained Word Embeddings, Proc. of ACL, doi: /v1/p CSC401/2511 Spring

78 Aside neural networks cut for time Query expansion involves reweighting likelihoods, usually through deleted interpolation, as we saw with HMMs:! " # $ = &! $ + 1 &! " *($) - " * comes from taking the. 0 term embedding matrix 1 and the. 1 query term vector 2, taking the top terms from 33 2, and normalizing their weights. We can also recompute the probability of a document as the softmax over KL divergences between distributions for query terms and document terms. This emphasizes documents that are more similar to the queries. Diaz F, Mitra B, Craswell N. (2016) Query Expansion with Locally-Trained Word Embeddings, Proc. of ACL, doi: /v1/p CSC401/2511 Spring

Mitra B, Craswell N. (2017) Neural Models for Information Retrieval. http://arxiv.org/abs/1705.

79 Mitra B, Craswell N. (2017) Neural Models for Information Retrieval. Zhang Y, Rahman MM, Braylan A, et al. (2016) Neural Information Retrieval: A Literature Review. CSC401/2511 Spring

80 MISCELLANEOUS

81 Naïve Bayes Naïve Bayes can be used for ad hoc document retrieval. Imagine each of! documents is a class with one training example the document itself. Rank documents " # based on the posterior probability of a query $ being generated by those documents. %($; " # ) This is sometimes called a language modelling approach. CSC401/2511 Spring

82 Naïve Bayes generative model!: The greatest and best song Ranked Retrievals: ((!; # $ ) = 0.3 # $ # % # & # ' # $ 0.3 #, 0.28 # # % 0.1 # & 0.08 # CSC401/2511 Spring

83 Smoothing Since each language model,! ", will be very sparse, we need to smooth. Linear interpolation: # $;! " ' (# $;! " + 1 ' (# $ where (#($;! " ) is the probability of $ in the language model earned on! " (#($) is the probability of $ given the entire corpus. Add-.: same as in Assignment 2. CSC401/2511 Spring

84 Experimental results Linear interpolation performs almost as well as vector-space ranking (VSR). Add-! (called Laplace here) not so much. From Ray Mooney CSC401/2511 Spring

85 Evaluation How can we decide which of two search engines is better at responding to a query? Precision is the proportion of returned documents that are relevant. Recall is the proportion of relevant documents that are returned. Rank is also important we want the correct documents to be near the top of the ordered list. CSC401/2511 Spring

86 Recall is not enough If there are 20 relevant documents in a collection, each of the following systems has a recall of " #$ = 25%. Rank Boogle Ging Whoopie CSC401/2511 Spring

87 Precision is not enough Each of the following systems has a precision of 50%. Rank Boogle Ging Whoopie CSC401/2511 Spring

88 Cutoff precision Measure precision down to some cutoff, e.g., first 5. Rank Boogle Ging Whoopie CP 0% 100% 60% CSC401/2511 Spring

89 Uninterpolated Average Precision Measure precision at each relevant document returned. Rank Boogle precision /6 7 2/7 8 3/8 9 4/9 10 5/10 UAP CSC401/2511 Spring ! " #$ % #& ' #( ) # *!+ * = Normalize at end by number of relevant documents

90 Uninterpolated Average Precision Rank Boogle Ging Whoopie UAP CSC401/2511 Spring

91 Precision versus recall UAP is useful when your primary concern is relevance order. Returning all documents available will give you 100% recall, but would be useless to the user. Returning one document may give you high precision, but very low recall. The F-score combines precision and recall with parameter! that weighs the two: " = $ % & '()% * CSC401/2511 Spring

92 Conclusion Some slide and material based on those of Ray J. Mooney (UTexas, CS371R), Hinrich Schütze, Christina Lioma, and Chris Manning (Stanford, CS276). Dan Jurafsky (Stanford, CS124) CSC401/2511 Spring

This lecture. Introduction to information retrieval. Making money with information retrieval. Some technical basics. Link analysis.

This lecture. Introduction to information retrieval. Making money with information retrieval. Some technical basics. Link analysis. This lecture Introduction to information retrieval. Making money with information retrieval. Some technical basics. Link analysis. CSC401/2511 Spring 2017 2 Information retrieval systems Information retrieval