Information Retrieval and Applications

Size: px

Start display at page:

Download "Information Retrieval and Applications"

Oswald Reeves
6 years ago
Views:

1 Information Retrieval and Applications Knowledge Discovery and Data Mining 2 (VU) ( ) Roman Kern, Heimo Gursch KTI, TU Graz Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

2 Outline 1 Basics Inverted Index Probabilistic IR 2 Advanced Concepts Latent Semantic Indexing Relevance Feedback Query Processing 3 Evaluation 4 Applications of IR Classification Content-based Recommender systems 5 Solutions Open Source Solutions Commercial Solutions Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

3 Basics Basics Inverted Index, Text-Preprocessing, Language-based & Probabilistic IR Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

4 Basics Functionnal features Search for single keyword Search for multiple keywords Search for phrases, i. e. keywords in particular order Boolean combination of keywords (AND, OR, NOT) Rank results by number of keyword occurrences Find the positions of keywords in a document Compose snippets of text around the keyword positions Motivation How can this goals be achieved? Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

5 Basics Structured, Semi-Structured Data Structured data Format and type of information is known IR task: normalize representations between different sources Example: SQL databases Semi-structured data Combination of structured and unstructured data IR task: extract information from the unstructured part and combine it with the structured information Example: , document with meta-data Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

6 Basics Unstructured data Unstructured data Format of information is not know, information is hidden in (human-readable) text IR task: extract information at all Example: blog-posts, Wikipedia articles This lecture focuses on unstructured data. Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

7 Basics Simple Approach Sequential Search Advantage Simple Disadvantages Slow and hard to scale Usage Small texts Frequently changing text Not enough space for index available Example grep on the UNIX command line Not feasible for large texts. A different approach is needed. But how? Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

8 Definition Basics Information Retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers). The term term is very common in Information Retrieval and typically refers to single words (also common: bi-grams, phrases,...) Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

9 Basics Assumptions Basic assumptions of Information Retrieval Collection Fixed set of documents Goal Retrieve documents with information that is relevant to user s information need and helps him complete a task. Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

10 Basics Overview Indexed Searching Split process Indexing Convert all input documents into a data-structure suitable for fast searching Searching Take an input query and map it onto the pre-processed data-structure to retrieve the matching documents Inverted index Reverses the relationship between documents and contained words (terms) Central data-structure of a search engine Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

11 Three Example Documents Basics Document 1 Sepp was stopping off in Hawaii. Document 2 Sepp is a co-worker of Heidi. Document 3 Heidi stopped going to Hawaii. Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

12 Basics Inverted Index Inverted Index Matrix Term-document incidence Matrix entries 0 The document does not contain the corresponding term 1 The document does contain the corresponding term Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

13 Basics Inverted Index Example Matrix Document 1 Document 2 Document 3 co-worker going Hawaii Heidi Seppl stopped stopping off Matrix entries 0 The document does not contain the corresponding term 1 The document does contain the corresponding term Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

14 Basics Inverted Index Properties of the Inverted Index Matrix Vectorization of Documents Column Words of an document Row All documents where a word occurs Possible Queries Documents containing all terms of a query Take row vectors for each term and do a bitwise AND Documents contain any terms of a query Take row vectors for each term and do a bitwise OR Documents not containing a term Use a bitwise NOT Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

15 Basics Inverted Index Examples All documents containing the term Hawaii Take the row vector of Hawaii, look for a one: Document 1 & Document 3 All documents containing the terms Seppl & Heidi 1 Take the rows for the terms Seppl & Heidi 2 Combine them by a bitwise AND 3 The position in the result vector indicates the correct document: Document 2 More complex queries Use Boolean algebra on the row vectors Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

16 Basics Inverted Index What we have so far... Advantages Search for terms Formulate boolean queries Works very fast Disadvantages Cannot search for phrases (i. e. terms in a particular order) No information where the term occurred in the document No snippets No ranking possible, documents either match or do not match Matrix is usually very sparse, possible waste of memory Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

17 Basics Inverted Index Make the inverted index memory efficient Dictionary Store a (sorted a ) list of all terms in the documents. a The sorting is not necessary for it to work, but it is necessary to find terms fast Postings For every term: store a list in which documents the term occurs Posting list Linked lists generally preferred to arrays Dynamic space allocation Insertion of terms into documents easy Space overhead of pointers Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

18 Example Basics Inverted Index Dictionary Postings co-worker Document 2 going Document 3 Hawaii Document 1, Document 3 Heidi Document 2, Document 3 Seppl Document 1, Document 2 stopped Document 3 stopping off Document 1 Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

19 Basics Inverted Index Boolean Retrieval AND Intersection of postings for two or more terms OR Union of postings for two or more terms NOT All documents not in the postings of a term Make it efficient Sort the dictionary Sort the postings by document IDs But: Tricky and long boolean combinations can still get difficult! Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

20 Basics Inverted Index Properties Prerequisite Every document needs a unique, sortable ID, i. e. an unique integer number Advantages Very efficient storage Search for terms Formulate boolean queries Disadvantages Cannot search for phrases (i. e. terms in a particular order) No information where the term occurred in the document No snippets No ranking possible, documents either match or do not match Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

21 Basics Inverted Index Positional Information Store not only that a document contains a term, but also where the term occurs in the document. Dictionary Postings term 1 Document ID, Position(s) in document term 2 Document ID, Position(s) in document term n Document ID, Position(s) in document Search for phrases (i. e. consecutive words) 1 Get each word from the dictionary and retrieve its posting list 2 If the document IDs are equal for two or more words, search for consecutive positions Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

22 Basics Inverted Index Example Dictionary Postings co-worker (Doc 2, Pos 4) going (Doc 3, Pos 3) Hawaii (Doc 1, Pos 5); (Doc 3, Pos 5) Heidi (Doc 2, Pos 6); (Doc 3, Pos 1) Seppl (Doc 1, Pos 1); (Doc 2, Pos 1) stopped (Doc 3, Pos 2) stopping off (Doc 1, Pos 3) Search for the phrase Heidi stopped 1 Heidi (Doc 2, 6); (Doc 3, Pos1) 2 stopped (Doc 3, Pos 2) 3 Consecutive positions for Document 3 Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

23 Basics Inverted Index Properties Advantages Still efficient storage Search for terms Formulate boolean queries Search for phrases (i. e. terms in a particular order) Information where the term occurred in the document Possible creation of snippets Disadvantages Ranking a bit tricky, but possible Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

24 Basics Inverted Index Term Frequencies Implicit term frequencies By storing the positions where the terms occur in a document, also the information how often it occurs is stored. As there have to be more position markers if the term occurs more often. Problem: Counting needs time Explicit term frequencies Store in the posting list, how often the term occurs in the document Possible to leave the positional information out, if not needed Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

25 Basics Inverted Index Term Frequencies Structure Store in the posting list, how often the term occurs in the document (This is often called Term Frequency (TF)) Possible to leave the positional information out, if not needed Dictionary Postings term 1 Doc ID, TF, Position(s); Doc ID, TF, Position(s);... term 2 Doc ID, TF, Position(s); term n Doc ID, TF, Position(s); Doc ID, TF, Position(s);... Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

26 Basics Inverted Index Properties Advantages Still efficient storage Search for terms Formulate boolean queries Search for phrases (i. e. terms in a particular order) Information where the term occurred in the document Possible creation of snippets Documents can be ranked by the number of terms they contain Disadvantages Takes up space, not only storing the original documents, but also the inverted index Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

27 Steps for building an index Basics Inverted Index Step 1 Build term-document table Dictionary Postings term 1 Doc ID, TF, Position(s); Doc ID, TF, Position(s);... term 2 Doc ID, TF, Position(s); term n Doc ID, TF, Position(s); Doc ID, TF, Position(s);... oman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

28 Basics Inverted Index Steps for building an index Step 1 Build term-document table Step 2 Sort by terms Dictionary Postings term 1 Doc ID, TF, Position(s); Doc ID, TF, Position(s);... term 2 Doc ID, TF, Position(s); term n Doc ID, TF, Position(s); Doc ID, TF, Position(s);... oman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

29 Basics Inverted Index Steps for building an index Step 1 Build term-document table Step 2 Sort by terms Step 3 Add term frequency, multiple entries from single document get merged Correspondingly Get the positions of the terms in the documents Dictionary Postings term 1 Doc ID, TF, Position(s); Doc ID, TF, Position(s);... term 2 Doc ID, TF, Position(s); term n Doc ID, TF, Position(s); Doc ID, TF, Position(s);... oman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

30 Basics Inverted Index Steps for building an index Step 1 Build term-document table Step 2 Sort by terms Step 3 Add term frequency, multiple entries from single document get merged Correspondingly Get the positions of the terms in the documents Step 4 : Result is split into dictionary file and description file. Dictionary Postings term 1 Doc ID, TF, Position(s); Doc ID, TF, Position(s);... term 2 Doc ID, TF, Position(s); term n Doc ID, TF, Position(s); Doc ID, TF, Position(s);... Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

31 Basics Inverted Index Preprocessing of Text To make a text indexable, a couple of steps are necessary: 1 Detect the language of the text 2 Detect sentence boundaries 3 Detect term (word) boundaries 4 Stemming and normalization 5 Remove stop words Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

32 Basics Inverted Index Language Detection Letter Frequencies Percentage of occurrence Letter English German A E I O U N-Gram Frequencies Better more elaborate methods use statistics over more than one letter, e. g. statistics over two, three or even more consecutive letters. Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

33 Basics Inverted Index Sentence Detection First approach Every period marks the end of a sentence Problem Utilize other features Periods also mark abbreviations, decimal points, -addresses, etc. Is the next letter capitalized? Is the term in front of the period a known abbreviation? Is the period surround by digits?... Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

34 Basics Inverted Index Tokenization Problem Goal Split sentences into tokens Throw away punctuations Possible Pit Falls White spaces are no safe delimiter for word boundaries(e. g. New York) Non-alphanumerical characters can separate words, but need not (e. g. co-worker) Different forms (e. g. white space vs. whitespace) German compound nouns (e. g. Donaudampfschiffahrtsgesellschaft, meaning Danube Steamboat Shipping Company) Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

35 Basics Inverted Index Tokenization Solutions Supervised Learning Train a model with annotated training data, use the trained model to tokenize unknown text Hidden-Markov-Models and conditional random fields are commonly used Dictionary Approach Build a dictionary (i. e. list ) of tokens Go over the sentence and always take the longest fitting token (greedy algorithm!) Remark: Works well for Asian languages without white spaces and short words. Problematic for European languages Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

36 Basics Inverted Index Normalization Some definitions Token Sequence of characters representing a useful semantic unit, instance of a type Task Type Common concept for tokens, element of the vocabulary Term Type that is stored in the dictionary of an index 1 Map all possible tokens to the corresponding type 2 Store all representing terms of one type in the dictionary Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

37 Example Basics Inverted Index Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

38 Basics Inverted Index Posible Ways for Normalization Ignore cases Rule based removal of hyphens, periods, white spaces, accents, diacritics (Be aware of possible different meaning after removal) Provide a list of all possible representations of a type Map different spelling version by using a phonetic algorithm. Phonetic algorithms translate the spelling into their pronunciation equivalent (e. g. Double Metaphone, Soundex). Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

39 Basics Inverted Index Stemming & Lemmatization Definition Goal of both Reduce different grammatical forms of a word to their common infinitive a Stemming Usage of heuristics to chop off / replace last part of words (Example: Porter s algorithm). Lemmatization Usage of proper grammatical rules to recreate the infinitive. a The common infinitive is the form of a word how it is written in a dictionary. This form is called lemma, hence the name Lemmatization. Examples Map do, doing, done, to commone infinitive do Map digitalizing, digitalized to digital Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

40 Basics Inverted Index Stop Words Definition Extremely common words that appear in nearly every text Problem As stop words are so conmen, their occurrence does not characterize a text Solution Just drop them, i. e. do not put them in the index directory Common stop words Solution Write a list with stop words (List might be topic specific!) Usual suspects: articles (a, an, the), conjunction (and, or, but,...), pre- and postposition (in, for, from), etc. Stop word list Ignore word that are given on a list (black list) Problem Special names and phrases (The Who, Let It Be,...) Solution Make another list... (white list) Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

41 Information flow Basics Probabilistic IR The classic search model Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

42 Basics Probabilistic IR Probabilistic Information Retrial Every transformation can hold a possible information loss 1 Task to query 2 Query to document Information loss leads to uncertainty Probability theory can deal with uncertainty Basic idea behind probabilistic IR Model the uncertainty, which originates in the imperfect transformations Use this uncertainty as a measurement how relevant a document is, given a certain query Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

43 Basics Probabilistic IR Probabilistic Information Retrial Models Binary Independence Model Okapi BM25 Bayesian networks for text retrieval Language Models Coming now... Binary: documents and queries represented as binary term incidence vectors Independence: terms are assumed to be independent of each other Takes number of term frequencies and document length into account Parameters to tune influence of term frequencies and document length Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

44 Basics Probabilistic IR Language Models Basic Idea Train a probabilistic model M d for document d, i. e. as many trained models as documents M d ranks documents on how possible it is that the user s query q was created by the document d Results are document models (or documents, respectively) which have a high probability to generate the user s query Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

45 Basics Probabilistic IR Examples for Language Models Successive Probability of query terms P(t 1 t 2... t n ) = P(t 1 )P(t 2 t 1 )P(t 3 t 1 t 2 )... P(t n t 1... t n 1 ) Unigram Languge Model P uni (t 1 t 2... t n ) = P(t 1 )P(t 2 )... P(t n ) Bigram Languge Model P(t 1 t 2... t n ) = P(t 1 )P(t 2 t 1 )P(t 3 t 2 )... P(t n t n 1 ) Probability P(t n ) Probabilities are generated from the term occurrence in the document in question And many more... A lot more, and more complex models are available Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

46 Advanced Concepts Advanced Concepts LSA, Relevance Feedback, Query Processing,... Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

47 Advanced Concepts Latent Semantic Indexing Index Construction Going beyond simple indexing Term-document matrices are very large But the number of topics that people talk about is small (in some sense) Clothes, movies, politics,... Can we represent the term-document space by a lower dimensional latent space? Latent Semantic Indexing (LSI) Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

48 Advanced Concepts Latent Semantic Indexing Latent Semantic Indexing - Introduction Mathematically speaking This is taking a 3-D point and projecting it into 2-D ( x, y, z) ( 10, 10, 10) ( x, y) ( 10, 10) The arrow in this picture acts like the overhead projector Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

49 Advanced Concepts Latent Semantic Indexing Latent Semantic Indexing - Introduction Mathematically speaking We can project from any number of dimensions into any other number of dimensions. Increasing dimensions adds redundant information But sometimes useful Support Vector Machines (kernel methods) do this effectively Latent Semantic Indexing always reduces the number of dimensions Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

50 Advanced Concepts Latent Semantic Indexing Latent Semantic Indexing - Introduction Mathematically speaking Latent Semantic Indexing always reduces the number of dimensions (x,y) ( 10, 10) (x) ( 10 ) Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

51 Advanced Concepts Latent Semantic Indexing Latent Semantic Indexing - Introduction Mathematically speaking Latent Semantic Indexing can project on an arbitrary axis, not just a principal axis Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

52 Advanced Concepts Latent Semantic Indexing Latent Semantic Indexing - Introduction Mathematically speaking Our documents were just points in an N-dimensional term space We can project them also Brutus MacBeth Brutus MacBeth Othello Othello Hamlet Hamlet Julius Caesar Julius Caesar Antony and Cleopatra Antony and Cleopatra Tempest Antony Tempest Antony Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

53 Advanced Concepts Latent Semantic Indexing Latent Semantic Indexing LSI - expected behaviour A term vector that is projected on new vectors may uncover deeper meanings For example transforming the 3 axes of a term matrix from ball, bat and cave to An axis that merges ball and bat An axis that merges bat and cave Should be able to separate differences in meaning of the term bat Bonus: less dimensions is faster Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

54 Advanced Concepts Latent Semantic Indexing Singular Value Decomposition SVD overview For an m n matrix A of rank r there exists a factorization (Singular Value Decomposition = SVD) as follows: M = UΣV T... U is an m m unitary matrix... Σ is an m n diagonal matrix... V T is an n n unitary matrix The columns of U are orthogonal eigenvectors of AA T The columns of V are orthogonal eigenvectors of A T A Eigenvalues λ 1...λ r of AA T are the eigenvalues of A T A. Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

55 Advanced Concepts Latent Semantic Indexing Singular Value Decomposition Matrix Decomposition SVD enables lossy compression of a term-document matrix Reduces the dimensionality or the rank Arbitrarily reduce the dimensionality by putting zeros in the bottom right of sigma This is a mathematically optimal way of reducing dimensions Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

56 Advanced Concepts Latent Semantic Indexing Singular Value Decomposition Reduced SVD If we retain only k singular values, and set the rest to 0, then we don t need the matrix parts in green Then Σ is k k, U is M k, V T is k n, and A k is m n This is referred to as the reduced SVD It is the convenient (space-saving) and usual form for computational applications Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

57 Advanced Concepts Latent Semantic Indexing Singular Value Decomposition Approximation Error How good (bad) is this approximation? It s the best possible, measured by the Frobenius norm of the error Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

58 Advanced Concepts Latent Semantic Indexing Singular Value Decomposition Reduced SVD Whereas the term-doc matrix A may have m = 50000, n = 10million (and rank close to 50000) We can construct an approximation A 100 with rank 100. Of all rank 100 matrices, it would have the lowest Frobenius error. Latent Semantic Indexing Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

59 Advanced Concepts Latent Semantic Indexing Latent Semantic Indexing Application of SVD - LSI From term-document matrix A, we compute the approximation A k. There is a row for each term and a column for each document in A k Thus documents live in a space of k << r dimensions These dimensions are not the original axes Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

60 Advanced Concepts Latent Semantic Indexing Latent Semantic Indexing Approximation Error Each row and column of A gets mapped into the k-dimensional LSI space, by the SVD. Claim this is not only the mapping with the best (Frobenius error) approximation to A, but in fact improves retrieval. A query q is also mapped into this space Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

61 Advanced Concepts Latent Semantic Indexing Latent Semantic Indexing LSI - Results Similar terms map to similar location in low dimensional space Noise reduction by dimension reduction Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

62 Advanced Concepts Relevance Feedback Relevance Feedback Integrate feedback from the user Given a search result for a user s query... allow the user to rate the retrieved results (good vs. not-so-good match) This information can then be used to re-rank the results to match the user s preference Often automatically inferred from the user s behaviour using the search query logs... and ultimately improve the search experience The query logs (optimally) contain the queries plus the items the user has clicked on (the user s session) Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

63 Advanced Concepts Relevance Feedback Relevance Feedback Pseudo relevance feedback No user interaction is needed for the pseudo relevance feedback... the first n results are simply assumed to be a good match As if the users has clicked on the first results and marked them as good match often improves the quality of the search results Pseudo relevance feedback is sometimes also called blind relevance feedback Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

64 Advanced Concepts Relevance Feedback More Like This A common feature of search engines is the more like this search Given a search results... the user wants items similar to one of the presented results This can be seen as: taking a document as query which is a common implementation (but restricting the query to the most discriminative terms) Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

65 Advanced Concepts Query Processing Query Reformulation Modify the query, during of after the user interaction Query suggestion & completion (Automatic) query correction (Automatic) query expansion Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

66 Advanced Concepts Query Processing Query Suggestion The user start typing a query... automatically gets suggestions to complete the currently typed word to complete a whole phrase (e.g. the next word) suggest related queries Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

67 Advanced Concepts Query Processing Query Correction Query misspellings Our principal focus here E.g., the query Britni Spears We can either Retrieve documents indexed by the correct spelling, Rewrite the input query (replace, expand), britni spears (britni OR britney) spears Return several suggested alternative queries with the correct spelling - Did you mean Britney Spears? Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

68 Query Correction Advanced Concepts Query Processing Isolated word correction Fundamental premise there is a lexicon from which the correct spellings come Basic choices for this A standard lexicon such as Webster s English Dictionary An industry-specific lexicon hand-maintained The lexicon of the indexed corpus E.g., all words on the web All names, acronyms etc. (Including the mis-spellings) The use of a query log Use other users behaviour Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

69 Advanced Concepts Query Processing Query Correction Isolated word correction Given a lexicon and a character sequence Q, return the words in the lexicon closest to Q What s closest? We ll study several alternatives 1 Edit distance (Levenshtein distance) 2 Weighted edit distance 3 n-gram overlap Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

70 Query Correction Advanced Concepts Query Processing Word correction: edit distance Given two strings S 1 and S 2, the minimum number of operations to convert one to the other Operations are typically character-level Insert, Delete, Replace, (Transposition) Examples From dof to dog is 1 From cat to act is 2 (Just 1 with transpose.) From cat to dog is 3. See for a nice example plus an applet. Notebook with various distances (Levenshtein being the best known): Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

71 Query Correction Advanced Concepts Query Processing Word correction: edit distance Given two strings S 1 and S 2, the minimum number of operations to convert one to the other Operations are typically character-level Insert, Delete, Replace, (Transposition) Examples From dof to dog is 1 From cat to act is 2 (Just 1 with transpose.) From cat to dog is 3. See for a nice example plus an applet. Notebook with various distances (Levenshtein being the best known): Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

72 Advanced Concepts Query Processing Query Correction Word correction: weighted edit distance As above, but the weight of an operation depends on the character(s) involved Meant to capture OCR or keyboard errors, e.g. m more likely to be mis-typed as n than as q Therefore, replacing m by n is a smaller edit distance than by q This may be formulated as a probability model Requires weight matrix as input Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

73 Advanced Concepts Query Processing Query Correction Word correction: n-gram overlap Enumerate all the n-grams in the query string as well as in the lexicon Use the n-gram index (recall wild-card search) to retrieve all lexicon terms matching any of the query n-grams Threshold by number of matching n-grams Variants weight by keyboard layout, etc. Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

74 Advanced Concepts Query Processing Query Correction Word correction: n-gram overlap, example Suppose the text is november Trigrams are: nov ove vem emb mbe ber The query is december Trigrams are: dec ece cem emb mbe ber Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

75 Advanced Concepts Query Processing Query Correction Word correction: n-gram overlap, example Suppose the text is november Trigrams are: nov ove vem emb mbe ber The query is december Trigrams are: dec ece cem emb mbe ber So 3 trigrams overlap (of 6 in each term) How can we turn this into a normalised measure of overlap? Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

76 Advanced Concepts Query Processing Query Correction n-gram overlap, Jaccard Coefficient A commonly-used measure of overlap Let X and Y be two sets; then the Jaccard Coefficient is Equals 1 when X and Y have the same elements Equals 0 when they are disjoint X and Y don t have to be of the same size Always assigns a number between 0 and 1 Now threshold to decide if you have a match E.g., if JC > 0.8 declare a match Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

77 Advanced Concepts Query Processing Query Expansion The user has entered a query... automatically (related) words are added to the query Global query expansion - only use the query (plus corpus or background knowledge) Local query expansion - conduct the search and analyse the search results Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

78 Global Query Expansion Advanced Concepts Query Processing Uses only the query Example: Use of a thesaurus Query: [buy car] Use synonyms from the thesaurus Expanded query: [(buy OR purchase) (car OR auto)] Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

79 Advanced Concepts Query Processing Local query expansion Uses the query plus the search results... conceptually similar to the pseudo relevance feedback Look for terms in the first n search results and add them to the query Re-run the query with the added terms Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

80 Evaluation Evaluation Assess the quality of search Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

81 Evaluation Overview: Evaluation Two basic questions / goals Are the retrieved document relevant for the user s information need? Have all relevant document been retrieved? In practice these two goal contradict each other. Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

82 Evaluation Precision & Recall Evaluation Scenario Fixed document collection Number of queries Set of documents marked as relevant for each query Main Evaluation Metrics Precision = Recall = #(relevant items retrieved) #(retrieved items) #(relevant items retrieved) #(relevant items) F 1 = 2PrecisionRecall Precision+Recall Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

83 Precision & Recall Evaluation Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

84 Evaluation Advanced Evaluation Measures Mean Average Precision - MAP... mean of all average precision of multiple queries Map = 1 Q q Q AP q AP q = 1 n n k=1 Precision(k) (Normalised) discounted cumulative gain (ndcg)... where there is a reference ranking the relevant results Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

85 Applications of IR Selected Applications of IR Classification & Recommender Systems Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

86 Applications of IR Classification Overview Classification Supervised Learning: labelled training data is needed (effort for labelling) Classification problems: classify an example into a given set of categories Algorithm learns model from the labelled training set and applies this model on unknown data Used in real applications Clustering Unsupervised Learning Find groups in the data, where the similarity within the group is high and the similarity between groups is low Hardly ever used in real applications Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

87 Applications of IR Classification Supervised learning Given examples of a function (x, f (x)) Predict function f (x) for new examples x Depending on f (x) we have: 1 Classification if f (x) is a discrete function 2 Regression if f (x) is a continuous function 3 Probability estimation if f (x) is a probability distribution Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

88 Applications of IR Classification Classification Definition Given: Training examples (x, f (x)) for some unknown discrete function f Find: A good approximation for f x is data, e. g. a vector, text documents, s, words in text, etc. Classification approaches Logistic regression Decision trees Support vector machines Neural networks Statistical inference such as Bayesian learning Nearest neighbour algorithms (Here: knn-algorithm) Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

89 General knn-algorithm Applications of IR Classification 1 For each desired class: at least one example must be provided 2 Take a unclassified document: Look at its k nearest neighbours 3 Assign the unknown element to the class with the most neighbours in it 4 Goto 2 Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

90 Document Classification Applications of IR Classification 1 For each desired class: at least one document must be provided 2 Find k nearest neighbours of a document: 1 Take all terms of a document 2 Combine them to a query (e. g. with a logical OR) 3 Take the first k results form an index 3 Assign the unknown document to the class with the most neighbours in it 4 Goto 2 Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

91 Applications of IR Content-based Recommender systems Overview Recommender systems should provide usable suggestion to users Recommender systems should show new, unknown but similar documents Recommender systems can utilize different information sources Content-based Recommender Finds similar documents on the basis of document content Collaborative Filering Employs user rating to find new and useful documents Knowledge-based Recommender Makes decisions based on pre-programmed rules Hybrid Recommender Combination of two or three other approaches Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

92 Applications of IR Types of Content-based Recommender Content-based Recommender systems Document-to-Document Recommender Document-to-User Recommender Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

93 Applications of IR Content-based Recommender systems Building a Recommender by Using an Index Document-to-Document Recommender 1 Take the terms from a document 2 Construct a (weighted) query with them 3 Query the index 4 Results are possible candidates for recommending Document-to-User Recommender 1 Create a user model from Terms typed by the user Documents viewed by the user Simplest user model is just a set of terms 2 Construct a (weighted) query with them 3 Query the index 4 Results are possible candidates for recommending Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

94 Solutions Solutions Open-source and commercial IR solutions Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

95 Solutions Open Source Solutions Open Source Search Engine Apache Lucene Very popular, high quality Backbone of many other search engines, e.g. Apache Solr, ElasticSearch Lucene is the core search engine Solr provides a service & configuration infrastructure (plus a few extra features) Implemented in Java, bindings for many other programming languages Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

96 Solutions Open Source Solutions Solr Index Management API HTTP Server Various formats (XML, binary, JavaScript,...) Document life-cycle There is no update Delete (done automatically by Solr) Insert Implications An unique id is necessary Use batch updates Commit, rollback (and optimize) Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

97 Solutions Open Source Solutions Solr Text Processing Scope During indexing & query Tokenization Split text into tokens Lower-case alignment Stemming (e.g. ponies, pony poni, triplicate triplic,...) Synonyms (via Thesaurus) Stop-word filtering Multi-word splitting (e.g. Wi-Fi Wi, Fi) n-grams, soundex, umlauts Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

98 Solutions Open Source Solutions Solr Query Processing Query parsers Lucene query parser (rich syntax) AND, OR, NOT, range queries, wildcards, fuzzy query, phrase query Boosting of individual parts Example: ((boltzmann OR schroedinger) NOT einstein) Dismax query parser No query syntax Searches over multiple fields (separate boost for each field) Configure the amount of terms to be mandatory Distance between terms is used for ranking (phrase boosting) Dismax is a good starting point, but may become expensive Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

99 Solutions Open Source Solutions Search Features Query filter Additional query No impact on ranking Results are cached Boosting query Only in Dismax Query elevation Fix certain queries Request handler Pre-define clauses Invariants Function queries Score is computed on field values Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

100 Solutions Open Source Solutions Solr Search Result Ranking Relevance Sort on field value (only single term per document) Available data & features Sequence of IDs & score Stored fields Snippets (plus highlighting) Facets Count the search hits Types: field value, dates, queries Sort, prefix,... Could be used for term suggestion (aka. query suggestion) Field collapsing (grouping) Spell checking (did-you-mean) Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

101 Solutions Open Source Solutions Additional Solr Features Query by Example More like this Stats Per field Min, max, sum, missing,... Admin-GUI Webapp to troubleshoot queries Browse schema JMX Read properties & statistics Can be accessed remotely Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

102 Solutions Open Source Solutions Solr Integration Deployment Within a web application server Embedded Monitor Log output Access Various language bindings Java, Ruby, JavaScript, PHP,... Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

103 Solutions Open Source Solutions Other Open-Source Search Engines Xapian Support for many (programming) languages Used by many open-source projects Lemur & Indri Language models Academic background Terrier Many algorithms Academic background Whoosh for Pythonistas Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

104 What customers want Solutions Commercial Solutions Indexing pipeline And also: user interface, access control mechanisms, logging, etc. Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

105 Solutions Commercial Solutions Commercial Search Engines Autonomy Sinequa Attivio Grantees not to truncate documents Offers a wide range on source connectivity Expert Search Creates new views, i. e. do not only show result list, but combine results to new unit Use SQL statements in queries IntraFIND Lucene-based And many, many more... Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

106 Solutions Commercial Solutions The End Next: Preprocessing Platform & Practical Applications Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

Recap of last time CS276A Information Retrieval

Recap of last time CS276A Information Retrieval Index compression Space estimation Lecture 4 This lecture Tolerant retrieval Wild-card queries Spelling correction Soundex Wild-card queries Wild-card queries: