Information Retrieval and Applications

Size: px
Start display at page:

Download "Information Retrieval and Applications"

Transcription

1 Information Retrieval and Applications Knowledge Discovery and Data Mining 2 (VU) ( ) Roman Kern, Heimo Gursch KTI, TU Graz Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

2 Outline 1 Basics Inverted Index Probabilistic IR 2 Advanced Concepts Latent Semantic Indexing Relevance Feedback Query Processing 3 Evaluation 4 Applications of IR Classification Content-based Recommender systems 5 Solutions Open Source Solutions Commercial Solutions Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

3 Basics Basics Inverted Index, Text-Preprocessing, Language-based & Probabilistic IR Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

4 Basics Functionnal features Search for single keyword Search for multiple keywords Search for phrases, i. e. keywords in particular order Boolean combination of keywords (AND, OR, NOT) Rank results by number of keyword occurrences Find the positions of keywords in a document Compose snippets of text around the keyword positions Motivation How can this goals be achieved? Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

5 Basics Structured, Semi-Structured Data Structured data Format and type of information is known IR task: normalize representations between different sources Example: SQL databases Semi-structured data Combination of structured and unstructured data IR task: extract information from the unstructured part and combine it with the structured information Example: , document with meta-data Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

6 Basics Unstructured data Unstructured data Format of information is not know, information is hidden in (human-readable) text IR task: extract information at all Example: blog-posts, Wikipedia articles This lecture focuses on unstructured data. Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

7 Basics Simple Approach Sequential Search Advantage Simple Disadvantages Slow and hard to scale Usage Small texts Frequently changing text Not enough space for index available Example grep on the UNIX command line Not feasible for large texts. A different approach is needed. But how? Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

8 Definition Basics Information Retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers). The term term is very common in Information Retrieval and typically refers to single words (also common: bi-grams, phrases,...) Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

9 Basics Assumptions Basic assumptions of Information Retrieval Collection Fixed set of documents Goal Retrieve documents with information that is relevant to user s information need and helps him complete a task. Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

10 Basics Overview Indexed Searching Split process Indexing Convert all input documents into a data-structure suitable for fast searching Searching Take an input query and map it onto the pre-processed data-structure to retrieve the matching documents Inverted index Reverses the relationship between documents and contained words (terms) Central data-structure of a search engine Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

11 Three Example Documents Basics Document 1 Sepp was stopping off in Hawaii. Document 2 Sepp is a co-worker of Heidi. Document 3 Heidi stopped going to Hawaii. Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

12 Basics Inverted Index Inverted Index Matrix Term-document incidence Matrix entries 0 The document does not contain the corresponding term 1 The document does contain the corresponding term Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

13 Basics Inverted Index Example Matrix Document 1 Document 2 Document 3 co-worker going Hawaii Heidi Seppl stopped stopping off Matrix entries 0 The document does not contain the corresponding term 1 The document does contain the corresponding term Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

14 Basics Inverted Index Properties of the Inverted Index Matrix Vectorization of Documents Column Words of an document Row All documents where a word occurs Possible Queries Documents containing all terms of a query Take row vectors for each term and do a bitwise AND Documents contain any terms of a query Take row vectors for each term and do a bitwise OR Documents not containing a term Use a bitwise NOT Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

15 Basics Inverted Index Examples All documents containing the term Hawaii Take the row vector of Hawaii, look for a one: Document 1 & Document 3 All documents containing the terms Seppl & Heidi 1 Take the rows for the terms Seppl & Heidi 2 Combine them by a bitwise AND 3 The position in the result vector indicates the correct document: Document 2 More complex queries Use Boolean algebra on the row vectors Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

16 Basics Inverted Index What we have so far... Advantages Search for terms Formulate boolean queries Works very fast Disadvantages Cannot search for phrases (i. e. terms in a particular order) No information where the term occurred in the document No snippets No ranking possible, documents either match or do not match Matrix is usually very sparse, possible waste of memory Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

17 Basics Inverted Index Make the inverted index memory efficient Dictionary Store a (sorted a ) list of all terms in the documents. a The sorting is not necessary for it to work, but it is necessary to find terms fast Postings For every term: store a list in which documents the term occurs Posting list Linked lists generally preferred to arrays Dynamic space allocation Insertion of terms into documents easy Space overhead of pointers Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

18 Example Basics Inverted Index Dictionary Postings co-worker Document 2 going Document 3 Hawaii Document 1, Document 3 Heidi Document 2, Document 3 Seppl Document 1, Document 2 stopped Document 3 stopping off Document 1 Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

19 Basics Inverted Index Boolean Retrieval AND Intersection of postings for two or more terms OR Union of postings for two or more terms NOT All documents not in the postings of a term Make it efficient Sort the dictionary Sort the postings by document IDs But: Tricky and long boolean combinations can still get difficult! Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

20 Basics Inverted Index Properties Prerequisite Every document needs a unique, sortable ID, i. e. an unique integer number Advantages Very efficient storage Search for terms Formulate boolean queries Disadvantages Cannot search for phrases (i. e. terms in a particular order) No information where the term occurred in the document No snippets No ranking possible, documents either match or do not match Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

21 Basics Inverted Index Positional Information Store not only that a document contains a term, but also where the term occurs in the document. Dictionary Postings term 1 Document ID, Position(s) in document term 2 Document ID, Position(s) in document term n Document ID, Position(s) in document Search for phrases (i. e. consecutive words) 1 Get each word from the dictionary and retrieve its posting list 2 If the document IDs are equal for two or more words, search for consecutive positions Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

22 Basics Inverted Index Example Dictionary Postings co-worker (Doc 2, Pos 4) going (Doc 3, Pos 3) Hawaii (Doc 1, Pos 5); (Doc 3, Pos 5) Heidi (Doc 2, Pos 6); (Doc 3, Pos 1) Seppl (Doc 1, Pos 1); (Doc 2, Pos 1) stopped (Doc 3, Pos 2) stopping off (Doc 1, Pos 3) Search for the phrase Heidi stopped 1 Heidi (Doc 2, 6); (Doc 3, Pos1) 2 stopped (Doc 3, Pos 2) 3 Consecutive positions for Document 3 Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

23 Basics Inverted Index Properties Advantages Still efficient storage Search for terms Formulate boolean queries Search for phrases (i. e. terms in a particular order) Information where the term occurred in the document Possible creation of snippets Disadvantages Ranking a bit tricky, but possible Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

24 Basics Inverted Index Term Frequencies Implicit term frequencies By storing the positions where the terms occur in a document, also the information how often it occurs is stored. As there have to be more position markers if the term occurs more often. Problem: Counting needs time Explicit term frequencies Store in the posting list, how often the term occurs in the document Possible to leave the positional information out, if not needed Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

25 Basics Inverted Index Term Frequencies Structure Store in the posting list, how often the term occurs in the document (This is often called Term Frequency (TF)) Possible to leave the positional information out, if not needed Dictionary Postings term 1 Doc ID, TF, Position(s); Doc ID, TF, Position(s);... term 2 Doc ID, TF, Position(s); term n Doc ID, TF, Position(s); Doc ID, TF, Position(s);... Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

26 Basics Inverted Index Properties Advantages Still efficient storage Search for terms Formulate boolean queries Search for phrases (i. e. terms in a particular order) Information where the term occurred in the document Possible creation of snippets Documents can be ranked by the number of terms they contain Disadvantages Takes up space, not only storing the original documents, but also the inverted index Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

27 Steps for building an index Basics Inverted Index Step 1 Build term-document table Dictionary Postings term 1 Doc ID, TF, Position(s); Doc ID, TF, Position(s);... term 2 Doc ID, TF, Position(s); term n Doc ID, TF, Position(s); Doc ID, TF, Position(s);... oman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

28 Basics Inverted Index Steps for building an index Step 1 Build term-document table Step 2 Sort by terms Dictionary Postings term 1 Doc ID, TF, Position(s); Doc ID, TF, Position(s);... term 2 Doc ID, TF, Position(s); term n Doc ID, TF, Position(s); Doc ID, TF, Position(s);... oman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

29 Basics Inverted Index Steps for building an index Step 1 Build term-document table Step 2 Sort by terms Step 3 Add term frequency, multiple entries from single document get merged Correspondingly Get the positions of the terms in the documents Dictionary Postings term 1 Doc ID, TF, Position(s); Doc ID, TF, Position(s);... term 2 Doc ID, TF, Position(s); term n Doc ID, TF, Position(s); Doc ID, TF, Position(s);... oman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

30 Basics Inverted Index Steps for building an index Step 1 Build term-document table Step 2 Sort by terms Step 3 Add term frequency, multiple entries from single document get merged Correspondingly Get the positions of the terms in the documents Step 4 : Result is split into dictionary file and description file. Dictionary Postings term 1 Doc ID, TF, Position(s); Doc ID, TF, Position(s);... term 2 Doc ID, TF, Position(s); term n Doc ID, TF, Position(s); Doc ID, TF, Position(s);... Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

31 Basics Inverted Index Preprocessing of Text To make a text indexable, a couple of steps are necessary: 1 Detect the language of the text 2 Detect sentence boundaries 3 Detect term (word) boundaries 4 Stemming and normalization 5 Remove stop words Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

32 Basics Inverted Index Language Detection Letter Frequencies Percentage of occurrence Letter English German A E I O U N-Gram Frequencies Better more elaborate methods use statistics over more than one letter, e. g. statistics over two, three or even more consecutive letters. Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

33 Basics Inverted Index Sentence Detection First approach Every period marks the end of a sentence Problem Utilize other features Periods also mark abbreviations, decimal points, -addresses, etc. Is the next letter capitalized? Is the term in front of the period a known abbreviation? Is the period surround by digits?... Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

34 Basics Inverted Index Tokenization Problem Goal Split sentences into tokens Throw away punctuations Possible Pit Falls White spaces are no safe delimiter for word boundaries(e. g. New York) Non-alphanumerical characters can separate words, but need not (e. g. co-worker) Different forms (e. g. white space vs. whitespace) German compound nouns (e. g. Donaudampfschiffahrtsgesellschaft, meaning Danube Steamboat Shipping Company) Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

35 Basics Inverted Index Tokenization Solutions Supervised Learning Train a model with annotated training data, use the trained model to tokenize unknown text Hidden-Markov-Models and conditional random fields are commonly used Dictionary Approach Build a dictionary (i. e. list ) of tokens Go over the sentence and always take the longest fitting token (greedy algorithm!) Remark: Works well for Asian languages without white spaces and short words. Problematic for European languages Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

36 Basics Inverted Index Normalization Some definitions Token Sequence of characters representing a useful semantic unit, instance of a type Task Type Common concept for tokens, element of the vocabulary Term Type that is stored in the dictionary of an index 1 Map all possible tokens to the corresponding type 2 Store all representing terms of one type in the dictionary Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

37 Example Basics Inverted Index Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

38 Basics Inverted Index Posible Ways for Normalization Ignore cases Rule based removal of hyphens, periods, white spaces, accents, diacritics (Be aware of possible different meaning after removal) Provide a list of all possible representations of a type Map different spelling version by using a phonetic algorithm. Phonetic algorithms translate the spelling into their pronunciation equivalent (e. g. Double Metaphone, Soundex). Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

39 Basics Inverted Index Stemming & Lemmatization Definition Goal of both Reduce different grammatical forms of a word to their common infinitive a Stemming Usage of heuristics to chop off / replace last part of words (Example: Porter s algorithm). Lemmatization Usage of proper grammatical rules to recreate the infinitive. a The common infinitive is the form of a word how it is written in a dictionary. This form is called lemma, hence the name Lemmatization. Examples Map do, doing, done, to commone infinitive do Map digitalizing, digitalized to digital Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

40 Basics Inverted Index Stop Words Definition Extremely common words that appear in nearly every text Problem As stop words are so conmen, their occurrence does not characterize a text Solution Just drop them, i. e. do not put them in the index directory Common stop words Solution Write a list with stop words (List might be topic specific!) Usual suspects: articles (a, an, the), conjunction (and, or, but,...), pre- and postposition (in, for, from), etc. Stop word list Ignore word that are given on a list (black list) Problem Special names and phrases (The Who, Let It Be,...) Solution Make another list... (white list) Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

41 Information flow Basics Probabilistic IR The classic search model Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

42 Basics Probabilistic IR Probabilistic Information Retrial Every transformation can hold a possible information loss 1 Task to query 2 Query to document Information loss leads to uncertainty Probability theory can deal with uncertainty Basic idea behind probabilistic IR Model the uncertainty, which originates in the imperfect transformations Use this uncertainty as a measurement how relevant a document is, given a certain query Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

43 Basics Probabilistic IR Probabilistic Information Retrial Models Binary Independence Model Okapi BM25 Bayesian networks for text retrieval Language Models Coming now... Binary: documents and queries represented as binary term incidence vectors Independence: terms are assumed to be independent of each other Takes number of term frequencies and document length into account Parameters to tune influence of term frequencies and document length Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

44 Basics Probabilistic IR Language Models Basic Idea Train a probabilistic model M d for document d, i. e. as many trained models as documents M d ranks documents on how possible it is that the user s query q was created by the document d Results are document models (or documents, respectively) which have a high probability to generate the user s query Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

45 Basics Probabilistic IR Examples for Language Models Successive Probability of query terms P(t 1 t 2... t n ) = P(t 1 )P(t 2 t 1 )P(t 3 t 1 t 2 )... P(t n t 1... t n 1 ) Unigram Languge Model P uni (t 1 t 2... t n ) = P(t 1 )P(t 2 )... P(t n ) Bigram Languge Model P(t 1 t 2... t n ) = P(t 1 )P(t 2 t 1 )P(t 3 t 2 )... P(t n t n 1 ) Probability P(t n ) Probabilities are generated from the term occurrence in the document in question And many more... A lot more, and more complex models are available Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

46 Advanced Concepts Advanced Concepts LSA, Relevance Feedback, Query Processing,... Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

47 Advanced Concepts Latent Semantic Indexing Index Construction Going beyond simple indexing Term-document matrices are very large But the number of topics that people talk about is small (in some sense) Clothes, movies, politics,... Can we represent the term-document space by a lower dimensional latent space? Latent Semantic Indexing (LSI) Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

48 Advanced Concepts Latent Semantic Indexing Latent Semantic Indexing - Introduction Mathematically speaking This is taking a 3-D point and projecting it into 2-D ( x, y, z) ( 10, 10, 10) ( x, y) ( 10, 10) The arrow in this picture acts like the overhead projector Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

49 Advanced Concepts Latent Semantic Indexing Latent Semantic Indexing - Introduction Mathematically speaking We can project from any number of dimensions into any other number of dimensions. Increasing dimensions adds redundant information But sometimes useful Support Vector Machines (kernel methods) do this effectively Latent Semantic Indexing always reduces the number of dimensions Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

50 Advanced Concepts Latent Semantic Indexing Latent Semantic Indexing - Introduction Mathematically speaking Latent Semantic Indexing always reduces the number of dimensions (x,y) ( 10, 10) (x) ( 10 ) Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

51 Advanced Concepts Latent Semantic Indexing Latent Semantic Indexing - Introduction Mathematically speaking Latent Semantic Indexing can project on an arbitrary axis, not just a principal axis Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

52 Advanced Concepts Latent Semantic Indexing Latent Semantic Indexing - Introduction Mathematically speaking Our documents were just points in an N-dimensional term space We can project them also Brutus MacBeth Brutus MacBeth Othello Othello Hamlet Hamlet Julius Caesar Julius Caesar Antony and Cleopatra Antony and Cleopatra Tempest Antony Tempest Antony Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

53 Advanced Concepts Latent Semantic Indexing Latent Semantic Indexing LSI - expected behaviour A term vector that is projected on new vectors may uncover deeper meanings For example transforming the 3 axes of a term matrix from ball, bat and cave to An axis that merges ball and bat An axis that merges bat and cave Should be able to separate differences in meaning of the term bat Bonus: less dimensions is faster Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

54 Advanced Concepts Latent Semantic Indexing Singular Value Decomposition SVD overview For an m n matrix A of rank r there exists a factorization (Singular Value Decomposition = SVD) as follows: M = UΣV T... U is an m m unitary matrix... Σ is an m n diagonal matrix... V T is an n n unitary matrix The columns of U are orthogonal eigenvectors of AA T The columns of V are orthogonal eigenvectors of A T A Eigenvalues λ 1...λ r of AA T are the eigenvalues of A T A. Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

55 Advanced Concepts Latent Semantic Indexing Singular Value Decomposition Matrix Decomposition SVD enables lossy compression of a term-document matrix Reduces the dimensionality or the rank Arbitrarily reduce the dimensionality by putting zeros in the bottom right of sigma This is a mathematically optimal way of reducing dimensions Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

56 Advanced Concepts Latent Semantic Indexing Singular Value Decomposition Reduced SVD If we retain only k singular values, and set the rest to 0, then we don t need the matrix parts in green Then Σ is k k, U is M k, V T is k n, and A k is m n This is referred to as the reduced SVD It is the convenient (space-saving) and usual form for computational applications Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

57 Advanced Concepts Latent Semantic Indexing Singular Value Decomposition Approximation Error How good (bad) is this approximation? It s the best possible, measured by the Frobenius norm of the error Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

58 Advanced Concepts Latent Semantic Indexing Singular Value Decomposition Reduced SVD Whereas the term-doc matrix A may have m = 50000, n = 10million (and rank close to 50000) We can construct an approximation A 100 with rank 100. Of all rank 100 matrices, it would have the lowest Frobenius error. Latent Semantic Indexing Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

59 Advanced Concepts Latent Semantic Indexing Latent Semantic Indexing Application of SVD - LSI From term-document matrix A, we compute the approximation A k. There is a row for each term and a column for each document in A k Thus documents live in a space of k << r dimensions These dimensions are not the original axes Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

60 Advanced Concepts Latent Semantic Indexing Latent Semantic Indexing Approximation Error Each row and column of A gets mapped into the k-dimensional LSI space, by the SVD. Claim this is not only the mapping with the best (Frobenius error) approximation to A, but in fact improves retrieval. A query q is also mapped into this space Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

61 Advanced Concepts Latent Semantic Indexing Latent Semantic Indexing LSI - Results Similar terms map to similar location in low dimensional space Noise reduction by dimension reduction Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

62 Advanced Concepts Relevance Feedback Relevance Feedback Integrate feedback from the user Given a search result for a user s query... allow the user to rate the retrieved results (good vs. not-so-good match) This information can then be used to re-rank the results to match the user s preference Often automatically inferred from the user s behaviour using the search query logs... and ultimately improve the search experience The query logs (optimally) contain the queries plus the items the user has clicked on (the user s session) Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

63 Advanced Concepts Relevance Feedback Relevance Feedback Pseudo relevance feedback No user interaction is needed for the pseudo relevance feedback... the first n results are simply assumed to be a good match As if the users has clicked on the first results and marked them as good match often improves the quality of the search results Pseudo relevance feedback is sometimes also called blind relevance feedback Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

64 Advanced Concepts Relevance Feedback More Like This A common feature of search engines is the more like this search Given a search results... the user wants items similar to one of the presented results This can be seen as: taking a document as query which is a common implementation (but restricting the query to the most discriminative terms) Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

65 Advanced Concepts Query Processing Query Reformulation Modify the query, during of after the user interaction Query suggestion & completion (Automatic) query correction (Automatic) query expansion Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

66 Advanced Concepts Query Processing Query Suggestion The user start typing a query... automatically gets suggestions to complete the currently typed word to complete a whole phrase (e.g. the next word) suggest related queries Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

67 Advanced Concepts Query Processing Query Correction Query misspellings Our principal focus here E.g., the query Britni Spears We can either Retrieve documents indexed by the correct spelling, Rewrite the input query (replace, expand), britni spears (britni OR britney) spears Return several suggested alternative queries with the correct spelling - Did you mean Britney Spears? Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

68 Query Correction Advanced Concepts Query Processing Isolated word correction Fundamental premise there is a lexicon from which the correct spellings come Basic choices for this A standard lexicon such as Webster s English Dictionary An industry-specific lexicon hand-maintained The lexicon of the indexed corpus E.g., all words on the web All names, acronyms etc. (Including the mis-spellings) The use of a query log Use other users behaviour Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

69 Advanced Concepts Query Processing Query Correction Isolated word correction Given a lexicon and a character sequence Q, return the words in the lexicon closest to Q What s closest? We ll study several alternatives 1 Edit distance (Levenshtein distance) 2 Weighted edit distance 3 n-gram overlap Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

70 Query Correction Advanced Concepts Query Processing Word correction: edit distance Given two strings S 1 and S 2, the minimum number of operations to convert one to the other Operations are typically character-level Insert, Delete, Replace, (Transposition) Examples From dof to dog is 1 From cat to act is 2 (Just 1 with transpose.) From cat to dog is 3. See for a nice example plus an applet. Notebook with various distances (Levenshtein being the best known): Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

71 Query Correction Advanced Concepts Query Processing Word correction: edit distance Given two strings S 1 and S 2, the minimum number of operations to convert one to the other Operations are typically character-level Insert, Delete, Replace, (Transposition) Examples From dof to dog is 1 From cat to act is 2 (Just 1 with transpose.) From cat to dog is 3. See for a nice example plus an applet. Notebook with various distances (Levenshtein being the best known): Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

72 Advanced Concepts Query Processing Query Correction Word correction: weighted edit distance As above, but the weight of an operation depends on the character(s) involved Meant to capture OCR or keyboard errors, e.g. m more likely to be mis-typed as n than as q Therefore, replacing m by n is a smaller edit distance than by q This may be formulated as a probability model Requires weight matrix as input Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

73 Advanced Concepts Query Processing Query Correction Word correction: n-gram overlap Enumerate all the n-grams in the query string as well as in the lexicon Use the n-gram index (recall wild-card search) to retrieve all lexicon terms matching any of the query n-grams Threshold by number of matching n-grams Variants weight by keyboard layout, etc. Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

74 Advanced Concepts Query Processing Query Correction Word correction: n-gram overlap, example Suppose the text is november Trigrams are: nov ove vem emb mbe ber The query is december Trigrams are: dec ece cem emb mbe ber Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

75 Advanced Concepts Query Processing Query Correction Word correction: n-gram overlap, example Suppose the text is november Trigrams are: nov ove vem emb mbe ber The query is december Trigrams are: dec ece cem emb mbe ber So 3 trigrams overlap (of 6 in each term) How can we turn this into a normalised measure of overlap? Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

76 Advanced Concepts Query Processing Query Correction n-gram overlap, Jaccard Coefficient A commonly-used measure of overlap Let X and Y be two sets; then the Jaccard Coefficient is Equals 1 when X and Y have the same elements Equals 0 when they are disjoint X and Y don t have to be of the same size Always assigns a number between 0 and 1 Now threshold to decide if you have a match E.g., if JC > 0.8 declare a match Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

77 Advanced Concepts Query Processing Query Expansion The user has entered a query... automatically (related) words are added to the query Global query expansion - only use the query (plus corpus or background knowledge) Local query expansion - conduct the search and analyse the search results Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

78 Global Query Expansion Advanced Concepts Query Processing Uses only the query Example: Use of a thesaurus Query: [buy car] Use synonyms from the thesaurus Expanded query: [(buy OR purchase) (car OR auto)] Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

79 Advanced Concepts Query Processing Local query expansion Uses the query plus the search results... conceptually similar to the pseudo relevance feedback Look for terms in the first n search results and add them to the query Re-run the query with the added terms Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

80 Evaluation Evaluation Assess the quality of search Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

81 Evaluation Overview: Evaluation Two basic questions / goals Are the retrieved document relevant for the user s information need? Have all relevant document been retrieved? In practice these two goal contradict each other. Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

82 Evaluation Precision & Recall Evaluation Scenario Fixed document collection Number of queries Set of documents marked as relevant for each query Main Evaluation Metrics Precision = Recall = #(relevant items retrieved) #(retrieved items) #(relevant items retrieved) #(relevant items) F 1 = 2PrecisionRecall Precision+Recall Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

83 Precision & Recall Evaluation Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

84 Evaluation Advanced Evaluation Measures Mean Average Precision - MAP... mean of all average precision of multiple queries Map = 1 Q q Q AP q AP q = 1 n n k=1 Precision(k) (Normalised) discounted cumulative gain (ndcg)... where there is a reference ranking the relevant results Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

85 Applications of IR Selected Applications of IR Classification & Recommender Systems Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

86 Applications of IR Classification Overview Classification Supervised Learning: labelled training data is needed (effort for labelling) Classification problems: classify an example into a given set of categories Algorithm learns model from the labelled training set and applies this model on unknown data Used in real applications Clustering Unsupervised Learning Find groups in the data, where the similarity within the group is high and the similarity between groups is low Hardly ever used in real applications Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

87 Applications of IR Classification Supervised learning Given examples of a function (x, f (x)) Predict function f (x) for new examples x Depending on f (x) we have: 1 Classification if f (x) is a discrete function 2 Regression if f (x) is a continuous function 3 Probability estimation if f (x) is a probability distribution Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

88 Applications of IR Classification Classification Definition Given: Training examples (x, f (x)) for some unknown discrete function f Find: A good approximation for f x is data, e. g. a vector, text documents, s, words in text, etc. Classification approaches Logistic regression Decision trees Support vector machines Neural networks Statistical inference such as Bayesian learning Nearest neighbour algorithms (Here: knn-algorithm) Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

89 General knn-algorithm Applications of IR Classification 1 For each desired class: at least one example must be provided 2 Take a unclassified document: Look at its k nearest neighbours 3 Assign the unknown element to the class with the most neighbours in it 4 Goto 2 Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

90 Document Classification Applications of IR Classification 1 For each desired class: at least one document must be provided 2 Find k nearest neighbours of a document: 1 Take all terms of a document 2 Combine them to a query (e. g. with a logical OR) 3 Take the first k results form an index 3 Assign the unknown document to the class with the most neighbours in it 4 Goto 2 Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

91 Applications of IR Content-based Recommender systems Overview Recommender systems should provide usable suggestion to users Recommender systems should show new, unknown but similar documents Recommender systems can utilize different information sources Content-based Recommender Finds similar documents on the basis of document content Collaborative Filering Employs user rating to find new and useful documents Knowledge-based Recommender Makes decisions based on pre-programmed rules Hybrid Recommender Combination of two or three other approaches Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

92 Applications of IR Types of Content-based Recommender Content-based Recommender systems Document-to-Document Recommender Document-to-User Recommender Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

93 Applications of IR Content-based Recommender systems Building a Recommender by Using an Index Document-to-Document Recommender 1 Take the terms from a document 2 Construct a (weighted) query with them 3 Query the index 4 Results are possible candidates for recommending Document-to-User Recommender 1 Create a user model from Terms typed by the user Documents viewed by the user Simplest user model is just a set of terms 2 Construct a (weighted) query with them 3 Query the index 4 Results are possible candidates for recommending Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

94 Solutions Solutions Open-source and commercial IR solutions Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

95 Solutions Open Source Solutions Open Source Search Engine Apache Lucene Very popular, high quality Backbone of many other search engines, e.g. Apache Solr, ElasticSearch Lucene is the core search engine Solr provides a service & configuration infrastructure (plus a few extra features) Implemented in Java, bindings for many other programming languages Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

96 Solutions Open Source Solutions Solr Index Management API HTTP Server Various formats (XML, binary, JavaScript,...) Document life-cycle There is no update Delete (done automatically by Solr) Insert Implications An unique id is necessary Use batch updates Commit, rollback (and optimize) Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

97 Solutions Open Source Solutions Solr Text Processing Scope During indexing & query Tokenization Split text into tokens Lower-case alignment Stemming (e.g. ponies, pony poni, triplicate triplic,...) Synonyms (via Thesaurus) Stop-word filtering Multi-word splitting (e.g. Wi-Fi Wi, Fi) n-grams, soundex, umlauts Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

98 Solutions Open Source Solutions Solr Query Processing Query parsers Lucene query parser (rich syntax) AND, OR, NOT, range queries, wildcards, fuzzy query, phrase query Boosting of individual parts Example: ((boltzmann OR schroedinger) NOT einstein) Dismax query parser No query syntax Searches over multiple fields (separate boost for each field) Configure the amount of terms to be mandatory Distance between terms is used for ranking (phrase boosting) Dismax is a good starting point, but may become expensive Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

99 Solutions Open Source Solutions Search Features Query filter Additional query No impact on ranking Results are cached Boosting query Only in Dismax Query elevation Fix certain queries Request handler Pre-define clauses Invariants Function queries Score is computed on field values Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

100 Solutions Open Source Solutions Solr Search Result Ranking Relevance Sort on field value (only single term per document) Available data & features Sequence of IDs & score Stored fields Snippets (plus highlighting) Facets Count the search hits Types: field value, dates, queries Sort, prefix,... Could be used for term suggestion (aka. query suggestion) Field collapsing (grouping) Spell checking (did-you-mean) Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

101 Solutions Open Source Solutions Additional Solr Features Query by Example More like this Stats Per field Min, max, sum, missing,... Admin-GUI Webapp to troubleshoot queries Browse schema JMX Read properties & statistics Can be accessed remotely Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

102 Solutions Open Source Solutions Solr Integration Deployment Within a web application server Embedded Monitor Log output Access Various language bindings Java, Ruby, JavaScript, PHP,... Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

103 Solutions Open Source Solutions Other Open-Source Search Engines Xapian Support for many (programming) languages Used by many open-source projects Lemur & Indri Language models Academic background Terrier Many algorithms Academic background Whoosh for Pythonistas Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

104 What customers want Solutions Commercial Solutions Indexing pipeline And also: user interface, access control mechanisms, logging, etc. Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

105 Solutions Commercial Solutions Commercial Search Engines Autonomy Sinequa Attivio Grantees not to truncate documents Offers a wide range on source connectivity Expert Search Creates new views, i. e. do not only show result list, but combine results to new unit Use SQL statements in queries IntraFIND Lucene-based And many, many more... Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

106 Solutions Commercial Solutions The End Next: Preprocessing Platform & Practical Applications Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101

Recap of last time CS276A Information Retrieval

Recap of last time CS276A Information Retrieval Recap of last time CS276A Information Retrieval Index compression Space estimation Lecture 4 This lecture Tolerant retrieval Wild-card queries Spelling correction Soundex Wild-card queries Wild-card queries:

More information

3-1. Dictionaries and Tolerant Retrieval. Most slides were adapted from Stanford CS 276 course and University of Munich IR course.

3-1. Dictionaries and Tolerant Retrieval. Most slides were adapted from Stanford CS 276 course and University of Munich IR course. 3-1. Dictionaries and Tolerant Retrieval Most slides were adapted from Stanford CS 276 course and University of Munich IR course. 1 Dictionary data structures for inverted indexes Sec. 3.1 The dictionary

More information

Tolerant Retrieval. Searching the Dictionary Tolerant Retrieval. Information Retrieval & Extraction Misbhauddin 1

Tolerant Retrieval. Searching the Dictionary Tolerant Retrieval. Information Retrieval & Extraction Misbhauddin 1 Tolerant Retrieval Searching the Dictionary Tolerant Retrieval Information Retrieval & Extraction Misbhauddin 1 Query Retrieval Dictionary data structures Tolerant retrieval Wild-card queries Soundex Spelling

More information

Introduction to Information Retrieval

Introduction to Information Retrieval Introduction Inverted index Processing Boolean queries Course overview Introduction to Information Retrieval http://informationretrieval.org IIR 1: Boolean Retrieval Hinrich Schütze Institute for Natural

More information

Information Retrieval

Information Retrieval Information Retrieval Natural Language Processing: Lecture 12 30.11.2017 Kairit Sirts Homework 4 things that seemed to work Bidirectional LSTM instead of unidirectional Change LSTM activation to sigmoid

More information

Recap of the previous lecture. This lecture. A naïve dictionary. Introduction to Information Retrieval. Dictionary data structures Tolerant retrieval

Recap of the previous lecture. This lecture. A naïve dictionary. Introduction to Information Retrieval. Dictionary data structures Tolerant retrieval Ch. 2 Recap of the previous lecture Introduction to Information Retrieval Lecture 3: Dictionaries and tolerant retrieval The type/token distinction Terms are normalized types put in the dictionary Tokenization

More information

CSCI 5417 Information Retrieval Systems Jim Martin!

CSCI 5417 Information Retrieval Systems Jim Martin! CSCI 5417 Information Retrieval Systems Jim Martin! Lecture 4 9/1/2011 Today Finish up spelling correction Realistic indexing Block merge Single-pass in memory Distributed indexing Next HW details 1 Query

More information

Dictionaries and tolerant retrieval. Slides by Manning, Raghavan, Schutze

Dictionaries and tolerant retrieval. Slides by Manning, Raghavan, Schutze Dictionaries and tolerant retrieval 1 Ch. 2 Recap of the previous lecture The type/token distinction Terms are normalized types put in the dictionary Tokenization problems: Hyphens, apostrophes, compounds,

More information

Dictionaries and tolerant retrieval CE-324 : Modern Information Retrieval Sharif University of Technology

Dictionaries and tolerant retrieval CE-324 : Modern Information Retrieval Sharif University of Technology Dictionaries and tolerant retrieval CE-324 : Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2014 Most slides have been adapted from: Profs. Nayak & Raghavan (CS- 276, Stanford)

More information

Knowledge Discovery and Data Mining 1 (VO) ( )

Knowledge Discovery and Data Mining 1 (VO) ( ) Knowledge Discovery and Data Mining 1 (VO) (707.003) Data Matrices and Vector Space Model Denis Helic KTI, TU Graz Nov 6, 2014 Denis Helic (KTI, TU Graz) KDDM1 Nov 6, 2014 1 / 55 Big picture: KDDM Probability

More information

Dictionaries and Tolerant retrieval

Dictionaries and Tolerant retrieval Dictionaries and Tolerant retrieval Slides adapted from Stanford CS297:Introduction to Information Retrieval A skipped lecture The type/token distinction Terms are normalized types put in the dictionary

More information

Introduction to Information Retrieval

Introduction to Information Retrieval Introduction to Information Retrieval Mohsen Kamyar چهارمین کارگاه ساالنه آزمایشگاه فناوری و وب بهمن ماه 1391 Outline Outline in classic categorization Information vs. Data Retrieval IR Models Evaluation

More information

Dictionaries and tolerant retrieval CE-324 : Modern Information Retrieval Sharif University of Technology

Dictionaries and tolerant retrieval CE-324 : Modern Information Retrieval Sharif University of Technology Dictionaries and tolerant retrieval CE-324 : Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2016 Most slides have been adapted from: Profs. Nayak & Raghavan (CS- 276, Stanford)

More information

vector space retrieval many slides courtesy James Amherst

vector space retrieval many slides courtesy James Amherst vector space retrieval many slides courtesy James Allan@umass Amherst 1 what is a retrieval model? Model is an idealization or abstraction of an actual process Mathematical models are used to study the

More information

CS105 Introduction to Information Retrieval

CS105 Introduction to Information Retrieval CS105 Introduction to Information Retrieval Lecture: Yang Mu UMass Boston Slides are modified from: http://www.stanford.edu/class/cs276/ Information Retrieval Information Retrieval (IR) is finding material

More information

Boolean Retrieval. Manning, Raghavan and Schütze, Chapter 1. Daniël de Kok

Boolean Retrieval. Manning, Raghavan and Schütze, Chapter 1. Daniël de Kok Boolean Retrieval Manning, Raghavan and Schütze, Chapter 1 Daniël de Kok Boolean query model Pose a query as a boolean query: Terms Operations: AND, OR, NOT Example: Brutus AND Caesar AND NOT Calpuria

More information

Digital Libraries: Language Technologies

Digital Libraries: Language Technologies Digital Libraries: Language Technologies RAFFAELLA BERNARDI UNIVERSITÀ DEGLI STUDI DI TRENTO P.ZZA VENEZIA, ROOM: 2.05, E-MAIL: BERNARDI@DISI.UNITN.IT Contents 1 Recall: Inverted Index..........................................

More information

Soir 1.4 Enterprise Search Server

Soir 1.4 Enterprise Search Server Soir 1.4 Enterprise Search Server Enhance your search with faceted navigation, result highlighting, fuzzy queries, ranked scoring, and more David Smiley Eric Pugh *- PUBLISHING -J BIRMINGHAM - MUMBAI Preface

More information

boolean queries Inverted index query processing Query optimization boolean model September 9, / 39

boolean queries Inverted index query processing Query optimization boolean model September 9, / 39 boolean model September 9, 2014 1 / 39 Outline 1 boolean queries 2 3 4 2 / 39 taxonomy of IR models Set theoretic fuzzy extended boolean set-based IR models Boolean vector probalistic algebraic generalized

More information

Models for Document & Query Representation. Ziawasch Abedjan

Models for Document & Query Representation. Ziawasch Abedjan Models for Document & Query Representation Ziawasch Abedjan Overview Introduction & Definition Boolean retrieval Vector Space Model Probabilistic Information Retrieval Language Model Approach Summary Overview

More information

Information Retrieval. (M&S Ch 15)

Information Retrieval. (M&S Ch 15) Information Retrieval (M&S Ch 15) 1 Retrieval Models A retrieval model specifies the details of: Document representation Query representation Retrieval function Determines a notion of relevance. Notion

More information

http://www.xkcd.com/628/ Results Summaries Spelling Correction David Kauchak cs160 Fall 2009 adapted from: http://www.stanford.edu/class/cs276/handouts/lecture3-tolerantretrieval.ppt http://www.stanford.edu/class/cs276/handouts/lecture8-evaluation.ppt

More information

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University CS473: CS-473 Course Review Luo Si Department of Computer Science Purdue University Basic Concepts of IR: Outline Basic Concepts of Information Retrieval: Task definition of Ad-hoc IR Terminologies and

More information

OUTLINE. Documents Terms. General + Non-English English. Skip pointers. Phrase queries

OUTLINE. Documents Terms. General + Non-English English. Skip pointers. Phrase queries WECHAT GROUP 1 OUTLINE Documents Terms General + Non-English English Skip pointers Phrase queries 2 Phrase queries We want to answer a query such as [stanford university] as a phrase. Thus The inventor

More information

James Mayfield! The Johns Hopkins University Applied Physics Laboratory The Human Language Technology Center of Excellence!

James Mayfield! The Johns Hopkins University Applied Physics Laboratory The Human Language Technology Center of Excellence! James Mayfield! The Johns Hopkins University Applied Physics Laboratory The Human Language Technology Center of Excellence! (301) 219-4649 james.mayfield@jhuapl.edu What is Information Retrieval? Evaluation

More information

Information Retrieval and Organisation

Information Retrieval and Organisation Information Retrieval and Organisation Dell Zhang Birkbeck, University of London 2016/17 IR Chapter 02 The Term Vocabulary and Postings Lists Constructing Inverted Indexes The major steps in constructing

More information

Information Retrieval

Information Retrieval Information Retrieval Suan Lee - Information Retrieval - 06 Scoring, Term Weighting and the Vector Space Model 1 Recap of lecture 5 Collection and vocabulary statistics: Heaps and Zipf s laws Dictionary

More information

CSCI 5417 Information Retrieval Systems! What is Information Retrieval?

CSCI 5417 Information Retrieval Systems! What is Information Retrieval? CSCI 5417 Information Retrieval Systems! Lecture 1 8/23/2011 Introduction 1 What is Information Retrieval? Information retrieval is the science of searching for information in documents, searching for

More information

Mining Web Data. Lijun Zhang

Mining Web Data. Lijun Zhang Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems

More information

60-538: Information Retrieval

60-538: Information Retrieval 60-538: Information Retrieval September 7, 2017 1 / 48 Outline 1 what is IR 2 3 2 / 48 Outline 1 what is IR 2 3 3 / 48 IR not long time ago 4 / 48 5 / 48 now IR is mostly about search engines there are

More information

Indexing. Lecture Objectives. Text Technologies for Data Science INFR Learn about and implement Boolean search Inverted index Positional index

Indexing. Lecture Objectives. Text Technologies for Data Science INFR Learn about and implement Boolean search Inverted index Positional index Text Technologies for Data Science INFR11145 Indexing Instructor: Walid Magdy 03-Oct-2017 Lecture Objectives Learn about and implement Boolean search Inverted index Positional index 2 1 Indexing Process

More information

Representation/Indexing (fig 1.2) IR models - overview (fig 2.1) IR models - vector space. Weighting TF*IDF. U s e r. T a s k s

Representation/Indexing (fig 1.2) IR models - overview (fig 2.1) IR models - vector space. Weighting TF*IDF. U s e r. T a s k s Summary agenda Summary: EITN01 Web Intelligence and Information Retrieval Anders Ardö EIT Electrical and Information Technology, Lund University March 13, 2013 A Ardö, EIT Summary: EITN01 Web Intelligence

More information

Introduction to Information Retrieval

Introduction to Information Retrieval Introduction to Information Retrieval http://informationretrieval.org IIR 5: Index Compression Hinrich Schütze Center for Information and Language Processing, University of Munich 2014-04-17 1/59 Overview

More information

Overview. Lecture 3: Index Representation and Tolerant Retrieval. Type/token distinction. IR System components

Overview. Lecture 3: Index Representation and Tolerant Retrieval. Type/token distinction. IR System components Overview Lecture 3: Index Representation and Tolerant Retrieval Information Retrieval Computer Science Tripos Part II Ronan Cummins 1 Natural Language and Information Processing (NLIP) Group 1 Recap 2

More information

Information Retrieval and Organisation

Information Retrieval and Organisation Information Retrieval and Organisation Dell Zhang Birkbeck, University of London 2016/17 IR Chapter 01 Boolean Retrieval Example IR Problem Let s look at a simple IR problem Suppose you own a copy of Shakespeare

More information

Information Retrieval: Retrieval Models

Information Retrieval: Retrieval Models CS473: Web Information Retrieval & Management CS-473 Web Information Retrieval & Management Information Retrieval: Retrieval Models Luo Si Department of Computer Science Purdue University Retrieval Models

More information

Information Retrieval

Information Retrieval Information Retrieval Dictionaries & Tolerant Retrieval Gintarė Grigonytė gintare@ling.su.se Department of Linguistics and Philology Uppsala University Slides based on previous IR course given by Jörg

More information

Chapter 2. Architecture of a Search Engine

Chapter 2. Architecture of a Search Engine Chapter 2 Architecture of a Search Engine Search Engine Architecture A software architecture consists of software components, the interfaces provided by those components and the relationships between them

More information

Information Retrieval

Information Retrieval Introduction to Information Retrieval Information Retrieval and Web Search Lecture 1: Introduction and Boolean retrieval Outline ❶ Course details ❷ Information retrieval ❸ Boolean retrieval 2 Course details

More information

Information Retrieval and Web Search

Information Retrieval and Web Search Information Retrieval and Web Search Introduction to IR models and methods Rada Mihalcea (Some of the slides in this slide set come from IR courses taught at UT Austin and Stanford) Information Retrieval

More information

Mining Web Data. Lijun Zhang

Mining Web Data. Lijun Zhang Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems

More information

Introduction to Information Retrieval

Introduction to Information Retrieval Introduction to Information Retrieval http://informationretrieval.org IIR 1: Boolean Retrieval Hinrich Schütze Institute for Natural Language Processing, University of Stuttgart 2011-05-03 1/ 36 Take-away

More information

Information Retrieval. Information Retrieval and Web Search

Information Retrieval. Information Retrieval and Web Search Information Retrieval and Web Search Introduction to IR models and methods Information Retrieval The indexing and retrieval of textual documents. Searching for pages on the World Wide Web is the most recent

More information

Part I: Data Mining Foundations

Part I: Data Mining Foundations Table of Contents 1. Introduction 1 1.1. What is the World Wide Web? 1 1.2. A Brief History of the Web and the Internet 2 1.3. Web Data Mining 4 1.3.1. What is Data Mining? 6 1.3.2. What is Web Mining?

More information

Chapter 27 Introduction to Information Retrieval and Web Search

Chapter 27 Introduction to Information Retrieval and Web Search Chapter 27 Introduction to Information Retrieval and Web Search Copyright 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 27 Outline Information Retrieval (IR) Concepts Retrieval

More information

Multimedia Information Extraction and Retrieval Term Frequency Inverse Document Frequency

Multimedia Information Extraction and Retrieval Term Frequency Inverse Document Frequency Multimedia Information Extraction and Retrieval Term Frequency Inverse Document Frequency Ralf Moeller Hamburg Univ. of Technology Acknowledgement Slides taken from presentation material for the following

More information

Introducing Information Retrieval and Web Search. borrowing from: Pandu Nayak

Introducing Information Retrieval and Web Search. borrowing from: Pandu Nayak Introducing Information Retrieval and Web Search borrowing from: Pandu Nayak Information Retrieval Information Retrieval (IR) is finding material (usually documents) of an unstructured nature (usually

More information

CSE 7/5337: Information Retrieval and Web Search Introduction and Boolean Retrieval (IIR 1)

CSE 7/5337: Information Retrieval and Web Search Introduction and Boolean Retrieval (IIR 1) CSE 7/5337: Information Retrieval and Web Search Introduction and Boolean Retrieval (IIR 1) Michael Hahsler Southern Methodist University These slides are largely based on the slides by Hinrich Schütze

More information

Natural Language Processing and Information Retrieval

Natural Language Processing and Information Retrieval Natural Language Processing and Information Retrieval Indexing and Vector Space Models Alessandro Moschitti Department of Computer Science and Information Engineering University of Trento Email: moschitti@disi.unitn.it

More information

Introduction to Information Retrieval and Boolean model. Reference: Introduction to Information Retrieval by C. Manning, P. Raghavan, H.

Introduction to Information Retrieval and Boolean model. Reference: Introduction to Information Retrieval by C. Manning, P. Raghavan, H. Introduction to Information Retrieval and Boolean model Reference: Introduction to Information Retrieval by C. Manning, P. Raghavan, H. Schutze 1 Unstructured (text) vs. structured (database) data in late

More information

Information Retrieval

Information Retrieval Introduction to Information Retrieval CS3245 Information Retrieval Lecture 2: Boolean retrieval 2 Blanks on slides, you may want to fill in Last Time: Ngram Language Models Unigram LM: Bag of words Ngram

More information

Introduction to Information Retrieval

Introduction to Information Retrieval Mustafa Jarrar: Lecture Notes on Information Retrieval University of Birzeit, Palestine 2014 Introduction to Information Retrieval Dr. Mustafa Jarrar Sina Institute, University of Birzeit mjarrar@birzeit.edu

More information

Natural Language Processing

Natural Language Processing Natural Language Processing Information Retrieval Potsdam, 14 June 2012 Saeedeh Momtazi Information Systems Group based on the slides of the course book Outline 2 1 Introduction 2 Indexing Block Document

More information

Chapter 6: Information Retrieval and Web Search. An introduction

Chapter 6: Information Retrieval and Web Search. An introduction Chapter 6: Information Retrieval and Web Search An introduction Introduction n Text mining refers to data mining using text documents as data. n Most text mining tasks use Information Retrieval (IR) methods

More information

Information Retrieval. hussein suleman uct cs

Information Retrieval. hussein suleman uct cs Information Management Information Retrieval hussein suleman uct cs 303 2004 Introduction Information retrieval is the process of locating the most relevant information to satisfy a specific information

More information

Information Retrieval. CS630 Representing and Accessing Digital Information. What is a Retrieval Model? Basic IR Processes

Information Retrieval. CS630 Representing and Accessing Digital Information. What is a Retrieval Model? Basic IR Processes CS630 Representing and Accessing Digital Information Information Retrieval: Retrieval Models Information Retrieval Basics Data Structures and Access Indexing and Preprocessing Retrieval Models Thorsten

More information

Information Retrieval

Information Retrieval Information Retrieval Suan Lee - Information Retrieval - 01 Boolean Retrieval 1 01 Boolean Retrieval - Information Retrieval - 01 Boolean Retrieval 2 Introducing Information Retrieval and Web Search -

More information

Information Retrieval CS Lecture 06. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Information Retrieval CS Lecture 06. Razvan C. Bunescu School of Electrical Engineering and Computer Science Information Retrieval CS 6900 Lecture 06 Razvan C. Bunescu School of Electrical Engineering and Computer Science bunescu@ohio.edu Boolean Retrieval vs. Ranked Retrieval Many users (professionals) prefer

More information

Boolean retrieval & basics of indexing CE-324: Modern Information Retrieval Sharif University of Technology

Boolean retrieval & basics of indexing CE-324: Modern Information Retrieval Sharif University of Technology Boolean retrieval & basics of indexing CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2016 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan lectures

More information

Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p.

Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p. Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p. 6 What is Web Mining? p. 6 Summary of Chapters p. 8 How

More information

VK Multimedia Information Systems

VK Multimedia Information Systems VK Multimedia Information Systems Mathias Lux, mlux@itec.uni-klu.ac.at This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Results Exercise 01 Exercise 02 Retrieval

More information

Information Retrieval

Information Retrieval Information Retrieval Suan Lee - Information Retrieval - 03 Dictionaries and Tolerant Retrieval 1 03 Dictionaries and Tolerant Retrieval - Information Retrieval - 03 Dictionaries and Tolerant Retrieval

More information

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2016

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2016 CPSC 340: Machine Learning and Data Mining Principal Component Analysis Fall 2016 A2/Midterm: Admin Grades/solutions will be posted after class. Assignment 4: Posted, due November 14. Extra office hours:

More information

Minoru SASAKI and Kenji KITA. Department of Information Science & Intelligent Systems. Faculty of Engineering, Tokushima University

Minoru SASAKI and Kenji KITA. Department of Information Science & Intelligent Systems. Faculty of Engineering, Tokushima University Information Retrieval System Using Concept Projection Based on PDDP algorithm Minoru SASAKI and Kenji KITA Department of Information Science & Intelligent Systems Faculty of Engineering, Tokushima University

More information

Introduction to Information Retrieval

Introduction to Information Retrieval Introduction to Information Retrieval http://informationretrieval.org IIR 1: Boolean Retrieval Hinrich Schütze Center for Information and Language Processing, University of Munich 2014-04-09 Schütze: Boolean

More information

Information Retrieval

Information Retrieval Introduction to Information Retrieval CS276 Information Retrieval and Web Search Pandu Nayak and Prabhakar Raghavan Lecture 1: Boolean retrieval Information Retrieval Information Retrieval (IR) is finding

More information

Classic IR Models 5/6/2012 1

Classic IR Models 5/6/2012 1 Classic IR Models 5/6/2012 1 Classic IR Models Idea Each document is represented by index terms. An index term is basically a (word) whose semantics give meaning to the document. Not all index terms are

More information

Information Retrieval

Information Retrieval Introduction to Information Retrieval Lecture 6-: Scoring, Term Weighting Outline Why ranked retrieval? Term frequency tf-idf weighting 2 Ranked retrieval Thus far, our queries have all been Boolean. Documents

More information

CS54701: Information Retrieval

CS54701: Information Retrieval CS54701: Information Retrieval Basic Concepts 19 January 2016 Prof. Chris Clifton 1 Text Representation: Process of Indexing Remove Stopword, Stemming, Phrase Extraction etc Document Parser Extract useful

More information

- Content-based Recommendation -

- Content-based Recommendation - - Content-based Recommendation - Institute for Software Technology Inffeldgasse 16b/2 A-8010 Graz Austria 1 Content-based recommendation While CF methods do not require any information about the items,

More information

Recap: lecture 2 CS276A Information Retrieval

Recap: lecture 2 CS276A Information Retrieval Recap: lecture 2 CS276A Information Retrieval Stemming, tokenization etc. Faster postings merges Phrase queries Lecture 3 This lecture Index compression Space estimation Corpus size for estimates Consider

More information

Administrative. Distributed indexing. Index Compression! What I did last summer lunch talks today. Master. Tasks

Administrative. Distributed indexing. Index Compression! What I did last summer lunch talks today. Master. Tasks Administrative Index Compression! n Assignment 1? n Homework 2 out n What I did last summer lunch talks today David Kauchak cs458 Fall 2012 adapted from: http://www.stanford.edu/class/cs276/handouts/lecture5-indexcompression.ppt

More information

Unsupervised learning in Vision

Unsupervised learning in Vision Chapter 7 Unsupervised learning in Vision The fields of Computer Vision and Machine Learning complement each other in a very natural way: the aim of the former is to extract useful information from visual

More information

Today s topic CS347. Results list clustering example. Why cluster documents. Clustering documents. Lecture 8 May 7, 2001 Prabhakar Raghavan

Today s topic CS347. Results list clustering example. Why cluster documents. Clustering documents. Lecture 8 May 7, 2001 Prabhakar Raghavan Today s topic CS347 Clustering documents Lecture 8 May 7, 2001 Prabhakar Raghavan Why cluster documents Given a corpus, partition it into groups of related docs Recursively, can induce a tree of topics

More information

Document indexing, similarities and retrieval in large scale text collections

Document indexing, similarities and retrieval in large scale text collections Document indexing, similarities and retrieval in large scale text collections Eric Gaussier Univ. Grenoble Alpes - LIG Eric.Gaussier@imag.fr Eric Gaussier Document indexing, similarities & retrieval 1

More information

Boolean retrieval & basics of indexing CE-324: Modern Information Retrieval Sharif University of Technology

Boolean retrieval & basics of indexing CE-324: Modern Information Retrieval Sharif University of Technology Boolean retrieval & basics of indexing CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2015 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan lectures

More information

Information Retrieval

Information Retrieval Introduction to Information Retrieval CS276 Information Retrieval and Web Search Christopher Manning and Prabhakar Raghavan Lecture 1: Boolean retrieval Information Retrieval Information Retrieval (IR)

More information

Advanced Retrieval Information Analysis Boolean Retrieval

Advanced Retrieval Information Analysis Boolean Retrieval Advanced Retrieval Information Analysis Boolean Retrieval Irwan Ary Dharmawan 1,2,3 iad@unpad.ac.id Hana Rizmadewi Agustina 2,4 hagustina@unpad.ac.id 1) Development Center of Information System and Technology

More information

Boolean retrieval & basics of indexing CE-324: Modern Information Retrieval Sharif University of Technology

Boolean retrieval & basics of indexing CE-324: Modern Information Retrieval Sharif University of Technology Boolean retrieval & basics of indexing CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2013 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276,

More information

Pattern Mining. Knowledge Discovery and Data Mining 1. Roman Kern KTI, TU Graz. Roman Kern (KTI, TU Graz) Pattern Mining / 42

Pattern Mining. Knowledge Discovery and Data Mining 1. Roman Kern KTI, TU Graz. Roman Kern (KTI, TU Graz) Pattern Mining / 42 Pattern Mining Knowledge Discovery and Data Mining 1 Roman Kern KTI, TU Graz 2016-01-14 Roman Kern (KTI, TU Graz) Pattern Mining 2016-01-14 1 / 42 Outline 1 Introduction 2 Apriori Algorithm 3 FP-Growth

More information

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Schütze s, linked from http://informationretrieval.org/ IR 6: Index Compression Paul Ginsparg Cornell University, Ithaca, NY 15 Sep

More information

Instructor: Stefan Savev

Instructor: Stefan Savev LECTURE 2 What is indexing? Indexing is the process of extracting features (such as word counts) from the documents (in other words: preprocessing the documents). The process ends with putting the information

More information

CS347. Lecture 2 April 9, Prabhakar Raghavan

CS347. Lecture 2 April 9, Prabhakar Raghavan CS347 Lecture 2 April 9, 2001 Prabhakar Raghavan Today s topics Inverted index storage Compressing dictionaries into memory Processing Boolean queries Optimizing term processing Skip list encoding Wild-card

More information

Chapter 6. Queries and Interfaces

Chapter 6. Queries and Interfaces Chapter 6 Queries and Interfaces Keyword Queries Simple, natural language queries were designed to enable everyone to search Current search engines do not perform well (in general) with natural language

More information

Today s topics CS347. Inverted index storage. Inverted index storage. Processing Boolean queries. Lecture 2 April 9, 2001 Prabhakar Raghavan

Today s topics CS347. Inverted index storage. Inverted index storage. Processing Boolean queries. Lecture 2 April 9, 2001 Prabhakar Raghavan Today s topics CS347 Lecture 2 April 9, 2001 Prabhakar Raghavan Inverted index storage Compressing dictionaries into memory Processing Boolean queries Optimizing term processing Skip list encoding Wild-card

More information

Unstructured Data Management. Advanced Topics in Database Management (INFSCI 2711)

Unstructured Data Management. Advanced Topics in Database Management (INFSCI 2711) Unstructured Data Management Advanced Topics in Database Management (INFSCI 2711) Textbooks: Database System Concepts - 2010 Introduction to Information Retrieval - 2008 Vladimir Zadorozhny, DINS, SCI,

More information

A Neuro Probabilistic Language Model Bengio et. al. 2003

A Neuro Probabilistic Language Model Bengio et. al. 2003 A Neuro Probabilistic Language Model Bengio et. al. 2003 Class Discussion Notes Scribe: Olivia Winn February 1, 2016 Opening thoughts (or why this paper is interesting): Word embeddings currently have

More information

Jan Pedersen 22 July 2010

Jan Pedersen 22 July 2010 Jan Pedersen 22 July 2010 Outline Problem Statement Best effort retrieval vs automated reformulation Query Evaluation Architecture Query Understanding Models Data Sources Standard IR Assumptions Queries

More information

rpaf ktl Pen Apache Solr 3 Enterprise Search Server J community exp<= highlighting, relevancy ranked sorting, and more source publishing""

rpaf ktl Pen Apache Solr 3 Enterprise Search Server J community exp<= highlighting, relevancy ranked sorting, and more source publishing Apache Solr 3 Enterprise Search Server Enhance your search with faceted navigation, result highlighting, relevancy ranked sorting, and more David Smiley Eric Pugh rpaf ktl Pen I I riv IV I J community

More information

In = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most

In = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most In = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most common word appears 10 times then A = rn*n/n = 1*10/100

More information

Information Retrieval CS Lecture 01. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Information Retrieval CS Lecture 01. Razvan C. Bunescu School of Electrical Engineering and Computer Science Information Retrieval CS 6900 Razvan C. Bunescu School of Electrical Engineering and Computer Science bunescu@ohio.edu Information Retrieval Information Retrieval (IR) is finding material of an unstructured

More information

Lecture 8 May 7, Prabhakar Raghavan

Lecture 8 May 7, Prabhakar Raghavan Lecture 8 May 7, 2001 Prabhakar Raghavan Clustering documents Given a corpus, partition it into groups of related docs Recursively, can induce a tree of topics Given the set of docs from the results of

More information

Information Retrieval CS-E credits

Information Retrieval CS-E credits Information Retrieval CS-E4420 5 credits Tokenization, further indexing issues Antti Ukkonen antti.ukkonen@aalto.fi Slides are based on materials by Tuukka Ruotsalo, Hinrich Schütze and Christina Lioma

More information

Boolean Model. Hongning Wang

Boolean Model. Hongning Wang Boolean Model Hongning Wang CS@UVa Abstraction of search engine architecture Indexed corpus Crawler Ranking procedure Doc Analyzer Doc Representation Query Rep Feedback (Query) Evaluation User Indexer

More information

Problem 1: Complexity of Update Rules for Logistic Regression

Problem 1: Complexity of Update Rules for Logistic Regression Case Study 1: Estimating Click Probabilities Tackling an Unknown Number of Features with Sketching Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox January 16 th, 2014 1

More information

Introduction to Information Retrieval (Manning, Raghavan, Schutze)

Introduction to Information Retrieval (Manning, Raghavan, Schutze) Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 3 Dictionaries and Tolerant retrieval Chapter 4 Index construction Chapter 5 Index compression Content Dictionary data structures

More information

Latent Semantic Indexing

Latent Semantic Indexing Latent Semantic Indexing Thanks to Ian Soboroff Information Retrieval 1 Issues: Vector Space Model Assumes terms are independent Some terms are likely to appear together synonyms, related words spelling

More information

VALLIAMMAI ENGINEERING COLLEGE SRM Nagar, Kattankulathur DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING QUESTION BANK VII SEMESTER

VALLIAMMAI ENGINEERING COLLEGE SRM Nagar, Kattankulathur DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING QUESTION BANK VII SEMESTER VALLIAMMAI ENGINEERING COLLEGE SRM Nagar, Kattankulathur 603 203 DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING QUESTION BANK VII SEMESTER CS6007-INFORMATION RETRIEVAL Regulation 2013 Academic Year 2018

More information

Information Retrieval

Information Retrieval Introduction to Information Retrieval Lecture 4: Index Construction 1 Plan Last lecture: Dictionary data structures Tolerant retrieval Wildcards Spell correction Soundex a-hu hy-m n-z $m mace madden mo

More information

Information Retrieval

Information Retrieval Introduction to Information Retrieval Boolean retrieval Basic assumptions of Information Retrieval Collection: Fixed set of documents Goal: Retrieve documents with information that is relevant to the user

More information