Information Retrieval and Applications
|
|
- Oswald Reeves
- 6 years ago
- Views:
Transcription
1 Information Retrieval and Applications Knowledge Discovery and Data Mining 2 (VU) ( ) Roman Kern, Heimo Gursch KTI, TU Graz Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101
2 Outline 1 Basics Inverted Index Probabilistic IR 2 Advanced Concepts Latent Semantic Indexing Relevance Feedback Query Processing 3 Evaluation 4 Applications of IR Classification Content-based Recommender systems 5 Solutions Open Source Solutions Commercial Solutions Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101
3 Basics Basics Inverted Index, Text-Preprocessing, Language-based & Probabilistic IR Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101
4 Basics Functionnal features Search for single keyword Search for multiple keywords Search for phrases, i. e. keywords in particular order Boolean combination of keywords (AND, OR, NOT) Rank results by number of keyword occurrences Find the positions of keywords in a document Compose snippets of text around the keyword positions Motivation How can this goals be achieved? Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101
5 Basics Structured, Semi-Structured Data Structured data Format and type of information is known IR task: normalize representations between different sources Example: SQL databases Semi-structured data Combination of structured and unstructured data IR task: extract information from the unstructured part and combine it with the structured information Example: , document with meta-data Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101
6 Basics Unstructured data Unstructured data Format of information is not know, information is hidden in (human-readable) text IR task: extract information at all Example: blog-posts, Wikipedia articles This lecture focuses on unstructured data. Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101
7 Basics Simple Approach Sequential Search Advantage Simple Disadvantages Slow and hard to scale Usage Small texts Frequently changing text Not enough space for index available Example grep on the UNIX command line Not feasible for large texts. A different approach is needed. But how? Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101
8 Definition Basics Information Retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers). The term term is very common in Information Retrieval and typically refers to single words (also common: bi-grams, phrases,...) Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101
9 Basics Assumptions Basic assumptions of Information Retrieval Collection Fixed set of documents Goal Retrieve documents with information that is relevant to user s information need and helps him complete a task. Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101
10 Basics Overview Indexed Searching Split process Indexing Convert all input documents into a data-structure suitable for fast searching Searching Take an input query and map it onto the pre-processed data-structure to retrieve the matching documents Inverted index Reverses the relationship between documents and contained words (terms) Central data-structure of a search engine Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101
11 Three Example Documents Basics Document 1 Sepp was stopping off in Hawaii. Document 2 Sepp is a co-worker of Heidi. Document 3 Heidi stopped going to Hawaii. Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101
12 Basics Inverted Index Inverted Index Matrix Term-document incidence Matrix entries 0 The document does not contain the corresponding term 1 The document does contain the corresponding term Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101
13 Basics Inverted Index Example Matrix Document 1 Document 2 Document 3 co-worker going Hawaii Heidi Seppl stopped stopping off Matrix entries 0 The document does not contain the corresponding term 1 The document does contain the corresponding term Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101
14 Basics Inverted Index Properties of the Inverted Index Matrix Vectorization of Documents Column Words of an document Row All documents where a word occurs Possible Queries Documents containing all terms of a query Take row vectors for each term and do a bitwise AND Documents contain any terms of a query Take row vectors for each term and do a bitwise OR Documents not containing a term Use a bitwise NOT Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101
15 Basics Inverted Index Examples All documents containing the term Hawaii Take the row vector of Hawaii, look for a one: Document 1 & Document 3 All documents containing the terms Seppl & Heidi 1 Take the rows for the terms Seppl & Heidi 2 Combine them by a bitwise AND 3 The position in the result vector indicates the correct document: Document 2 More complex queries Use Boolean algebra on the row vectors Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101
16 Basics Inverted Index What we have so far... Advantages Search for terms Formulate boolean queries Works very fast Disadvantages Cannot search for phrases (i. e. terms in a particular order) No information where the term occurred in the document No snippets No ranking possible, documents either match or do not match Matrix is usually very sparse, possible waste of memory Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101
17 Basics Inverted Index Make the inverted index memory efficient Dictionary Store a (sorted a ) list of all terms in the documents. a The sorting is not necessary for it to work, but it is necessary to find terms fast Postings For every term: store a list in which documents the term occurs Posting list Linked lists generally preferred to arrays Dynamic space allocation Insertion of terms into documents easy Space overhead of pointers Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101
18 Example Basics Inverted Index Dictionary Postings co-worker Document 2 going Document 3 Hawaii Document 1, Document 3 Heidi Document 2, Document 3 Seppl Document 1, Document 2 stopped Document 3 stopping off Document 1 Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101
19 Basics Inverted Index Boolean Retrieval AND Intersection of postings for two or more terms OR Union of postings for two or more terms NOT All documents not in the postings of a term Make it efficient Sort the dictionary Sort the postings by document IDs But: Tricky and long boolean combinations can still get difficult! Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101
20 Basics Inverted Index Properties Prerequisite Every document needs a unique, sortable ID, i. e. an unique integer number Advantages Very efficient storage Search for terms Formulate boolean queries Disadvantages Cannot search for phrases (i. e. terms in a particular order) No information where the term occurred in the document No snippets No ranking possible, documents either match or do not match Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101
21 Basics Inverted Index Positional Information Store not only that a document contains a term, but also where the term occurs in the document. Dictionary Postings term 1 Document ID, Position(s) in document term 2 Document ID, Position(s) in document term n Document ID, Position(s) in document Search for phrases (i. e. consecutive words) 1 Get each word from the dictionary and retrieve its posting list 2 If the document IDs are equal for two or more words, search for consecutive positions Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101
22 Basics Inverted Index Example Dictionary Postings co-worker (Doc 2, Pos 4) going (Doc 3, Pos 3) Hawaii (Doc 1, Pos 5); (Doc 3, Pos 5) Heidi (Doc 2, Pos 6); (Doc 3, Pos 1) Seppl (Doc 1, Pos 1); (Doc 2, Pos 1) stopped (Doc 3, Pos 2) stopping off (Doc 1, Pos 3) Search for the phrase Heidi stopped 1 Heidi (Doc 2, 6); (Doc 3, Pos1) 2 stopped (Doc 3, Pos 2) 3 Consecutive positions for Document 3 Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101
23 Basics Inverted Index Properties Advantages Still efficient storage Search for terms Formulate boolean queries Search for phrases (i. e. terms in a particular order) Information where the term occurred in the document Possible creation of snippets Disadvantages Ranking a bit tricky, but possible Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101
24 Basics Inverted Index Term Frequencies Implicit term frequencies By storing the positions where the terms occur in a document, also the information how often it occurs is stored. As there have to be more position markers if the term occurs more often. Problem: Counting needs time Explicit term frequencies Store in the posting list, how often the term occurs in the document Possible to leave the positional information out, if not needed Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101
25 Basics Inverted Index Term Frequencies Structure Store in the posting list, how often the term occurs in the document (This is often called Term Frequency (TF)) Possible to leave the positional information out, if not needed Dictionary Postings term 1 Doc ID, TF, Position(s); Doc ID, TF, Position(s);... term 2 Doc ID, TF, Position(s); term n Doc ID, TF, Position(s); Doc ID, TF, Position(s);... Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101
26 Basics Inverted Index Properties Advantages Still efficient storage Search for terms Formulate boolean queries Search for phrases (i. e. terms in a particular order) Information where the term occurred in the document Possible creation of snippets Documents can be ranked by the number of terms they contain Disadvantages Takes up space, not only storing the original documents, but also the inverted index Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101
27 Steps for building an index Basics Inverted Index Step 1 Build term-document table Dictionary Postings term 1 Doc ID, TF, Position(s); Doc ID, TF, Position(s);... term 2 Doc ID, TF, Position(s); term n Doc ID, TF, Position(s); Doc ID, TF, Position(s);... oman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101
28 Basics Inverted Index Steps for building an index Step 1 Build term-document table Step 2 Sort by terms Dictionary Postings term 1 Doc ID, TF, Position(s); Doc ID, TF, Position(s);... term 2 Doc ID, TF, Position(s); term n Doc ID, TF, Position(s); Doc ID, TF, Position(s);... oman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101
29 Basics Inverted Index Steps for building an index Step 1 Build term-document table Step 2 Sort by terms Step 3 Add term frequency, multiple entries from single document get merged Correspondingly Get the positions of the terms in the documents Dictionary Postings term 1 Doc ID, TF, Position(s); Doc ID, TF, Position(s);... term 2 Doc ID, TF, Position(s); term n Doc ID, TF, Position(s); Doc ID, TF, Position(s);... oman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101
30 Basics Inverted Index Steps for building an index Step 1 Build term-document table Step 2 Sort by terms Step 3 Add term frequency, multiple entries from single document get merged Correspondingly Get the positions of the terms in the documents Step 4 : Result is split into dictionary file and description file. Dictionary Postings term 1 Doc ID, TF, Position(s); Doc ID, TF, Position(s);... term 2 Doc ID, TF, Position(s); term n Doc ID, TF, Position(s); Doc ID, TF, Position(s);... Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101
31 Basics Inverted Index Preprocessing of Text To make a text indexable, a couple of steps are necessary: 1 Detect the language of the text 2 Detect sentence boundaries 3 Detect term (word) boundaries 4 Stemming and normalization 5 Remove stop words Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101
32 Basics Inverted Index Language Detection Letter Frequencies Percentage of occurrence Letter English German A E I O U N-Gram Frequencies Better more elaborate methods use statistics over more than one letter, e. g. statistics over two, three or even more consecutive letters. Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101
33 Basics Inverted Index Sentence Detection First approach Every period marks the end of a sentence Problem Utilize other features Periods also mark abbreviations, decimal points, -addresses, etc. Is the next letter capitalized? Is the term in front of the period a known abbreviation? Is the period surround by digits?... Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101
34 Basics Inverted Index Tokenization Problem Goal Split sentences into tokens Throw away punctuations Possible Pit Falls White spaces are no safe delimiter for word boundaries(e. g. New York) Non-alphanumerical characters can separate words, but need not (e. g. co-worker) Different forms (e. g. white space vs. whitespace) German compound nouns (e. g. Donaudampfschiffahrtsgesellschaft, meaning Danube Steamboat Shipping Company) Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101
35 Basics Inverted Index Tokenization Solutions Supervised Learning Train a model with annotated training data, use the trained model to tokenize unknown text Hidden-Markov-Models and conditional random fields are commonly used Dictionary Approach Build a dictionary (i. e. list ) of tokens Go over the sentence and always take the longest fitting token (greedy algorithm!) Remark: Works well for Asian languages without white spaces and short words. Problematic for European languages Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101
36 Basics Inverted Index Normalization Some definitions Token Sequence of characters representing a useful semantic unit, instance of a type Task Type Common concept for tokens, element of the vocabulary Term Type that is stored in the dictionary of an index 1 Map all possible tokens to the corresponding type 2 Store all representing terms of one type in the dictionary Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101
37 Example Basics Inverted Index Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101
38 Basics Inverted Index Posible Ways for Normalization Ignore cases Rule based removal of hyphens, periods, white spaces, accents, diacritics (Be aware of possible different meaning after removal) Provide a list of all possible representations of a type Map different spelling version by using a phonetic algorithm. Phonetic algorithms translate the spelling into their pronunciation equivalent (e. g. Double Metaphone, Soundex). Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101
39 Basics Inverted Index Stemming & Lemmatization Definition Goal of both Reduce different grammatical forms of a word to their common infinitive a Stemming Usage of heuristics to chop off / replace last part of words (Example: Porter s algorithm). Lemmatization Usage of proper grammatical rules to recreate the infinitive. a The common infinitive is the form of a word how it is written in a dictionary. This form is called lemma, hence the name Lemmatization. Examples Map do, doing, done, to commone infinitive do Map digitalizing, digitalized to digital Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101
40 Basics Inverted Index Stop Words Definition Extremely common words that appear in nearly every text Problem As stop words are so conmen, their occurrence does not characterize a text Solution Just drop them, i. e. do not put them in the index directory Common stop words Solution Write a list with stop words (List might be topic specific!) Usual suspects: articles (a, an, the), conjunction (and, or, but,...), pre- and postposition (in, for, from), etc. Stop word list Ignore word that are given on a list (black list) Problem Special names and phrases (The Who, Let It Be,...) Solution Make another list... (white list) Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101
41 Information flow Basics Probabilistic IR The classic search model Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101
42 Basics Probabilistic IR Probabilistic Information Retrial Every transformation can hold a possible information loss 1 Task to query 2 Query to document Information loss leads to uncertainty Probability theory can deal with uncertainty Basic idea behind probabilistic IR Model the uncertainty, which originates in the imperfect transformations Use this uncertainty as a measurement how relevant a document is, given a certain query Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101
43 Basics Probabilistic IR Probabilistic Information Retrial Models Binary Independence Model Okapi BM25 Bayesian networks for text retrieval Language Models Coming now... Binary: documents and queries represented as binary term incidence vectors Independence: terms are assumed to be independent of each other Takes number of term frequencies and document length into account Parameters to tune influence of term frequencies and document length Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101
44 Basics Probabilistic IR Language Models Basic Idea Train a probabilistic model M d for document d, i. e. as many trained models as documents M d ranks documents on how possible it is that the user s query q was created by the document d Results are document models (or documents, respectively) which have a high probability to generate the user s query Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101
45 Basics Probabilistic IR Examples for Language Models Successive Probability of query terms P(t 1 t 2... t n ) = P(t 1 )P(t 2 t 1 )P(t 3 t 1 t 2 )... P(t n t 1... t n 1 ) Unigram Languge Model P uni (t 1 t 2... t n ) = P(t 1 )P(t 2 )... P(t n ) Bigram Languge Model P(t 1 t 2... t n ) = P(t 1 )P(t 2 t 1 )P(t 3 t 2 )... P(t n t n 1 ) Probability P(t n ) Probabilities are generated from the term occurrence in the document in question And many more... A lot more, and more complex models are available Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101
46 Advanced Concepts Advanced Concepts LSA, Relevance Feedback, Query Processing,... Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101
47 Advanced Concepts Latent Semantic Indexing Index Construction Going beyond simple indexing Term-document matrices are very large But the number of topics that people talk about is small (in some sense) Clothes, movies, politics,... Can we represent the term-document space by a lower dimensional latent space? Latent Semantic Indexing (LSI) Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101
48 Advanced Concepts Latent Semantic Indexing Latent Semantic Indexing - Introduction Mathematically speaking This is taking a 3-D point and projecting it into 2-D ( x, y, z) ( 10, 10, 10) ( x, y) ( 10, 10) The arrow in this picture acts like the overhead projector Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101
49 Advanced Concepts Latent Semantic Indexing Latent Semantic Indexing - Introduction Mathematically speaking We can project from any number of dimensions into any other number of dimensions. Increasing dimensions adds redundant information But sometimes useful Support Vector Machines (kernel methods) do this effectively Latent Semantic Indexing always reduces the number of dimensions Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101
50 Advanced Concepts Latent Semantic Indexing Latent Semantic Indexing - Introduction Mathematically speaking Latent Semantic Indexing always reduces the number of dimensions (x,y) ( 10, 10) (x) ( 10 ) Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101
51 Advanced Concepts Latent Semantic Indexing Latent Semantic Indexing - Introduction Mathematically speaking Latent Semantic Indexing can project on an arbitrary axis, not just a principal axis Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101
52 Advanced Concepts Latent Semantic Indexing Latent Semantic Indexing - Introduction Mathematically speaking Our documents were just points in an N-dimensional term space We can project them also Brutus MacBeth Brutus MacBeth Othello Othello Hamlet Hamlet Julius Caesar Julius Caesar Antony and Cleopatra Antony and Cleopatra Tempest Antony Tempest Antony Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101
53 Advanced Concepts Latent Semantic Indexing Latent Semantic Indexing LSI - expected behaviour A term vector that is projected on new vectors may uncover deeper meanings For example transforming the 3 axes of a term matrix from ball, bat and cave to An axis that merges ball and bat An axis that merges bat and cave Should be able to separate differences in meaning of the term bat Bonus: less dimensions is faster Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101
54 Advanced Concepts Latent Semantic Indexing Singular Value Decomposition SVD overview For an m n matrix A of rank r there exists a factorization (Singular Value Decomposition = SVD) as follows: M = UΣV T... U is an m m unitary matrix... Σ is an m n diagonal matrix... V T is an n n unitary matrix The columns of U are orthogonal eigenvectors of AA T The columns of V are orthogonal eigenvectors of A T A Eigenvalues λ 1...λ r of AA T are the eigenvalues of A T A. Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101
55 Advanced Concepts Latent Semantic Indexing Singular Value Decomposition Matrix Decomposition SVD enables lossy compression of a term-document matrix Reduces the dimensionality or the rank Arbitrarily reduce the dimensionality by putting zeros in the bottom right of sigma This is a mathematically optimal way of reducing dimensions Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101
56 Advanced Concepts Latent Semantic Indexing Singular Value Decomposition Reduced SVD If we retain only k singular values, and set the rest to 0, then we don t need the matrix parts in green Then Σ is k k, U is M k, V T is k n, and A k is m n This is referred to as the reduced SVD It is the convenient (space-saving) and usual form for computational applications Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101
57 Advanced Concepts Latent Semantic Indexing Singular Value Decomposition Approximation Error How good (bad) is this approximation? It s the best possible, measured by the Frobenius norm of the error Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101
58 Advanced Concepts Latent Semantic Indexing Singular Value Decomposition Reduced SVD Whereas the term-doc matrix A may have m = 50000, n = 10million (and rank close to 50000) We can construct an approximation A 100 with rank 100. Of all rank 100 matrices, it would have the lowest Frobenius error. Latent Semantic Indexing Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101
59 Advanced Concepts Latent Semantic Indexing Latent Semantic Indexing Application of SVD - LSI From term-document matrix A, we compute the approximation A k. There is a row for each term and a column for each document in A k Thus documents live in a space of k << r dimensions These dimensions are not the original axes Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101
60 Advanced Concepts Latent Semantic Indexing Latent Semantic Indexing Approximation Error Each row and column of A gets mapped into the k-dimensional LSI space, by the SVD. Claim this is not only the mapping with the best (Frobenius error) approximation to A, but in fact improves retrieval. A query q is also mapped into this space Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101
61 Advanced Concepts Latent Semantic Indexing Latent Semantic Indexing LSI - Results Similar terms map to similar location in low dimensional space Noise reduction by dimension reduction Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101
62 Advanced Concepts Relevance Feedback Relevance Feedback Integrate feedback from the user Given a search result for a user s query... allow the user to rate the retrieved results (good vs. not-so-good match) This information can then be used to re-rank the results to match the user s preference Often automatically inferred from the user s behaviour using the search query logs... and ultimately improve the search experience The query logs (optimally) contain the queries plus the items the user has clicked on (the user s session) Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101
63 Advanced Concepts Relevance Feedback Relevance Feedback Pseudo relevance feedback No user interaction is needed for the pseudo relevance feedback... the first n results are simply assumed to be a good match As if the users has clicked on the first results and marked them as good match often improves the quality of the search results Pseudo relevance feedback is sometimes also called blind relevance feedback Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101
64 Advanced Concepts Relevance Feedback More Like This A common feature of search engines is the more like this search Given a search results... the user wants items similar to one of the presented results This can be seen as: taking a document as query which is a common implementation (but restricting the query to the most discriminative terms) Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101
65 Advanced Concepts Query Processing Query Reformulation Modify the query, during of after the user interaction Query suggestion & completion (Automatic) query correction (Automatic) query expansion Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101
66 Advanced Concepts Query Processing Query Suggestion The user start typing a query... automatically gets suggestions to complete the currently typed word to complete a whole phrase (e.g. the next word) suggest related queries Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101
67 Advanced Concepts Query Processing Query Correction Query misspellings Our principal focus here E.g., the query Britni Spears We can either Retrieve documents indexed by the correct spelling, Rewrite the input query (replace, expand), britni spears (britni OR britney) spears Return several suggested alternative queries with the correct spelling - Did you mean Britney Spears? Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101
68 Query Correction Advanced Concepts Query Processing Isolated word correction Fundamental premise there is a lexicon from which the correct spellings come Basic choices for this A standard lexicon such as Webster s English Dictionary An industry-specific lexicon hand-maintained The lexicon of the indexed corpus E.g., all words on the web All names, acronyms etc. (Including the mis-spellings) The use of a query log Use other users behaviour Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101
69 Advanced Concepts Query Processing Query Correction Isolated word correction Given a lexicon and a character sequence Q, return the words in the lexicon closest to Q What s closest? We ll study several alternatives 1 Edit distance (Levenshtein distance) 2 Weighted edit distance 3 n-gram overlap Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101
70 Query Correction Advanced Concepts Query Processing Word correction: edit distance Given two strings S 1 and S 2, the minimum number of operations to convert one to the other Operations are typically character-level Insert, Delete, Replace, (Transposition) Examples From dof to dog is 1 From cat to act is 2 (Just 1 with transpose.) From cat to dog is 3. See for a nice example plus an applet. Notebook with various distances (Levenshtein being the best known): Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101
71 Query Correction Advanced Concepts Query Processing Word correction: edit distance Given two strings S 1 and S 2, the minimum number of operations to convert one to the other Operations are typically character-level Insert, Delete, Replace, (Transposition) Examples From dof to dog is 1 From cat to act is 2 (Just 1 with transpose.) From cat to dog is 3. See for a nice example plus an applet. Notebook with various distances (Levenshtein being the best known): Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101
72 Advanced Concepts Query Processing Query Correction Word correction: weighted edit distance As above, but the weight of an operation depends on the character(s) involved Meant to capture OCR or keyboard errors, e.g. m more likely to be mis-typed as n than as q Therefore, replacing m by n is a smaller edit distance than by q This may be formulated as a probability model Requires weight matrix as input Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101
73 Advanced Concepts Query Processing Query Correction Word correction: n-gram overlap Enumerate all the n-grams in the query string as well as in the lexicon Use the n-gram index (recall wild-card search) to retrieve all lexicon terms matching any of the query n-grams Threshold by number of matching n-grams Variants weight by keyboard layout, etc. Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101
74 Advanced Concepts Query Processing Query Correction Word correction: n-gram overlap, example Suppose the text is november Trigrams are: nov ove vem emb mbe ber The query is december Trigrams are: dec ece cem emb mbe ber Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101
75 Advanced Concepts Query Processing Query Correction Word correction: n-gram overlap, example Suppose the text is november Trigrams are: nov ove vem emb mbe ber The query is december Trigrams are: dec ece cem emb mbe ber So 3 trigrams overlap (of 6 in each term) How can we turn this into a normalised measure of overlap? Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101
76 Advanced Concepts Query Processing Query Correction n-gram overlap, Jaccard Coefficient A commonly-used measure of overlap Let X and Y be two sets; then the Jaccard Coefficient is Equals 1 when X and Y have the same elements Equals 0 when they are disjoint X and Y don t have to be of the same size Always assigns a number between 0 and 1 Now threshold to decide if you have a match E.g., if JC > 0.8 declare a match Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101
77 Advanced Concepts Query Processing Query Expansion The user has entered a query... automatically (related) words are added to the query Global query expansion - only use the query (plus corpus or background knowledge) Local query expansion - conduct the search and analyse the search results Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101
78 Global Query Expansion Advanced Concepts Query Processing Uses only the query Example: Use of a thesaurus Query: [buy car] Use synonyms from the thesaurus Expanded query: [(buy OR purchase) (car OR auto)] Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101
79 Advanced Concepts Query Processing Local query expansion Uses the query plus the search results... conceptually similar to the pseudo relevance feedback Look for terms in the first n search results and add them to the query Re-run the query with the added terms Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101
80 Evaluation Evaluation Assess the quality of search Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101
81 Evaluation Overview: Evaluation Two basic questions / goals Are the retrieved document relevant for the user s information need? Have all relevant document been retrieved? In practice these two goal contradict each other. Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101
82 Evaluation Precision & Recall Evaluation Scenario Fixed document collection Number of queries Set of documents marked as relevant for each query Main Evaluation Metrics Precision = Recall = #(relevant items retrieved) #(retrieved items) #(relevant items retrieved) #(relevant items) F 1 = 2PrecisionRecall Precision+Recall Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101
83 Precision & Recall Evaluation Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101
84 Evaluation Advanced Evaluation Measures Mean Average Precision - MAP... mean of all average precision of multiple queries Map = 1 Q q Q AP q AP q = 1 n n k=1 Precision(k) (Normalised) discounted cumulative gain (ndcg)... where there is a reference ranking the relevant results Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101
85 Applications of IR Selected Applications of IR Classification & Recommender Systems Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101
86 Applications of IR Classification Overview Classification Supervised Learning: labelled training data is needed (effort for labelling) Classification problems: classify an example into a given set of categories Algorithm learns model from the labelled training set and applies this model on unknown data Used in real applications Clustering Unsupervised Learning Find groups in the data, where the similarity within the group is high and the similarity between groups is low Hardly ever used in real applications Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101
87 Applications of IR Classification Supervised learning Given examples of a function (x, f (x)) Predict function f (x) for new examples x Depending on f (x) we have: 1 Classification if f (x) is a discrete function 2 Regression if f (x) is a continuous function 3 Probability estimation if f (x) is a probability distribution Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101
88 Applications of IR Classification Classification Definition Given: Training examples (x, f (x)) for some unknown discrete function f Find: A good approximation for f x is data, e. g. a vector, text documents, s, words in text, etc. Classification approaches Logistic regression Decision trees Support vector machines Neural networks Statistical inference such as Bayesian learning Nearest neighbour algorithms (Here: knn-algorithm) Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101
89 General knn-algorithm Applications of IR Classification 1 For each desired class: at least one example must be provided 2 Take a unclassified document: Look at its k nearest neighbours 3 Assign the unknown element to the class with the most neighbours in it 4 Goto 2 Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101
90 Document Classification Applications of IR Classification 1 For each desired class: at least one document must be provided 2 Find k nearest neighbours of a document: 1 Take all terms of a document 2 Combine them to a query (e. g. with a logical OR) 3 Take the first k results form an index 3 Assign the unknown document to the class with the most neighbours in it 4 Goto 2 Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101
91 Applications of IR Content-based Recommender systems Overview Recommender systems should provide usable suggestion to users Recommender systems should show new, unknown but similar documents Recommender systems can utilize different information sources Content-based Recommender Finds similar documents on the basis of document content Collaborative Filering Employs user rating to find new and useful documents Knowledge-based Recommender Makes decisions based on pre-programmed rules Hybrid Recommender Combination of two or three other approaches Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101
92 Applications of IR Types of Content-based Recommender Content-based Recommender systems Document-to-Document Recommender Document-to-User Recommender Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101
93 Applications of IR Content-based Recommender systems Building a Recommender by Using an Index Document-to-Document Recommender 1 Take the terms from a document 2 Construct a (weighted) query with them 3 Query the index 4 Results are possible candidates for recommending Document-to-User Recommender 1 Create a user model from Terms typed by the user Documents viewed by the user Simplest user model is just a set of terms 2 Construct a (weighted) query with them 3 Query the index 4 Results are possible candidates for recommending Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101
94 Solutions Solutions Open-source and commercial IR solutions Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101
95 Solutions Open Source Solutions Open Source Search Engine Apache Lucene Very popular, high quality Backbone of many other search engines, e.g. Apache Solr, ElasticSearch Lucene is the core search engine Solr provides a service & configuration infrastructure (plus a few extra features) Implemented in Java, bindings for many other programming languages Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101
96 Solutions Open Source Solutions Solr Index Management API HTTP Server Various formats (XML, binary, JavaScript,...) Document life-cycle There is no update Delete (done automatically by Solr) Insert Implications An unique id is necessary Use batch updates Commit, rollback (and optimize) Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101
97 Solutions Open Source Solutions Solr Text Processing Scope During indexing & query Tokenization Split text into tokens Lower-case alignment Stemming (e.g. ponies, pony poni, triplicate triplic,...) Synonyms (via Thesaurus) Stop-word filtering Multi-word splitting (e.g. Wi-Fi Wi, Fi) n-grams, soundex, umlauts Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101
98 Solutions Open Source Solutions Solr Query Processing Query parsers Lucene query parser (rich syntax) AND, OR, NOT, range queries, wildcards, fuzzy query, phrase query Boosting of individual parts Example: ((boltzmann OR schroedinger) NOT einstein) Dismax query parser No query syntax Searches over multiple fields (separate boost for each field) Configure the amount of terms to be mandatory Distance between terms is used for ranking (phrase boosting) Dismax is a good starting point, but may become expensive Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101
99 Solutions Open Source Solutions Search Features Query filter Additional query No impact on ranking Results are cached Boosting query Only in Dismax Query elevation Fix certain queries Request handler Pre-define clauses Invariants Function queries Score is computed on field values Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101
100 Solutions Open Source Solutions Solr Search Result Ranking Relevance Sort on field value (only single term per document) Available data & features Sequence of IDs & score Stored fields Snippets (plus highlighting) Facets Count the search hits Types: field value, dates, queries Sort, prefix,... Could be used for term suggestion (aka. query suggestion) Field collapsing (grouping) Spell checking (did-you-mean) Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101
101 Solutions Open Source Solutions Additional Solr Features Query by Example More like this Stats Per field Min, max, sum, missing,... Admin-GUI Webapp to troubleshoot queries Browse schema JMX Read properties & statistics Can be accessed remotely Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101
102 Solutions Open Source Solutions Solr Integration Deployment Within a web application server Embedded Monitor Log output Access Various language bindings Java, Ruby, JavaScript, PHP,... Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101
103 Solutions Open Source Solutions Other Open-Source Search Engines Xapian Support for many (programming) languages Used by many open-source projects Lemur & Indri Language models Academic background Terrier Many algorithms Academic background Whoosh for Pythonistas Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101
104 What customers want Solutions Commercial Solutions Indexing pipeline And also: user interface, access control mechanisms, logging, etc. Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101
105 Solutions Commercial Solutions Commercial Search Engines Autonomy Sinequa Attivio Grantees not to truncate documents Offers a wide range on source connectivity Expert Search Creates new views, i. e. do not only show result list, but combine results to new unit Use SQL statements in queries IntraFIND Lucene-based And many, many more... Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101
106 Solutions Commercial Solutions The End Next: Preprocessing Platform & Practical Applications Roman Kern, Heimo Gursch (KTI, TU Graz) Information Retrieval and Applications / 101
Recap of last time CS276A Information Retrieval
Recap of last time CS276A Information Retrieval Index compression Space estimation Lecture 4 This lecture Tolerant retrieval Wild-card queries Spelling correction Soundex Wild-card queries Wild-card queries:
More information3-1. Dictionaries and Tolerant Retrieval. Most slides were adapted from Stanford CS 276 course and University of Munich IR course.
3-1. Dictionaries and Tolerant Retrieval Most slides were adapted from Stanford CS 276 course and University of Munich IR course. 1 Dictionary data structures for inverted indexes Sec. 3.1 The dictionary
More informationTolerant Retrieval. Searching the Dictionary Tolerant Retrieval. Information Retrieval & Extraction Misbhauddin 1
Tolerant Retrieval Searching the Dictionary Tolerant Retrieval Information Retrieval & Extraction Misbhauddin 1 Query Retrieval Dictionary data structures Tolerant retrieval Wild-card queries Soundex Spelling
More informationIntroduction to Information Retrieval
Introduction Inverted index Processing Boolean queries Course overview Introduction to Information Retrieval http://informationretrieval.org IIR 1: Boolean Retrieval Hinrich Schütze Institute for Natural
More informationInformation Retrieval
Information Retrieval Natural Language Processing: Lecture 12 30.11.2017 Kairit Sirts Homework 4 things that seemed to work Bidirectional LSTM instead of unidirectional Change LSTM activation to sigmoid
More informationRecap of the previous lecture. This lecture. A naïve dictionary. Introduction to Information Retrieval. Dictionary data structures Tolerant retrieval
Ch. 2 Recap of the previous lecture Introduction to Information Retrieval Lecture 3: Dictionaries and tolerant retrieval The type/token distinction Terms are normalized types put in the dictionary Tokenization
More informationCSCI 5417 Information Retrieval Systems Jim Martin!
CSCI 5417 Information Retrieval Systems Jim Martin! Lecture 4 9/1/2011 Today Finish up spelling correction Realistic indexing Block merge Single-pass in memory Distributed indexing Next HW details 1 Query
More informationDictionaries and tolerant retrieval. Slides by Manning, Raghavan, Schutze
Dictionaries and tolerant retrieval 1 Ch. 2 Recap of the previous lecture The type/token distinction Terms are normalized types put in the dictionary Tokenization problems: Hyphens, apostrophes, compounds,
More informationDictionaries and tolerant retrieval CE-324 : Modern Information Retrieval Sharif University of Technology
Dictionaries and tolerant retrieval CE-324 : Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2014 Most slides have been adapted from: Profs. Nayak & Raghavan (CS- 276, Stanford)
More informationKnowledge Discovery and Data Mining 1 (VO) ( )
Knowledge Discovery and Data Mining 1 (VO) (707.003) Data Matrices and Vector Space Model Denis Helic KTI, TU Graz Nov 6, 2014 Denis Helic (KTI, TU Graz) KDDM1 Nov 6, 2014 1 / 55 Big picture: KDDM Probability
More informationDictionaries and Tolerant retrieval
Dictionaries and Tolerant retrieval Slides adapted from Stanford CS297:Introduction to Information Retrieval A skipped lecture The type/token distinction Terms are normalized types put in the dictionary
More informationIntroduction to Information Retrieval
Introduction to Information Retrieval Mohsen Kamyar چهارمین کارگاه ساالنه آزمایشگاه فناوری و وب بهمن ماه 1391 Outline Outline in classic categorization Information vs. Data Retrieval IR Models Evaluation
More informationDictionaries and tolerant retrieval CE-324 : Modern Information Retrieval Sharif University of Technology
Dictionaries and tolerant retrieval CE-324 : Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2016 Most slides have been adapted from: Profs. Nayak & Raghavan (CS- 276, Stanford)
More informationvector space retrieval many slides courtesy James Amherst
vector space retrieval many slides courtesy James Allan@umass Amherst 1 what is a retrieval model? Model is an idealization or abstraction of an actual process Mathematical models are used to study the
More informationCS105 Introduction to Information Retrieval
CS105 Introduction to Information Retrieval Lecture: Yang Mu UMass Boston Slides are modified from: http://www.stanford.edu/class/cs276/ Information Retrieval Information Retrieval (IR) is finding material
More informationBoolean Retrieval. Manning, Raghavan and Schütze, Chapter 1. Daniël de Kok
Boolean Retrieval Manning, Raghavan and Schütze, Chapter 1 Daniël de Kok Boolean query model Pose a query as a boolean query: Terms Operations: AND, OR, NOT Example: Brutus AND Caesar AND NOT Calpuria
More informationDigital Libraries: Language Technologies
Digital Libraries: Language Technologies RAFFAELLA BERNARDI UNIVERSITÀ DEGLI STUDI DI TRENTO P.ZZA VENEZIA, ROOM: 2.05, E-MAIL: BERNARDI@DISI.UNITN.IT Contents 1 Recall: Inverted Index..........................................
More informationSoir 1.4 Enterprise Search Server
Soir 1.4 Enterprise Search Server Enhance your search with faceted navigation, result highlighting, fuzzy queries, ranked scoring, and more David Smiley Eric Pugh *- PUBLISHING -J BIRMINGHAM - MUMBAI Preface
More informationboolean queries Inverted index query processing Query optimization boolean model September 9, / 39
boolean model September 9, 2014 1 / 39 Outline 1 boolean queries 2 3 4 2 / 39 taxonomy of IR models Set theoretic fuzzy extended boolean set-based IR models Boolean vector probalistic algebraic generalized
More informationModels for Document & Query Representation. Ziawasch Abedjan
Models for Document & Query Representation Ziawasch Abedjan Overview Introduction & Definition Boolean retrieval Vector Space Model Probabilistic Information Retrieval Language Model Approach Summary Overview
More informationInformation Retrieval. (M&S Ch 15)
Information Retrieval (M&S Ch 15) 1 Retrieval Models A retrieval model specifies the details of: Document representation Query representation Retrieval function Determines a notion of relevance. Notion
More informationhttp://www.xkcd.com/628/ Results Summaries Spelling Correction David Kauchak cs160 Fall 2009 adapted from: http://www.stanford.edu/class/cs276/handouts/lecture3-tolerantretrieval.ppt http://www.stanford.edu/class/cs276/handouts/lecture8-evaluation.ppt
More informationCS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University
CS473: CS-473 Course Review Luo Si Department of Computer Science Purdue University Basic Concepts of IR: Outline Basic Concepts of Information Retrieval: Task definition of Ad-hoc IR Terminologies and
More informationOUTLINE. Documents Terms. General + Non-English English. Skip pointers. Phrase queries
WECHAT GROUP 1 OUTLINE Documents Terms General + Non-English English Skip pointers Phrase queries 2 Phrase queries We want to answer a query such as [stanford university] as a phrase. Thus The inventor
More informationJames Mayfield! The Johns Hopkins University Applied Physics Laboratory The Human Language Technology Center of Excellence!
James Mayfield! The Johns Hopkins University Applied Physics Laboratory The Human Language Technology Center of Excellence! (301) 219-4649 james.mayfield@jhuapl.edu What is Information Retrieval? Evaluation
More informationInformation Retrieval and Organisation
Information Retrieval and Organisation Dell Zhang Birkbeck, University of London 2016/17 IR Chapter 02 The Term Vocabulary and Postings Lists Constructing Inverted Indexes The major steps in constructing
More informationInformation Retrieval
Information Retrieval Suan Lee - Information Retrieval - 06 Scoring, Term Weighting and the Vector Space Model 1 Recap of lecture 5 Collection and vocabulary statistics: Heaps and Zipf s laws Dictionary
More informationCSCI 5417 Information Retrieval Systems! What is Information Retrieval?
CSCI 5417 Information Retrieval Systems! Lecture 1 8/23/2011 Introduction 1 What is Information Retrieval? Information retrieval is the science of searching for information in documents, searching for
More informationMining Web Data. Lijun Zhang
Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems
More information60-538: Information Retrieval
60-538: Information Retrieval September 7, 2017 1 / 48 Outline 1 what is IR 2 3 2 / 48 Outline 1 what is IR 2 3 3 / 48 IR not long time ago 4 / 48 5 / 48 now IR is mostly about search engines there are
More informationIndexing. Lecture Objectives. Text Technologies for Data Science INFR Learn about and implement Boolean search Inverted index Positional index
Text Technologies for Data Science INFR11145 Indexing Instructor: Walid Magdy 03-Oct-2017 Lecture Objectives Learn about and implement Boolean search Inverted index Positional index 2 1 Indexing Process
More informationRepresentation/Indexing (fig 1.2) IR models - overview (fig 2.1) IR models - vector space. Weighting TF*IDF. U s e r. T a s k s
Summary agenda Summary: EITN01 Web Intelligence and Information Retrieval Anders Ardö EIT Electrical and Information Technology, Lund University March 13, 2013 A Ardö, EIT Summary: EITN01 Web Intelligence
More informationIntroduction to Information Retrieval
Introduction to Information Retrieval http://informationretrieval.org IIR 5: Index Compression Hinrich Schütze Center for Information and Language Processing, University of Munich 2014-04-17 1/59 Overview
More informationOverview. Lecture 3: Index Representation and Tolerant Retrieval. Type/token distinction. IR System components
Overview Lecture 3: Index Representation and Tolerant Retrieval Information Retrieval Computer Science Tripos Part II Ronan Cummins 1 Natural Language and Information Processing (NLIP) Group 1 Recap 2
More informationInformation Retrieval and Organisation
Information Retrieval and Organisation Dell Zhang Birkbeck, University of London 2016/17 IR Chapter 01 Boolean Retrieval Example IR Problem Let s look at a simple IR problem Suppose you own a copy of Shakespeare
More informationInformation Retrieval: Retrieval Models
CS473: Web Information Retrieval & Management CS-473 Web Information Retrieval & Management Information Retrieval: Retrieval Models Luo Si Department of Computer Science Purdue University Retrieval Models
More informationInformation Retrieval
Information Retrieval Dictionaries & Tolerant Retrieval Gintarė Grigonytė gintare@ling.su.se Department of Linguistics and Philology Uppsala University Slides based on previous IR course given by Jörg
More informationChapter 2. Architecture of a Search Engine
Chapter 2 Architecture of a Search Engine Search Engine Architecture A software architecture consists of software components, the interfaces provided by those components and the relationships between them
More informationInformation Retrieval
Introduction to Information Retrieval Information Retrieval and Web Search Lecture 1: Introduction and Boolean retrieval Outline ❶ Course details ❷ Information retrieval ❸ Boolean retrieval 2 Course details
More informationInformation Retrieval and Web Search
Information Retrieval and Web Search Introduction to IR models and methods Rada Mihalcea (Some of the slides in this slide set come from IR courses taught at UT Austin and Stanford) Information Retrieval
More informationMining Web Data. Lijun Zhang
Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems
More informationIntroduction to Information Retrieval
Introduction to Information Retrieval http://informationretrieval.org IIR 1: Boolean Retrieval Hinrich Schütze Institute for Natural Language Processing, University of Stuttgart 2011-05-03 1/ 36 Take-away
More informationInformation Retrieval. Information Retrieval and Web Search
Information Retrieval and Web Search Introduction to IR models and methods Information Retrieval The indexing and retrieval of textual documents. Searching for pages on the World Wide Web is the most recent
More informationPart I: Data Mining Foundations
Table of Contents 1. Introduction 1 1.1. What is the World Wide Web? 1 1.2. A Brief History of the Web and the Internet 2 1.3. Web Data Mining 4 1.3.1. What is Data Mining? 6 1.3.2. What is Web Mining?
More informationChapter 27 Introduction to Information Retrieval and Web Search
Chapter 27 Introduction to Information Retrieval and Web Search Copyright 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 27 Outline Information Retrieval (IR) Concepts Retrieval
More informationMultimedia Information Extraction and Retrieval Term Frequency Inverse Document Frequency
Multimedia Information Extraction and Retrieval Term Frequency Inverse Document Frequency Ralf Moeller Hamburg Univ. of Technology Acknowledgement Slides taken from presentation material for the following
More informationIntroducing Information Retrieval and Web Search. borrowing from: Pandu Nayak
Introducing Information Retrieval and Web Search borrowing from: Pandu Nayak Information Retrieval Information Retrieval (IR) is finding material (usually documents) of an unstructured nature (usually
More informationCSE 7/5337: Information Retrieval and Web Search Introduction and Boolean Retrieval (IIR 1)
CSE 7/5337: Information Retrieval and Web Search Introduction and Boolean Retrieval (IIR 1) Michael Hahsler Southern Methodist University These slides are largely based on the slides by Hinrich Schütze
More informationNatural Language Processing and Information Retrieval
Natural Language Processing and Information Retrieval Indexing and Vector Space Models Alessandro Moschitti Department of Computer Science and Information Engineering University of Trento Email: moschitti@disi.unitn.it
More informationIntroduction to Information Retrieval and Boolean model. Reference: Introduction to Information Retrieval by C. Manning, P. Raghavan, H.
Introduction to Information Retrieval and Boolean model Reference: Introduction to Information Retrieval by C. Manning, P. Raghavan, H. Schutze 1 Unstructured (text) vs. structured (database) data in late
More informationInformation Retrieval
Introduction to Information Retrieval CS3245 Information Retrieval Lecture 2: Boolean retrieval 2 Blanks on slides, you may want to fill in Last Time: Ngram Language Models Unigram LM: Bag of words Ngram
More informationIntroduction to Information Retrieval
Mustafa Jarrar: Lecture Notes on Information Retrieval University of Birzeit, Palestine 2014 Introduction to Information Retrieval Dr. Mustafa Jarrar Sina Institute, University of Birzeit mjarrar@birzeit.edu
More informationNatural Language Processing
Natural Language Processing Information Retrieval Potsdam, 14 June 2012 Saeedeh Momtazi Information Systems Group based on the slides of the course book Outline 2 1 Introduction 2 Indexing Block Document
More informationChapter 6: Information Retrieval and Web Search. An introduction
Chapter 6: Information Retrieval and Web Search An introduction Introduction n Text mining refers to data mining using text documents as data. n Most text mining tasks use Information Retrieval (IR) methods
More informationInformation Retrieval. hussein suleman uct cs
Information Management Information Retrieval hussein suleman uct cs 303 2004 Introduction Information retrieval is the process of locating the most relevant information to satisfy a specific information
More informationInformation Retrieval. CS630 Representing and Accessing Digital Information. What is a Retrieval Model? Basic IR Processes
CS630 Representing and Accessing Digital Information Information Retrieval: Retrieval Models Information Retrieval Basics Data Structures and Access Indexing and Preprocessing Retrieval Models Thorsten
More informationInformation Retrieval
Information Retrieval Suan Lee - Information Retrieval - 01 Boolean Retrieval 1 01 Boolean Retrieval - Information Retrieval - 01 Boolean Retrieval 2 Introducing Information Retrieval and Web Search -
More informationInformation Retrieval CS Lecture 06. Razvan C. Bunescu School of Electrical Engineering and Computer Science
Information Retrieval CS 6900 Lecture 06 Razvan C. Bunescu School of Electrical Engineering and Computer Science bunescu@ohio.edu Boolean Retrieval vs. Ranked Retrieval Many users (professionals) prefer
More informationBoolean retrieval & basics of indexing CE-324: Modern Information Retrieval Sharif University of Technology
Boolean retrieval & basics of indexing CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2016 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan lectures
More informationIntroduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p.
Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p. 6 What is Web Mining? p. 6 Summary of Chapters p. 8 How
More informationVK Multimedia Information Systems
VK Multimedia Information Systems Mathias Lux, mlux@itec.uni-klu.ac.at This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Results Exercise 01 Exercise 02 Retrieval
More informationInformation Retrieval
Information Retrieval Suan Lee - Information Retrieval - 03 Dictionaries and Tolerant Retrieval 1 03 Dictionaries and Tolerant Retrieval - Information Retrieval - 03 Dictionaries and Tolerant Retrieval
More informationCPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2016
CPSC 340: Machine Learning and Data Mining Principal Component Analysis Fall 2016 A2/Midterm: Admin Grades/solutions will be posted after class. Assignment 4: Posted, due November 14. Extra office hours:
More informationMinoru SASAKI and Kenji KITA. Department of Information Science & Intelligent Systems. Faculty of Engineering, Tokushima University
Information Retrieval System Using Concept Projection Based on PDDP algorithm Minoru SASAKI and Kenji KITA Department of Information Science & Intelligent Systems Faculty of Engineering, Tokushima University
More informationIntroduction to Information Retrieval
Introduction to Information Retrieval http://informationretrieval.org IIR 1: Boolean Retrieval Hinrich Schütze Center for Information and Language Processing, University of Munich 2014-04-09 Schütze: Boolean
More informationInformation Retrieval
Introduction to Information Retrieval CS276 Information Retrieval and Web Search Pandu Nayak and Prabhakar Raghavan Lecture 1: Boolean retrieval Information Retrieval Information Retrieval (IR) is finding
More informationClassic IR Models 5/6/2012 1
Classic IR Models 5/6/2012 1 Classic IR Models Idea Each document is represented by index terms. An index term is basically a (word) whose semantics give meaning to the document. Not all index terms are
More informationInformation Retrieval
Introduction to Information Retrieval Lecture 6-: Scoring, Term Weighting Outline Why ranked retrieval? Term frequency tf-idf weighting 2 Ranked retrieval Thus far, our queries have all been Boolean. Documents
More informationCS54701: Information Retrieval
CS54701: Information Retrieval Basic Concepts 19 January 2016 Prof. Chris Clifton 1 Text Representation: Process of Indexing Remove Stopword, Stemming, Phrase Extraction etc Document Parser Extract useful
More information- Content-based Recommendation -
- Content-based Recommendation - Institute for Software Technology Inffeldgasse 16b/2 A-8010 Graz Austria 1 Content-based recommendation While CF methods do not require any information about the items,
More informationRecap: lecture 2 CS276A Information Retrieval
Recap: lecture 2 CS276A Information Retrieval Stemming, tokenization etc. Faster postings merges Phrase queries Lecture 3 This lecture Index compression Space estimation Corpus size for estimates Consider
More informationAdministrative. Distributed indexing. Index Compression! What I did last summer lunch talks today. Master. Tasks
Administrative Index Compression! n Assignment 1? n Homework 2 out n What I did last summer lunch talks today David Kauchak cs458 Fall 2012 adapted from: http://www.stanford.edu/class/cs276/handouts/lecture5-indexcompression.ppt
More informationUnsupervised learning in Vision
Chapter 7 Unsupervised learning in Vision The fields of Computer Vision and Machine Learning complement each other in a very natural way: the aim of the former is to extract useful information from visual
More informationToday s topic CS347. Results list clustering example. Why cluster documents. Clustering documents. Lecture 8 May 7, 2001 Prabhakar Raghavan
Today s topic CS347 Clustering documents Lecture 8 May 7, 2001 Prabhakar Raghavan Why cluster documents Given a corpus, partition it into groups of related docs Recursively, can induce a tree of topics
More informationDocument indexing, similarities and retrieval in large scale text collections
Document indexing, similarities and retrieval in large scale text collections Eric Gaussier Univ. Grenoble Alpes - LIG Eric.Gaussier@imag.fr Eric Gaussier Document indexing, similarities & retrieval 1
More informationBoolean retrieval & basics of indexing CE-324: Modern Information Retrieval Sharif University of Technology
Boolean retrieval & basics of indexing CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2015 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan lectures
More informationInformation Retrieval
Introduction to Information Retrieval CS276 Information Retrieval and Web Search Christopher Manning and Prabhakar Raghavan Lecture 1: Boolean retrieval Information Retrieval Information Retrieval (IR)
More informationAdvanced Retrieval Information Analysis Boolean Retrieval
Advanced Retrieval Information Analysis Boolean Retrieval Irwan Ary Dharmawan 1,2,3 iad@unpad.ac.id Hana Rizmadewi Agustina 2,4 hagustina@unpad.ac.id 1) Development Center of Information System and Technology
More informationBoolean retrieval & basics of indexing CE-324: Modern Information Retrieval Sharif University of Technology
Boolean retrieval & basics of indexing CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2013 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276,
More informationPattern Mining. Knowledge Discovery and Data Mining 1. Roman Kern KTI, TU Graz. Roman Kern (KTI, TU Graz) Pattern Mining / 42
Pattern Mining Knowledge Discovery and Data Mining 1 Roman Kern KTI, TU Graz 2016-01-14 Roman Kern (KTI, TU Graz) Pattern Mining 2016-01-14 1 / 42 Outline 1 Introduction 2 Apriori Algorithm 3 FP-Growth
More informationINFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from
INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Schütze s, linked from http://informationretrieval.org/ IR 6: Index Compression Paul Ginsparg Cornell University, Ithaca, NY 15 Sep
More informationInstructor: Stefan Savev
LECTURE 2 What is indexing? Indexing is the process of extracting features (such as word counts) from the documents (in other words: preprocessing the documents). The process ends with putting the information
More informationCS347. Lecture 2 April 9, Prabhakar Raghavan
CS347 Lecture 2 April 9, 2001 Prabhakar Raghavan Today s topics Inverted index storage Compressing dictionaries into memory Processing Boolean queries Optimizing term processing Skip list encoding Wild-card
More informationChapter 6. Queries and Interfaces
Chapter 6 Queries and Interfaces Keyword Queries Simple, natural language queries were designed to enable everyone to search Current search engines do not perform well (in general) with natural language
More informationToday s topics CS347. Inverted index storage. Inverted index storage. Processing Boolean queries. Lecture 2 April 9, 2001 Prabhakar Raghavan
Today s topics CS347 Lecture 2 April 9, 2001 Prabhakar Raghavan Inverted index storage Compressing dictionaries into memory Processing Boolean queries Optimizing term processing Skip list encoding Wild-card
More informationUnstructured Data Management. Advanced Topics in Database Management (INFSCI 2711)
Unstructured Data Management Advanced Topics in Database Management (INFSCI 2711) Textbooks: Database System Concepts - 2010 Introduction to Information Retrieval - 2008 Vladimir Zadorozhny, DINS, SCI,
More informationA Neuro Probabilistic Language Model Bengio et. al. 2003
A Neuro Probabilistic Language Model Bengio et. al. 2003 Class Discussion Notes Scribe: Olivia Winn February 1, 2016 Opening thoughts (or why this paper is interesting): Word embeddings currently have
More informationJan Pedersen 22 July 2010
Jan Pedersen 22 July 2010 Outline Problem Statement Best effort retrieval vs automated reformulation Query Evaluation Architecture Query Understanding Models Data Sources Standard IR Assumptions Queries
More informationrpaf ktl Pen Apache Solr 3 Enterprise Search Server J community exp<= highlighting, relevancy ranked sorting, and more source publishing""
Apache Solr 3 Enterprise Search Server Enhance your search with faceted navigation, result highlighting, relevancy ranked sorting, and more David Smiley Eric Pugh rpaf ktl Pen I I riv IV I J community
More informationIn = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most
In = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most common word appears 10 times then A = rn*n/n = 1*10/100
More informationInformation Retrieval CS Lecture 01. Razvan C. Bunescu School of Electrical Engineering and Computer Science
Information Retrieval CS 6900 Razvan C. Bunescu School of Electrical Engineering and Computer Science bunescu@ohio.edu Information Retrieval Information Retrieval (IR) is finding material of an unstructured
More informationLecture 8 May 7, Prabhakar Raghavan
Lecture 8 May 7, 2001 Prabhakar Raghavan Clustering documents Given a corpus, partition it into groups of related docs Recursively, can induce a tree of topics Given the set of docs from the results of
More informationInformation Retrieval CS-E credits
Information Retrieval CS-E4420 5 credits Tokenization, further indexing issues Antti Ukkonen antti.ukkonen@aalto.fi Slides are based on materials by Tuukka Ruotsalo, Hinrich Schütze and Christina Lioma
More informationBoolean Model. Hongning Wang
Boolean Model Hongning Wang CS@UVa Abstraction of search engine architecture Indexed corpus Crawler Ranking procedure Doc Analyzer Doc Representation Query Rep Feedback (Query) Evaluation User Indexer
More informationProblem 1: Complexity of Update Rules for Logistic Regression
Case Study 1: Estimating Click Probabilities Tackling an Unknown Number of Features with Sketching Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox January 16 th, 2014 1
More informationIntroduction to Information Retrieval (Manning, Raghavan, Schutze)
Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 3 Dictionaries and Tolerant retrieval Chapter 4 Index construction Chapter 5 Index compression Content Dictionary data structures
More informationLatent Semantic Indexing
Latent Semantic Indexing Thanks to Ian Soboroff Information Retrieval 1 Issues: Vector Space Model Assumes terms are independent Some terms are likely to appear together synonyms, related words spelling
More informationVALLIAMMAI ENGINEERING COLLEGE SRM Nagar, Kattankulathur DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING QUESTION BANK VII SEMESTER
VALLIAMMAI ENGINEERING COLLEGE SRM Nagar, Kattankulathur 603 203 DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING QUESTION BANK VII SEMESTER CS6007-INFORMATION RETRIEVAL Regulation 2013 Academic Year 2018
More informationInformation Retrieval
Introduction to Information Retrieval Lecture 4: Index Construction 1 Plan Last lecture: Dictionary data structures Tolerant retrieval Wildcards Spell correction Soundex a-hu hy-m n-z $m mace madden mo
More informationInformation Retrieval
Introduction to Information Retrieval Boolean retrieval Basic assumptions of Information Retrieval Collection: Fixed set of documents Goal: Retrieve documents with information that is relevant to the user
More information