Introduction to Information Retrieval Mohsen Kamyar چهارمین کارگاه ساالنه آزمایشگاه فناوری و وب بهمن ماه 1391
Outline Outline in classic categorization Information vs. Data Retrieval IR Models Evaluation Techniques Query Types and Issues Main Text Issues Interesting Topics 2
Outline Outline in practical categorization; these topics are covered implicitly What is main framework for IR What are this words: Crawling, Indexing, Ranking, Query Answering? What are substitutions for above mentioned words? How much is the effectiveness of our assembly of general framework? 3
IR vs. DR In Data Retrieval, we just retrieve all objects which satisfy clearly defined conditions. Is it enough? Often, user can t define his/her need. Also, it is impossible to clearly define conditions On the other hand, in IR we should return results in respect to user needs what are user needs? In this view main activities will be determining relevance to user needs -> RELEVANCE determining user needs -> PROFILING 4
IR Models Classic Models Boolean Models, extended with these Set theoretic models: Fuzzy models, Extended Boolean Models Vector Models, extended with these Algebraic models: Generalized Vector, Latent Semantic Indexing, Neural Networks Probabilistic Models, extended with these Inference Network Belief Network 5
IR Models preliminaries Frequency: is the number of occurrences of a term in a document divided by the total of terms in that document Document Frequency: is the number of documents that have term i divided by the total number of documents tf-idf: it is the multiplication of frequency with logarithm of reverse document frequency Euclidean distance: is the (x 2 +y 2 + ) 1/2 Cosine distance: is the (X.Y)/( X. Y ) 6
Boolean Models In this view, terms, documents and queries are related with boolean variables. A term is present in one document or not Always queries are conjunction of terms In fuzzy models we use frequency or tf-idf and use fuzzy inference model for determining relevance of queries and documents. In extended boolean models again we use frequency and tfidf and use euclidean distance for disjunctive queries and complementary euclidean distance for conjunctive queries. 7
Vector Models In this view we use frequency or tf-idf and construct a vector for documents and queries we use cosine distance in this model Latent Sematic Indexing In this model we decompose term-document matrix using SVD algorithm (it is most famous algorithm in this area, but we can other algorithms in this model. But main concern is decomposition). This algorithm gives 3 matrices U, S, V. U is the matrix of singular vectors that construct orthogonal components of our space. This matrix determines dependence between terms and these components. S is matrix of singular values. V is again singular vectors but determines documents dependencies to orthogonal components. 8
Vector Models Why orthogonal components: if two objects be orthogonal then they are independent. In retrieval if two objects be independent, then we can retrieve information about object oi with no concern about oj. In real world keywords are not independent (but in many models of retrieval we assumes objects are orthogonal). Neural Network Models: in this models again we construct a three layer network. In layer one, one node per each query term. In layer two, one node per each term. In layer three, one node per each document. Now, we have an initial weighting for arcs similar to general vector model. Then, we should correct our weights in a supervised manner. 9
Probabilistic Models In this model we use conditional probabilities. Parts are as below: Query terms Dataset (set of documents) A set R of relevant documents to query. It is not a real set. Now we construct conditional probabilities of relevance of a document to R. This probabilities will be expanded using probabilities of relevance of keywords to query and keywords to documents. Then we will construct a set of problems (linear/convex/nonlinear programming) and determine unknown conditional probabilities. 10
Structured Text Retrieval Models In this models we don t have a flat text. Text has a structure. For example assume a paper or a book (This structure are so simple). Now assume a html document (It can be a complex structure). In these models we should concern about multiple factors. For example if a term is in subject field it is different from when it is in abstract field and so on. There are two main views: Non-Overlapping Lists Proximal Nodes (Hierarchy model) 11
What after model selection? Is it done? Did we know everything? Model is not the only thing that we want. In IR we should select model and then tune it according to our application. In many cases we need to change the model after failing our tries. So, we select model according to our overall knowledge about application. Then we should determine application characteristics. Using characteristics and evaluation results we can tune our approach. 12
Evaluation Techniques Main evaluation Measures are Precision: number of retrieved desired objects to number of retrieved objects Recall: number of retrieved desired objects to number of desired objects Alternative measures: these measures computed for j-th object in retrieved list Harmonic Mean: F j = 2/((1/r j )+(1/P j )) E Measure: E j = 1-(1+b 2 )/(b 2 /r j +1/P j ) In all evaluations we should use standard datasets (for specific applications we can make our dataset but this is very difficult. Constructing datasets is a wide field for research) Famous ones are: TREC, MEDLINE, CACM, ISI 13
Query Types and Issues Query types Keyword-based queries Single word queries Context queries Boolean queries Natural language Pattern matching queries: includes stemming for text or video retrieval with a sample given video or Structural queries 14
Query Types and Issues Query Issues User relevance feedback: according to our model should find a weight correction model Query expansion: always queries are so small and can t be useful for retrieval. We should expand them. One of main techniques is local clustering. One sample for local clustering for web is HITS algorithm. Another technique for text is expansion based on thesaurus. Thesaurus can be something like WordNet or can be a statistical model that was retrieved from data. 15
Main Text Issues Preprocessing Elimination of StopWords Detecting noun groups Detecting n-grams Stemming This level is not covered in classic IR POS tagging Anaphora resolution and Compression Statistical Methods Dictionary Methods Inverted Files Method 16
Interesting Topics User Interfaces and Visualization one of main problems is presentation of results, for example in both syntactic and semantic search engines Parallel and Distributed IR Multimedia IR determining similarity in multimedia data Profiling Searching the Web heterogeneity of its domain and different techniques for bombing and other frauds to search engines 17