Introduction to Information Retrieval

Similar documents
Representation/Indexing (fig 1.2) IR models - overview (fig 2.1) IR models - vector space. Weighting TF*IDF. U s e r. T a s k s

Glossary. ASCII: Standard binary codes to represent occidental characters in one byte.

Information Retrieval. (M&S Ch 15)

Chapter 27 Introduction to Information Retrieval and Web Search

Modern Information Retrieval

CS 6320 Natural Language Processing

Chapter 6: Information Retrieval and Web Search. An introduction

Information Retrieval and Web Search

Information Retrieval. Information Retrieval and Web Search

CS377: Database Systems Text data and information. Li Xiong Department of Mathematics and Computer Science Emory University

Information Retrieval

Information Retrieval. hussein suleman uct cs

Part I: Data Mining Foundations

Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p.

Information Retrieval CS Lecture 01. Razvan C. Bunescu School of Electrical Engineering and Computer Science

In = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most

modern database systems lecture 4 : information retrieval

CHAPTER 3 INFORMATION RETRIEVAL BASED ON QUERY EXPANSION AND LATENT SEMANTIC INDEXING

Information Retrieval: Retrieval Models

Text Analytics (Text Mining)

Web Information Retrieval using WordNet

CS54701: Information Retrieval

Department of Computer Science and Engineering B.E/B.Tech/M.E/M.Tech : B.E. Regulation: 2013 PG Specialisation : _

Bing Liu. Web Data Mining. Exploring Hyperlinks, Contents, and Usage Data. With 177 Figures. Springer

VK Multimedia Information Systems

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University

Text Analytics (Text Mining)

60-538: Information Retrieval

Introduction to Information Retrieval

Chapter 2. Architecture of a Search Engine

Information Retrieval CS Lecture 06. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Information Retrieval. CS630 Representing and Accessing Digital Information. What is a Retrieval Model? Basic IR Processes

James Mayfield! The Johns Hopkins University Applied Physics Laboratory The Human Language Technology Center of Excellence!

Information Retrieval and Data Mining Part 1 Information Retrieval

dr.ir. D. Hiemstra dr. P.E. van der Vet

Reading group on Ontologies and NLP:

Basic Tokenizing, Indexing, and Implementation of Vector-Space Retrieval

Information Retrieval

Multimedia Information Systems

Chapter 3 - Text. Management and Retrieval

Clustering. Bruno Martins. 1 st Semester 2012/2013

INFSCI 2140 Information Storage and Retrieval Lecture 2: Models of Information Retrieval: Boolean model. Final Group Projects

Department of Electronic Engineering FINAL YEAR PROJECT REPORT

VALLIAMMAI ENGINEERING COLLEGE SRM Nagar, Kattankulathur DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING QUESTION BANK VII SEMESTER

Modern information retrieval

The Information Retrieval Series. Series Editor W. Bruce Croft

Information Retrieval

Mining Web Data. Lijun Zhang

Information Retrieval

TEXT MINING APPLICATION PROGRAMMING

Document indexing, similarities and retrieval in large scale text collections

Query Languages. Berlin Chen Reference: 1. Modern Information Retrieval, chapter 4

Mining Web Data. Lijun Zhang

Natural Language Processing

Basic techniques. Text processing; term weighting; vector space model; inverted index; Web Search

Tag-based Social Interest Discovery

Instructor: Stefan Savev

Latent Semantic Indexing

Boolean Model. Hongning Wang

Outline. Possible solutions. The basic problem. How? How? Relevance Feedback, Query Expansion, and Inputs to Ranking Beyond Similarity

Relevance Feedback and Query Reformulation. Lecture 10 CS 510 Information Retrieval on the Internet Thanks to Susan Price. Outline

Representation of Documents and Infomation Retrieval

A Deep Relevance Matching Model for Ad-hoc Retrieval

Feature selection. LING 572 Fei Xia

Mahout in Action MANNING ROBIN ANIL SEAN OWEN TED DUNNING ELLEN FRIEDMAN. Shelter Island

Multimedia Information Extraction and Retrieval Term Frequency Inverse Document Frequency

Information Retrieval (IR) Introduction to Information Retrieval. Lecture Overview. Why do we need IR? Basics of an IR system.

CSE 494: Information Retrieval, Mining and Integration on the Internet

CMSC 476/676 Information Retrieval Midterm Exam Spring 2014

CLUSTERING, TIERED INDEXES AND TERM PROXIMITY WEIGHTING IN TEXT-BASED RETRIEVAL

Digital Libraries: Language Technologies

Towards Understanding Latent Semantic Indexing. Second Reader: Dr. Mario Nascimento

Information Retrieval and Web Search

This lecture: IIR Sections Ranked retrieval Scoring documents Term frequency Collection statistics Weighting schemes Vector space scoring

Knowledge Discovery and Data Mining 1 (VO) ( )

QUERY EXPANSION USING WORDNET WITH A LOGICAL MODEL OF INFORMATION RETRIEVAL

Clustering and Dimensionality Reduction. Stony Brook University CSE545, Fall 2017

ESANN'2001 proceedings - European Symposium on Artificial Neural Networks Bruges (Belgium), April 2001, D-Facto public., ISBN ,

Classic IR Models 5/6/2012 1

University of Virginia Department of Computer Science. CS 4501: Information Retrieval Fall 2015

ABSTRACT. VENKATESH, JAYASHREE. Pairwise Document Similarity using an Incremental Approach to TF-IDF. (Under the direction of Dr. Christopher Healey.

Using Semantic Similarity in Crawling-based Web Application Testing. (National Taiwan Univ.)

Outline. Lecture 3: EITN01 Web Intelligence and Information Retrieval. Query languages - aspects. Previous lecture. Anders Ardö.

Authoritative K-Means for Clustering of Web Search Results

Vector Space Models: Theory and Applications

WordNet-based User Profiles for Semantic Personalization

68A8 Multimedia DataBases Information Retrieval - Exercises

Q: Given a set of keywords how can we return relevant documents quickly?

SYSTEMS FOR NON STRUCTURED INFORMATION MANAGEMENT

Text Analytics (Text Mining)

Designing and Building an Automatic Information Retrieval System for Handling the Arabic Data

Search Engine Architecture II

VK Multimedia Information Systems

Information Retrieval

Clustering Results. Result List Example. Clustering Results. Information Retrieval

Information Retrieval

June 15, Abstract. 2. Methodology and Considerations. 1. Introduction

Knowledge Engineering in Search Engines

Semantic Extensions to Syntactic Analysis of Queries Ben Handy, Rohini Rajaraman

Data Modelling and Multimedia Databases M

Transcription:

Introduction to Information Retrieval Mohsen Kamyar چهارمین کارگاه ساالنه آزمایشگاه فناوری و وب بهمن ماه 1391

Outline Outline in classic categorization Information vs. Data Retrieval IR Models Evaluation Techniques Query Types and Issues Main Text Issues Interesting Topics 2

Outline Outline in practical categorization; these topics are covered implicitly What is main framework for IR What are this words: Crawling, Indexing, Ranking, Query Answering? What are substitutions for above mentioned words? How much is the effectiveness of our assembly of general framework? 3

IR vs. DR In Data Retrieval, we just retrieve all objects which satisfy clearly defined conditions. Is it enough? Often, user can t define his/her need. Also, it is impossible to clearly define conditions On the other hand, in IR we should return results in respect to user needs what are user needs? In this view main activities will be determining relevance to user needs -> RELEVANCE determining user needs -> PROFILING 4

IR Models Classic Models Boolean Models, extended with these Set theoretic models: Fuzzy models, Extended Boolean Models Vector Models, extended with these Algebraic models: Generalized Vector, Latent Semantic Indexing, Neural Networks Probabilistic Models, extended with these Inference Network Belief Network 5

IR Models preliminaries Frequency: is the number of occurrences of a term in a document divided by the total of terms in that document Document Frequency: is the number of documents that have term i divided by the total number of documents tf-idf: it is the multiplication of frequency with logarithm of reverse document frequency Euclidean distance: is the (x 2 +y 2 + ) 1/2 Cosine distance: is the (X.Y)/( X. Y ) 6

Boolean Models In this view, terms, documents and queries are related with boolean variables. A term is present in one document or not Always queries are conjunction of terms In fuzzy models we use frequency or tf-idf and use fuzzy inference model for determining relevance of queries and documents. In extended boolean models again we use frequency and tfidf and use euclidean distance for disjunctive queries and complementary euclidean distance for conjunctive queries. 7

Vector Models In this view we use frequency or tf-idf and construct a vector for documents and queries we use cosine distance in this model Latent Sematic Indexing In this model we decompose term-document matrix using SVD algorithm (it is most famous algorithm in this area, but we can other algorithms in this model. But main concern is decomposition). This algorithm gives 3 matrices U, S, V. U is the matrix of singular vectors that construct orthogonal components of our space. This matrix determines dependence between terms and these components. S is matrix of singular values. V is again singular vectors but determines documents dependencies to orthogonal components. 8

Vector Models Why orthogonal components: if two objects be orthogonal then they are independent. In retrieval if two objects be independent, then we can retrieve information about object oi with no concern about oj. In real world keywords are not independent (but in many models of retrieval we assumes objects are orthogonal). Neural Network Models: in this models again we construct a three layer network. In layer one, one node per each query term. In layer two, one node per each term. In layer three, one node per each document. Now, we have an initial weighting for arcs similar to general vector model. Then, we should correct our weights in a supervised manner. 9

Probabilistic Models In this model we use conditional probabilities. Parts are as below: Query terms Dataset (set of documents) A set R of relevant documents to query. It is not a real set. Now we construct conditional probabilities of relevance of a document to R. This probabilities will be expanded using probabilities of relevance of keywords to query and keywords to documents. Then we will construct a set of problems (linear/convex/nonlinear programming) and determine unknown conditional probabilities. 10

Structured Text Retrieval Models In this models we don t have a flat text. Text has a structure. For example assume a paper or a book (This structure are so simple). Now assume a html document (It can be a complex structure). In these models we should concern about multiple factors. For example if a term is in subject field it is different from when it is in abstract field and so on. There are two main views: Non-Overlapping Lists Proximal Nodes (Hierarchy model) 11

What after model selection? Is it done? Did we know everything? Model is not the only thing that we want. In IR we should select model and then tune it according to our application. In many cases we need to change the model after failing our tries. So, we select model according to our overall knowledge about application. Then we should determine application characteristics. Using characteristics and evaluation results we can tune our approach. 12

Evaluation Techniques Main evaluation Measures are Precision: number of retrieved desired objects to number of retrieved objects Recall: number of retrieved desired objects to number of desired objects Alternative measures: these measures computed for j-th object in retrieved list Harmonic Mean: F j = 2/((1/r j )+(1/P j )) E Measure: E j = 1-(1+b 2 )/(b 2 /r j +1/P j ) In all evaluations we should use standard datasets (for specific applications we can make our dataset but this is very difficult. Constructing datasets is a wide field for research) Famous ones are: TREC, MEDLINE, CACM, ISI 13

Query Types and Issues Query types Keyword-based queries Single word queries Context queries Boolean queries Natural language Pattern matching queries: includes stemming for text or video retrieval with a sample given video or Structural queries 14

Query Types and Issues Query Issues User relevance feedback: according to our model should find a weight correction model Query expansion: always queries are so small and can t be useful for retrieval. We should expand them. One of main techniques is local clustering. One sample for local clustering for web is HITS algorithm. Another technique for text is expansion based on thesaurus. Thesaurus can be something like WordNet or can be a statistical model that was retrieved from data. 15

Main Text Issues Preprocessing Elimination of StopWords Detecting noun groups Detecting n-grams Stemming This level is not covered in classic IR POS tagging Anaphora resolution and Compression Statistical Methods Dictionary Methods Inverted Files Method 16

Interesting Topics User Interfaces and Visualization one of main problems is presentation of results, for example in both syntactic and semantic search engines Parallel and Distributed IR Multimedia IR determining similarity in multimedia data Profiling Searching the Web heterogeneity of its domain and different techniques for bombing and other frauds to search engines 17