Introduction to Information Retrieval
|
|
- Kristian Melton
- 5 years ago
- Views:
Transcription
1 Introduction to Information Retrieval (Supplementary Material) Zhou Shuigeng March 23, 2007 Advanced Distributed Computing 1
2 Text Databases and IR Text databases (document databases) Large collections of documents from various sources: news articles, research papers, books, digital libraries, messages, and Web pages, library database, etc. Data stored is usually semi-structured Information retrieval A field developed in parallel with database systems Information is organized into (a large number of) documents Information retrieval problem: locating relevant documents based on user input, such as keywords or example documents Advanced Distributed Computing 2
3 Information Retrieval Typical IR systems Online library catalogs Online document management systems Information retrieval vs. database systems Some DB problems are not present in IR, e.g., update, transaction management, complex objects Some IR problems are not addressed well in DBMS, e.g., unstructured documents, approximate search using keywords and relevance Advanced Distributed Computing 3
4 Basic Measures for Text Retrieval Precision: the percentage of retrieved documents that are in fact relevant to the query (i.e., correct responses) { Relevant} { Retrieved} precision = { Retrieved} Recall: the percentage of documents that are relevant to the query and were, in fact, retrieved precision recall { Relevant} { Retrieved} = { Relevant} Advanced Distributed Computing 4
5 Precision vs. Recall(1) Relevant Relevant & Retrieved All Documents precision precision recall Retrieved relevantirrelevant retrieved & irrelevant retrieved & relevant retrieved { Relevant} { Retrieved} = { Retrieved} { Relevant} { Retrieved} = { Relevant} Not retrieved & irrelevant not retrieved but relevant not retrieved Advanced Distributed Computing 5
6 Recall vs. Precision Return relevant documents but miss many useful ones too precision 1 0 recall 1 The ideal Return mostly relevant documents but include many junks too Advanced Distributed Computing 6
7 IR Techniques(1) Basic Concepts A document can be described by a set of representative keywords called index terms. Different index terms have varying relevance when used to describe document contents. This effect is captured through the assignment of numerical weights to each index term of a document. (e.g.: frequency, tf-idf) DBMS Analogy Index Terms Attributes Weights Attribute Values Advanced Distributed Computing 7
8 IR Techniques(2) Index Terms (Attribute) Selection: Stop list Word stem Index terms weighting methods Terms Documents Frequency Matrices Information Retrieval Models: Boolean Model Vector Model Probabilistic Model Advanced Distributed Computing 8
9 Stop Words From a given Stop Word List [a, about, again, are, the, to, of, ] Remove them from the documents Or, determine stop words Given a large enough corpus of common English Sort the list of words in decreasing order of their occurrence frequency in the corpus Zipf s law: Frequency * rank constant most frequent words tend to be short most frequent 20% of words account for 60% of usage Advanced Distributed Computing 9
10 Zipf s Law -- An illustration Rank(R) Term Frequency (F) R*F (10**6) 1 the 69, of 36, and 28, to 26, a 23, in 21, that 10, is 10, was 9, he 9, Advanced Distributed Computing 10
11 Resolving Power of Word Non-significant high-frequency terms Non-significant low-frequency terms Presumed resolving power of significant words Words in decreasing frequency order Advanced Distributed Computing 11
12 Simple Indexing Scheme Based on Zipf s Law Use term frequency information only: Compute frequency of term k in document i, Freq ik Determine total collection frequency TotalFreq k = Freq ik for i = 1, 2,, n Arrange terms in order of collection frequency Set thresholds - eliminate high and low frequency terms Use remaining terms as index terms Advanced Distributed Computing 12
13 Stemming Stemming: transforming words to root form Computing, Computer, Computation comput Suffix based methods Remove ability from computability +ness, +ive, remove Suffix list + context rules Advanced Distributed Computing 13
14 Thesaurus Rules A thesaurus aims at classification of words in a language for a word, it gives related terms which are broader than, narrower than, same as (synonyms) and opposed to (antonyms) of the given word (other kinds of relationships may exist, e.g., composed of) Static Thesaurus Tables [anneal, strain], [antenna, receiver], Roget s thesaurus WordNet at Preinceton Advanced Distributed Computing 14
15 Thesaurus Rules can also be Learned From a search engine query log After typing queries, browse If query1 and query2 leads to the same document Then, Similar(query1, query2) If query1 leads to Document with title keyword K, Then, Similar(query1, K) Then, transitivity Advanced Distributed Computing 15
16 Indexing Techniques Inverted index Maintains two hash- or B+-tree indexed tables: document_table: a set of document records <doc_id, postings_list> term_table: a set of term records, <term, postings_list> Answer query: Find all docs associated with one or a set of terms + easy to implement do not handle well synonymy and polysemy, and posting lists could be too long (storage could be very large) Signature file Associate a signature with each document A signature is a representation of an ordered list of terms that describe the document Order is obtained by frequency analysis, stemming and stop lists Advanced Distributed Computing 16
17 Boolean Model Consider that index terms are either present or absent in a document As a result, the index term weights are assumed to be all binaries A query is composed of index terms linked by three connectives: not, and, and or e.g.: car and repair, plane or airplane The Boolean model predicts that each document is either relevant or non-relevant based on the match of a document to the query Advanced Distributed Computing 17
18 Boolean Model: Keyword-Based Retrieval A document is represented by a string, which can be identified by a set of keywords Queries may use expressions of keywords E.g., car and repair shop, tea or coffee, DBMS but not Oracle Queries and retrieval should consider synonyms, e.g., repair and maintenance Major difficulties of the model Synonymy: A keyword T does not appear anywhere in the document, even though the document is closely related to T, e.g., data mining Polysemy: The same keyword may mean different things in different contexts, e.g., mining Advanced Distributed Computing 18
19 The Vector-Space Model The distinct terms are available; call them index terms or the vocabulary The index terms represent important terms for an application a vector to represent the document <T1,T2,T3,T4,T5> or <W(T1),W(T2),W(T3),W(T4),W(T5)> computer science collection T1=architecture T2=bus T3=computer T4=database T5=xml index terms or vocabulary of the collection Advanced Distributed Computing 19
20 The Vector-Space Model Assumptions: words are uncorrelated Given: 1. N documents and a Query 2. Query considered a document too 2. Each represented by t terms 3. Each term j in document i has weight d ij 4. We will deal with how to compute the weights later T 1 T 2. T t D 1 d 11 d 12 d 1t D 2 d 21 d 22 d 2t : : : : : : : : D n d n1 d n2 d nt Q q... 1 q 2 q t Advanced Distributed Computing 20
21 Graphic Representation Example: D 1 = 2T 1 + 3T 2 + 5T 3 D 1 = 2T 1 + 3T 2 + 5T 3 5 T 3 D 2 = 3T 1 + 7T 2 + T 3 Q = 0T 1 + 0T 2 Q = 0T 1 + 0T 2 + 2T T 1 + 2T 3 D 2 = 3T 1 + 7T 2 + T 3 T 2 7 Is D 1 or D 2 more similar to Q? How to measure the degree of similarity? Distance? Angle? Projection? Advanced Distributed Computing 21
22 Similarity Measure - Inner Product Similarity between documents D i and query Q can be computed as the inner vector product: t sim ( D i, Q ) = (D i Q) Binary: weight = 1 if word present, 0 o/w Non-binary: weight represents degree of similary k= 1 t = = Example: TF/IDF we explain later j 1 d * q ij j Advanced Distributed Computing 22
23 Inner Product -- Examples Binary: D = 1, 1, 1, 0, 1, 1, 0 Q = 1, 0, 1, 0, 0, 1, 1 sim(d, Q) = 3 retrieval database architecture computer text management information Size of vector = size of vocabulary = 7 Weighted D 1 = 2T 1 + 3T 2 + 5T 3 Q = 0T 1 + 0T 2 + 2T 3 sim(d 1, Q) = 2*0 + 3*0 + 5*2 = 10 Advanced Distributed Computing 23
24 Properties of Inner Product The inner product similarity is unbounded Favors long documents long document a large number of unique terms, each of which may occur many times measures how many terms matched but not how many terms not matched Advanced Distributed Computing 24
25 Cosine Similarity Measures Cosine similarity measures the cosine of the angle between two vectors Inner product normalized by D 1 the vector lengths θ 2 CosSim(D i, Q) = t k = 1 t d ( d ik ik 2 q ) k k = 1 k = 1 t q k 2 t 2 θ 1 D 2 t 3 Q t 1 Advanced Distributed Computing 25
26 Cosine Similarity: an Example D 1 = 2T 1 + 3T 2 + 5T 3 CosSim(D 1, Q) = 5 / 38 = 0.81 D 2 = 3T 1 + 7T 2 + T 3 CosSim(D 2, Q) = 1 / 59 = 0.13 Q = 0T 1 + 0T 2 + 2T 3 D 1 is 6 times better than D 2 using cosine similarity but only 5 times better using inner product Advanced Distributed Computing 26
27 Document and Term Weights Document term weights are calculated using frequencies in documents (tf) and in collection (idf) tf ij = frequency of term j in document i df j = document frequency of term j = number of documents containing term j idf j = inverse document frequency of term j = log 2 (N/ df j ) (N: number of documents in collection) Inverse document frequency -- an indication of term values as a document discriminator. Advanced Distributed Computing 27
28 Term Weight Calculations Weight of the jth term in ith document: d ij = tf ij idf j = tf ij log 2 (N/ df j ) TF Term Frequency A term occurs frequently in the document but rarely in the remaining of the collection has a high weight Let max l {tf lj } be the term frequency of the most frequent term in document j Normalization: term frequency = tf ij /max l {tf lj } Advanced Distributed Computing 28
29 An example of TF Document=(A Computer Science Student Uses Computers) Vector Model based on keywords (Computer, Engineering, Student) Tf(Computer) = 2 Tf(Engineering)=0 Tf(Student) = 1 Max(Tf)=2 TF weight for: Computer = 2/2 = 1 Engineering = 0/2 = 0 Student = ½ = 0.5 Advanced Distributed Computing 29
30 Inverse Document Frequency Df j gives the number of times term j appeared among N documents IDF = 1/DF Typically use log 2 (N/ df j ) for IDF Example: given 1000 documents, computer appeared in 200 of them, IDF= log 2 (1000/ 200) =log 2 (5) Advanced Distributed Computing 30
31 TF IDF d ij = (tf ij /max l {tf lj }) idf j = (tf ij /max l {tf lj }) log 2 (N/ df j ) Can use this to obtain non-binary weights Used in the SMART Information Retrieval System by the late Gerald Salton and MJ McGill, Cornell University to tremendous success, 1983 Advanced Distributed Computing 31
32 Implementation based on Inverted Files In practice, document vectors are not stored directly; an inverted organization provides much better access speed. The index file can be implemented as a hash file, a sorted list, or a B-tree. Index terms df computer database D 7, 4 D 1, 3 science 4 D 2, 4 system 1 D 5, D j, tf j Advanced Distributed Computing 32
33 Latent Semantic Indexing (1) Basic idea The size of the term frequency matrix is very large Use a singular value decomposition (SVD) techniques to reduce the size of frequency table Retain the K most significant rows of the frequency table Method Create a term x document weighted frequency matrix A SVD construction: A = U * S * V Define K and obtain U k,, S k, and V k. Create query vector q. Project q into the term-document space: Dq = q * U k * S k -1 Calculate similarities: cos α = Dq. D / Dq * D Advanced Distributed Computing 33
34 Latent Semantic Indexing (2) Weighted Frequency Matrix Query Terms: - Insulation -Joint Advanced Distributed Computing 34
35 Probabilistic Model Basic assumption: Given a user query, there is a set of documents which contains exactly the relevant documents and no other (ideal answer set) Querying process as a process of specifying the properties of an ideal answer set. Since these properties are not known at query time, an initial guess is made This initial guess allows the generation of a preliminary probabilistic description of the ideal answer set which is used to retrieve the first set of documents An interaction with the user is then initiated with the purpose of improving the probabilistic description of the answer set Advanced Distributed Computing 35
36 Reference Richardo Baeza-Yates, Berthir Ribeiro-Neto, Modern Information Retrieval, Addison Wesley, 机械工业出版社影印出版, 2004 Advanced Distributed Computing 36
Information Retrieval. (M&S Ch 15)
Information Retrieval (M&S Ch 15) 1 Retrieval Models A retrieval model specifies the details of: Document representation Query representation Retrieval function Determines a notion of relevance. Notion
More informationIn = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most
In = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most common word appears 10 times then A = rn*n/n = 1*10/100
More informationCS 6320 Natural Language Processing
CS 6320 Natural Language Processing Information Retrieval Yang Liu Slides modified from Ray Mooney s (http://www.cs.utexas.edu/users/mooney/ir-course/slides/) 1 Introduction of IR System components, basic
More informationCHAPTER-26 Mining Text Databases
CHAPTER-26 Mining Text Databases 26.1 Introduction 26.2 Text Data Analysis and Information Retrieval 26.3 Basle Measures for Text Retrieval 26.4 Keyword-Based and Similarity-Based Retrieval 26.5 Other
More informationChapter 6: Information Retrieval and Web Search. An introduction
Chapter 6: Information Retrieval and Web Search An introduction Introduction n Text mining refers to data mining using text documents as data. n Most text mining tasks use Information Retrieval (IR) methods
More informationDesigning and Building an Automatic Information Retrieval System for Handling the Arabic Data
American Journal of Applied Sciences (): -, ISSN -99 Science Publications Designing and Building an Automatic Information Retrieval System for Handling the Arabic Data Ibrahiem M.M. El Emary and Ja'far
More informationQ: Given a set of keywords how can we return relevant documents quickly?
Keyword Search Traditional B+index is good for answering 1-dimensional range or point query Q: What about keyword search? Geo-spatial queries? Q: Documents on Computer Science? Q: Nearby coffee shops?
More informationMultimedia Information Systems
Multimedia Information Systems Samson Cheung EE 639, Fall 2004 Lecture 6: Text Information Retrieval 1 Digital Video Library Meta-Data Meta-Data Similarity Similarity Search Search Analog Video Archive
More informationBasic Tokenizing, Indexing, and Implementation of Vector-Space Retrieval
Basic Tokenizing, Indexing, and Implementation of Vector-Space Retrieval 1 Naïve Implementation Convert all documents in collection D to tf-idf weighted vectors, d j, for keyword vocabulary V. Convert
More informationInternational Journal of Advance Foundation and Research in Science & Engineering (IJAFRSE) Volume 1, Issue 2, July 2014.
A B S T R A C T International Journal of Advance Foundation and Research in Science & Engineering (IJAFRSE) Information Retrieval Models and Searching Methodologies: Survey Balwinder Saini*,Vikram Singh,Satish
More informationCS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University
CS473: CS-473 Course Review Luo Si Department of Computer Science Purdue University Basic Concepts of IR: Outline Basic Concepts of Information Retrieval: Task definition of Ad-hoc IR Terminologies and
More informationInformation Retrieval: Retrieval Models
CS473: Web Information Retrieval & Management CS-473 Web Information Retrieval & Management Information Retrieval: Retrieval Models Luo Si Department of Computer Science Purdue University Retrieval Models
More informationModern Information Retrieval
Modern Information Retrieval Chapter 3 Modeling Part I: Classic Models Introduction to IR Models Basic Concepts The Boolean Model Term Weighting The Vector Model Probabilistic Model Chap 03: Modeling,
More informationInformation Retrieval CS Lecture 06. Razvan C. Bunescu School of Electrical Engineering and Computer Science
Information Retrieval CS 6900 Lecture 06 Razvan C. Bunescu School of Electrical Engineering and Computer Science bunescu@ohio.edu Boolean Retrieval vs. Ranked Retrieval Many users (professionals) prefer
More informationIntroduction to Information Retrieval
Introduction to Information Retrieval Mohsen Kamyar چهارمین کارگاه ساالنه آزمایشگاه فناوری و وب بهمن ماه 1391 Outline Outline in classic categorization Information vs. Data Retrieval IR Models Evaluation
More informationmodern database systems lecture 4 : information retrieval
modern database systems lecture 4 : information retrieval Aristides Gionis Michael Mathioudakis spring 2016 in perspective structured data relational data RDBMS MySQL semi-structured data data-graph representation
More informationInformation Retrieval. CS630 Representing and Accessing Digital Information. What is a Retrieval Model? Basic IR Processes
CS630 Representing and Accessing Digital Information Information Retrieval: Retrieval Models Information Retrieval Basics Data Structures and Access Indexing and Preprocessing Retrieval Models Thorsten
More informationSocial Media Computing
Social Media Computing Lecture 4: Introduction to Information Retrieval and Classification Lecturer: Aleksandr Farseev E-mail: farseev@u.nus.edu Slides: http://farseev.com/ainlfruct.html At the beginning,
More informationContents 1. INTRODUCTION... 3
Contents 1. INTRODUCTION... 3 2. WHAT IS INFORMATION RETRIEVAL?... 4 2.1 FIRST: A DEFINITION... 4 2.1 HISTORY... 4 2.3 THE RISE OF COMPUTER TECHNOLOGY... 4 2.4 DATA RETRIEVAL VERSUS INFORMATION RETRIEVAL...
More informationInformation Retrieval CSCI
Information Retrieval CSCI 4141-6403 My name is Anwar Alhenshiri My email is: anwar@cs.dal.ca I prefer: aalhenshiri@gmail.com The course website is: http://web.cs.dal.ca/~anwar/ir/main.html 5/6/2012 1
More information60-538: Information Retrieval
60-538: Information Retrieval September 7, 2017 1 / 48 Outline 1 what is IR 2 3 2 / 48 Outline 1 what is IR 2 3 3 / 48 IR not long time ago 4 / 48 5 / 48 now IR is mostly about search engines there are
More informationRepresentation/Indexing (fig 1.2) IR models - overview (fig 2.1) IR models - vector space. Weighting TF*IDF. U s e r. T a s k s
Summary agenda Summary: EITN01 Web Intelligence and Information Retrieval Anders Ardö EIT Electrical and Information Technology, Lund University March 13, 2013 A Ardö, EIT Summary: EITN01 Web Intelligence
More informationInformation Retrieval. hussein suleman uct cs
Information Management Information Retrieval hussein suleman uct cs 303 2004 Introduction Information retrieval is the process of locating the most relevant information to satisfy a specific information
More informationCHAPTER 3 INFORMATION RETRIEVAL BASED ON QUERY EXPANSION AND LATENT SEMANTIC INDEXING
43 CHAPTER 3 INFORMATION RETRIEVAL BASED ON QUERY EXPANSION AND LATENT SEMANTIC INDEXING 3.1 INTRODUCTION This chapter emphasizes the Information Retrieval based on Query Expansion (QE) and Latent Semantic
More informationLecture 7: Relevance Feedback and Query Expansion
Lecture 7: Relevance Feedback and Query Expansion Information Retrieval Computer Science Tripos Part II Ronan Cummins Natural Language and Information Processing (NLIP) Group ronan.cummins@cl.cam.ac.uk
More informationChapter 3 - Text. Management and Retrieval
Prof. Dr.-Ing. Stefan Deßloch AG Heterogene Informationssysteme Geb. 36, Raum 329 Tel. 0631/205 3275 dessloch@informatik.uni-kl.de Chapter 3 - Text Management and Retrieval Literature: Baeza-Yates, R.;
More informationInformation Retrieval and Web Search
Information Retrieval and Web Search Introduction to IR models and methods Rada Mihalcea (Some of the slides in this slide set come from IR courses taught at UT Austin and Stanford) Information Retrieval
More informationText Analytics (Text Mining)
CSE 6242 / CX 4242 Text Analytics (Text Mining) Concepts and Algorithms Duen Horng (Polo) Chau Georgia Tech Some lectures are partly based on materials by Professors Guy Lebanon, Jeffrey Heer, John Stasko,
More informationInformation Retrieval & Text Mining
Information Retrieval & Text Mining Data Mining and Text Mining (UIC 583 @ Politecnico di Milano) References 2 Jiawei Han and Micheline Kamber, "Data Mining: Concepts and Techniques", The Morgan Kaufmann
More informationSYSTEMS FOR NON STRUCTURED INFORMATION MANAGEMENT
SYSTEMS FOR NON STRUCTURED INFORMATION MANAGEMENT Prof. Dipartimento di Elettronica e Informazione Politecnico di Milano INFORMATION SEARCH AND RETRIEVAL Inf. retrieval 1 PRESENTATION SCHEMA GOALS AND
More informationBalancing Manual and Automatic Indexing for Retrieval of Paper Abstracts
Balancing Manual and Automatic Indexing for Retrieval of Paper Abstracts Kwangcheol Shin 1, Sang-Yong Han 1, and Alexander Gelbukh 1,2 1 Computer Science and Engineering Department, Chung-Ang University,
More informationText Analytics (Text Mining)
CSE 6242 / CX 4242 Apr 1, 2014 Text Analytics (Text Mining) Concepts and Algorithms Duen Horng (Polo) Chau Georgia Tech Some lectures are partly based on materials by Professors Guy Lebanon, Jeffrey Heer,
More information10/10/13. Traditional database system. Information Retrieval. Information Retrieval. Information retrieval system? Information Retrieval Issues
COS 597A: Principles of Database and Information Systems Information Retrieval Traditional database system Large integrated collection of data Uniform access/modifcation mechanisms Model of data organization
More informationExam in course TDT4215 Web Intelligence - Solutions and guidelines - Wednesday June 4, 2008 Time:
English Student no:... Page 1 of 14 Contact during the exam: Geir Solskinnsbakk Phone: 735 94218/ 93607988 Exam in course TDT4215 Web Intelligence - Solutions and guidelines - Wednesday June 4, 2008 Time:
More informationInformation Retrieval
s Information Retrieval Information system management system Model Processing of queries/updates Queries Answer Access to stored data Patrick Lambrix Department of Computer and Information Science Linköpings
More informationInstructor: Stefan Savev
LECTURE 2 What is indexing? Indexing is the process of extracting features (such as word counts) from the documents (in other words: preprocessing the documents). The process ends with putting the information
More informationData Modelling and Multimedia Databases M
ALMA MATER STUDIORUM - UNIERSITÀ DI BOLOGNA Data Modelling and Multimedia Databases M International Second cycle degree programme (LM) in Digital Humanities and Digital Knoledge (DHDK) University of Bologna
More informationInformation Retrieval
Natural Language Processing SoSe 2015 Information Retrieval Dr. Mariana Neves June 22nd, 2015 (based on the slides of Dr. Saeedeh Momtazi) Outline Introduction Indexing Block 2 Document Crawling Text Processing
More informationInformation Retrieval. Information Retrieval and Web Search
Information Retrieval and Web Search Introduction to IR models and methods Information Retrieval The indexing and retrieval of textual documents. Searching for pages on the World Wide Web is the most recent
More informationRelevance Feedback and Query Reformulation. Lecture 10 CS 510 Information Retrieval on the Internet Thanks to Susan Price. Outline
Relevance Feedback and Query Reformulation Lecture 10 CS 510 Information Retrieval on the Internet Thanks to Susan Price IR on the Internet, Spring 2010 1 Outline Query reformulation Sources of relevance
More informationModels for Document & Query Representation. Ziawasch Abedjan
Models for Document & Query Representation Ziawasch Abedjan Overview Introduction & Definition Boolean retrieval Vector Space Model Probabilistic Information Retrieval Language Model Approach Summary Overview
More informationChapter 27 Introduction to Information Retrieval and Web Search
Chapter 27 Introduction to Information Retrieval and Web Search Copyright 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 27 Outline Information Retrieval (IR) Concepts Retrieval
More informationEncoding Words into String Vectors for Word Categorization
Int'l Conf. Artificial Intelligence ICAI'16 271 Encoding Words into String Vectors for Word Categorization Taeho Jo Department of Computer and Information Communication Engineering, Hongik University,
More informationvector space retrieval many slides courtesy James Amherst
vector space retrieval many slides courtesy James Allan@umass Amherst 1 what is a retrieval model? Model is an idealization or abstraction of an actual process Mathematical models are used to study the
More informationBoolean Model. Hongning Wang
Boolean Model Hongning Wang CS@UVa Abstraction of search engine architecture Indexed corpus Crawler Ranking procedure Doc Analyzer Doc Representation Query Rep Feedback (Query) Evaluation User Indexer
More informationInformation Retrieval and Data Mining Part 1 Information Retrieval
Information Retrieval and Data Mining Part 1 Information Retrieval 2005/6, Karl Aberer, EPFL-IC, Laboratoire de systèmes d'informations répartis Information Retrieval - 1 1 Today's Question 1. Information
More informationCMSC 476/676 Information Retrieval Midterm Exam Spring 2014
CMSC 476/676 Information Retrieval Midterm Exam Spring 2014 Name: You may consult your notes and/or your textbook. This is a 75 minute, in class exam. If there is information missing in any of the question
More informationOutline. Possible solutions. The basic problem. How? How? Relevance Feedback, Query Expansion, and Inputs to Ranking Beyond Similarity
Outline Relevance Feedback, Query Expansion, and Inputs to Ranking Beyond Similarity Lecture 10 CS 410/510 Information Retrieval on the Internet Query reformulation Sources of relevance for feedback Using
More informationOutline of the course
Outline of the course Introduction to Digital Libraries (15%) Description of Information (30%) Access to Information (30%) User Services (10%) Additional topics (15%) Buliding of a (small) digital library
More informationMidterm Exam Search Engines ( / ) October 20, 2015
Student Name: Andrew ID: Seat Number: Midterm Exam Search Engines (11-442 / 11-642) October 20, 2015 Answer all of the following questions. Each answer should be thorough, complete, and relevant. Points
More informationJames Mayfield! The Johns Hopkins University Applied Physics Laboratory The Human Language Technology Center of Excellence!
James Mayfield! The Johns Hopkins University Applied Physics Laboratory The Human Language Technology Center of Excellence! (301) 219-4649 james.mayfield@jhuapl.edu What is Information Retrieval? Evaluation
More informationCSE 494: Information Retrieval, Mining and Integration on the Internet
CSE 494: Information Retrieval, Mining and Integration on the Internet Midterm. 18 th Oct 2011 (Instructor: Subbarao Kambhampati) In-class Duration: Duration of the class 1hr 15min (75min) Total points:
More informationCSCI 599: Applications of Natural Language Processing Information Retrieval Retrieval Models (Part 1)"
CSCI 599: Applications of Natural Language Processing Information Retrieval Retrieval Models (Part 1)" All slides Addison Wesley, Donald Metzler, and Anton Leuski, 2008, 2012! Retrieval Models" Provide
More informationDigital Libraries: Language Technologies
Digital Libraries: Language Technologies RAFFAELLA BERNARDI UNIVERSITÀ DEGLI STUDI DI TRENTO P.ZZA VENEZIA, ROOM: 2.05, E-MAIL: BERNARDI@DISI.UNITN.IT Contents 1 Recall: Inverted Index..........................................
More informationNatural Language Processing
Natural Language Processing Information Retrieval Potsdam, 14 June 2012 Saeedeh Momtazi Information Systems Group based on the slides of the course book Outline 2 1 Introduction 2 Indexing Block Document
More informationInformation Retrieval and Web Search
Information Retrieval and Web Search Relevance Feedback. Query Expansion Instructor: Rada Mihalcea Intelligent Information Retrieval 1. Relevance feedback - Direct feedback - Pseudo feedback 2. Query expansion
More informationTowards Understanding Latent Semantic Indexing. Second Reader: Dr. Mario Nascimento
Towards Understanding Latent Semantic Indexing Bin Cheng Supervisor: Dr. Eleni Stroulia Second Reader: Dr. Mario Nascimento 0 TABLE OF CONTENTS ABSTRACT...3 1 INTRODUCTION...4 2 RELATED WORKS...6 2.1 TRADITIONAL
More informationKnowledge Discovery and Data Mining 1 (VO) ( )
Knowledge Discovery and Data Mining 1 (VO) (707.003) Data Matrices and Vector Space Model Denis Helic KTI, TU Graz Nov 6, 2014 Denis Helic (KTI, TU Graz) KDDM1 Nov 6, 2014 1 / 55 Big picture: KDDM Probability
More informationModern Information Retrieval
Modern Information Retrieval Chapter 5 Relevance Feedback and Query Expansion Introduction A Framework for Feedback Methods Explicit Relevance Feedback Explicit Feedback Through Clicks Implicit Feedback
More informationWeb Information Retrieval using WordNet
Web Information Retrieval using WordNet Jyotsna Gharat Asst. Professor, Xavier Institute of Engineering, Mumbai, India Jayant Gadge Asst. Professor, Thadomal Shahani Engineering College Mumbai, India ABSTRACT
More informationInformation Retrieval
Information Retrieval CSC 375, Fall 2016 An information retrieval system will tend not to be used whenever it is more painful and troublesome for a customer to have information than for him not to have
More informationTag-based Social Interest Discovery
Tag-based Social Interest Discovery Xin Li / Lei Guo / Yihong (Eric) Zhao Yahoo!Inc 2008 Presented by: Tuan Anh Le (aletuan@vub.ac.be) 1 Outline Introduction Data set collection & Pre-processing Architecture
More informationLevel of analysis Finding Out About Chapter 3: 25 Sept 01 R. K. Belew
Overview The fascination with the subliminal, the camouflaged, and the encrypted is ancient. Getting a computer to munch away at long strings of letters from the Old Testament is not that different from
More informationElementary IR: Scalable Boolean Text Search. (Compare with R & G )
Elementary IR: Scalable Boolean Text Search (Compare with R & G 27.1-3) Information Retrieval: History A research field traditionally separate from Databases Hans P. Luhn, IBM, 1959: Keyword in Context
More informationRelevance Feedback & Other Query Expansion Techniques
Relevance Feedback & Other Query Expansion Techniques (Thesaurus, Semantic Network) (COSC 416) Nazli Goharian nazli@cs.georgetown.edu Slides are mostly based on Informion Retrieval Algorithms and Heuristics,
More informationVector Space Models: Theory and Applications
Vector Space Models: Theory and Applications Alexander Panchenko Centre de traitement automatique du langage (CENTAL) Université catholique de Louvain FLTR 2620 Introduction au traitement automatique du
More informationCS54701: Information Retrieval
CS54701: Information Retrieval Basic Concepts 19 January 2016 Prof. Chris Clifton 1 Text Representation: Process of Indexing Remove Stopword, Stemming, Phrase Extraction etc Document Parser Extract useful
More informationINFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from
INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Schütze s, linked from http://informationretrieval.org/ IR 6: Index Compression Paul Ginsparg Cornell University, Ithaca, NY 15 Sep
More informationInformation Retrieval on the Internet (Volume III, Part 3, 213)
Information Retrieval on the Internet (Volume III, Part 3, 213) Diana Inkpen, Ph.D., University of Toronto Assistant Professor, University of Ottawa, 800 King Edward, Ottawa, ON, Canada, K1N 6N5 Tel. 1-613-562-5800
More informationKeyword Extraction by KNN considering Similarity among Features
64 Int'l Conf. on Advances in Big Data Analytics ABDA'15 Keyword Extraction by KNN considering Similarity among Features Taeho Jo Department of Computer and Information Engineering, Inha University, Incheon,
More informationA Model for Information Retrieval Agent System Based on Keywords Distribution
A Model for Information Retrieval Agent System Based on Keywords Distribution Jae-Woo LEE Dept of Computer Science, Kyungbok College, 3, Sinpyeong-ri, Pocheon-si, 487-77, Gyeonggi-do, Korea It2c@koreaackr
More informationVannevar Bush. Information Retrieval. Prophetic: Hypertext. Historic Vision 2/8/17
Information Retrieval Vannevar Bush Director of the Office of Scientific Research and Development (1941-1947) Vannevar Bush,1890-1974 End of WW2 - what next big challenge for scientists? 1 Historic Vision
More informationInformation Retrieval
Information Retrieval Natural Language Processing: Lecture 12 30.11.2017 Kairit Sirts Homework 4 things that seemed to work Bidirectional LSTM instead of unidirectional Change LSTM activation to sigmoid
More informationMultimedia Information Extraction and Retrieval Term Frequency Inverse Document Frequency
Multimedia Information Extraction and Retrieval Term Frequency Inverse Document Frequency Ralf Moeller Hamburg Univ. of Technology Acknowledgement Slides taken from presentation material for the following
More informationLecture 5: Information Retrieval using the Vector Space Model
Lecture 5: Information Retrieval using the Vector Space Model Trevor Cohn (tcohn@unimelb.edu.au) Slide credits: William Webber COMP90042, 2015, Semester 1 What we ll learn today How to take a user query
More informationSuresha M Department of Computer Science, Kuvempu University, India
International Journal of Scientific Research in Computer Science, Engineering and Information echnology 2017 IJSRCSEI Volume 2 Issue 6 ISSN : 2456-3307 Document Analysis using Similarity Measures : A Case
More informationCIS 4930/6930 Spring 2014 Introduction to Data Science /Data Intensive Computing. University of Florida, CISE Department Prof.
CIS 4930/6930 Spring 2014 Introduction to Data Science /Data Intensive Computing University of Florida, CISE Department Prof. Daisy Zhe Wang Text To Knowledge IR and Boolean Search Text to Knowledge (IE)
More informationCSE 544 Principles of Database Management Systems. Magdalena Balazinska Winter 2009 Lecture 12 Google Bigtable
CSE 544 Principles of Database Management Systems Magdalena Balazinska Winter 2009 Lecture 12 Google Bigtable References Bigtable: A Distributed Storage System for Structured Data. Fay Chang et. al. OSDI
More informationFeature selection. LING 572 Fei Xia
Feature selection LING 572 Fei Xia 1 Creating attribute-value table x 1 x 2 f 1 f 2 f K y Choose features: Define feature templates Instantiate the feature templates Dimensionality reduction: feature selection
More informationCOMP6237 Data Mining Searching and Ranking
COMP6237 Data Mining Searching and Ranking Jonathon Hare jsh2@ecs.soton.ac.uk Note: portions of these slides are from those by ChengXiang Cheng Zhai at UIUC https://class.coursera.org/textretrieval-001
More informationClassic IR Models 5/6/2012 1
Classic IR Models 5/6/2012 1 Classic IR Models Idea Each document is represented by index terms. An index term is basically a (word) whose semantics give meaning to the document. Not all index terms are
More informationInformation Retrieval
Natural Language Processing SoSe 2014 Information Retrieval Dr. Mariana Neves June 18th, 2014 (based on the slides of Dr. Saeedeh Momtazi) Outline Introduction Indexing Block 2 Document Crawling Text Processing
More informationInverted Indexes. Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 p. 5
Inverted Indexes Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 p. 5 Basic Concepts Inverted index: a word-oriented mechanism for indexing a text collection to speed up the
More informationBasic techniques. Text processing; term weighting; vector space model; inverted index; Web Search
Basic techniques Text processing; term weighting; vector space model; inverted index; Web Search Overview Indexes Query Indexing Ranking Results Application Documents User Information analysis Query processing
More informationThis lecture: IIR Sections Ranked retrieval Scoring documents Term frequency Collection statistics Weighting schemes Vector space scoring
This lecture: IIR Sections 6.2 6.4.3 Ranked retrieval Scoring documents Term frequency Collection statistics Weighting schemes Vector space scoring 1 Ch. 6 Ranked retrieval Thus far, our queries have all
More informationAuthoritative K-Means for Clustering of Web Search Results
Authoritative K-Means for Clustering of Web Search Results Gaojie He Master in Information Systems Submission date: June 2010 Supervisor: Kjetil Nørvåg, IDI Co-supervisor: Robert Neumayer, IDI Norwegian
More informationAN EFFECTIVE INFORMATION RETRIEVAL FOR AMBIGUOUS QUERY
Asian Journal Of Computer Science And Information Technology 2: 3 (2012) 26 30. Contents lists available at www.innovativejournal.in Asian Journal of Computer Science and Information Technology Journal
More informationOptimal Query. Assume that the relevant set of documents C r. 1 N C r d j. d j. Where N is the total number of documents.
Optimal Query Assume that the relevant set of documents C r are known. Then the best query is: q opt 1 C r d j C r d j 1 N C r d j C r d j Where N is the total number of documents. Note that even this
More informationDefinitions. Lecture Objectives. Text Technologies for Data Science INFR Learn about main concepts in IR 9/19/2017. Instructor: Walid Magdy
Text Technologies for Data Science INFR11145 Definitions Instructor: Walid Magdy 19-Sep-2017 Lecture Objectives Learn about main concepts in IR Document Information need Query Index BOW 2 1 IR in a nutshell
More informationA Study on Information Retrieval Methods in Text Mining
A Study on Information Retrieval Methods in Text Mining 1 Dr.M.Suresh Babu, Professor & Head, Department of Computer Applications, Madanapalle Institute of Technology & Science, Madanapalle,India.. 2 Mr.
More informationIndexing and Query Processing. What will we cover?
Indexing and Query Processing CS 510 Winter 2007 1 What will we cover? Key concepts and terminology Inverted index structures Organization, creation, maintenance Compression Distribution Answering queries
More informationVK Multimedia Information Systems
VK Multimedia Information Systems Mathias Lux, mlux@itec.uni-klu.ac.at This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Information Retrieval Basics: Agenda Vector
More informationIMPROVING THE RELEVANCY OF DOCUMENT SEARCH USING THE MULTI-TERM ADJACENCY KEYWORD-ORDER MODEL
IMPROVING THE RELEVANCY OF DOCUMENT SEARCH USING THE MULTI-TERM ADJACENCY KEYWORD-ORDER MODEL Lim Bee Huang 1, Vimala Balakrishnan 2, Ram Gopal Raj 3 1,2 Department of Information System, 3 Department
More informationMaking Retrieval Faster Through Document Clustering
R E S E A R C H R E P O R T I D I A P Making Retrieval Faster Through Document Clustering David Grangier 1 Alessandro Vinciarelli 2 IDIAP RR 04-02 January 23, 2004 D a l l e M o l l e I n s t i t u t e
More informationExam IST 441 Spring 2014
Exam IST 441 Spring 2014 Last name: Student ID: First name: I acknowledge and accept the University Policies and the Course Policies on Academic Integrity This 100 point exam determines 30% of your grade.
More informationInformation Retrieval
Information Retrieval Suan Lee - Information Retrieval - 06 Scoring, Term Weighting and the Vector Space Model 1 Recap of lecture 5 Collection and vocabulary statistics: Heaps and Zipf s laws Dictionary
More informationDepartment of Computer Science and Engineering B.E/B.Tech/M.E/M.Tech : B.E. Regulation: 2013 PG Specialisation : _
COURSE DELIVERY PLAN - THEORY Page 1 of 6 Department of Computer Science and Engineering B.E/B.Tech/M.E/M.Tech : B.E. Regulation: 2013 PG Specialisation : _ LP: CS6007 Rev. No: 01 Date: 27/06/2017 Sub.
More informationAutomatic Document; Retrieval Systems. The conventional library classifies documents by numeric subject codes which are assigned manually (Dewey
I. Automatic Document; Retrieval Systems The conventional library classifies documents by numeric subject codes which are assigned manually (Dewey decimal aystemf Library of Congress system). Cross-indexing
More informationGlossary. ASCII: Standard binary codes to represent occidental characters in one byte.
Glossary ASCII: Standard binary codes to represent occidental characters in one byte. Ad hoc retrieval: standard retrieval task in which the user specifies his information need through a query which initiates
More informationRetrieving Model for Design Patterns
Retrieving Model for Design Patterns 51 Retrieving Model for Design Patterns Sarun Intakosum and Weenawadee Muangon, Non-members ABSTRACT The purpose of this research is to develop a retrieving model for
More information