Information Retrieval

Similar documents
Information Retrieval

Natural Language Processing

Chapter 2. Architecture of a Search Engine

Information Retrieval

CSCI 599: Applications of Natural Language Processing Information Retrieval Retrieval Models (Part 1)"

Chapter 6: Information Retrieval and Web Search. An introduction

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University

Search Engine Architecture II

CSE 5243 INTRO. TO DATA MINING

Outline. Possible solutions. The basic problem. How? How? Relevance Feedback, Query Expansion, and Inputs to Ranking Beyond Similarity

CS54701: Information Retrieval

Information Retrieval. (M&S Ch 15)

Introduction to Information Retrieval

Tag-based Social Interest Discovery

Information Retrieval

Search Engines Exercise 5: Querying. Dustin Lange & Saeedeh Momtazi 9 June 2011

Chapter 27 Introduction to Information Retrieval and Web Search

Multimedia Information Systems

Instructor: Stefan Savev

Relevance Feedback and Query Reformulation. Lecture 10 CS 510 Information Retrieval on the Internet Thanks to Susan Price. Outline

Information Retrieval: Retrieval Models

Information Retrieval. Information Retrieval and Web Search

Department of Electronic Engineering FINAL YEAR PROJECT REPORT

In = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most

Knowledge Discovery and Data Mining 1 (VO) ( )

Information Retrieval and Web Search

CS 6320 Natural Language Processing

Search Engines Chapter 2 Architecture Felix Naumann

Information Retrieval

Search Engines. Information Retrieval in Practice

Exam IST 441 Spring 2014

James Mayfield! The Johns Hopkins University Applied Physics Laboratory The Human Language Technology Center of Excellence!

Exam IST 441 Spring 2011

CS377: Database Systems Text data and information. Li Xiong Department of Mathematics and Computer Science Emory University

Information Retrieval

Chapter 4. Processing Text

vector space retrieval many slides courtesy James Amherst

Basic techniques. Text processing; term weighting; vector space model; inverted index; Web Search

Information Retrieval CS Lecture 01. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Vannevar Bush. Information Retrieval. Prophetic: Hypertext. Historic Vision 2/8/17

Representation of Documents and Infomation Retrieval

CS/INFO 1305 Information Retrieval

CS/INFO 1305 Summer 2009

A Security Model for Multi-User File System Search. in Multi-User Environments

Mining Web Data. Lijun Zhang

Exam IST 441 Spring 2013

Open Source Search. Andreas Pesenhofer. max.recall information systems GmbH Künstlergasse 11/1 A-1150 Wien Austria

Query Languages. Berlin Chen Reference: 1. Modern Information Retrieval, chapter 4

Representation/Indexing (fig 1.2) IR models - overview (fig 2.1) IR models - vector space. Weighting TF*IDF. U s e r. T a s k s

Ranked Retrieval. Evaluation in IR. One option is to average the precision scores at discrete. points on the ROC curve But which points?

VALLIAMMAI ENGINEERING COLLEGE SRM Nagar, Kattankulathur DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING QUESTION BANK VII SEMESTER

Search Engine Architecture. Search Engine Architecture

THE WEB SEARCH ENGINE

Models for Document & Query Representation. Ziawasch Abedjan

Basic Tokenizing, Indexing, and Implementation of Vector-Space Retrieval

Introduction to Information Retrieval

Information Retrieval CS Lecture 06. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Text Analytics (Text Mining)

Boolean Model. Hongning Wang

Information Retrieval. CS630 Representing and Accessing Digital Information. What is a Retrieval Model? Basic IR Processes

Information Retrieval

CLUSTERING, TIERED INDEXES AND TERM PROXIMITY WEIGHTING IN TEXT-BASED RETRIEVAL

Home Page. Title Page. Page 1 of 14. Go Back. Full Screen. Close. Quit

Text Analytics (Text Mining)

Information Retrieval (IR) Introduction to Information Retrieval. Lecture Overview. Why do we need IR? Basics of an IR system.

68A8 Multimedia DataBases Information Retrieval - Exercises

Data Modelling and Multimedia Databases M

Informa/on Retrieval. Text Search. CISC437/637, Lecture #23 Ben CartereAe. Consider a database consis/ng of long textual informa/on fields

Information Retrieval. hussein suleman uct cs

Retrieval Evaluation. Hongning Wang

Notes: Notes: Primo Ranking Customization

CS47300: Web Information Search and Management

Search Like a Pro. How Search Engines Work. Comparison Search Engine. Comparison Search Engine. How Search Engines Work

International Journal of Advance Foundation and Research in Science & Engineering (IJAFRSE) Volume 1, Issue 2, July 2014.

Web Search Engine Question Answering

Information Retrieval

Architecture and Implementation of Database Systems (Summer 2018)

Part I: Data Mining Foundations

Digital Libraries: Language Technologies

How Does a Search Engine Work? Part 1

Search Evaluation. Tao Yang CS293S Slides partially based on text book [CMS] [MRS]

CHAPTER THREE INFORMATION RETRIEVAL SYSTEM

CMSC 476/676 Information Retrieval Midterm Exam Spring 2014

Indexing: Part IV. Announcements (February 17) Keyword search. CPS 216 Advanced Database Systems

Finding Information on the Information Highway. How to get around in the Internet

Indexing and Query Processing. What will we cover?

SEARCH ENGINE INSIDE OUT

Exam in course TDT4215 Web Intelligence - Solutions and guidelines - Wednesday June 4, 2008 Time:

Search Engines Information Retrieval in Practice

Web Information Retrieval using WordNet

Prior Art Retrieval Using Various Patent Document Fields Contents

Desktop Crawls. Document Feeds. Document Feeds. Information Retrieval

Problem 1: Complexity of Update Rules for Logistic Regression

Computer Science 572 Midterm Prof. Horowitz Thursday, March 8, 2012, 2:00pm 3:00pm

Designing and Building an Automatic Information Retrieval System for Handling the Arabic Data

modern database systems lecture 4 : information retrieval

Query Refinement and Search Result Presentation

International Journal of Scientific & Engineering Research Volume 2, Issue 12, December ISSN Web Search Engine

Contextual Information Retrieval Using Ontology-Based User Profiles

VK Multimedia Information Systems

Transcription:

Natural Language Processing SoSe 2014 Information Retrieval Dr. Mariana Neves June 18th, 2014 (based on the slides of Dr. Saeedeh Momtazi)

Outline Introduction Indexing Block 2 Document Crawling Text Processing Index Storing Querying Block Query Processing Retrieval Models Result Representation Evaluation Natural Language Processing Information Retrieval

Outline Introduction Indexing Block 3 Document Crawling Text Processing Index Storing Querying Block Query Processing Retrieval Models Result Representation Evaluation Natural Language Processing Information Retrieval

Information Retrieval The most popular usages of computer (and Internet) Search Communication Information Retrieval (IR) 4 The field of computer science that is mostly involved with R&D for search Primary focus of IR is on text and documents Mostly textual content A bit structured data Papers: title, author, date, publisher Email: subject, sender, receiver, date Natural Language Processing Information Retrieval

IR Dimensions Web search Most common Vertical search Restricted domain/topic Corporate intranet Personal enterprise search P2P search No centralized control 5 Emails, web pages, documentations, codes, wikis, tags, directories, presentations, spreadsheets Desktop search Books, movies, suppliers Enterprise search Search engines File sharing, shared locality Forum search... Natural Language Processing Information Retrieval

IR Architecture 6 Indexing Text Acquisition Text Transformation Index Creation Querying User Interaction Ranking Natural Language Processing Information Retrieval

Outline Introduction Indexing Block 7 Document Crawling Text Processing Index Storing Querying Block Query Processing Retrieval Models Result Representation Evaluation Natural Language Processing Information Retrieval

Indexing Architecture (figure taken from the slides of Dr. Saeedeh Momtazi) 8 Natural Language Processing Information Retrieval

Outline Introduction Indexing Block 9 Document Crawling Text Processing Index Storing Querying Block Query Processing Retrieval Models Result Representation Evaluation Natural Language Processing Information Retrieval

Web Crawler Identifying and acquiring documents for search engine Following links to find documents Finding huge numbers of web pages efficiently (coverage) Keeping the crawled data up-to-date (freshness) 10 Natural Language Processing Information Retrieval

Outline Introduction Indexing Block Document Crawling Text Processing Index Storing Querying Block Query Processing Retrieval Models Result Representation Evaluation 11 Natural Language Processing Information Retrieval

Text Processing Parsing Tokenizing Stopword filtering Stemming 12 Natural Language Processing Information Retrieval

Parsing Recognizing structural elements Titles Links Headings... Using the syntax of markup languages to identify structures 13 Natural Language Processing Information Retrieval

Tokenization Basic model Considering white-space as delimiter Main issues that should be considered and normalized Capitalization apple vs. Apple Apostrophes O Conner vs. owner s Hyphens Non-alpha characters Word segmentation (e.g., in Chinese or German) 14 Natural Language Processing Information Retrieval

Stopwords filtering Removing the stop words from documents Stop words: the most common words in a language and, or, the, in Around 400 stop words for English Advantages Effective Efficient Do not consider as important query terms to be searched Reduce storage size and time complexity Disadvantages Problem with queries with higher impact of stop words To be or not to be 15 Natural Language Processing Information Retrieval

Stemming Grouping words derived from a common stem computer, computers, computing, compute fish, fishing, fisherman 16 Natural Language Processing Information Retrieval

Outline Introduction Indexing Block Document Crawling Text Processing Index Storing Querying Block Query Processing Retrieval Models Result Representation Evaluation 17 Natural Language Processing Information Retrieval

Index Storing Storing document-words information in an inverse format Converting document-term statistics to term-document for indexing Increasing the efficiency of the retrieval engine 18 Natural Language Processing Information Retrieval

Index Storing Index-term with document ID environments fish fishkeepers found fresh freshwater from 1 1 2 3 4 2 1 2 1 4 4 19 Natural Language Processing Information Retrieval

Index Storing Index-term with document ID and frequency environments fish fishkeepers found fresh freshwater from 1:1 1:2 2:3 3:2 4:2 2:1 1:1 2:1 1:1 4:1 4:1 20 Natural Language Processing Information Retrieval

Index Storing Index-term with document ID and position environments 1,8 1,2 1,4 2,7 2,18 2,23 3,2 3,6 4,3 4,13 fish fishkeepers 2,1 found 1,5 fresh 2,13 freshwater 1,14 4,2 from 4,8 21 Natural Language Processing Information Retrieval

Outline Introduction Indexing Block Document Crawling Text Processing Index Storing Querying Block Query Processing Retrieval Models Result Representation Evaluation 22 Natural Language Processing Information Retrieval

Querying Architecture (figure taken from the slides of Dr. Saeedeh Momtazi) 23 Natural Language Processing Information Retrieval

Outline Introduction Indexing Block Document Crawling Text Processing Index Storing Querying Block Query Processing Retrieval Models Result Representation Evaluation 24 Natural Language Processing Information Retrieval

Query Processing Applying similar techniques used in text processing for documents Spell checking Correcting the query if there exists any spelling error in it Query suggestion Tokenization, stopwords filtering, stemming Providing alternatives words to the original query (based on query logs) Query expansion and relevance feedback Modifying the original query with additional terms 25 Natural Language Processing Information Retrieval

Query Expansion Short search queries are under-specified http://www.tamingthebeast.net/blog/web-marketing/search-query-lengths.h Google: 26.45% (1 word), 23.66% (2 words), 19.34 (3 words), 13.17% (4 words), 7.69% (5 words), 4.12% (6 words), 2.26% (7 words) 54.5% of user queries are greater than 3 words Keyword queries are often poor descriptions of actual information need car inventor Nicolas Joseph Cugnot built the first automobile 26 Natural Language Processing Information Retrieval

Outline Introduction Indexing Block Document Crawling Text Processing Index Storing Querying Block Query Processing Retrieval Models Result Representation Evaluation 27 Natural Language Processing Information Retrieval

Retrieval Models Calculating the scores for documents using a ranking algorithm Boolean model Vector space model Probabilistic model Language model 28 Natural Language Processing Information Retrieval

Boolean Model Two possible outcomes for query processing TRUE or FALSE All matching documents are considered equally relevant Query usually specified using Boolean operators AND, OR, NOT 29 Natural Language Processing Information Retrieval

Boolean Model Search for news articles about President Lincoln lincoln cars places people 30 Natural Language Processing Information Retrieval

Boolean Model Search for news articles about President Lincoln president AND lincoln Ford Motor Company today announced that Darryl Hazel will succeed Brian Kelley as president of Lincoln Mercury 31 Natural Language Processing Information Retrieval

Boolean Model Search for news articles about President Lincoln president AND lincoln AND NOT (automobile OR car) President Lincoln s body departs Washington in a ninecar funeral train. 32 Natural Language Processing Information Retrieval

Boolean Model Search for news articles about President Lincoln president AND lincoln AND (biography OR life OR birthplace OR gettysburg) AND NOT (automobile OR car) President s Day - Holiday activities - crafts, mazes, mazes word searches,... The Life of Washington Read the entire searches The Washington book online! Abraham Lincoln Research Site... 33 Natural Language Processing Information Retrieval

Boolean Model Advantages Results are predictable and relatively easy to explain Disadvantages Relevant documents have no order Complex queries are difficult to write 34 Natural Language Processing Information Retrieval

Vector Space Model Very popular model, even today Documents and query represented by a vector of term weights t is number of index terms (i.e., very large) Di = (di1, di2,..., dit) Q = (q1, q2,..., qt) Collection represented by a matrix of term weights Term1 Term2... Termt Doc1 d11 d12... d1t Doc2 d21 d22... d2t... Docn... dn1... dn2...... dnt 35 Natural Language Processing Information Retrieval

Vector Space Model Ranking each document by distance between points representing query and document Using cosine similarity Cosine of angle between document and query vectors Normalized dot-product t d ij q j Cosine ( D i, Q)= j=1 2 2 d ij q j j j Having no explicit definition of relevance as a retrieval model Implicit: Closer documents are more relevant. 36 Natural Language Processing Information Retrieval

Vector Space Model Term frequency weight (tf) Measuring the importance of term t in document i Inverse document frequency (idf ) Measuring importance of the term t in collection idf k =log N nk Final weighting: multiplying tf and idf, called tf.idf 37 Natural Language Processing Information Retrieval

Vector Space Model Advantages Simple computational framework for ranking Any similarity measure or term weighting scheme can be used Disadvantages Assumption of term independence 38 Natural Language Processing Information Retrieval

Outline Introduction Indexing Block Document Crawling Text Processing Index Storing Querying Block Query Processing Retrieval Models Result Representation Evaluation 39 Natural Language Processing Information Retrieval

Results Output Constructing the display of ranked documents for a query Generating snippets to show how queries match documents Highlighting important words and passages Providing clustering (if required) 40 Natural Language Processing Information Retrieval

Outline Introduction Indexing Block Document Crawling Text Processing Index Storing Querying Block Query Processing Retrieval Models Result Representation Evaluation 41 Natural Language Processing Information Retrieval

Evaluation Metrics Evaluation of unranked sets Precision Recall F-measure Accuracy 42 Natural Language Processing Information Retrieval

Evaluation Metrics Evaluation of ranked sets Precision-recall curve Mean average precision Precision at n Mean reciprocal rank 43 Natural Language Processing Information Retrieval

Average Precision Average precision Calculating precision of the system after retrieving each relevant document Averaging these precision values Mean average precision Reporting the mean of the average precisions over all queries in the query set K P@ k AveragePrecision= k =1 number relevant documents (k is the rank of relevant documents in the retrieved list) 44 Natural Language Processing Information Retrieval

Average Precision 1/2 2/3 1 2 3 + + 2 3 4 AP= =0.64 3 3/4 45 Natural Language Processing Information Retrieval

Precision at n Being used by people who want to receive good results at the first n items of the retrieved documents Calculating precision at n Considering the top n documents only Ignoring the rest of documents 46 Natural Language Processing Information Retrieval

Precision at 5 3 P @ 5= =0.6 5 47 Natural Language Processing Information Retrieval

Reciprocal Rank Being used by people who need only one correct answer (no matter how many relevant items are available in the retrieved list, as soon as the first correct item appears, the user is satisfied with the results.) Reciprocal Rank Inversing the score of the rank at which the first correct answer is returned Mean Reciprocal Rank Reporting the mean of the reciprocal rank over all queries in the query set ReciprocalRank = 1 R 48 Natural Language Processing Information Retrieval (R is the position of the first correct item in the ranked list)

Reciprocal Rank 1 ReciprocalRank = =0.5 2 49 Natural Language Processing Information Retrieval

Further Reading 50 Natural Language Processing Information Retrieval