Information Retrieval

Similar documents
Information Retrieval

Natural Language Processing

Chapter 2. Architecture of a Search Engine

Information Retrieval

CSCI 599: Applications of Natural Language Processing Information Retrieval Retrieval Models (Part 1)"

Chapter 6: Information Retrieval and Web Search. An introduction

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University

Information Retrieval. (M&S Ch 15)

CSE 5243 INTRO. TO DATA MINING

CS54701: Information Retrieval

Outline. Possible solutions. The basic problem. How? How? Relevance Feedback, Query Expansion, and Inputs to Ranking Beyond Similarity

CS 6320 Natural Language Processing

Multimedia Information Systems

Introduction to Information Retrieval

Relevance Feedback and Query Reformulation. Lecture 10 CS 510 Information Retrieval on the Internet Thanks to Susan Price. Outline

Search Engine Architecture II

Instructor: Stefan Savev

Basic techniques. Text processing; term weighting; vector space model; inverted index; Web Search

Information Retrieval

Department of Electronic Engineering FINAL YEAR PROJECT REPORT

Information Retrieval. Information Retrieval and Web Search

In = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most

Information Retrieval and Web Search

Tag-based Social Interest Discovery

Introduction to Information Retrieval

James Mayfield! The Johns Hopkins University Applied Physics Laboratory The Human Language Technology Center of Excellence!

Chapter 27 Introduction to Information Retrieval and Web Search

Information Retrieval. Techniques for Relevance Feedback

Exam IST 441 Spring 2014

Exam IST 441 Spring 2011

Information Retrieval: Retrieval Models

Information Retrieval CS Lecture 01. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Search Engines Chapter 2 Architecture Felix Naumann

Information Retrieval

Representation/Indexing (fig 1.2) IR models - overview (fig 2.1) IR models - vector space. Weighting TF*IDF. U s e r. T a s k s

Knowledge Discovery and Data Mining 1 (VO) ( )

Query Languages. Berlin Chen Reference: 1. Modern Information Retrieval, chapter 4

Exam IST 441 Spring 2013

Information Retrieval. hussein suleman uct cs

Information Retrieval (IR) Introduction to Information Retrieval. Lecture Overview. Why do we need IR? Basics of an IR system.

Vannevar Bush. Information Retrieval. Prophetic: Hypertext. Historic Vision 2/8/17

Search Engines. Information Retrieval in Practice

Search Engines Exercise 5: Querying. Dustin Lange & Saeedeh Momtazi 9 June 2011

VALLIAMMAI ENGINEERING COLLEGE SRM Nagar, Kattankulathur DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING QUESTION BANK VII SEMESTER

Representation of Documents and Infomation Retrieval

Information Retrieval CS Lecture 06. Razvan C. Bunescu School of Electrical Engineering and Computer Science

A Security Model for Multi-User File System Search. in Multi-User Environments

68A8 Multimedia DataBases Information Retrieval - Exercises

CLUSTERING, TIERED INDEXES AND TERM PROXIMITY WEIGHTING IN TEXT-BASED RETRIEVAL

Exam in course TDT4215 Web Intelligence - Solutions and guidelines - Wednesday June 4, 2008 Time:

Information Retrieval and Web Search

Chapter 4. Processing Text

Mining Web Data. Lijun Zhang

Home Page. Title Page. Page 1 of 14. Go Back. Full Screen. Close. Quit

CS377: Database Systems Text data and information. Li Xiong Department of Mathematics and Computer Science Emory University

Web Search Engine Question Answering

Basic Tokenizing, Indexing, and Implementation of Vector-Space Retrieval

Web Information Retrieval using WordNet

Text Analytics (Text Mining)

CS47300: Web Information Search and Management

Information Retrieval

How Does a Search Engine Work? Part 1

Information Retrieval

Information Retrieval. CS630 Representing and Accessing Digital Information. What is a Retrieval Model? Basic IR Processes

Search Evaluation. Tao Yang CS293S Slides partially based on text book [CMS] [MRS]

CS506/606 - Topics in Information Retrieval

Problem 1: Complexity of Update Rules for Logistic Regression

Query Refinement and Search Result Presentation

vector space retrieval many slides courtesy James Amherst

Informa/on Retrieval. Text Search. CISC437/637, Lecture #23 Ben CartereAe. Consider a database consis/ng of long textual informa/on fields

CHAPTER THREE INFORMATION RETRIEVAL SYSTEM

Search Engine Architecture. Search Engine Architecture

Retrieval Evaluation. Hongning Wang

Prior Art Retrieval Using Various Patent Document Fields Contents

Boolean Model. Hongning Wang

International Journal of Advance Foundation and Research in Science & Engineering (IJAFRSE) Volume 1, Issue 2, July 2014.

Models for Document & Query Representation. Ziawasch Abedjan

Static Pruning of Terms In Inverted Files

Contextual Information Retrieval Using Ontology-Based User Profiles

Architecture and Implementation of Database Systems (Summer 2018)

Digital Libraries: Language Technologies

Designing and Building an Automatic Information Retrieval System for Handling the Arabic Data

Part I: Data Mining Foundations

Computer Science 572 Midterm Prof. Horowitz Thursday, March 8, 2012, 2:00pm 3:00pm

Building Search Applications

Combining Information Retrieval and Relevance Feedback for Concept Location

Knowledge Retrieval. Franz J. Kurfess. Computer Science Department California Polytechnic State University San Luis Obispo, CA, U.S.A.

A RECOMMENDER SYSTEM FOR SOCIAL BOOK SEARCH

CS/INFO 1305 Summer 2009

CS/INFO 1305 Information Retrieval

modern database systems lecture 4 : information retrieval

Text Analytics (Text Mining)

Unstructured Data. CS102 Winter 2019

University of Virginia Department of Computer Science. CS 4501: Information Retrieval Fall 2015

CHAPTER 31 WEB SEARCH TECHNOLOGIES FOR TEXT DOCUMENTS

Ranked Retrieval. Evaluation in IR. One option is to average the precision scores at discrete. points on the ROC curve But which points?

Desktop Crawls. Document Feeds. Document Feeds. Information Retrieval

THE WEB SEARCH ENGINE

Session 10: Information Retrieval

Data Modelling and Multimedia Databases M

Transcription:

Natural Language Processing SoSe 2015 Information Retrieval Dr. Mariana Neves June 22nd, 2015 (based on the slides of Dr. Saeedeh Momtazi)

Outline Introduction Indexing Block 2 Document Crawling Text Processing Index Storing Querying Block Query Processing Retrieval Models Result Representation Evaluation

Outline Introduction Indexing Block 3 Document Crawling Text Processing Index Storing Querying Block Query Processing Retrieval Models Result Representation Evaluation

Information Retrieval The most popular usages of computer (and Internet) Information Retrieval (IR) 4 Search, Communication The field of computer science that is mostly involved with R&D for search

Information Retrieval 5 Primary focus of IR is on text and documents Mostly textual content A bit structured data Papers: title, author, date, publisher Email: subject, sender, receiver, date

IR Dimensions Web search 6 Search engines

IR Dimensions Vertical search (Restricted domain/topic) 7 Books, movies, suppliers

IR Dimensions Enterprise search, Corporate intranet 8 Emails, web pages, documentations, codes, wikis, tags, directories, presentations, spreadsheets (http://www.n-axis.in/practices-moss.php)

IR Dimensions Desktop search 9 Personal enterprise search (http://www.heise.de/download/copernic-desktop-search-120141912.html)

IR Dimensions P2P search (no centralized control) 10 File sharing, shared locality Forum search

IR Architecture 11 Indexing Text Acquisition Text Transformation Index Creation Querying User Interaction Ranking

Outline Introduction Indexing Block 12 Document Crawling Text Processing Index Storing Querying Block Query Processing Retrieval Models Result Representation Evaluation

Indexing Architecture Crawling documents... Text aquisition Document storage Processing documents... Text transformation Storing index terms... Index creation 13 Inverted index

Outline Introduction Indexing Block 14 Document Crawling Text Processing Index Storing Querying Block Query Processing Retrieval Models Result Representation Evaluation

Web Crawler 15 Identifying and acquiring documents for the search engine Following links to find documents Finding huge numbers of web pages efficiently (coverage) Keeping the crawled data up-to-date (freshness)

Googlebot (http://static.googleusercontent.com/media/www.google.com/en//insidesearch/howsearchworks/assets/searchinfographic.pdf) 16

Outline Introduction Indexing Block 17 Document Crawling Text Processing Index Storing Querying Block Query Processing Retrieval Models Result Representation Evaluation

Text Processing Document parsing (not syntatic parsing!) Tokenizing Stopword filtering Stemming (https://developer.apple.com/library/mac/documentation/userexperience/conceptual/searchkitconcepts/searchkit_basics/searchkit_basics.html) 18

Parsing Recognizing structural elements Titles Links Headings... Using the syntax of markup languages to identify structures (http://htmlparser.sourceforge.net/) 19

Tokenization Basic model Considering white-space as delimiter Main issues that should be considered and normalized Capitalization Apostrophes 20 apple vs. Apple O Conner vs. owner s Hyphens Non-alpha characters Word segmentation (e.g., in Chinese or German)

Stopwords filtering Removing the stop words from documents Stopwords: the most common words in a language and, or, the, in Around 400 stop words for English (https://developer.apple.com/library/mac/documentation/userexperience/conceptual/searchkitconcepts/searchkit_basics/searchkit_basics.html) 21

Stopwords filtering Advantages Effective Efficient Reduce storage size and time complexity Disadvantages Problem with queries with higher impact of stop words 22 Do not consider as important query terms to be searched To be or not to be

Stemming Grouping words derived from a common stem computer, computers, computing, compute fish, fishing, fisherman (http://amysorr.cmswiki.wikispaces.net/stem%20contract%20&%20word%20lists) 23

Outline Introduction Indexing Block 24 Document Crawling Text Processing Index Storing Querying Block Query Processing Retrieval Models Result Representation Evaluation

Index Storing Storing document-words information in an inverse format 25 Converting document-term statistics to term-document for indexing Increasing the efficiency of the retrieval engine

Index Storing (https://developer.apple.com/library/mac/documentation/userexperience/conceptual/searchkitconcepts/searchkit_basics/searchkit_basics.html) 26

Index Storing Index-term with document ID environments fish fishkeepers found fresh freshwater from 27 1 1 2 3 4 2 1 2 1 4 4

Index Storing Index-term with document ID and frequency environments fish fishkeepers found fresh freshwater from 28 1:1 1:2 2:3 2:1 1:1 2:1 1:1 4:1 4:1 3:2 4:2

Index Storing Index-term with document ID and position environments 1,8 1,2 1,4 2,7 2,18 2,23 3,2 3,6 4,3 4,13 fish fishkeepers 2,1 found 1,5 fresh 2,13 freshwater 1,14 4,2 from 4,8 29

Outline Introduction Indexing Block 30 Document Crawling Text Processing Index Storing Querying Block Query Processing Retrieval Models Result Representation Evaluation

Querying Architecture Document storage User interaction Processing input query Presenting results... 31 Ranking (retrieval) Retrieving documents... Inverted index

Outline Introduction Indexing Block 32 Document Crawling Text Processing Index Storing Querying Block Query Processing Retrieval Models Result Representation Evaluation

Query Processing Applying similar techniques used in text processing for documents Tokenization, stopwords filtering, stemming the best book for natural language processing best book natur languag process 33 (http://text-processing.com/demo/stem/)

Query Processing Spell checking Correcting the query in case of spelling errors the best bok for natural language procesing the best book for natural language processing 34

Query Processing Query suggestion Providing alternatives words to the original query (based on query logs) the best book for natural language processing the best book for NLP 35

Query Processing Query expansion and relevance feedback Modifying the original query with additional terms the best book for natural language processing the best [book volume] for [natural language processing NLP] 36

Query Expansion Short search queries are under-specified 26.45% (1 word) 23.66% (2 words) 19.34 (3 words) 13.17% (4 words) 7.69% (5 words) 4.12% (6 words) 2.26% (7 words) Google: 54.5% of user queries are greater than 3 words (http://www.tamingthebeast.net/blog/web-marketing/search-query-lengths.htm) (http://www.smallbusinesssem.com/google-query-length/3273/) 37

Query Expansion 38 car inventor (https://en.wikipedia.org/?title=car)

Google search (http://static.googleusercontent.com/media/www.google.com/en//insidesearch/howsearchworks/assets/searchinfographic.pdf) 39

Outline Introduction Indexing Block 40 Document Crawling Text Processing Index Storing Querying Block Query Processing Retrieval Models Result Representation Evaluation

Retrieval Models 41 Calculating the scores for documents using a ranking algorithm Boolean model Vector space model Probabilistic model Language model

Boolean Model Two possible outcomes for query processing TRUE or FALSE All matching documents are considered equally relevant Query usually specified using Boolean operators 42 AND, OR, NOT

Boolean Model Search for news articles about President Lincoln 43 lincoln cars places people

Boolean Model Search for news articles about President Lincoln president AND lincoln 44 Ford Motor Company today announced that Darryl Hazel will succeed Brian Kelley as president of Lincoln Mercury

Boolean Model Search for news articles about President Lincoln president AND lincoln AND NOT (automobile OR car) 45 President Lincoln s body departs Washington in a ninecar funeral train.

Boolean Model Search for news articles about President Lincoln president AND lincoln AND (biography OR life OR birthplace OR gettysburg) AND NOT (automobile OR car) 46 President s Day - Holiday activities - crafts, mazes, mazes word searches,... The Life of Washington Read the entire searches The Washington book online! Abraham Lincoln Research Site...

Boolean Model Advantages 47 Results are predictable and relatively easy to explain Disadvantages Relevant documents have no order Complex queries are difficult to write

Vector Space Model Very popular model, even today Documents and query represented by a vector of term weights 48 t is number of index terms (i.e., very large) Di = (di1, di2,..., dit) Q = (q1, q2,..., qt)

Vector Space Model 49 Document collection represented by a matrix of term weights Term1 Term2... Termt Doc1 d11 d12... d1t Doc2 d21 d22... d2t... Docn... dn1... dn2...... dnt

Vector Space Model Ranking each document by distance between query and document Using cosine similarity (normalized dot product) Cosine of angle between document and query vectors t d ij q j Cosine ( D i, Q)= j=1 d 2ij q 2j j 50 j Having no explicit definition of relevance as a retrieval model Implicit: Closer documents are more relevant.

Vector Space Model Term frequency weight (tf) Measuring the importance of term k in document i Inverse document frequency (idf) Measuring importance of the term k in collection idf k =log 51 N nk Final weighting: multiplying tf and idf, called tf.idf

Vector Space Model Advantages Simple computational framework for ranking Any similarity measure or term weighting scheme can be used Disadvantages 52 Assumption of term independence

Google ranking (http://static.googleusercontent.com/media/www.google.com/en//insidesearch/howsearchworks/assets/searchinfographic.pdf) 53

Outline Introduction Indexing Block 54 Document Crawling Text Processing Index Storing Querying Block Query Processing Retrieval Models Result Representation Evaluation

Results 55 Constructing the display of ranked documents for a query Generating snippets to show how queries match documents Highlighting important words and passages Providing clustering (if required)

Relevance feedback Good results even after only one iteration (http://dme.rwth-aachen.de/en/research/projects/relevance-feedback) 56

Relevance feedback Push the vector Towards the relevant documents Away from the irrelevant documents R S β γ qi +1 = qi + r j s R j =1 S k =1 k R = number of relevant documents S = number of irrelevant document β+γ=1; both defined experimentally 57 β=0.75; γ=0.25 (Salton and Buckley 1990)

Relevance feedback Evaluation only on the residual collection 58 Documents not shown to the user

Google result (http://static.googleusercontent.com/media/www.google.com/en//insidesearch/howsearchworks/assets/searchinfographic.pdf) 59

Outline Introduction Indexing Block 60 Document Crawling Text Processing Index Storing Querying Block Query Processing Retrieval Models Result Representation Evaluation

Evaluation Metrics 61 Evaluation of unranked sets Precision Recall F-measure Accuracy Two categories: relevant or not relevant to the query

Evaluation Metrics Total of T documents in response to a query R relevant documents N irrelevant documents Total U relevant documents in the collection Precision= 62 R T Recall= R U

Homonymy, Polysemy, Synonymy Homonymy and polysemy reduce the precision 63 Documents of the wrong sense will be judged as irrelevant

Homonymy, Polysemy, Synonymy Synonymy and hyponymy reduce the recall 64 Miss relevant documents that do not contain the keyword

Evaluation Metrics 65 Evaluation of ranked sets Precision-recall curve Mean average precision Precision at n Mean reciprocal rank

Evaluation Metrics Precision-recall curve 66 Based on the rank of the results (http://nlp.stanford.edu/ir-book/html/htmledition/evaluation-of-ranked-retrieval-results-1.html)

Average Precision Average precision Calculating precision of the system after retrieving each relevant document Averaging these precision values K P@ k AveragePrecision = k =1 number relevant documents (k is the rank of relevant documents in the retrieved list) 67

Average Precision 1/2 2/3 3/4 68 1 2 3 + + 2 3 4 AP= =0.64 3

Mean average Precision Mean average precision 69 Reporting the mean of the average precisions over all queries in the query set

Precision at n 70 Evaluate results for the first n items of the retrieved documents Calculating precision at n Considering the top n documents only Ignoring the rest of documents

Precision at 5 3 P @ 5= =0.6 5 71

Reciprocal Rank Evaluate only one correct answer Reciprocal Rank Inversing the score of the rank at which the first correct answer is returned ReciprocalRank = 72 1 R (R is the position of the first correct item in the ranked list)

Reciprocal Rank 1 ReciprocalRank = =0.5 2 73

Reciprocal Rank Mean Reciprocal Rank 74 Reporting the mean of the reciprocal rank over all queries in the query set

Further Reading Speech and Language Processing 75 Chapter 23.1

Further Reading 76