INLS 509: Introduction to Information Retrieval

Similar documents
60-538: Information Retrieval

Information Retrieval

Automatic people tagging for expertise profiling in the enterprise

Information Retrieval CS6200. Jesse Anderton College of Computer and Information Science Northeastern University

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University

Information Retrieval Spring Web retrieval

Search Engines Information Retrieval in Practice

Information Retrieval

Top 10 pre-paid SEO tools

CS6200 Informa.on Retrieval. David Smith College of Computer and Informa.on Science Northeastern University

CS6200 Information Retrieval. Jesse Anderton College of Computer and Information Science Northeastern University

User Interfaces for Information Retrieval on the WWW

Introduction to Ad-hoc Retrieval

Information Retrieval May 15. Web retrieval

Boolean Model. Hongning Wang

Natural Language Processing

CS54701: Information Retrieval

CS6200 Informa.on Retrieval. David Smith College of Computer and Informa.on Science Northeastern University

Information Retrieval: Retrieval Models

vector space retrieval many slides courtesy James Amherst

Federated Search. Jaime Arguello INLS 509: Information Retrieval November 21, Thursday, November 17, 16

Information Retrieval

Annotation and Evaluation

Instructor: Stefan Savev

Aggregated Search. Jaime Arguello School of Information and Library Science University of North Carolina at Chapel Hill

Introduction & Administrivia

Prior Art Retrieval Using Various Patent Document Fields Contents

Information Retrieval CSCI

What s an SEO Strategy With Out Social Media?

Search Engine Optimization

Web Search. Lecture Objectives. Text Technologies for Data Science INFR Learn about: 11/14/2017. Instructor: Walid Magdy

Search Engines. Information Retrieval in Practice

Information Retrieval (Part 1)

BUYER S GUIDE WEBSITE DEVELOPMENT

Towards open-domain QA. Question answering. TReC QA framework. TReC QA: evaluation

The answer (circa 2001)

Introduction to Information Retrieval. Hongning Wang

Presented by: Dimitri Galmanovich. Petros Venetis, Alon Halevy, Jayant Madhavan, Marius Paşca, Warren Shen, Gengxin Miao, Chung Wu

Chapter 4. Processing Text

Unstructured Data. CS102 Winter 2019

RANK TRACKER User Guide

Multimedia Information Systems

Retrieval Evaluation. Hongning Wang

Page Title is one of the most important ranking factor. Every page on our site should have unique title preferably relevant to keyword.

Ranked Retrieval. Evaluation in IR. One option is to average the precision scores at discrete. points on the ROC curve But which points?

Myths about Links, Links and More Links:

Enhanced Performance of Search Engine with Multitype Feature Co-Selection of Db-scan Clustering Algorithm

The left menu is very flexible, allowing you to get to administrations screens with fewer clicks and faster load times.

Assignment 3 ITCS-6010/8010: Cloud Computing for Data Analysis

Jianyong Wang Department of Computer Science and Technology Tsinghua University

Search Engines. Information Retrieval in Practice

Hyperlink-Extended Pseudo Relevance Feedback for Improved. Microblog Retrieval

KDD 10 Tutorial: Recommender Problems for Web Applications. Deepak Agarwal and Bee-Chung Chen Yahoo! Research

CURZON PR BUYER S GUIDE WEBSITE DEVELOPMENT

Information Retrieval and Web Search

Promoting Website CS 4640 Programming Languages for Web Applications

COMP6237 Data Mining Searching and Ranking

The Internet. Tim Capes. November 7, 2011

Semantic Extensions to Syntactic Analysis of Queries Ben Handy, Rohini Rajaraman

Mining the Search Trails of Surfing Crowds: Identifying Relevant Websites from User Activity Data

CSE 494: Information Retrieval, Mining and Integration on the Internet

Information Retrieval (IR) Introduction to Information Retrieval. Lecture Overview. Why do we need IR? Basics of an IR system.

How Co-Occurrence can Complement Semantics?

Information Retrieval

Heuristic Review of iinview An in-depth analysis! May 2014

A World Wide Web-based HCI-library Designed for Interaction Studies

RPMA - Roadway Project Mapping Application

Crossing the AI Chasm. What We Learned from building Apache PredictionIO (incubating)

I m a new dad 30/06/10

Fall CS646: Information Retrieval. Lecture 2 - Introduction to Search Result Ranking. Jiepu Jiang University of Massachusetts Amherst 2016/09/12

SEO According to Google

Desktop Crawls. Document Feeds. Document Feeds. Information Retrieval

CS 572: Information Retrieval. Lecture 1: Course Overview and Introduction 11 January 2016

Overview of Web Mining Techniques and its Application towards Web

Exploring the Structure of Data at Scale. Rudy Agovic, PhD CEO & Chief Data Scientist at Reliancy January 16, 2019

Form Identifying. Figure 1 A typical HTML form

CHAPTER 5 EXPERT LOCATOR USING CONCEPT LINKING

Empowering People with Knowledge the Next Frontier for Web Search. Wei-Ying Ma Assistant Managing Director Microsoft Research Asia

Today s topic CS347. Results list clustering example. Why cluster documents. Clustering documents. Lecture 8 May 7, 2001 Prabhakar Raghavan

Home Page Scavenger Hunt World Book Kids For Desktops and Laptops

Bisnode View Why is it so damn hard to piece together information across the enterprise?

Analysis, Dekalb Roofing Company Web Site

Get Twitter Followers in an Easy Way Step by Step Guide

UNIVERSITY OF BOLTON WEB PUBLISHER GUIDE JUNE 2016 / VERSION 1.0

E-Shop: A Vertical Search Engine for Domain of Online Shopping

GROW YOUR BUSINESS WITH HELP FROM GOOGLE Ranking Factors

Full Website Audit. Conducted by Mathew McCorry. Digimush.co.uk

World Wide Web has specific challenges and opportunities

TRINET INTERNET SOLUTIONS, INC.

VK Multimedia Information Systems

Furl Furled Furling. Social on-line book marking for the masses. Jim Wenzloff Blog:

Part 11: Collaborative Filtering. Francesco Ricci

Predictive Analysis: Evaluation and Experimentation. Heejun Kim

CS506/606 - Topics in Information Retrieval

Unwanted Message Filtering From Osn User Walls And Implementation Of Blacklist (Implementation Paper)

the magazine of the Marketing Research and Intelligence Association YEARS OF RESEARCH INTELLIGENCE A FUTURESPECTIVE

CSCI 599: Applications of Natural Language Processing Information Retrieval Evaluation"

Top-To-Bottom (And Beyond) On-Page Optimization Guidebook

Conclusions. Chapter Summary of our contributions

Detecting Spam Web Pages

Transcription:

INLS 509: Introduction to Information Retrieval Jaime Arguello jarguell@email.unc.edu January 11, 2016

Outline Introductions What is information retrieval (IR)? What is a search engine? Why is information retrieval difficult? How do search engines predict relevance? How good is a search engine? 2

Introductions Hello, my name is. However, I d rather be called. (optional) I m in the program. I m taking this course because I want to. 3

What is Information Retrieval? Information retrieval (IR) is the science and practice of developing and evaluating systems that match information seekers with the information they seek. 4

What is Information Retrieval? This course mainly focuses on search engines Given a query and a corpus, find relevant items query: a user s expression of their information need corpus: a repository of retrievable items relevance: satisfaction of the user s information need 5

What is Information Retrieval? Gerard Salton, 1968: Information retrieval is a field concerned with the structure, analysis, organization, storage, and retrieval of information. 6

Information Retrieval structure 7

Information Retrieval document structure 8

Information Retrieval document structure However, the main content of the page is in the form of natural language text, which has little structure that a computer can understand 9

Information Retrieval document structure However, the main content of the page is in the form of natural language text, which has little structure that a As it turns out, it s not necessary for a computer can understand computer to understand natural language text for it to determine that this document is likely to be relevant to a particular query (e.g., Gerard Salton ) 10

Information Retrieval collection structure 11

Information Retrieval analysis: classification and information extraction 12

Information Retrieval organization: cataloguing http://www.dmoz.org 13

Information Retrieval organization: cataloguing http://www.dmoz.org 14

Information Retrieval analysis and organization: reading-level 15

Information Retrieval organization: recommendations http://www.yelp.com/biz/cosmic-cantina-chapel-hill (not actual page) 16

What is Information Retrieval? Gerard Salton, 1968: Information retrieval is a field concerned with the structure, analysis, organization, storage, and retrieval of information. 17

Information Retrieval storage How might a web search engine view these pages differently in terms of storage? 18

Information Retrieval retrieval Efficiency: retrieving results in this lifetime (or, better yet, in 0.18 seconds) Effectiveness: retrieving results that satisfy the user s information need (more on this later) We will focus more on effectiveness However, we will also discuss in some detail how search engines retrieve results as fast as they do 19

Information Retrieval retrieval effectiveness 20

Outline Introductions What is information retrieval (IR)? What is a search engine? Why is information retrieval difficult? How do search engines predict relevance? How good is a search engine? 21

Many Types of Search Engines 22

Many Types of Search Engines 23

The Search Task Given a query and a corpus, find relevant items query: user s expression of their information need corpus: a repository of retrievable items relevance: satisfaction of the user s information need 24

Search Engines web search query results corpus web pages 25

Search Engines digital library search query results corpus scientific publications 26

Search Engines news search query results corpus news articles 27

Search Engines local business search query results corpus curated/synthesized business listings 28

Search Engines desktop search query results corpus files in my laptop 29

Search Engines micro-blog search query results corpus tweets 30

Search Engines people/profile search query results corpus profiles 31

Information Retrieval Tasks and Applications digital library search web search enterprise search news search local business search image search video search (micro-)blog search community Q&A search desktop search question-answering federated search social search expert search product search patent search recommender systems opinion mining 32

The Search Task in this course Given a query and a corpus, find relevant items query: user s expression of their information need a textual description of what the user wants corpus: a repository of retrievable items a collection of textual documents relevance: satisfaction of the user s information need the document contains information the user wants 33

Outline Introductions What is information retrieval (IR)? What is a search engine? Why is information retrieval difficult? How do search engines predict relevance? How good is a search engine? 34

Why is IR Difficult? Information retrieval is an uncertain process users don t know what they want users don t know how to convey what they want computers can t elicit information like a librarian computers can t understand natural language text the search engine can only guess what is relevant the search engine can only guess if a user is satisfied over time, we can only guess how users adjust their short- and long-term behavior for the better 35

Queries and Relevance 36

Queries and Relevance soft surroundings trains interlocking dog sheets belly dancing music christian dior large bag best western airport sea tac www.bajawedding.com marie selby botanical gardens big chill down coats www.magichat.co.uk marie selby botanical gardens broadstone raquet club seadoo utopia seasons white plains condo priority club.com aircat tools epicurus evil instructions hinds county city of jackson last searches on aol a to z riverbank run (AOL query-log) 37

Queries and Relevance A query is an impoverished description of the user s information need Highly ambiguous to anyone other than the user 38

Queries and Relevance the input to the system Query 435: curbing population growth what is in the user s head Description: What measures have been taken worldwide and what countries have been effective in curbing population growth? A relevant document must describe an actual case in which population measures have been taken and their results are known. Reduction measures must have been actively pursued. Passive events such as decease, which involuntarily reduce population, are not relevant. (from TREC 2005 HARD Track) 39

Queries and Relevance Query 435: curbing population growth Description:??????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????? (from TREC 2005 HARD Track) 40

Queries and Relevance Query 435: curbing population growth Can we imagine a relevant document without all these query terms? 41

Queries and Relevance Query 435: curbing population growth The same concept can be expressed in different ways 42

Queries and Relevance Query 435: curbing population growth Can we imagine a non-relevant document with all these query terms? 43

Queries and Relevance Query 435: curbing population growth The query concept can have different senses 44

Queries and Relevance This is why IR is difficult (and fascinating!) Croft, Metzler, & Strohman: Understanding how people compare text and designing computer algorithms to accurately perform this comparison is at the core of information retrieval. IR does not seek a deep understanding of the document text It uses statistical properties of the text to predict whether a document is relevant to a query easier and often times sufficient 45

Outline Introductions What is information retrieval (IR)? What is a search engine? Why is information retrieval difficult? How do search engines predict relevance? How good is a search engine? 46

Predicting Relevance What types of evidence can we use to predict that a document is relevant to a query? query-document evidence: a property of the querydocument pair (e.g., a measure of similarity) document evidence: a property of the document (same for all queries) 47

Query-Document Evidence Query: bathing a cat... 48

Query-Document Evidence Query: bathing a cat The important query terms occur frequently Both terms occur Terms occur close together Terms occur in the title Terms occur in the URL www.wikihow.com/bathe-your-cat Any other ideas?... 49

Query-Document Evidence Terms occur in hyperlinks pointing to the page Same language as query Other terms semantically related to query-terms (e.g., feline, wash)... 50

Query-Document Evidence Does not contain.com Not one of the most popular queries Does not contain the term news... 51

Query-Document Evidence We can also use previous user interactions, e.g.: The query is similar to other queries associated with clicks on this document The document is similar to other documents associated with clicks for this query... 52

Document Evidence Lots of in-links (endorsements) Non-spam properties: grammatical sentences no profanity Has good formatting Anything other ideas?... 53

Document Evidence Author attributes Peer-reviewed by many Reading-level appropriate for user community Has pictures Recently modified (fresh) Normal length From domain with other high-quality documents... 54

Predicting Relevance IR does not require a deep understanding of information We can get by using shallow sources of evidence, which can be generated from the query-document pair or just the document itself. 55

Outline Introductions What is information retrieval (IR)? What is a search engine? Why is information retrieval difficult? How do search engines predict relevance? How good is a search engine? 56

The Search Task Output: a ranking of items in descending order of predicted relevance (simplifies the task) Assumption: the user scans the results from top to bottom and stops when he/she is satisfied or gives up 57

Evaluating a Ranking So, how good is a particular ranking? Suppose we know which documents are truly relevant to the query... 58

Evaluating a Ranking A B Which ranking is better? 59

Evaluating a Ranking A B In general, a ranking with all the relevant documents at the top is best (A is better than B) 60

Evaluating a Ranking A B Which ranking is better? 61

Evaluating a Ranking A B Oftentimes the (relative) quality of a ranking is unclear and depends on the task 62

Evaluating a Ranking A B Web search:?????? 63

Evaluating a Ranking A B Web search: A is better than B Many documents (redundantly) satisfy the user; the user doesn t want all of them, the higher the first relevant document, the better 64

Evaluating a Ranking A B Patent search:?????? 65

Evaluating a Ranking A B Patent search: B is better than A User wants to see everything in the corpus that is related to the query (high cost in missing something) 66

Evaluating a Ranking A B Exploratory search:?????? 67

Evaluating a Ranking A B Exploratory search: A is better than B Satisfying the information need requires information found in different documents 68

Evaluating a Ranking evaluation metrics Given a ranking with known relevant/non-relevant documents, an evaluation metric outputs a quality score Many, many metrics Different metrics make different assumptions Choosing the right one requires understanding the task Often, we use several (sanity check) 69

Summary The goal of information retrieval is to match informationseekers with the information they seek. IR involves analysis, organization, storage, and retrieval There are many types of search engines There is uncertainty at every step of the search process Simple heuristics don t work, so IR systems make predictions about relevance! IR systems use superficial evidence to make predictions Users expect different things, depending on the task Evaluation requires understanding the user community. My goal is convince you that IR is a fascinating science 70