CS47300: Web Information Search and Management

Similar documents
CS54701: Information Retrieval

Ad-hoc IR: Introduction. Ad-hoc IR: Introduction. Basic Concepts of IR: Outline. Content Based Filtering Filtering. Ad-hoc IR: Terminologies

CS34800 Information Systems

CS54701: Information Retrieval

CS47300: Web Information Search and Management

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University

Information Retrieval: Retrieval Models

CS54701: Information Retrieval

CS47300: Web Information Search and Management

Information Retrieval. CS630 Representing and Accessing Digital Information. What is a Retrieval Model? Basic IR Processes

Information Retrieval. (M&S Ch 15)

CS 6320 Natural Language Processing

Federated Text Search

Information Retrieval

Information Retrieval

Chapter 27 Introduction to Information Retrieval and Web Search

CS490W: Web Information Search & Management. CS-490W Web Information Search and Management. Luo Si. Department of Computer Science Purdue University

CS-490WIR Web Information Retrieval and Management. Luo Si

In = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most

Knowledge Retrieval. Franz J. Kurfess. Computer Science Department California Polytechnic State University San Luis Obispo, CA, U.S.A.

Design and Implementation of Search Engine Using Vector Space Model for Personalized Search

Search Engine Architecture. Hongning Wang

Exam IST 441 Spring 2014

Introduction to Information Retrieval

Chapter 4. Processing Text

CIS 4930/6930 Spring 2014 Introduction to Data Science /Data Intensive Computing. University of Florida, CISE Department Prof.

Information Retrieval (IR) Introduction to Information Retrieval. Lecture Overview. Why do we need IR? Basics of an IR system.

Introduction to Information Retrieval (Manning, Raghavan, Schutze)

Ranked Retrieval. Evaluation in IR. One option is to average the precision scores at discrete. points on the ROC curve But which points?

Information Retrieval CS Lecture 06. Razvan C. Bunescu School of Electrical Engineering and Computer Science

INLS W: Information Retrieval Systems Design and Implementation. Fall 2009.

Information Retrieval II

Information Retrieval

Department of Electronic Engineering FINAL YEAR PROJECT REPORT

Exam IST 441 Spring 2013

TEXT CHAPTER 5. W. Bruce Croft BACKGROUND

Overview. Lecture 3: Index Representation and Tolerant Retrieval. Type/token distinction. IR System components

Information Retrieval

vector space retrieval many slides courtesy James Amherst

Assignment No. 1. Abdurrahman Yasar. June 10, QUESTION 1

CS6200 Information Retrieval. David Smith College of Computer and Information Science Northeastern University

Boolean retrieval & basics of indexing CE-324: Modern Information Retrieval Sharif University of Technology

CS/INFO 1305 Summer 2009

Prior Art Retrieval Using Various Patent Document Fields Contents

CSCI 5417 Information Retrieval Systems! What is Information Retrieval?

Boolean Model. Hongning Wang

Representation/Indexing (fig 1.2) IR models - overview (fig 2.1) IR models - vector space. Weighting TF*IDF. U s e r. T a s k s

A Deep Relevance Matching Model for Ad-hoc Retrieval

modern database systems lecture 4 : information retrieval

dr.ir. D. Hiemstra dr. P.E. van der Vet

Jan Pedersen 22 July 2010

Fall CS646: Information Retrieval. Lecture 2 - Introduction to Search Result Ranking. Jiepu Jiang University of Massachusetts Amherst 2016/09/12

CS506/606 - Topics in Information Retrieval

60-538: Information Retrieval

CS 572: Information Retrieval. Lecture 1: Course Overview and Introduction 11 January 2016

CSA4020. Multimedia Systems:

Elementary IR: Scalable Boolean Text Search. (Compare with R & G )

Relational Approach. Problem Definition

CS47300 Web Information Search and Management

Information Retrieval and Web Search

Text Analytics. Index-Structures for Information Retrieval. Ulf Leser

Multimedia Information Systems

Information Retrieval and Organisation

Real-time Query Expansion in Relevance Models

Introduction to Information Retrieval. Lecture Outline

Presented by Kit Na Goh

Semantic Extensions to Syntactic Analysis of Queries Ben Handy, Rohini Rajaraman

itunes U Guidelines Creating your itunes U site

Slice Intelligence!

Introduction to Information Retrieval. Hongning Wang

Boolean retrieval & basics of indexing CE-324: Modern Information Retrieval Sharif University of Technology

In the previous lecture we went over the process of building a search. We identified the major concepts of a topic. We used Boolean to define the

Search Engines. Informa1on Retrieval in Prac1ce. Annotations by Michael L. Nelson

CS/INFO 1305 Information Retrieval

Retrieval Evaluation. Hongning Wang

CS646 (Fall 2016) Homework 1

INFORMATION COMUNICATION TECHNOLOGY SKS 1362

CS47300: Web Information Search and Management

Questioning Yahoo! Answers

Information Retrieval. Information Retrieval and Web Search

CS-E4420 Information Retrieval

Exam IST 441 Spring 2011

Retrieval of Highly Related Documents Containing Gene-Disease Association

Promoting Website CS 4640 Programming Languages for Web Applications

Toward a Knowledge-Based Solution for Information Discovery in Complex and Dynamic Domains

A short introduction to the development and evaluation of Indexing systems

Ontology Research Group Overview

Text Analytics (Text Mining)

10/10/13. Traditional database system. Information Retrieval. Information Retrieval. Information retrieval system? Information Retrieval Issues

The Evolution of Search:

Retrieval Evaluation

Chapter 6: Information Retrieval and Web Search. An introduction

This lecture: IIR Sections Ranked retrieval Scoring documents Term frequency Collection statistics Weighting schemes Vector space scoring

Text Analytics (Text Mining)

CS 533 Information Retrieval Systems

CSCI 5417 Information Retrieval Systems. Jim Martin!

The IR Black Box. Anomalous State of Knowledge. The Information Retrieval Cycle. Different Types of Interactions. Upcoming Topics.

Semantic Annotation, Search and Analysis

Juggling the Jigsaw Towards Automated Problem Inference from Network Trouble Tickets

Outline. Possible solutions. The basic problem. How? How? Relevance Feedback, Query Expansion, and Inputs to Ranking Beyond Similarity

Transcription:

CS47300: Web Information Search and Management Prof. Chris Clifton 22 August 2018 Material adapted from course created by Dr. Luo Si, now leading Alibaba research group 43 Administrivia Piazza is active https://piazza.com/purdue/fall2018/cs473/home Blackboard is now available Don t expect to find much there Office Hours for Prof. Clifton: TBD, but I ll be available tomorrow 12:30-1:30, LWSN 2142F TA office hours (and locations) TBD The department is still working on office assignments 44 Jan-17 20 Christopher W. Clifton 1

Basic Concepts of IR: Outline Basic Concepts of Information Retrieval: Task definition of Ad-hoc IR Terminologies and concepts Overview of retrieval models Text representation Indexing Text preprocessing Evaluation Evaluation methodology Evaluation metrics Ad-hoc IR: Terminologies Terminologies: Query Representative data of user s information need: text (default) and other media Document Data candidate to satisfy user s information need: text (default) and other media Database Collection Corpus A set of documents Corpora A set of databases Valuable corpora from TREC (Text Retrieval Evaluation Conference) Jan-17 20 Christopher W. Clifton 2

Ad-hoc Information Retrieval: Ad-hoc IR: Introduction Search a collection of documents to find relevant documents that satisfy different information needs (i.e., queries) Example: Web search 47 Ad-hoc IR: Introduction Ad-hoc Information Retrieval: Search a collection of documents to find relevant documents that satisfy different information needs (i.e., queries) Relatively Stable Changes Queries are created and used dynamically; change fast Ad-hoc : formed or used for specific or immediate problems or needs Merriam-Webster s collegiate Dictionary Ad-hoc IR vs. Filtering Filtering: Queries are stable (e.g., Asian High-Tech) while the collection changes (e.g., news) More for filtering in later lectures Jan-17 20 Christopher W. Clifton 3

Content Based Filtering Information Needs are Stable System should make a delivery decision on the fly when a document arrives User Profile: Asian High-Tech Filtering System AD-hoc IR: Basic Process Information Need Representation Representation Query Retrieval Model Indexed Objects Retrieved Objects Evaluation/Feedback Jan-17 20 Christopher W. Clifton 4

AD-hoc IR: Overview of Retrieval Model Retrieval Models Boolean Vector space Basic vector space Extended Boolean Probabilistic models Statistical language models Galago) Two Poisson model Bayesian inference networks Citation/Link analysis models Page rank Hub & authorities SMART, LUCENE Lemur Project (Indri, Okapi Inquery Google Clever AD-hoc IR: Overview of Retrieval Model Purpose of the Retrieval Model Determine whether a document is relevant to query Relevance is difficult to define Varies by judgers Varies by context (i.e., jointly by a set of documents and queries) Different retrieval methods estimate relevance differently Word occurrence of document and query In probabilistic framework, P(query document) or P(Relevant query,document) Estimate semantic consistency between query and document 52 Jan-17 20 Christopher W. Clifton 5

Types of Retrieval Models Exact Match (Document Selection) Example: Boolean Retrieval Method Query defines the exact retrieval criterion Relevance is a binary variable; a document is either relevant (i.e., match query) or irrelevant (i.e., mismatch) Result is a set of documents Documents are unordered Often in reverse-chronological order (e.g., Pubmed) Exact Match Return Ignore Types of Retrieval Models Best Match (Document Ranking) Example: Most probabilistic models Query describes the desired retrieval criterion Degree of relevance is a continuous/integral variable; each document matches query to some degree Result in a ranked list ( top ones match better) Often return a partial list (e.g., rank threshold) Doc1 0.99 + Best Match Return Doc2 0.90 + Doc3 0.85 + Doc4 0.82 - Doc5 0.81 + Rank Doc6 0.79 -. Jan-17 20 Christopher W. Clifton 6

Types of Retrieval Models Exact Match (Selection) vs. Best Match (Ranking) Best Match is usually more accurate/effective Do not need precise query; representative query generates good results Users have control to explore the rank list: view more if need every piece; view less if need one or two most relevant Exact Match Hard to define the precise query; too strict (terms are too specific) or too coarse (terms are too general) Users have no control over the returned results Still prevalent in some markets (e.g., legal retrieval) AD-hoc IR: Basic Process Information Need Representation Representation Query Retrieval Model Indexed Objects Retrieved Objects Evaluation/Feedback Jan-17 20 Christopher W. Clifton 7

Text Representation: What you see It never leaves my side, April 6, 2002 Reviewer:"dage456" (Carmichael, CA USA) - See all my reviewsit fits in the palm of your hand and is the size of a deflated wallet (wonder where the money went). I have had my ipod now for 4 months and cannot imagine how I used to get by with my old rio 600 with is 64 megs of ram and.. usb connection. Because of its size this little machine goes with my everywhere and its ten hour battery life means I can listen to stuff all day long. Pros: size, both physical and capacity. design: It looks beautiful controls: simple and very easy to use connection: FIREWIRE!! Cons: needs the ability to bookmark. I use my ipod mostly for audiobooks. the ipod needs to include a bookmark feature for those like me. From Amazon Customer Review of IPod Text Representation: What computer sees <table><tr><td valign="top"> Reviewer:</td> <td><a href="http://www.amazon.com/exec/obidos/tg/cm/member-glance/- /AJF9GJKJ8UGNX/1/ref=cm_cr_auth/002-1193904-0468830?%5Fencoding=UTF8"><span style =" font-weight: bold;">"dage456"</span></a> (Carmichael, CA USA) - <a href="http://www.amazon.com/gp/cdp/memberreviews/ajf9gjkj8ugnx/ref=cm_cr_auth/002-1193904-0468830?ie=utf8 > See all my reviews</a></td></tr></table>it fits in the palm of your hand and is the size of a deflated wallet (wonder where the money went). <p>i have had my ipod now for 4 months and cannot imagine how I used to get by with my old rio 600 with is 64 megs of ram and.. usb connection. Because of its size this little machine goes with my everywhere and its ten hour battery life means I can listen to stuff all day long.<p>pros: size, both physical and capacity.<br>design: It looks beautiful<br>controls: simple and very easy to use<p>connection: FIREWIRE!!<p>Cons: needs the ability to bookmark. I use my ipod mostly for audiobooks. the ipod needs to include a bookmark feature for those like me.<br /><br /> From Amazon Customer Review of IPod Jan-17 20 Christopher W. Clifton 8

Text Representation: TREC Format <DOC> <DOCNO> AP900101-0001 </DOCNO> <FILEID>AP-NR-01-01-90 2345EDT</FILEID> <FIRST>r i PM-Iran-Population Bjt 01-01 0777</FIRST> <SECOND>PM-Iran-Population, Bjt,0800</SECOND> <HEAD>Iran Moves To Curb A Baby Boom That Threatens Its Economic Future</HEAD> <HEAD>An AP Extra</HEAD> <BYLINE>By ED BLANCHE</BYLINE> <BYLINE>Associated Press Writer</BYLINE> <DATELINE>NICOSIA, Cyprus (AP) </DATELINE> <TEXT> Iran's government is intensifying a birth control program _ despite opposition from radicals _ because the country's fast-growing population is imposing strains on a struggling economy. </TEXT> </DOC> Text Representation: Indexing Indexing Associate document/query with a set of keys Manual or human Indexing Indexers assign keywords or key concepts (e.g., libraries, Medline, Yahoo!); often small vocabulary Significant human efforts, may not be thorough Automatic Indexing Index program assigns words, phrases or other features; often large vocabulary No human effort low cost Jan-17 20 Christopher W. Clifton 9

Text Representation: Indexing Controlled Vocabulary vs. Full Text Controlled Vocabulary Indexing Assign words from a small vocabulary or a node from an ontology Often manually but can be done by learning algorithms Full Indexing: Often index with an uncontrolled vocabulary of full text Automatically while good algorithm can generate more representative keywords/ key concepts Text Representation: Indexing Controlled Vocabulary Mutation of a mutl homolog in hereditary colon cancer. Papadopoulos N, Nicolaides NC, Wei YF, Ruben SM, Carter KC, Rosen CA, Haseltine WA, Fleischmann RD, Fraser CM, Adams MD, et al. Johns Hopkins Oncology Center, Baltimore, MD 21231. Some cases of hereditary nonpolyposis colorectal cancer (HNPCC) are due to alterations in a muts-related mismatch repair gene. A search of a large database of expressed sequence tags derived from random complementary DNA clones revealed three additional human mismatch repair genes, all related to the bacterial mutl gene. One of these genes (hmlh1) resides on chromosome 3p21, within 1 centimorgan of markers previously linked to cancer susceptibility in HNPCC kindreds. Mutations of hmlh1 that would disrupt the gene product were identified in such kindreds, demonstrating that this gene is responsible for the disease. These results suggest that defects in any of several mismatch repair genes can cause HNPCC. Jan-17 20 Christopher W. Clifton 10

Text Representation: Indexing Controlled Vocabulary Text Representation: Indexing Controlled Vocabulary Jan-17 20 Christopher W. Clifton 11

Text Representation: Indexing Controlled Vocabulary Pros and cons of controlled vocabulary indexing Advantages Many available vocabularies/ontologies (e.g., MeSH, Open Directory, UMLS) Normalization of indexing terms: less vocabulary mismatch, more consistent semantics Easy to use by RDBMS (e.g., semantic Web) Support concept based retrieval and browsing Disadvantages Substantial efforts to be assigned manually Inconvenient for users not familiar with the controlled vocabulary Coarse representation of semantic meaning Jan-17 20 Christopher W. Clifton 12