Information Retrieval

Similar documents
Information Retrieval

CS6200 Informa.on Retrieval. David Smith College of Computer and Informa.on Science Northeastern University

CS6200 Informa.on Retrieval. David Smith College of Computer and Informa.on Science Northeastern University

Search Engine Architecture II

Chapter 6: Information Retrieval and Web Search. An introduction

Query Refinement and Search Result Presentation

Information Retrieval CSCI

Information Retrieval CS6200. Jesse Anderton College of Computer and Information Science Northeastern University

CSE 5243 INTRO. TO DATA MINING

UNIT-V WEB MINING. 3/18/2012 Prof. Asha Ambhaikar, RCET Bhilai.

Chapter 2. Architecture of a Search Engine

Introduction to Information Retrieval. (COSC 488) Spring Nazli Goharian. Course Outline

TEXT CHAPTER 5. W. Bruce Croft BACKGROUND

Information Retrieval Spring Web retrieval

Natural Language Processing

Knowledge Retrieval. Franz J. Kurfess. Computer Science Department California Polytechnic State University San Luis Obispo, CA, U.S.A.

Information Retrieval

CS506/606 - Topics in Information Retrieval

Information Retrieval

Outline. Structures for subject browsing. Subject browsing. Research issues. Renardus

Exam IST 441 Spring 2014

WEB SEARCH, FILTERING, AND TEXT MINING: TECHNOLOGY FOR A NEW ERA OF INFORMATION ACCESS

Query Phrase Expansion using Wikipedia for Patent Class Search

Search Engine Architecture. Hongning Wang

CS54701: Information Retrieval

Search Engines. Information Retrieval in Practice

Information Retrieval May 15. Web retrieval

60-538: Information Retrieval

Fall CS646: Information Retrieval. Lecture 2 - Introduction to Search Result Ranking. Jiepu Jiang University of Massachusetts Amherst 2016/09/12

CS6200 Information Retrieval. Jesse Anderton College of Computer and Information Science Northeastern University

A BFS-BASED SIMILAR CONFERENCE RETRIEVAL FRAMEWORK

Welcome to the class of Web Information Retrieval!

Design and Implementation of Search Engine Using Vector Space Model for Personalized Search

Opportunities from Open Source Search

Searching. Outline. Copyright 2006 Haim Levkowitz. Copyright 2006 Haim Levkowitz

Information Retrieval

The Security Role for Content Analysis

INLS 509: Introduction to Information Retrieval

Introduction to Information Retrieval. Hongning Wang

INFS 427: AUTOMATED INFORMATION RETRIEVAL (1 st Semester, 2018/2019)

Part I: Data Mining Foundations

Natural Language Processing as Key Component to Successful Information Products

Information Retrieval and Organisation

Information Retrieval. (M&S Ch 15)

Chapter 6. Queries and Interfaces

CSCI 5417 Information Retrieval Systems! What is Information Retrieval?

Shrey Patel B.E. Computer Engineering, Gujarat Technological University, Ahmedabad, Gujarat, India

<is web> Information Systems & Semantic Web University of Koblenz Landau, Germany

CS 6320 Natural Language Processing

SEO and UAEX.EDU GETTING YOUR WEB PAGES FOUND IN GOOGLE

Learning to find transliteration on the Web

Exam IST 441 Spring 2013

TREC OpenSearch Planning Session

Web Search. Lecture Objectives. Text Technologies for Data Science INFR Learn about: 11/14/2017. Instructor: Walid Magdy

Empowering People with Knowledge the Next Frontier for Web Search. Wei-Ying Ma Assistant Managing Director Microsoft Research Asia

21. Search Models and UIs for IR

SE Workshop PLAN. What is a Search Engine? Components of a SE. Crawler-Based Search Engines. How Search Engines (SEs) Work?

Bing Liu. Web Data Mining. Exploring Hyperlinks, Contents, and Usage Data. With 177 Figures. Springer

CHAPTER THREE INFORMATION RETRIEVAL SYSTEM

State of the Art and Trends in Search Engine Technology. Gerhard Weikum

Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p.

Search Evaluation. Tao Yang CS293S Slides partially based on text book [CMS] [MRS]

Announcements. Project #2 has been posted Research Paper assignment will be posted soon We will begin Javascript on Friday

Open Source Search. Andreas Pesenhofer. max.recall information systems GmbH Künstlergasse 11/1 A-1150 Wien Austria

Information Retrieval: Retrieval Models

Module 1: Internet Basics for Web Development (II)

Searching the Deep Web

Chapter 27 Introduction to Information Retrieval and Web Search

Why is Search Engine Optimisation (SEO) important?

Learning Ontology-Based User Profiles: A Semantic Approach to Personalized Web Search

Introducing XAIRA. Lou Burnard Tony Dodd. An XML aware tool for corpus indexing and searching. Research Technology Services, OUCS

Oleksandr Kuzomin, Bohdan Tkachenko

Qvidian Proposal Automation provides several options to help you accelerate the population of your content database.

: Semantic Web (2013 Fall)

Search Engines Chapter 8 Evaluating Search Engines Felix Naumann

Northeastern University in TREC 2009 Million Query Track

The World Bank Enterprise Search Program. Luisita Guanlao The World Bank Group May 10, 2005

Exam IST 441 Spring 2011

INTRODUCTION. Chapter GENERAL

Overview of Web Mining Techniques and its Application towards Web

Databases and Information Retrieval Integration TIETS42. Kostas Stefanidis Autumn 2016

TIPS FOR USING GOOGLE

Information Retrieval

Domain Specific Search Engine for Students

Automatic Search Engine Evaluation with Click-through Data Analysis. Yiqun Liu State Key Lab of Intelligent Tech. & Sys Jun.

Searching the Web What is this Page Known for? Luis De Alba

DATA MINING II - 1DL460. Spring 2014"

Information Retrieval and Web Search

Evaluation. David Kauchak cs160 Fall 2009 adapted from:

extreme searching: how to avoid extreme frustration and bird walks presented by Kathy Schrock Overview The Problems

Estimating Embedding Vectors for Queries

Department of Electronic Engineering FINAL YEAR PROJECT REPORT

Search Engines. Information Retrieval in Practice

MULTIMEDIA DATABASES OVERVIEW

In the recent past, the World Wide Web has been witnessing an. explosive growth. All the leading web search engines, namely, Google,

SEARCH ENGINE INSIDE OUT

GETCA ( G.A.S.E.T ) Project Business Plan

University of Virginia Department of Computer Science. CS 4501: Information Retrieval Fall 2015

Information Retrieval and Knowledge Organisation

Searching the Deep Web

Transcription:

Introduction

Information Retrieval Information retrieval is a field concerned with the structure, analysis, organization, storage, searching and retrieval of information Gerard Salton, 1968 J. Pei: Information Retrieval and Web Search -- Introduction 2

IR/Search Applications Usual search: typing in a query, receiving answers in the form of a list of documents in ranked order Vertical search: the domain of the search is restricted to a particular topic Enterprise search: finding the required information in the huge variety of computer files scattered across a corporate intranet Desktop search: the personal version of enterprise search The information sources are the files stored on an individual computer, including email messages and Web pages that have recently been browsed Peer-to-peer search: finding information in networks of nodes or computers without any centralized control J. Pei: Information Retrieval and Web Search -- Introduction 3

IR/Search Tasks Ad hoc search Filtering/tracking: detecting stories of interest based on a user s interests and providing an alert using email or some other mechanism Classification/categorization: using a defined set of labels or classes and automatically assigning those labels to documents Yahoo Directory http://dir.yahoo.com Question answering: aiming at more specific questions and returning specific answers instead of a list of documents Example: Which Canadian university was the first to offer academic courses on a year-round trimester system, and since when? answer: SFU, since 1965 J. Pei: Information Retrieval and Web Search -- Introduction 4

Three Dimensions of IR Content Applications Tasks Text Web search Ad-hoc search Images Vertical search Filtering Video Enterprise search Classification Scanned documents Desktop search Audio Peer-to-peer search Music Literature search Question answering J. Pei: Information Retrieval and Web Search -- Introduction 5

Relevance Is a text document relevant in the sense of containing the information that the user was looking for? Exact matching? Often poor results Vocabulary mismatch problem Topical relevance: a document is topically relevant to a query if it is on the same topic A news story about a tornado in Kansas is topically relevant to the query severe weather events User relevance: a document is user relevant if the user is interested in If a user has read the story, or if the story is too old or in a language that the user does not understand, the user may not be interested in J. Pei: Information Retrieval and Web Search -- Introduction 6

Retrieval Models A formal representation of the process of matching a query and a document Typically, retrieval models capture the statistical properties of text rather than the linguistic structure More sophisticated models incorporate linguistic features, but tend to be of secondary importance An important difference from NLP J. Pei: Information Retrieval and Web Search -- Introduction 7

Simple Models vs. Complex Models Who shot Abraham Lincoln? NLP methods use sophisticated linguistic processing techniques, such as syntactic and semantic analysis A surprisingly simple, data-driven, redundancybased method Simply searching for the phrase shot Abraham Lincoln on the web Look for what frequently appears to its left Simple models are as good as sophisticated ones when large amount of data is used J. Pei: Information Retrieval and Web Search -- Introduction 8

Evaluation How well does a document ranking match users expectations? Precision: the proportion of retrieved documents that are relevant Recall: the proportion of relevant documents that are retrieved Assumption: all the relevant documents for a given query are known may not hold all the time Using test collections (corpora): samples of typical queries and a list of relevant documents for each query The best-known test collections: TREC (http://trec.nist.gov) Clickthrough data: records of documents that were clicked on during a search session J. Pei: Information Retrieval and Web Search -- Introduction 9

Information Needs What are the courses at SFU talking about document indexes? Issue a query course, SFU, document indexes to a search engine Do the query results answer the question? Information need: the topic about which the user desires to know more Unfortunately, often cannot be fed into a search engine J. Pei: Information Retrieval and Web Search -- Introduction 10

Queries Query: what the user conveys to the computer in an attempt to communicate the information need Multiple queries may be formed to capture an information need A query may not capture the information need sufficiently One-word queries are very common in web search Query refinement by query suggestion, expansion and relevance feedback J. Pei: Information Retrieval and Web Search -- Introduction 11

Search Engines Practical application of IR techniques to large scale (text) databases Web search engines, e.g., Google, Bing, Crawl many terabytes of data, provide sub-second response times to millions of queries everyday around the world Enterprise search engines, e.g., Verity, Autonomy, Process the large variety of information sources in a company, and use company-specific knowledge (e.g., data mining) as part of search and related tasks Desktop search engines Rapidly incorporate new documents, web pages, and email, and provide an intuitive interface for search this heterogeneous mix of information One real system can be configured as different search engines We will build a vertical search engine in our course project J. Pei: Information Retrieval and Web Search -- Introduction 12

Open Source Search Engines Many systems exist Lucene: a popular Java-based search engine Has been used for a wide range of commercial applications The IR techniques there are relatively simple Lemur: an open source toolkit that includes the Indri C++-based search engine Has primarily been used by IR researchers to compare advanced search techniques J. Pei: Information Retrieval and Web Search -- Introduction 13

Big Issues in Search Engines Performance and scalability Response time: the delay between submitting a query and receiving the result list Throughput: the number of queries that can be processed in a given time Indexing speed: the rate at which text documents can be transformed into indexes for searching Coverage: how much of the existing information has been indexed and stored in the search engine database Recency and freshness: the age of the information in the search engine database Customizability and adaptability: how much a search engine can be used in different applications? J. Pei: Information Retrieval and Web Search -- Introduction 14

Core IR and Search Engine Issues J. Pei: Information Retrieval and Web Search -- Introduction 15

To-Do List Read Chapter 1 in the textbook How do people compare Google and Bing? Find out from the web by search Student background survey J. Pei: Information Retrieval and Web Search -- Introduction 16