CS6200 Informa.on Retrieval. David Smith College of Computer and Informa.on Science Northeastern University

Similar documents
CS6200 Informa.on Retrieval. David Smith College of Computer and Informa.on Science Northeastern University

Information Retrieval

Information Retrieval CS6200. Jesse Anderton College of Computer and Information Science Northeastern University

Informa(on Retrieval. Informa(on Retrieval

Information Retrieval

CS6200 Informa.on Retrieval. David Smith College of Computer and Informa.on Science Northeastern University

Search Engines. Informa1on Retrieval in Prac1ce. Annota1ons by Michael L. Nelson

Natural Language Processing

CSE 5243 INTRO. TO DATA MINING

Search Engines. Informa1on Retrieval in Prac1ce. Annotations by Michael L. Nelson

Modeling Rich Interac1ons in Session Search Georgetown University at TREC 2014 Session Track

Informa/on Retrieval. Text Search. CISC437/637, Lecture #23 Ben CartereAe. Consider a database consis/ng of long textual informa/on fields

Information Retrieval and Web Search

About the Course. Reading List. Assignments and Examina5on

Developing an Analy.cs Dashboard for Coursera MOOC Discussion Forums CNI Fall 2014 Membership Mee.ng

Welcome! COMS 4118 Opera3ng Systems I Spring 2018

CSCI 5417 Information Retrieval Systems! What is Information Retrieval?

Introduc3on to Data Management

Behrang Mohit : txt proc! Review. Bag of word view. Document Named

CS 241 Data Organization using C

Informa(on Retrieval. Administra*ve. Sta*s*cal MT Overview. Problems for Sta*s*cal MT

CS 3030 Scripting Languages Syllabus

Machine Learning Crash Course: Part I

CMSC Introduction to Database Systems

Course work. Today. Last lecture index construc)on. Why compression (in general)? Why compression for inverted indexes?

represen/ng the world in 1s and 0s CS 4100/5100 Founda/ons of AI

Introduction to Information Retrieval. Hongning Wang

Prepared for COMPANY X

Search Engines. Informa1on Retrieval in Prac1ce. Annotations by Michael L. Nelson

Chapter 2. Architecture of a Search Engine

CS60092: Informa0on Retrieval

CS506/606 - Topics in Information Retrieval

Information Retrieval

60-538: Information Retrieval

Informa(on Retrieval

CS 3030 Scripting Languages Syllabus

Informa(on Retrieval

CS6200 Informa.on Retrieval. David Smith College of Computer and Informa.on Science Northeastern University

Introduction & Administrivia

Jason Polakis, Marco Lancini, Georgios Kontaxis, Federico Maggi, So5ris Ioannidis, Angelos Keromy5s, Stefano Zanero.

CENG505 Advanced Computer Graphics Lecture 1 - Introduction. Instructor: M. Abdullah Bülbül

CS 241 Data Organization. August 21, 2018

Introduc)on to. CS60092: Informa0on Retrieval

ECSE 425 Lecture 1: Course Introduc5on Bre9 H. Meyer

INLS 509: Introduction to Information Retrieval

Open Data Kit. A set of open source tools to help organiza3ons collect, aggregate and visualize their rich data.

CSC 261/461 Database Systems. Fall 2017 MW 12:30 pm 1:45 pm CSB 601

Information Retrieval

RESTful Design for Internet of Things Systems

Evaluating and Improving Software Usability

CSC 5991 Cyber Security Prac1ce

Search Engine Architecture. Hongning Wang

A RECOMMENDER SYSTEM FOR SOCIAL BOOK SEARCH

CS60092: Informa0on Retrieval

Welcome to CS 241 Systems Programming at Illinois

Informa(on Retrieval

This lecture. Measures for a search engine EVALUATING SEARCH ENGINES. Measuring user happiness. Measures for a search engine

CS 4604: Introduc0on to Database Management Systems

Cisco Exam Dumps PDF for Guaranteed Success

University of Virginia Department of Computer Science. CS 4501: Information Retrieval Fall 2015

Josh Bloch Charlie Garrod Darya Melicher

CSE 344 JANUARY 3 RD - INTRODUCTION

CLOUD SERVICES. Cloud Value Assessment.

OLA Super Conference : Introducing New Media to Hansard: Building a FoundaCon for Knowledge Management

: Semantic Web (2013 Fall)

Part 1: Search Engine Op2miza2on for Libraries (Public, Academic, School) and Special Collec2ons

Ranked Retrieval. Evaluation in IR. One option is to average the precision scores at discrete. points on the ROC curve But which points?

CSE373: Data Structures and Algorithms Lecture 1: Introduc<on; ADTs; Stacks/Queues

Empowering People with Knowledge the Next Frontier for Web Search. Wei-Ying Ma Assistant Managing Director Microsoft Research Asia

VK Multimedia Information Systems

BOSTON UNIVERSITY Metropolitan College MET CS342 Data Structures with Java Dr. V.Shtern (Fall 2011) Course Syllabus

Information Retrieval

Welcome to CS 241 Systems Programming at Illinois

HBase... And Lewis Carroll! Twi:er,

If you click the links in this document or on the class website and get a logon screen:

Design and Debug: Essen.al Concepts CS 16: Solving Problems with Computers I Lecture #8

Exam IST 441 Spring 2014

Introduction to Computer Systems

Introduction to Web Design & Computer Principles

Informa(on Retrieval

Enterprise Architecture CS 4720 Web & Mobile Systems

Informa)on Retrieval and Map- Reduce Implementa)ons. Mohammad Amir Sharif PhD Student Center for Advanced Computer Studies

Introduction to Data Structures

CIS 3308 Web Application Programming Syllabus

San José State University Department of Computer Science CS-174, Server-side Web Programming, Section 2, Spring 2018

What makes an applica/on a good applica/on? How is so'ware experienced by end- users? Chris7an Campo EclipseCon 2012

CS 4604: Introduc0on to Database Management Systems. B. Aditya Prakash Lecture #1: Introduc/on

Computer Science Department

CS 572: Information Retrieval. Lecture 1: Course Overview and Introduction 11 January 2016

You must pass the final exam to pass the course.

CS 200, Section 1, Programming I, Fall 2017 College of Arts & Sciences Syllabus

CSCI 4250/6250 Fall 2013 Computer and Network Security. Instructor: Prof. Roberto Perdisci

Information Retrieval Spring Web retrieval

CS503 Advanced Programming I CS305 Computer Algorithms I

Lecture 27: Learning from relational data

CMPS 182: Introduction to Database Management Systems. Instructor: David Martin TA: Avi Kaushik. Syllabus

Lecture 1: Course Overview

Introduction to Data Management CSE 344. Lecture 1: Introduction

COSC 115: Introduction to Web Authoring Fall 2013

Search Evaluation. Tao Yang CS293S Slides partially based on text book [CMS] [MRS]

Transcription:

CS6200 Informa.on Retrieval David Smith College of Computer and Informa.on Science Northeastern University

Course Goals To help you to understand search engines, evaluate and compare them, and modify them for specific applica.ons Provide broad coverage of the important issues in informa.on retrieval and search engines Books (Kindle edi.ons OK): Search Engines: Informa2on Retrieval in Prac2ce CroM, Metzler, and Strohman Introduc2on to Informa2on Retrieval Manning, Raghavan, and Schütze Not required, but readings will be recommended

Topics Overview Architecture of a search engine Data acquisi.on Text representa.on Informa.on extrac.on Indexing Query processing Ranking Evalua.on Classifica.on and clustering Social search More

Course Evalua.on Six assignments (50%) All are worth equal points; lowest- scoring assignment dropped Some require wri\en answers and some short programming exercises. Others involve more extensive programming and a short wri\en report on your design choices and experimental results. You may use any programming language installed on college Linux machines, so we can (compile and) run your code. You may save work by using reasonable libraries that don t simply implement the goal of the assignment E.g., HTTP request libraries are OK; web- crawling libraries are not. Two exams Midterm (20%) in class Final (30%)

Late Policy Assignments are due at the beginning of class on the announced due date. You may take a single homework extension of four calendar days, to be used at your discre.on. AMer the first late assignment, unexcused late assignments will be penalized 10% per calendar day late. We normally will not accept assignments amer the date on which the following assignment is due or amer the solu.ons have been handed out, whichever comes first. If you are in circumstances that would cause you to turn in an assignment late, please contact me in advance to ask for an extension for cause.

Academic Honesty All work submi\ed for credit must be your own. You may discuss the assignments with your classmates, the TA, and the instructor. You must acknowledge the people with whom you discussed your work, and you must write up your own solu.ons. Any wri\en sources used (apart from the text) must also be acknowledged; however, you may not consult any solu.ons from previous years assignments whether they are student or faculty generated.

Me Contact Office hours: Thursdays 3 5, Room 356 dasmith@ccs.neu.edu TAs TBA Course website and Piazza list h\p://www.ccs.neu.edu/course/cs6200f15/

Informa.on Retrieval Informa2on retrieval is a field concerned with the structure, analysis, organiza2on, storage, searching, and retrieval of informa2on. (Salton, 1968) General defini.on that can be applied to many types of informa.on and search applica.ons Primary focus of IR since the 50s has been on text and documents

What is a Document? Examples: web pages, email, books, news stories, scholarly papers, text messages, tweets, MS Office, PDF, Facebook pages, blogs, forum pos.ngs, IM sessions, etc. Common proper.es text content some structure (e.g.,.tle, author, date for papers; subject, sender, des.na.on for email)

Documents vs. Database Records Database records (or tuples in rela.onal databases) are typically made up of well- defined fields (or aeributes) e.g., bank records with account numbers, balances, names, addresses, social security numbers, dates of birth, etc. Easy to compare fields with well- defined seman.cs to queries in order to find matches Text is more difficult

Documents vs. Records Example bank database query Find records with balance > $50,000 in branches located in Somerville, MA. Matches easily found by comparison with field values of records Example search engine query bank scandals in western mass This text must be compared to the text of en.re news stories

Comparing Text Comparing the query text to the document text and determining what is a good match is the core issue of informa.on retrieval Exact matching of words is not enough Many different ways to write the same thing in a natural language like English e.g., does a news story containing the text bank director in Amherst steals funds match the query? Some stories will be be\er matches than others

Dimensions of IR IR is more than just text, and more than just web search although these are central People doing IR work with different media, different types of search applica.ons, and different tasks

Dimensions of IR Content Applica-ons Tasks Text Web search Ad hoc search Images Ver.cal search Filtering Video Enterprise search Classifica.on Scanned docs Desktop search Ques.on answering Audio Forum search Summariza.on Music P2P search Literature search

Big Issues in IR Relevance What is it? Simple (and simplis.c) defini.on: A relevant document contains the informa.on that a person was looking for when they submi\ed a query to the search engine Many factors influence users decision about what is relevant: e.g., task, context, novelty, style, other documents they ve already read Topical relevance (same topic) vs. user relevance (everything else)

Big Issues in IR Relevance Retrieval models define a view of relevance Ranking algorithms used in search engines are based on retrieval models Most models based on sta.s.cal proper.es of text rather than deep linguis.c analysis i.e., coun.ng simple text features such as words instead of parsing and analyzing the sentences

Big Issues in IR Evalua.on Experimental procedures and measures for comparing system output with user expecta.ons IR evalua.on methods now used in many fields e.g. Neslix compe..on Typically use test collec2on of documents, queries, and relevance judgments Most commonly used are TREC collec.ons Recall and precision are two examples of effec.veness measures

Big Issues in IR Users and Informa.on Needs Search evalua.on is user- centered Keyword queries are omen poor descrip.ons of actual informa.on needs Interac.on and context are important for understanding user intent Query refinement techniques such as query expansion, query sugges2on, relevance feedback improve ranking

IR and Search Engines A search engine is the prac.cal applica.on of informa.on retrieval techniques to large scale text collec.ons Web search engines are best- known examples, but many others Open source search engines are important for research and development e.g., Lucene, Lemur/Indri, Galago Big issues include main IR issues but also some others

IR and Search Engines Informa.on Retrieval Search Engines Relevance - Effec2ve ranking Evalua.on - Tes2ng and measuring Informa.on needs - User interac2on Performance - Efficient search and indexing Incorpora.ng new data - Coverage and freshness Scalability - Growing with data and users Adaptability - Tuning for applica2ons Specific problems - e.g. Spam

Search Engine Issues Performance Measuring and improving the efficiency of search e.g., reducing response 2me, increasing query throughput, increasing indexing speed Indexes are data structures designed to improve search efficiency designing and implemen.ng them are major issues for search engines

Search Engine Issues Dynamic data The collec.on for most real applica.ons is constantly changing in terms of updates, addi.ons, dele.ons e.g., web pages Acquiring or crawling the documents is a major task Typical measures are coverage (how much has been indexed) and freshness (how recently was it indexed) Upda.ng the indexes while processing queries is also a design issue

Search Engine Issues Scalability Making everything work with millions of users every day, and many terabytes of documents Distributed processing is essen.al Adaptability Changing and tuning search engine components such as ranking algorithm, indexing strategy, interface for different applica.ons

Spam Search Engine Issues For web search, spam in all its forms is one of the major issues Affects the efficiency of search engines and, more seriously, the effec.veness of the results Prolifera.on of spam varie.es e.g. spamdexing or term spam, link spam, search engine op.miza.on New subfield called adversarial IR, since spammers are adversaries with different goals

Topics Overview Architecture of a search engine For background, read chapters 1 and 2 of Search Engines by CroM, Metzler, and Strohman