CS6200 Informa.on Retrieval. David Smith College of Computer and Informa.on Science Northeastern University

Similar documents
CS6200 Informa.on Retrieval. David Smith College of Computer and Informa.on Science Northeastern University

Information Retrieval

Information Retrieval CS6200. Jesse Anderton College of Computer and Information Science Northeastern University

Informa(on Retrieval. Informa(on Retrieval

Information Retrieval

CS6200 Informa.on Retrieval. David Smith College of Computer and Informa.on Science Northeastern University

Search Engines. Informa1on Retrieval in Prac1ce. Annota1ons by Michael L. Nelson

CSE 5243 INTRO. TO DATA MINING

Natural Language Processing

About the Course. Reading List. Assignments and Examina5on

Modeling Rich Interac1ons in Session Search Georgetown University at TREC 2014 Session Track

Developing an Analy.cs Dashboard for Coursera MOOC Discussion Forums CNI Fall 2014 Membership Mee.ng

Search Engines. Informa1on Retrieval in Prac1ce. Annotations by Michael L. Nelson

Information Retrieval and Web Search

Informa/on Retrieval. Text Search. CISC437/637, Lecture #23 Ben CartereAe. Consider a database consis/ng of long textual informa/on fields

Introduc3on to Data Management

Welcome! COMS 4118 Opera3ng Systems I Spring 2018

Behrang Mohit : txt proc! Review. Bag of word view. Document Named

Informa(on Retrieval. Administra*ve. Sta*s*cal MT Overview. Problems for Sta*s*cal MT

Machine Learning Crash Course: Part I

Introduction to Information Retrieval. Hongning Wang

CSCI 5417 Information Retrieval Systems! What is Information Retrieval?

CSC 5991 Cyber Security Prac1ce

Exam IST 441 Spring 2014

CS506/606 - Topics in Information Retrieval

Informa(on Retrieval

Search Engines. Informa1on Retrieval in Prac1ce. Annotations by Michael L. Nelson

Jason Polakis, Marco Lancini, Georgios Kontaxis, Federico Maggi, So5ris Ioannidis, Angelos Keromy5s, Stefano Zanero.

CS60092: Informa0on Retrieval

represen/ng the world in 1s and 0s CS 4100/5100 Founda/ons of AI

ECSE 425 Lecture 1: Course Introduc5on Bre9 H. Meyer

Informa(on Retrieval

Course work. Today. Last lecture index construc)on. Why compression (in general)? Why compression for inverted indexes?

Exam IST 441 Spring 2013

Chapter 2. Architecture of a Search Engine

A RECOMMENDER SYSTEM FOR SOCIAL BOOK SEARCH

OLA Super Conference : Introducing New Media to Hansard: Building a FoundaCon for Knowledge Management

Exam IST 441 Spring 2011

Introduc)on to. CS60092: Informa0on Retrieval

University of Virginia Department of Computer Science. CS 4501: Information Retrieval Fall 2015

Open Data Kit. A set of open source tools to help organiza3ons collect, aggregate and visualize their rich data.

Evaluating and Improving Software Usability

CS6200 Information Retrieval. Jesse Anderton College of Computer and Information Science Northeastern University

CS6200 Informa.on Retrieval. David Smith College of Computer and Informa.on Science Northeastern University

Empowering People with Knowledge the Next Frontier for Web Search. Wei-Ying Ma Assistant Managing Director Microsoft Research Asia

CS60092: Informa0on Retrieval

RESTful Design for Internet of Things Systems

Information Retrieval

CS 4604: Introduc0on to Database Management Systems. B. Aditya Prakash Lecture #1: Introduc/on

CS 4604: Introduc0on to Database Management Systems

CSE373: Data Structures and Algorithms Lecture 1: Introduc<on; ADTs; Stacks/Queues

CLOUD SERVICES. Cloud Value Assessment.

MINUTES Date 5/7/2015. Monthly Faculty Meeting Department of Electrical Engineering

Introduction & Administrivia

Part 1: Search Engine Op2miza2on for Libraries (Public, Academic, School) and Special Collec2ons

CS490W: Web Information Search & Management. CS-490W Web Information Search and Management. Luo Si. Department of Computer Science Purdue University

Lecture 1: Course Overview

Prepared for COMPANY X

CENG505 Advanced Computer Graphics Lecture 1 - Introduction. Instructor: M. Abdullah Bülbül

INFS 427: AUTOMATED INFORMATION RETRIEVAL (1 st Semester, 2018/2019)

Search Engine Architecture. Hongning Wang

Information Retrieval and Organisation

Microsoft FAST Search Server 2010 for SharePoint for Application Developers Course 10806A; 3 Days, Instructor-led

HBase... And Lewis Carroll! Twi:er,

INLS 509: Introduction to Information Retrieval

CMSC Introduction to Database Systems

CS 3030 Scripting Languages Syllabus

CS-490WIR Web Information Retrieval and Management. Luo Si

VK Multimedia Information Systems

Josh Bloch Charlie Garrod Darya Melicher

Informa(on Retrieval

60-538: Information Retrieval

What makes an applica/on a good applica/on? How is so'ware experienced by end- users? Chris7an Campo EclipseCon 2012

Information Retrieval

Cisco Exam Dumps PDF for Guaranteed Success

VIS II: Design Communication Graphic Design Basics, Photoshop and InDesign Spring 2018

Ges$one Avanzata dell Informazione Part A Full- Text Informa$on Management. Full- Text Indexing

Cyber Security and Power System Communica4ons Essen4al Parts of a Smart Grid Infrastructure. Talal El Awar

CSE 344 JANUARY 3 RD - INTRODUCTION

Information Retrieval

University of Virginia Department of Computer Science. CS 4501: Information Retrieval Fall 2015

Best Prac*ces in Accessibility and Universal Design for Learning. Rozy Parlette, Instruc*onal Designer Center for Instruc*on and Research Technology

CS 200, Section 1, Programming I, Fall 2017 College of Arts & Sciences Syllabus

Informa)on Retrieval and Map- Reduce Implementa)ons. Mohammad Amir Sharif PhD Student Center for Advanced Computer Studies

Informa(on Retrieval

This lecture. Measures for a search engine EVALUATING SEARCH ENGINES. Measuring user happiness. Measures for a search engine

F.P. Brooks, No Silver Bullet: Essence and Accidents of Software Engineering CIS 422

You must pass the final exam to pass the course.

Embedding System Dynamics in Agent Based Models for Complex Adap;ve Systems

Introduc)on to Database Systems CSE 444. Lecture #1 March 29, 2010

CS6200 Informa.on Retrieval. David Smith College of Computer and Informa.on Science Northeastern University

CS 572: Information Retrieval. Lecture 1: Course Overview and Introduction 11 January 2016

GDPR ESSENTIALS END-USER COMPLIANCE TRAINING. Copyright 2018 Logical Operations, Inc. All rights reserved.

Informa(on Retrieval

Intelligent Systems Knowledge Representa6on

Informa(on Retrieval

Informa(on Extrac(on and Named En(ty Recogni(on. Introducing the tasks: Ge3ng simple structured informa8on out of text

Lecture 27: Learning from relational data

Instruc)onal Staff. Instructors Ron Cytron & Roger Chamberlain. Head TA Andrew Buckley. TAs there are many! See the web page

Volume Visualiza0on. Today s Class. Grades & Homework feedback on Homework Submission Server

Transcription:

CS6200 Informa.on Retrieval David Smith College of Computer and Informa.on Science Northeastern University

Course Goals To help you to understand search engines, evaluate and compare them, and modify them for specific applica.ons Provide broad coverage of the important issues in informa.on retrieval and search engines Books (Kindle edi.ons OK): Search Engines: Informa2on Retrieval in Prac2ce CroM, Metzler, and Strohman Introduc2on to Informa2on Retrieval Manning, Raghavan, and Schütze Not required, but readings will be recommended

Topics Overview Architecture of a search engine Data acquisi.on Text representa.on Informa.on extrac.on Indexing Query processing Ranking Evalua.on Classifica.on and clustering Social search More

Course Evalua.on Homework Three assignments 40% of grade Wri\en and some short programming exercises Projects Three projects 60% of grade More extensive programming and a short wri\en report on your design choices Any programming language installed on college Linux machines, reasonable library usage Undergrads are not required to do the final project (except for fun and credit)

Late Policy Assignments are due at the beginning of class on the announced due date. You will be granted one homework extension of four calendar days, to be used at your discre.on, no ques.ons asked. This policy does not apply to projects. AMer the first late assignment, unexcused late assignments will be penalized 20% per calendar day late. We normally will not accept assignments amer the date on which the following assignment is due or amer the solu.ons have been handed out, whichever comes first. If you will have a valid reason for turning in an assignment late, please contact me in advance to obtain full credit.

Academic Honesty All work submi\ed for credit must be your own. You may discuss the homework problems or projects with your classmates, the TA, and the instructor. You must acknowledge the people with whom you discussed your work, and you must write up your own solu.ons. Any wri\en sources used (apart from the text) must also be acknowledged; however, you may not consult any solu.ons from previous years assignments whether they are student or faculty generated.

Contact Me Office hours: Th 3 5, Room 356 dasmith@ccs.neu.edu TAs Jesse Anderton: WVH 472, Wednesdays 1 3 jesse@ccs.neu.edu Ting Chen: WVH 472, Fridays 3 5.ngchen@ccs.neu.edu Course website and Piazza list h\p://www.ccs.neu.edu/course/cs6200f13/

Informa.on Retrieval Informa2on retrieval is a field concerned with the structure, analysis, organiza2on, storage, searching, and retrieval of informa2on. (Salton, 1968) General defini.on that can be applied to many types of informa.on and search applica.ons Primary focus of IR since the 50s has been on text and documents

What is a Document? Examples: web pages, email, books, news stories, scholarly papers, text messages, tweets, MS Office, PDF, Facebook pages, blogs, forum pos.ngs, IM sessions, etc. Common proper.es text content some structure (e.g.,.tle, author, date for papers; subject, sender, des.na.on for email)

Documents vs. Database Records Database records (or tuples in rela.onal databases) are typically made up of well- defined fields (or aeributes) e.g., bank records with account numbers, balances, names, addresses, social security numbers, dates of birth, etc. Easy to compare fields with well- defined seman.cs to queries in order to find matches Text is more difficult

Documents vs. Records Example bank database query Find records with balance > $50,000 in branches located in Amherst, MA. Matches easily found by comparison with field values of records Example search engine query bank scandals in western mass This text must be compared to the text of en.re news stories

Comparing Text Comparing the query text to the document text and determining what is a good match is the core issue of informa.on retrieval Exact matching of words is not enough Many different ways to write the same thing in a natural language like English e.g., does a news story containing the text bank director in Amherst steals funds match the query? Some stories will be be\er matches than others

Dimensions of IR IR is more than just text, and more than just web search although these are central People doing IR work with different media, different types of search applica.ons, and different tasks

Dimensions of IR Content Applica-ons Tasks Text Web search Ad hoc search Images Ver.cal search Filtering Video Enterprise search Classifica.on Scanned docs Desktop search Ques.on answering Audio Forum search Music P2P search Literature search

Big Issues in IR Relevance What is it? Simple (and simplis.c) defini.on: A relevant document contains the informa.on that a person was looking for when they submi\ed a query to the search engine Many factors influence a person s decision about what is relevant: e.g., task, context, novelty, style Topical relevance (same topic) vs. user relevance (everything else)

Big Issues in IR Relevance Retrieval models define a view of relevance Ranking algorithms used in search engines are based on retrieval models Most models based on sta.s.cal proper.es of text rather than deep linguis.c analysis i.e., coun.ng simple text features such as words instead of parsing and analyzing the sentences

Big Issues in IR Evalua.on Experimental procedures and measures for comparing system output with user expecta.ons IR evalua.on methods now used in many fields e.g. Nevlix compe..on Typically use test collec2on of documents, queries, and relevance judgments Most commonly used are TREC collec.ons Recall and precision are two examples of effec.veness measures

Big Issues in IR Users and Informa.on Needs Search evalua.on is user- centered Keyword queries are omen poor descrip.ons of actual informa.on needs Interac.on and context are important for understanding user intent Query refinement techniques such as query expansion, query sugges2on, relevance feedback improve ranking

IR and Search Engines A search engine is the prac.cal applica.on of informa.on retrieval techniques to large scale text collec.ons Web search engines are best- known examples, but many others Open source search engines are important for research and development e.g., Lucene, Lemur/Indri, Galago Big issues include main IR issues but also some others

IR and Search Engines Informa.on Retrieval Search Engines Relevance - Effec2ve ranking Evalua.on - Tes2ng and measuring Informa.on needs - User interac2on Performance - Efficient search and indexing Incorpora.ng new data - Coverage and freshness Scalability - Growing with data and users Adaptability - Tuning for applica2ons Specific problems - e.g. Spam

Search Engine Issues Performance Measuring and improving the efficiency of search e.g., reducing response 2me, increasing query throughput, increasing indexing speed Indexes are data structures designed to improve search efficiency designing and implemen.ng them are major issues for search engines

Search Engine Issues Dynamic data The collec.on for most real applica.ons is constantly changing in terms of updates, addi.ons, dele.ons e.g., web pages Acquiring or crawling the documents is a major task Typical measures are coverage (how much has been indexed) and freshness (how recently was it indexed) Upda.ng the indexes while processing queries is also a design issue

Search Engine Issues Scalability Making everything work with millions of users every day, and many terabytes of documents Distributed processing is essen.al Adaptability Changing and tuning search engine components such as ranking algorithm, indexing strategy, interface for different applica.ons

Spam Search Engine Issues For web search, spam in all its forms is one of the major issues Affects the efficiency of search engines and, more seriously, the effec.veness of the results Prolifera.on of spam varie.es e.g. spamdexing or term spam, link spam, op.miza.on New subfield called adversarial IR, since spammers are adversaries with different goals

Topics Overview Architecture of a search engine For background, read chapters 1 and 2 of Search Engines by CroM, Metzler, and Strohman