Introduction & Administrivia

Similar documents
Introduction to Information Retrieval. Hongning Wang

Information Retrieval and Organisation

Information Retrieval

Department of Computer Science and Engineering B.E/B.Tech/M.E/M.Tech : B.E. Regulation: 2013 PG Specialisation : _

CS290N Summary Tao Yang

CS506/606 - Topics in Information Retrieval

CS 572: Information Retrieval. Lecture 1: Course Overview and Introduction 11 January 2016

Information Retrieval CS6200. Jesse Anderton College of Computer and Information Science Northeastern University

60-538: Information Retrieval

Fall CS646: Information Retrieval. Lecture 2 - Introduction to Search Result Ranking. Jiepu Jiang University of Massachusetts Amherst 2016/09/12

Vannevar Bush. Information Retrieval. Prophetic: Hypertext. Historic Vision 2/8/17

Information Retrieval and Web Search Engines

Introduction to Information Retrieval. (COSC 488) Spring Nazli Goharian. Course Outline

Web Search. Lecture Objectives. Text Technologies for Data Science INFR Learn about: 11/14/2017. Instructor: Walid Magdy

What is Information Retrieval (IR)? Information Retrieval vs. Databases. What is Information Retrieval (IR)? Why Should I Know about All This?

CS 4317: Human-Computer Interaction

A RECOMMENDER SYSTEM FOR SOCIAL BOOK SEARCH

Information Retrieval and Web Search

Lecture 1: Course Introduction

COSC-589 Web Search and Sense-making Information Retrieval In the Big Data Era. Spring Instructor: Grace Hui Yang

: Semantic Web (2013 Fall)

CS6200 Information Retrieval. Jesse Anderton College of Computer and Information Science Northeastern University

Search Engines Information Retrieval in Practice

Information Retrieval and Extraction

Information Retrieval CS Lecture 01. Razvan C. Bunescu School of Electrical Engineering and Computer Science

KOMAR UNIVERSITY OF SCIENCE AND TECHNOLOGY (KUST)

Search Engine Architecture. Hongning Wang

Effective Information Retrieval using Genetic Algorithms based Matching Functions Adaptation

Information Retrieval

Information Retrieval & Data Mining Universität des Saarlandes, Saarbrücken Winter Semester 2013/14

University of Asia Pacific (UAP) Department of Computer Science and Engineering (CSE)

Abstract. 1. Introduction

Development of Search Engines using Lucene: An Experience

Information Retrieval and Extraction

21. Search Models and UIs for IR

CHAPTER THREE INFORMATION RETRIEVAL SYSTEM

Query Likelihood with Negative Query Generation

Dynamic Visualization of Hubs and Authorities during Web Search

Information Retrieval

Semi-Parametric and Non-parametric Term Weighting for Information Retrieval

A Machine Learning Approach for Information Retrieval Applications. Luo Si. Department of Computer Science Purdue University

University of Virginia Department of Computer Science. CS 4501: Information Retrieval Fall 2015

Lecture 27: Learning from relational data

Web Search Basics. Berlin Chen Department of Computer Science & Information Engineering National Taiwan Normal University

Documents Retrieval Using the Combination of Two Keywords

Introduction to Text Mining. Hongning Wang

Window Extraction for Information Retrieval

CS6200 Informa.on Retrieval. David Smith College of Computer and Informa.on Science Northeastern University

INF5890 IT and Management. Introduction 16 th January Margunn Aanestad, Bendik Bygstad, Mikael Hailu Gebremariam, Mwiza Kumwenda

Databases and Information Retrieval Integration TIETS42. Kostas Stefanidis Autumn 2016

1DL321: Kompilatorteknik I (Compiler Design 1)

A Modified Algorithm to Handle Dangling Pages using Hypothetical Node

Introduction to Information Retrieval

Definitions. Lecture Objectives. Text Technologies for Data Science INFR Learn about main concepts in IR 9/19/2017. Instructor: Walid Magdy

Human-Computer Interaction (CS4317/5317)

Lecture 5: Information Retrieval using the Vector Space Model

TEXT CHAPTER 5. W. Bruce Croft BACKGROUND

Basic techniques. Text processing; term weighting; vector space model; inverted index; Web Search

Course Design Document: IS202 Data Management. Version 4.5

Semantic Scholar. ICSTI Towards a More Efficient Review of Research Literature 11 September

CPSC 2380 Data Structures and Algorithms

Diversification of Query Interpretations and Search Results

The University of Jordan. Accreditation & Quality Assurance Center. Curriculum for Doctorate Degree

Anatomy of a search engine. Design criteria of a search engine Architecture Data structures

CS-490WIR Web Information Retrieval and Management. Luo Si

1DL321: Kompilatorteknik I (Compiler Design 1) Introduction to Programming Language Design and to Compilation

WEB SEARCH, FILTERING, AND TEXT MINING: TECHNOLOGY FOR A NEW ERA OF INFORMATION ACCESS

Focused Retrieval Using Topical Language and Structure

Chapter 27 Introduction to Information Retrieval and Web Search

Information Retrieval

Looking back: On relevance, probabilistic indexing and information retrieval

CSCE 441 Computer Graphics Fall 2018

CS 200, Section 1, Programming I, Fall 2017 College of Arts & Sciences Syllabus

How to Use Google Scholar An Educator s Guide

Google technology for teachers

Information Retrieval Lecture 4: Web Search. Challenges of Web Search 2. Natural Language and Information Processing (NLIP) Group

Part A: Course Outline

University of Asia Pacific (UAP) Department of Computer Science and Engineering (CSE) Course Outline

Implementation of the common phrase index method on the phrase query for information retrieval

CS/INFO 1305 Summer 2009

Data Mining. Jeff M. Phillips. January 12, 2015 CS 5140 / CS 6140

10/10/13. Traditional database system. Information Retrieval. Information Retrieval. Information retrieval system? Information Retrieval Issues

ROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015

Social Information Retrieval

CS6200 Informa.on Retrieval. David Smith College of Computer and Informa.on Science Northeastern University

GDSA - Audiovisual Signal Management and Distribution

University of Asia Pacific (UAP) Department of Electrical and Electronics Engineering (EEE) Course Outline

CS54701: Information Retrieval

Oleksandr Kuzomin, Bohdan Tkachenko

Research Topics in Information Retrieval

CIS 120. Introduction to Programming

Online the Library

USC Viterbi School of Engineering

The application of Randomized HITS algorithm in the fund trading network

BEng (Hons) Mechanical Engineering - E440 (Under Review)

CS 3030 Scripting Languages Syllabus

James Mayfield! The Johns Hopkins University Applied Physics Laboratory The Human Language Technology Center of Excellence!

Palimpsest: Improving Assisted Curation of Loco-specific Literature

CLUSTERING, TIERED INDEXES AND TERM PROXIMITY WEIGHTING IN TEXT-BASED RETRIEVAL

San José State University Computer Science Department CS157A: Introduction to Database Management Systems Sections 5 and 6, Fall 2015

Transcription:

Introduction & Administrivia Information Retrieval Evangelos Kanoulas ekanoulas@uva.nl

Section 1: Unstructured data Sec. 8.1 2

Big Data Growth of global data volume data everywhere! Web data: observation, interaction, transaction Smartphones, personal devices, traces in the real world Sensors, internet of things Scientific and technical challenges: how to make sense of data? Data center, virtualization, storage (no-rdbm), mapreduce, indexing & search, large scale machine learning

The Rise of Unstructured Data Business 80% of business is conducted on unstructured data Consumer

Media & Sources What types of unstructured information exist? Text: Web pages, books, articles, papers, reports, letters, blogs,? Conversational: Emails, tweets, comments,... Graphics & images, presentations Speech & video Maps & satellite imagery Local business information, yellow pages Mismatch: given representation in specific medium vs. semantic description of information Semantic gap needs to be bridged to establish relevance.

Internet Users December 26

The Use of Search Engines 70-80% of users use search engines to find Web sites More than 60% of online shoppers use search engines (and many more other search technologies) [compete.com, US

Section 2: A Historic Perspective

The Library the knowledge repositories of our civilization Library of Alexandria (280 BC): 700,000 scrolls Vatican Library (1500): 3,600 codices Herzog-August-Bibl.(1661): 116,000 books British Museum (1845): 240,000 books Library of Congress (1990): 100,000,000 docs

The Library Organise information using a subject catalogue Sort cards by author Sort cards by title Sort cards by subject How to do this? Librarians argued over which was the best subject catalogue to use

At the same time While librarians were coping with the information explosion Could machines help? Could computers help?

Pioneers: Memex Vannevar Bush, 1945 Consider a future device for individual use, which is a sort of mechanized private file and library. It needs a name, and to coin one at random, memex will do. A memex is a device in which an individual stores all his books, records, and communications, and which is mechanized so that it may be consulted with exceeding speed and flexibility. It is an enlarged intimate supplement to his memory.

Semantic Gap Hans Peter Luhn, 1957 & 1961 Words of similar or related meaning are grouped into notional families Encoding of documents in terms of notional elements Matching by measuring the degree of notional similarity A common language for annotating documents the faculty of interpretation is beyond the talent of machines. Statistical cues extracted by machines to assist human indexer v H. P. Luhn: A statistical approach to mechanical literature searching, New York, IBM Research Center, 1957.

Vector Space Model G. Salton, 1960-1970ies Represent queries and documents by a high-dimensional vector in a word vector space Each word can be associated with a weight Underlying mathematical framework: Geometric v G. Salton, Automatic text processing: The transformation, analysis and retrieval of information by computer. Reading, MA:

v Robertson, S. E., & Spärck Jones, K.: Relevance weighting of search terms, Journal of the American Society for Information Science, 27:129-146, 1972. v Ponte, Jay M., and W. Bruce Croft. 1998. A language modeling approach to information retrieval. In Proc. SIGIR, pp. 275-281. ACM Press. Probabilistic Relevance Model M. E. Maron and J. L. Kuhns, 1960 S. E. Robertson and K. Spärck Jones, 1976 J.M. Ponte and W.B. Croft, 1998 View documents and queries as probability distribution over underlying word space; match between prob. distributions Underlying mathematical framework: Probabilistic

Web Search Engines L. Page, S. Brin, A. Singhal, many more, 2000 today Underlying mathematical framework: Graph theoretic & Markov Chains Exploit link structure of the Web Exploit usage data Most successful company of all times: Google Index the entire Web, 10-100Bs of Web pages Query response 200ms, 2 Trillion queries p.a. in 2013 New engineering discipline: data engineering v L. Page, S. Brin, R. Motwani, T. Winograd, The PageRank citation ranking: Bringing order to the web, 1999

The Future? Can we make information retrieval systems more intelligent? Can they comprehend and combine the information available? machine reading, text understanding statistics + semantics Can they understand (or anticipate) user intention? use of queries, but also context, user preferences

Section 4: Your near Future

Your IR Team Evangelos Kanoulas Anne Schuth Tomáš Tunys Tom Kenter

Lectures: tentative plan (subject to change) Week 1 Monday, Jan 5 Tuesday, Jan 6 Thursday, Jan 8 Week 2 Monday, Jan 12 Tuesday, Jan 13 Thursday, Jan 15 Week 3 Monday, Jan 19 Tuesday, Jan 20 Thursday, Jan 22 Week 4 Monday, Jan 26 Tuesday, Jan 27 Evaluation Introduction & Administrivia Offline Evaluation Online Evaluation Click Models Relevance Models and Scoring Functions Relevance models Topic Models & Semantic Distance (word2vec) Semantic Matching Combining Evidence Offline Learning to rank Online learning to rank Link Analysis Applications of Information Retrieval Question Answering (factoid & not) Temporal Information Retrieval & Contextual Suggestion

Work & Credit Two programming assignments Individuals; 30% of your grade Evaluation measures (due Thursday, Jan. 15) Language models (due Thursday, Jan. 22) Three programming projects Groups of 5; 70% of your grade Evaluation (due Thursday, Jan. 15) Relevance models (due Thursday, Jan. 22) Learning to rank (due Thursday, Jan. 29) No final exam

Pre-requisites and Outcomes Pre-requisites Python programming skills Basic knowledge in Information Retrieval Crawling, Parsing & Stemming, Indexing, Compression, Scoring Functions Basic knowledge in NLP and Machine Learning Outcomes Practical familiarity with range of text analysis technologies Understanding of theoretical models underlying these tools Competence (and courage!) in reading research literature

Learning resources Lecture notes are primary resources No text book as such, but following texts are useful: Christopher D. Manning, Prabhakar Raghavan and Hinrich Schtze, Introduction to Information Retrieval, Cambridge University Press. 2008. (Available free online) Stefan Büttcher, Charles L. A. Clarke, Gordon V. Cormack, Information Retrieval: Implementing and Evaluating Search Engines, MIT Press. 2010 W. Bruce Croft, Donald Metzler, Trevor Strohman, Search Engines: Information Retrieval in Practice, Addison-Wesley. 2010 Information Retrieval Surveys (Available free online) Citations to other readings will be given as required