Introduction & Administrivia Information Retrieval Evangelos Kanoulas ekanoulas@uva.nl
Section 1: Unstructured data Sec. 8.1 2
Big Data Growth of global data volume data everywhere! Web data: observation, interaction, transaction Smartphones, personal devices, traces in the real world Sensors, internet of things Scientific and technical challenges: how to make sense of data? Data center, virtualization, storage (no-rdbm), mapreduce, indexing & search, large scale machine learning
The Rise of Unstructured Data Business 80% of business is conducted on unstructured data Consumer
Media & Sources What types of unstructured information exist? Text: Web pages, books, articles, papers, reports, letters, blogs,? Conversational: Emails, tweets, comments,... Graphics & images, presentations Speech & video Maps & satellite imagery Local business information, yellow pages Mismatch: given representation in specific medium vs. semantic description of information Semantic gap needs to be bridged to establish relevance.
Internet Users December 26
The Use of Search Engines 70-80% of users use search engines to find Web sites More than 60% of online shoppers use search engines (and many more other search technologies) [compete.com, US
Section 2: A Historic Perspective
The Library the knowledge repositories of our civilization Library of Alexandria (280 BC): 700,000 scrolls Vatican Library (1500): 3,600 codices Herzog-August-Bibl.(1661): 116,000 books British Museum (1845): 240,000 books Library of Congress (1990): 100,000,000 docs
The Library Organise information using a subject catalogue Sort cards by author Sort cards by title Sort cards by subject How to do this? Librarians argued over which was the best subject catalogue to use
At the same time While librarians were coping with the information explosion Could machines help? Could computers help?
Pioneers: Memex Vannevar Bush, 1945 Consider a future device for individual use, which is a sort of mechanized private file and library. It needs a name, and to coin one at random, memex will do. A memex is a device in which an individual stores all his books, records, and communications, and which is mechanized so that it may be consulted with exceeding speed and flexibility. It is an enlarged intimate supplement to his memory.
Semantic Gap Hans Peter Luhn, 1957 & 1961 Words of similar or related meaning are grouped into notional families Encoding of documents in terms of notional elements Matching by measuring the degree of notional similarity A common language for annotating documents the faculty of interpretation is beyond the talent of machines. Statistical cues extracted by machines to assist human indexer v H. P. Luhn: A statistical approach to mechanical literature searching, New York, IBM Research Center, 1957.
Vector Space Model G. Salton, 1960-1970ies Represent queries and documents by a high-dimensional vector in a word vector space Each word can be associated with a weight Underlying mathematical framework: Geometric v G. Salton, Automatic text processing: The transformation, analysis and retrieval of information by computer. Reading, MA:
v Robertson, S. E., & Spärck Jones, K.: Relevance weighting of search terms, Journal of the American Society for Information Science, 27:129-146, 1972. v Ponte, Jay M., and W. Bruce Croft. 1998. A language modeling approach to information retrieval. In Proc. SIGIR, pp. 275-281. ACM Press. Probabilistic Relevance Model M. E. Maron and J. L. Kuhns, 1960 S. E. Robertson and K. Spärck Jones, 1976 J.M. Ponte and W.B. Croft, 1998 View documents and queries as probability distribution over underlying word space; match between prob. distributions Underlying mathematical framework: Probabilistic
Web Search Engines L. Page, S. Brin, A. Singhal, many more, 2000 today Underlying mathematical framework: Graph theoretic & Markov Chains Exploit link structure of the Web Exploit usage data Most successful company of all times: Google Index the entire Web, 10-100Bs of Web pages Query response 200ms, 2 Trillion queries p.a. in 2013 New engineering discipline: data engineering v L. Page, S. Brin, R. Motwani, T. Winograd, The PageRank citation ranking: Bringing order to the web, 1999
The Future? Can we make information retrieval systems more intelligent? Can they comprehend and combine the information available? machine reading, text understanding statistics + semantics Can they understand (or anticipate) user intention? use of queries, but also context, user preferences
Section 4: Your near Future
Your IR Team Evangelos Kanoulas Anne Schuth Tomáš Tunys Tom Kenter
Lectures: tentative plan (subject to change) Week 1 Monday, Jan 5 Tuesday, Jan 6 Thursday, Jan 8 Week 2 Monday, Jan 12 Tuesday, Jan 13 Thursday, Jan 15 Week 3 Monday, Jan 19 Tuesday, Jan 20 Thursday, Jan 22 Week 4 Monday, Jan 26 Tuesday, Jan 27 Evaluation Introduction & Administrivia Offline Evaluation Online Evaluation Click Models Relevance Models and Scoring Functions Relevance models Topic Models & Semantic Distance (word2vec) Semantic Matching Combining Evidence Offline Learning to rank Online learning to rank Link Analysis Applications of Information Retrieval Question Answering (factoid & not) Temporal Information Retrieval & Contextual Suggestion
Work & Credit Two programming assignments Individuals; 30% of your grade Evaluation measures (due Thursday, Jan. 15) Language models (due Thursday, Jan. 22) Three programming projects Groups of 5; 70% of your grade Evaluation (due Thursday, Jan. 15) Relevance models (due Thursday, Jan. 22) Learning to rank (due Thursday, Jan. 29) No final exam
Pre-requisites and Outcomes Pre-requisites Python programming skills Basic knowledge in Information Retrieval Crawling, Parsing & Stemming, Indexing, Compression, Scoring Functions Basic knowledge in NLP and Machine Learning Outcomes Practical familiarity with range of text analysis technologies Understanding of theoretical models underlying these tools Competence (and courage!) in reading research literature
Learning resources Lecture notes are primary resources No text book as such, but following texts are useful: Christopher D. Manning, Prabhakar Raghavan and Hinrich Schtze, Introduction to Information Retrieval, Cambridge University Press. 2008. (Available free online) Stefan Büttcher, Charles L. A. Clarke, Gordon V. Cormack, Information Retrieval: Implementing and Evaluating Search Engines, MIT Press. 2010 W. Bruce Croft, Donald Metzler, Trevor Strohman, Search Engines: Information Retrieval in Practice, Addison-Wesley. 2010 Information Retrieval Surveys (Available free online) Citations to other readings will be given as required