Definitions. Lecture Objectives. Text Technologies for Data Science INFR Learn about main concepts in IR 9/19/2017. Instructor: Walid Magdy

Similar documents
Text Technologies. What you will learn. Information Retrieval (IR) Overview. How to build a search engine. How to evaluate a search algorithm

Indexing. Lecture Objectives. Text Technologies for Data Science INFR Learn about and implement Boolean search Inverted index Positional index

Chapter 2. Architecture of a Search Engine

Information Retrieval

Information Retrieval and Knowledge Organisation

Information Retrieval CS Lecture 01. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Web Search. Lecture Objectives. Text Technologies for Data Science INFR Learn about: 11/14/2017. Instructor: Walid Magdy

60-538: Information Retrieval

Information Retrieval CSCI

A Deep Relevance Matching Model for Ad-hoc Retrieval

Knowledge Retrieval. Franz J. Kurfess. Computer Science Department California Polytechnic State University San Luis Obispo, CA, U.S.A.

Introduction to Information Retrieval

Informa/on Retrieval. Text Search. CISC437/637, Lecture #23 Ben CartereAe. Consider a database consis/ng of long textual informa/on fields

User-Centered and System-Centered IR

Chapter 6: Information Retrieval and Web Search. An introduction

Information Retrieval (Part 1)

Representation/Indexing (fig 1.2) IR models - overview (fig 2.1) IR models - vector space. Weighting TF*IDF. U s e r. T a s k s

Elementary IR: Scalable Boolean Text Search. (Compare with R & G )

Information Retrieval Spring Web retrieval

INFSCI 2140 Information Storage and Retrieval Lecture 2: Models of Information Retrieval: Boolean model. Final Group Projects

Information Retrieval

Chapter 27 Introduction to Information Retrieval and Web Search

Exam IST 441 Spring 2014

Search Engines. Information Retrieval in Practice

Building Search Applications

Multimedia Information Extraction and Retrieval Term Frequency Inverse Document Frequency

Introduction to IR Systems: Supporting Boolean Text Search

Empowering People with Knowledge the Next Frontier for Web Search. Wei-Ying Ma Assistant Managing Director Microsoft Research Asia

Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p.

Module 1: Internet Basics for Web Development (II)

Introduction to Information Retrieval. Lecture Outline

10/10/13. Traditional database system. Information Retrieval. Information Retrieval. Information retrieval system? Information Retrieval Issues

CIS 4930/6930 Spring 2014 Introduction to Data Science /Data Intensive Computing. University of Florida, CISE Department Prof.

Session 10: Information Retrieval

Home Page. Title Page. Page 1 of 14. Go Back. Full Screen. Close. Quit

A Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2

COMP6237 Data Mining Searching and Ranking

Information Retrieval May 15. Web retrieval

Query Languages. Berlin Chen Reference: 1. Modern Information Retrieval, chapter 4

Information Retrieval

Part I: Data Mining Foundations

Search Engine Architecture. Hongning Wang

Desktop Crawls. Document Feeds. Document Feeds. Information Retrieval

Databases and Information Retrieval Integration TIETS42. Kostas Stefanidis Autumn 2016

Introduction & Administrivia

DATA MINING II - 1DL460. Spring 2014"

UNIT-V WEB MINING. 3/18/2012 Prof. Asha Ambhaikar, RCET Bhilai.

CS377: Database Systems Text data and information. Li Xiong Department of Mathematics and Computer Science Emory University

Automatic Summarization

Information Retrieval and Web Search

Information Retrieval. Session 11 LBSC 671 Creating Information Infrastructures

Bing Liu. Web Data Mining. Exploring Hyperlinks, Contents, and Usage Data. With 177 Figures. Springer

Information Retrieval

Information Retrieval. Information Retrieval and Web Search

Information Retrieval (IR) Introduction to Information Retrieval. Lecture Overview. Why do we need IR? Basics of an IR system.

Content Enrichment. An essential strategic capability for every publisher. Enriched content. Delivered.

Lecture 1: Introduction and the Boolean Model

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University

In = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most

Information Retrieval. hussein suleman uct cs

Promoting Website CS 4640 Programming Languages for Web Applications

Chapter 4. Processing Text

Sec. 8.7 RESULTS PRESENTATION

CSE 5243 INTRO. TO DATA MINING

Processing a Trillion Cells per Mouse Click

Querying Introduction to Information Retrieval INF 141 Donald J. Patterson. Content adapted from Hinrich Schütze

Natural Language Processing SoSe Question Answering. (based on the slides of Dr. Saeedeh Momtazi)

DATA MINING - 1DL105, 1DL111

CS 6320 Natural Language Processing

Welcome to the class of Web Information Retrieval!

NOTA BENE. Orbis Tutorial. Copyright Nota Bene Associates, Inc. All Rights Reserved.

Web Search Basics. Berlin Chen Department of Computer Science & Information Engineering National Taiwan Normal University

Overview of DB & IR. ICS 624 Spring Asst. Prof. Lipyeow Lim Information & Computer Science Department University of Hawaii at Manoa

Introduction to Information Retrieval

THE WEB SEARCH ENGINE

SEARCH ENGINE INSIDE OUT

Lecture 1: Introduction and Overview

Information Retrieval

Information Retrieval

SYSTEMS FOR NON STRUCTURED INFORMATION MANAGEMENT

Text Analytics (Text Mining)

Lecture 5. Logic I. Statement Logic

Multimedia Information Systems

Web Search Basics. Berlin Chen Department t of Computer Science & Information Engineering National Taiwan Normal University

CLEF-IP 2009: Exploring Standard IR Techniques on Patent Retrieval

Cambridge Books Online (CBO)

CS506/606 - Topics in Information Retrieval

CS6200 Information Retrieval. David Smith College of Computer and Information Science Northeastern University

Search Engines Information Retrieval in Practice

An Overview of Search Engine. Hai-Yang Xu Dev Lead of Search Technology Center Microsoft Research Asia

The Beauty and Joy of Computing

SEO According to Google

Modules, Details & Fees. Total Modules- 25 (highest in Industry) Duration- 2-5Months Full Course Fees- 30, (Pay in two Installments *2)

The Topic Specific Search Engine

AURA ACADEMY Training With Expertised Faculty Call us on for Free Demo

Introduction to Information Retrieval (Manning, Raghavan, Schutze)

Information Retrieval: Retrieval Models

Outline. Lecture 2: EITN01 Web Intelligence and Information Retrieval. Previous lecture. Representation/Indexing (fig 1.

Outline. Lecture 3: EITN01 Web Intelligence and Information Retrieval. Query languages - aspects. Previous lecture. Anders Ardö.

CS54701: Information Retrieval

Transcription:

Text Technologies for Data Science INFR11145 Definitions Instructor: Walid Magdy 19-Sep-2017 Lecture Objectives Learn about main concepts in IR Document Information need Query Index BOW 2 1

IR in a nutshell A System and Method for A System and Method for...... Documents Query Search Engine Relevant Documents User 3 IR, basic form Given Query Q, find relevant documents D 4 2

Two main Issues in IR Effectiveness need to find relevant documents needle in a haystack very different from relational DBs (SQL) Efficiency need to find them quickly vast quantities of data (10 s billions pages) thousands queries per second (Google, 40,000) data constantly changes, need to keep up compared with other NLP areas IR is very fast 5 IR main components Documents Queries Relevant documents 6 3

Documents The element to be retrieved Unstructured nature Unique ID N documents Collection... web-pages, emails, book, page, sentence, tweets photos, videos, musical pieces, code answers to questions product descriptions, advertisements may be in a different language may not have words at all (e.g. DNA) 7 Queries Free text to express user s information need Same information need can be described by multiple queries Latest news on the hurricane in the US Florida storm Irma Same query can represent multiple information needs Apple Jaguar 8 4

Queries different forms Web search keywords, narrative Image search keywords, sample image QA question Music search humming a tune Filtering/recommendation user s interest/history Scholar search structured (author, title..) Advanced search #wsyn(0.9 #field (title, #phrase (homer,simpson)) 0.7 #and (#> (pagerank,3), #ow3 (homer,simpson)) 0.4 #passage (homer, simpson, dan, castellaneta)) 9 Relevance At an abstract level, IR is about: does item D match item Q? or is item D relevant to item Q? Relevance a tricky notion will the user like it / click on it? will it help the user achieve a task? (satisfy information need) is it novel (not redundant)? Relevance = what is the topic about? i.e. D,Q share similar meaning about the same topic / subject / issue 10 5

What is the challenge in relevance? No clear semantics, contrast: William Shakespeare Author history s? list of plays? a play by him? Inherent ambiguity of language: synonymy: Florida storm = Irma hurricane polysemy: Apple, Jaguar Relevance highly subjective Rel: yes/no, Rel: perfect/excellent/good/fair/bad On the web: counter SEOs / spam 11 Relevant Items are Similar Key idea: Use similar vocabulary similar meaning Similar documents relevant to same queries Similarity String match Word overlap P(D Q) retrieval model 12 6

IR vs. DB What we re retrieving Queries we re posing Databases Structured data. Clear semantics based on a formal model. Formally-defined (relational algebra, SQL). Unambiguous. IR Mostly unstructured. Free text with some metadata. Free text ( natural language ), Boolean Results we get Exact (always correct ) Imprecise (need to measure relevance) Interaction with system One-shot queries. Interaction is important. Tamer Elsayed, QU 13 How IR sees documents? 14 7

Bag-of-words trick Can you guess what this is about: per is salary hour 5,594 Neymar Neymar s salary per hour is 5,594 obesity French is of full cause and fat fries French fries is full of fat and cause obesity Re-ordering doesn t destroy the topic individual words building blocks bag of words: a composition of meanings 15 Bag-of-words trick Most search engines use BOW treat documents, queries as bags of words A bag is a set with repetitions match = degree of overlap between D,Q Retrieval models statistical models (function) that use words as features decide which documents most likely to be relevant What should be the top results for Q? BOW makes these models tractable 16 8

Bag-of-words: Criticism word meaning lost without context True, but BOW doesn t really discard context it discards surface form / well-formedness of text what about negations, etc.? {not, climate change is real} vs. {climate change is not real} still discusses the same subject (climate change, real) restrict negations to words: { not real } does not work for all languages No natural word unit for Chinese, images, music Solve by segmentation or feature induction break/aggregate until units reflect aboutness e.g.: German decompounding, Chinese Segmentation 17 IR Black Box Query Documents Representation Function online offline Transf ormation Function Retrieval Model Query Representation Comparison Function BOW Index Hits Tamer Elsayed, QU 18 9

Systems perspective on IR Indexing Process: (offline) get the data into the system acquire the data from crawling, feeds, etc. store the originals (if needed) transform to BOW and index Search (retrieval) Process: (online) satisfy users requests assist user in formulating query retrieve a set of results help user browse / re-formulate log user s actions, adjust retrieval model 19 Indexing Process A System and Method for.. web-crawling provider feeds RSS feeds desktop/email Documents acquisition Document data store document unique ID what can you store? disk space? rights? compression? Index creation Index what data do we want? Text transformation format conversion. international? which part contains meaning? word units? stopping? stemming? a lookup table for quickly finding all docs containing a word Addison Wesley, 2008 20 10

Search Process help user formulate the query by suggesting what he could search for Document data store fetch a set of results, present to the user User Interaction Ranking Index log user s actions: clicks, hovering, giving up Log data Evaluation Iterate! Addison Wesley, 2008 21 Summary Information Retrieval (IR): core technology selling point: IR is very fast, provides context Main issues: effectiveness and efficiency Documents, queries, relevance Bag-of-words trick Search system architecture: indexing: get data into the system searching: help users find relevant data 22 11

Questions Next time: Laws of text (Zipf.) Vector space models Reading: Search Engines: Information Retrieval in Practice, chapter 1 & 2 Videos: The Zipf Mystery, Vsauce Tools: Perl regular expressions: https://perldoc.perl.org/perlre.html 23 12