Text Technologies. What you will learn. Information Retrieval (IR) Overview. How to build a search engine. How to evaluate a search algorithm

Similar documents
Definitions. Lecture Objectives. Text Technologies for Data Science INFR Learn about main concepts in IR 9/19/2017. Instructor: Walid Magdy

Information Retrieval

Chapter 2. Architecture of a Search Engine

Database Systems (INFR10070) Dr Paolo Guagliardo. University of Edinburgh. Fall 2016

Search Engines. Information Retrieval in Practice

Chapter 6: Information Retrieval and Web Search. An introduction

A Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2

Promoting Website CS 4640 Programming Languages for Web Applications

Information Retrieval Spring Web retrieval

Introduction to Databases Fall-Winter 2010/11. Syllabus

Information Retrieval CS6200. Jesse Anderton College of Computer and Information Science Northeastern University

Chapter 27 Introduction to Information Retrieval and Web Search

Knowledge Retrieval. Franz J. Kurfess. Computer Science Department California Polytechnic State University San Luis Obispo, CA, U.S.A.

Course Syllabus. Course Information

Programming 2. Outline (112) Lecture 0. Important Information. Lecture Protocol. Subject Overview. General Overview.

Elementary IR: Scalable Boolean Text Search. (Compare with R & G )

Desktop Crawls. Document Feeds. Document Feeds. Information Retrieval

Python & Web Mining. Lecture Old Dominion University. Department of Computer Science CS 495 Fall 2012

Fall CS646: Information Retrieval. Lecture 2 - Introduction to Search Result Ranking. Jiepu Jiang University of Massachusetts Amherst 2016/09/12

Information Retrieval CS Lecture 01. Razvan C. Bunescu School of Electrical Engineering and Computer Science

SEO. Definitions/Acronyms. Definitions/Acronyms

Introduction to Databases Fall-Winter 2009/10. Syllabus

: Semantic Web (2013 Fall)

Indexing. Lecture Objectives. Text Technologies for Data Science INFR Learn about and implement Boolean search Inverted index Positional index

Introduction to IR Systems: Supporting Boolean Text Search

Analysis, Dekalb Roofing Company Web Site

The Security Role for Content Analysis

OPTIONS GUIDANCE COHORT JANUARY 2018 OPTIONS PROCESS.

A Deep Relevance Matching Model for Ad-hoc Retrieval

CMPUT 391 Database Management Systems. Fall Semester 2006, Section A1, Dr. Jörg Sander. Introduction

Course Administration

Building Search Applications

EE3315 Internet Technology EE3315 Internet Technology Overview Slide 1

Information Retrieval

Instructor: Anna Miller

Advanced Relational Database Management MISM Course F A Fall 2017 Carnegie Mellon University

Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p.

Recipes. Marketing For Bloggers. List Building, Traffic, Money & More. A Free Guide by The Social Ms Page! 1 of! 24

Advanced Digital Marketing Course

Module 1: Internet Basics for Web Development (II)

MWF 9:00-9:50AM & 12:00-12:50PM (ET)

Information Retrieval May 15. Web retrieval

CS 564: DATABASE MANAGEMENT SYSTEMS. Spring 2018

LIS 2680: Database Design and Applications

Introduction & Administrivia

Information Retrieval

Microsoft FAST Search Server 2010 for SharePoint for Application Developers Course 10806A; 3 Days, Instructor-led

Shrey Patel B.E. Computer Engineering, Gujarat Technological University, Ahmedabad, Gujarat, India

Multimedia Information Extraction and Retrieval Term Frequency Inverse Document Frequency

SEO: SEARCH ENGINE OPTIMISATION

CS47300: Web Information Search and Management

Object-Oriented Programming for Managers

Programming 1. Outline (111) Lecture 0. Important Information. Lecture Protocol. Subject Overview. General Overview.

G64DBS Database Systems. G64DBS Module. Recommended Textbook. Assessment. Recommended Textbook. Recommended Textbook.

Textbook(s) and other required material: Raghu Ramakrishnan & Johannes Gehrke, Database Management Systems, Third edition, McGraw Hill, 2003.

Course Title: Computer Networking 2. Course Section: CNS (Winter 2018) FORMAT: Face to Face

CS377: Database Systems Text data and information. Li Xiong Department of Mathematics and Computer Science Emory University

Session 10: Information Retrieval

Information Retrieval CSCI

Introduction to Information Retrieval

CS646 (Fall 2016) Homework 1

This session will provide an overview of the research resources and strategies that can be used when conducting business research.

Part I: Data Mining Foundations

Introduction to Information Technology ITP 101x (4 Units)

X

CS 1803 Pair Homework 10 Newsvendor Inventory Policy Due: Monday, November 29th before 6:00 PM Out of 100 points

CS 2316 Homework 9a Login Due: Friday, November 2nd, before 11:55 PM Out of 100 points. Premise

Lecture 1: Introduction & Overview

COMP 3500 Introduction to Operating Systems Project 5 Virtual Memory Manager

User-Centered and System-Centered IR

Search Engine Architecture. Hongning Wang

Data Structures and Algorithms

Introduction to Database Systems CS432. CS432/433: Introduction to Database Systems. CS432/433: Introduction to Database Systems

Note: Review pre-course content, Before the Course Begins section, starting August 14, 07:00 UTC. Course Element

The Topic Specific Search Engine

Part A: Course Outline

Lecture 27: Learning from relational data

EECS 282 Information Systems Design and Programming. Atul Prakash Professor, Computer Science and Engineering University of Michigan

SEO Dubai. SEO Dubai is currently the top ranking SEO agency in Dubai, UAE. First lets get to know what is SEO?

Introduction to Data Mining

EECS 282 Information Systems Design and Programming. Atul Prakash Professor, Computer Science and Engineering University of Michigan

DATABASE PERFORMANCE AND INDEXES. CS121: Relational Databases Fall 2017 Lecture 11

Using NLP and context for improved search result in specialized search engines

Chapter 3: Google Penguin, Panda, & Hummingbird

Bing Liu. Web Data Mining. Exploring Hyperlinks, Contents, and Usage Data. With 177 Figures. Springer

Orientation for Online Students

WebSite Grade For : 97/100 (December 06, 2007)

SE Workshop PLAN. What is a Search Engine? Components of a SE. Crawler-Based Search Engines. How Search Engines (SEs) Work?

Fundamentals of Database Systems

CC PROCESAMIENTO MASIVO DE DATOS OTOÑO 2018

ITP489 In-Memory DBMS for Real Time Analytics

Is your Google ranking suffering from outdated SEO?

Today. An Animated Introduction to Programming. Prerequisites. Computer programming

Detecting Spam Web Pages

CSA4020. Multimedia Systems:

Outline. Intro. Week 1, Fri Jan 4. What is CG used for? What is Computer Graphics? University of British Columbia CPSC 314 Computer Graphics Jan 2013

Intro. Week 1, Fri Jan 4

CSci Introduction to Operating Systems. Administrivia, Intro

Advanced Relational Database Management MISM Course S A3 Spring 2019 Carnegie Mellon University

Informa/on Retrieval. Text Search. CISC437/637, Lecture #23 Ben CartereAe. Consider a database consis/ng of long textual informa/on fields

Transcription:

What you will learn Text Technologies Victor Lavrenko vlavrenk@inf Monday 16 th September 2013 How to build a search engine which search results to rank at the top how to do it fast and on a massive scale How to evaluate a search algorithm is system A really better than system B How to work with text two web-pages discuss the same topic? handle misspellings, morphology, synonyms build algorithms for languages you don t know Overview Information Retrieval Two main issues in IR: speed and accuracy Documents, queries, relevance Bag-of-words trick Overview of Search System Architectures Other IR tasks Course mechanics Information Retrieval (IR) Information retrieval is a field concerned with the structure, analysis, organization, storage, searching, and retrieval of information. (Salton, 1968) IR core technology for text processing widely used in NLP/DBMS applications driving force behind web technologies

IR in a nutshell Find documents in response to the user s query Two main issues in IR Effectiveness need to find relevant documents needle in a haystack: very different from relational DBs (SQL) Efficiency need to find them quickly: vast quantities of data (100b pages) thousands queries per second data constantly changes, need to keep up compared with other NLP areas IR is very fast Documents documents has a very wide meaning: web-pages, emails, word/pdf/excel, news photos, videos, musical pieces, code answers to questions product descriptions, advertisements may be in a different language may not have words at all (e.g. DNA) IR: match A against a large set of Bs problem arises in many different domains web search: Queries query = a few keywords ( homer simpson ) query = expression of information need describes what you want to find can have many forms: keywords, narrative, example document question, photo, scribble, humming a tune #wsum(0.9 #field (title, #phrase (homer,simpson)) 0.7 #and (#> (pagerank,3), #ow3 (homer,simpson)) 0.4 #passage (homer, simpson, dan, castellaneta))

Relevance at an abstract level, IR is about: does item D match item Q? is item D relevant to item Q? relevance a tricky notion will the user like it / click on it? will it help the user achieve a task? or is it novel (not redundant)? common take: relevance = topicality / aboutness i.e. D,Q share similar meaning about the same topic / subject / issue Why is matching a challenge? no clear semantics, contrast: author = X123456 ( Shakespeare, William ) vs. play, frequently attributed to Shakespeare, is in fact inherent ambiguity of language: synonymy: banking crisis = financial meltdown polysemy: Homer can be Simpson or Greek author relevance highly subjective Anomalous State of Knowledge (Belkin) relevance not observable (when we need it) on the web: counter SEOs / spam How do search engines do it? not with relational DBs ok in niche domains (libraries) tagging works for multi-media spammers, loses clarity with scale semantic web inconsistent ontologies not by understanding the language NLP brittle in unrestricted domains good w. fixed structure/vocabulary (e.g. takeovers) computationally expensive homer, simpson, cartoon Relevant Items are Similar Key idea: use similar vocabulary similar meaning similar documents relevant to same queries Similarity string match word overlap P (same model)

Bag-of-words trick Can you guess what this is about: beating falls 355 Dow another takes points Dow takes another beating, falls 355 points said fat fries McDonalds French obesity does French refer to France here? why? Re-ordering doesn t destroy meaning individual words building blocks bag of words: a composition of meanings Bag-of-words trick (2) Most search engines use BOW treat documents, queries as bags of words a bag is a set with repetitions (multi-set, urn) match = degree of overlap between D,Q Retrieval models statistical models that use words as features decide which docs most likely to be relevant what should be the top 10 for homer simpson? BOW makes these models tractable Bag-of-words: criticisms word meaning lost without context true, but BOW doesn t really discard context it discards surface form / well-formedness of text what about negations, etc.? not, but he loves me vs. but he loves me not still discusses the same subject (him,me,love) propagate negations to words: but he not_loves me does not work for all languages no natural word unit Chinese, images, music circumvent by segmentation or feature induction break/aggregate until units reflect aboutness Systems perspective on IR get the data into the system acquire the data from crawling, feeds, etc. store the originals (if needed) transform to BOW and index satisfy users requests assist user in formulating query retrieve a set of results help user browse / re-formulate log user s actions, adjust retrieval model

Indexing Process web-crawling provider feeds RSS feeds desktop/email what can you store? disk space? rights? compression? help user formulate the query by suggesting what he could search for Search Process fetch a set of results, present to the user what data do we want? format conversion. international? which part contains meaning? word units? stopping? stemming? a lookup table for quickly finding all docs containing a word Addison Wesley, 2008 log user s actions: clicks, hovering, giving up iterate! Addison Wesley, 2008 TDT Event Detection Applications of IR not always matching queries to documents Put stories on same topic together Cross-language, cross-source, unsupervised NBC NPR #$%!" ABC AP... Victor Lavrenko, 2008

Market Event Detection Microsoft (MSFT) stock Software giant Microsoft saw its shares dip a few percentage points this morning after U.S. District Judge Thomas Penfield Jackson issued his "findings of fact" in the government's ongoing antitrust case against the Seattle wealth-creation machine... News: Words like Jackson and antitrust are more likely in the stories preceding the plunge. P ( shares ) = 0.074 P ( antitrust ) = 0.009 P ( judge ) = 0.006 P ( trading ) = 0.032 P ( against ) = 0.025 P ( Jackson ) = 0.001 P ( shares MSFT ) = 0.071 P ( antitrust MSFT ) = 0.044 P ( judge MSFT ) = 0.039 P ( trading MSFT ) = 0.029 P ( against MSFT ) = 0.027 P ( Jackson MSFT ) = 0.025 P ( MSFT Jackson ) = P ( Jackson MSFT ) P ( MSFT ) / P ( Jackson ) Victor Lavrenko, 2008 Victor Lavrenko, 2008 Summary Information Retrieval (IR): core technology selling point: IR is very fast, provides context Main issues: effectiveness and efficiency Documents, queries, relevance Bag-of-words trick Search system architecture: indexing: get data into the system searching: help users find relevant data Administrivia

Course Information Lectures: Mon/Thu 12-1pm in 7 George Sq, F.21 Coursework due 4pm Mondays in weeks 4,6,8,10 Practical labs: TBD in weeks 2,4,6,8 sign up: TBD TA: Dominik Wurzer Textbook: Search Engines: Information Retrieval in Practice very useful but not required (attend all lectures) Syllabus / assignments / notes / readings: www.inf.ed.ac.uk/teaching/courses/tts Ask all questions on the forum: www.forums.ed.ac.uk/viewforum.php?f=821 Assessment (70%) written exam: April 2014 (30%) four programming assignments due 4pm Monday in weeks 4,6,8,10 must use Python tutorial: TBD last year s topics (subject to change) intelligent web crawler plagiarism detector image search engine PageRank on emails all deadline extensions through ITO Ethics Submitted work must be your own ok to discuss assignments with each other not ok to share code / data / figures suggestion: talk, don t write anything down Always cite your sources including web, fellow students, other courses exception: lectures from this course when in doubt: cite This is not an easy class Hours spent on TTS coursework in past years: on average: 60 hours 1 in 5 students spent over 100 hours Relative to other Informatics courses: 70% felt TTS required substantially more work 85% felt they learned more in TTS Common Marking Scale: doing everything that s asked gets a mark of 70% (A) marks above 70% reserved for extra achievement Exam average 2011 13: 55%, 51%, 53% c.o. James Allan