Informa(on Retrieval. Administra*ve. Sta*s*cal MT Overview. Problems for Sta*s*cal MT

Similar documents
Informa(on Retrieval

Behrang Mohit : txt proc! Review. Bag of word view. Document Named

Information Retrieval

CS60092: Informa0on Retrieval. Sourangshu Bha<acharya

Introduction to Information Retrieval and Boolean model. Reference: Introduction to Information Retrieval by C. Manning, P. Raghavan, H.

Querying Introduction to Information Retrieval INF 141 Donald J. Patterson. Content adapted from Hinrich Schütze

Information Retrieval and Organisation

Information Retrieval

CSE 7/5337: Information Retrieval and Web Search Introduction and Boolean Retrieval (IIR 1)

Introduction to Information Retrieval

Introducing Information Retrieval and Web Search. borrowing from: Pandu Nayak

Information Retrieval

Information Retrieval

Information Retrieval

Administrative. Web crawlers. Web Crawlers and Link Analysis!

boolean queries Inverted index query processing Query optimization boolean model September 9, / 39

Boolean retrieval & basics of indexing CE-324: Modern Information Retrieval Sharif University of Technology

CS105 Introduction to Information Retrieval

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

Information Retrieval

Part 2: Boolean Retrieval Francesco Ricci

Boolean retrieval & basics of indexing CE-324: Modern Information Retrieval Sharif University of Technology

Introduction to Information Retrieval

Ges$one Avanzata dell Informazione Part A Full- Text Informa$on Management. Full- Text Indexing

Information Retrieval

CS 572: Information Retrieval. Lecture 2: Hello World! (of Text Search)

Information Retrieval and Text Mining

Classic IR Models 5/6/2012 1

Informa(on Retrieval

Information Retrieval

Search: the beginning. Nisheeth

Informa/on Retrieval. Text Search. CISC437/637, Lecture #23 Ben CartereAe. Consider a database consis/ng of long textual informa/on fields

Introduction to Information Retrieval

Informa(on Retrieval

Advanced Retrieval Information Analysis Boolean Retrieval

Data-analysis and Retrieval Boolean retrieval, posting lists and dictionaries

Introduction to Information Retrieval

Information Retrieval (IR) Introduction to Information Retrieval. Lecture Overview. Why do we need IR? Basics of an IR system.

Brief (non-technical) history

CSCI 5417 Information Retrieval Systems! What is Information Retrieval?

Boolean retrieval & basics of indexing CE-324: Modern Information Retrieval Sharif University of Technology

Information Retrieval

60-538: Information Retrieval

Introduc)on to. CS60092: Informa0on Retrieval

CS6200 Informa.on Retrieval. David Smith College of Computer and Informa.on Science Northeastern University

CS6200 Informa.on Retrieval. David Smith College of Computer and Informa.on Science Northeastern University

Information Retrieval. CS630 Representing and Accessing Digital Information. What is a Retrieval Model? Basic IR Processes

6 WAYS Google s First Page

CS6200 Informa.on Retrieval. David Smith College of Computer and Informa.on Science Northeastern University

Unstructured Data Management. Advanced Topics in Database Management (INFSCI 2711)

Lecture 8: Linkage algorithms and web search

Indexing. Lecture Objectives. Text Technologies for Data Science INFR Learn about and implement Boolean search Inverted index Positional index

More on indexing CE-324: Modern Information Retrieval Sharif University of Technology

F. Aiolli - Sistemi Informativi 2007/2008. Web Search before Google

n Tuesday office hours changed: n 2-3pm n Homework 1 due Tuesday n Assignment 1 n Due next Friday n Can work with a partner

Introduction to Information Retrieval

Models for Document & Query Representation. Ziawasch Abedjan

Search Engines. Informa1on Retrieval in Prac1ce. Annotations by Michael L. Nelson

Corso di Biblioteche Digitali

Course work. Today. Last lecture index construc)on. Why compression (in general)? Why compression for inverted indexes?

1Boolean retrieval. information retrieval. term search is quite ambiguous, but in context we use the two synonymously.

CS60092: Informa0on Retrieval

Jargon Buster. Ad Network. Analytics or Web Analytics Tools. Avatar. App (Application) Blog. Banner Ad

Introduction to Information Retrieval IIR 1: Boolean Retrieval

Information Retrieval CS Lecture 06. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Einführung in Web und Data Science Community Analysis. Prof. Dr. Ralf Möller Universität zu Lübeck Institut für Informationssysteme

Chapter 27 Introduction to Information Retrieval and Web Search

Internet Lead Generation START with Your Own Web Site

Information Retrieval CS Lecture 01. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Information Retrieval Lecture 4: Web Search. Challenges of Web Search 2. Natural Language and Information Processing (NLIP) Group

Index construc-on. Friday, 8 April 16 1

4. Backlink Analysis Check backlinks What Else? Analyze historical data... 29

Web Search. Lecture Objectives. Text Technologies for Data Science INFR Learn about: 11/14/2017. Instructor: Walid Magdy

DAY CAMP. Today s Quick Win: How to Set Up a Paid Search Campaign Using Adwords Editor

Pay-Per-Click Advertising Special Report

CS/INFO 1305 Summer 2009

CC PROCESAMIENTO MASIVO DE DATOS OTOÑO Lecture 7: Information Retrieval II. Aidan Hogan

Web Search Basics Introduction to Information Retrieval INF 141/ CS 121 Donald J. Patterson

CS60092: Informa0on Retrieval

Introduction to Information Retrieval

Keyword is the term used for the words that people type into search engines to find you:

Search Engine Marketing Guide 5 Ways to Optimize Your Business Online

CSCI 5417 Information Retrieval Systems Jim Martin!

Information Retrieval

Clickbank Domination Presents. A case study by Devin Zander. A look into how absolutely easy internet marketing is. Money Mindset Page 1

THE URBAN COWGIRL PRESENTS KEYWORD RESEARCH

Information Retrieval Spring Web retrieval

seosummit seosummit April 24-26, 2017 Copyright 2017 Rebecca Gill & ithemes

Preliminary draft (c)2006 Cambridge UP

Chapter 6: Information Retrieval and Web Search. An introduction

Building Your Blog Audience. Elise Bauer & Vanessa Fox BlogHer Conference Chicago July 27, 2007

Administrative. Distributed indexing. Index Compression! What I did last summer lunch talks today. Master. Tasks

Lecture 1: Introduction and the Boolean Model

Digital Marketing Overview of Digital Marketing Website Creation Search Engine Optimization What is Google Page Rank?

Information Retrieval

Lecture 1: Introduction and Overview

So You Think You Know How To Write Content? DSS User Group March 6, 2014

CS 345A Data Mining Lecture 1. Introduction to Web Mining

Multimedia Information Extraction and Retrieval Term Frequency Inverse Document Frequency

INCREASED PPC & WEBSITE CONVERSIONS. AMMA MARKETING Online Marketing - Websites Big Agency Background & Small Agency Focus

Transcription:

Administra*ve Introduc*on to Informa(on Retrieval CS457 Fall 2011! David Kauchak Projects Status 2 on Friday Paper next Friday work on the paper in parallel if you re not done with experiments by early next week CS lunch today! adapted from: h7p://www.stanford.edu/class/cs276/handouts/lecture1- intro.ppt Sta*s*cal MT Overview Bilingual data monolingual data training learned parameters model Problems for Sta*s*cal MT Preprocessing Language modeling Transla*on modeling Decoding Parameter op*miza*on Translation Foreign sentence Find the best translation given the foreign sentence and the model English sentence 1

Phrase- Based Sta*s*cal MT Morgen fliege ich nach Kanada zur Konferenz Tomorrow I will fly to the conference In Canada Foreign input segmented in to phrases phrase is any sequence of words Each phrase is probabilistically translated into English P(to the conference zur Konferenz) P(into the meeting zur Konferenz) Phrases are probabilistically re-ordered See [Koehn et al, 2003] for an intro. Informa*on retrieval (IR) What comes to mind when I say informa*on retrieval? Where have you seen IR? What are some real- world examples/uses? Search engines File search (e.g. OS X Spotlight, Windows Instant Search, Google Desktop) Databases? Catalog search (e.g. library) Intranet search (i.e. corporate networks) Web search Web search July 2006 Feb 2011 2

Web search Challenges Why is informa*on retrieval hard? Lots and lots of data efficiency storage discovery (web) Data is unstructured Querying/Understanding user intent SPAM Data quality Informa*on Retrieval Informa*on Retrieval is finding material in documents of an unstructured nature that sa*sfy an informa*on need from within large collec*ons of digitally stored content Informa*on Retrieval Informa*on Retrieval is finding material in documents of an unstructured nature that sa*sfy an informa*on need from within large collec*ons of digitally stored content? 3

Informa*on Retrieval Informa*on Retrieval is finding material in text documents of an unstructured nature that sa*sfy an informa*on need from within large collec*ons of digitally stored content Find all documents about computer science Find all course web pages at Middlebury What is the cheapest flight from LA to NY? Who is was the 15th president? Informa*on Retrieval Informa*on Retrieval is finding material in text documents of an unstructured nature that sa*sfy an informa*on need from within large collec*ons of digitally stored content What is the difference between an information need and a query? Informa*on Retrieval Informa*on Retrieval is finding material in text documents of an unstructured nature that sa*sfy an informa*on need from within large collec*ons of digitally stored content Information need Find all documents about computer science Find all course web pages at Middlebury Who is was the 15th president? Query computer science Middlebury AND college AND url-contains class WHO=president NUMBER=15 IR vs. databases Structured data tends to refer to informa*on in tables Employee Manager Salary Smith Jones 50000 Chang Smith 60000 Ivy Smith 50000 Typically allows numerical range and exact match (for text) queries, e.g., Salary < 60000 AND Manager = Smith. 4

Unstructured (text) vs. structured (database) data in 1996 Unstructured (text) vs. structured (database) data in 2006 Challenges Why is informa*on retrieval hard? Lots and lots of data efficiency storage discovery (web) Data is unstructured Understanding user intent SPAM Data quality Efficiency 200 million tweets/day over 4 years = ~300 billion tweets How much data is this? ~40 TB of data uncompressed for the text itself ~400 TB of data including addi*onal meta- data 300 billion web pages? assume web pages are 100 *mes longer than tweets 4 PB of data 1000 4 TB disks assume web pages are 1000 *mes long than tweets 40 PB of data 10,000 4 TB disks assume web pages are 10,000 *mes longer than tweets 400 PB of data 100,000 4TB disks 5

Efficiency Can we store all of the documents in memory? How long will it take to do a naïve search of the data? To search over a small data collec*on, almost any approach will work (e.g. grep) At web scale, there are many challenges: queries need to be really fast! massive paralleliza*on redundancy (hard- drives fail, networks fail, ) Unstructured data in 1680 Which plays of Shakespeare contain the words Brutus AND Caesar but NOT Calpurnia? All of Shakespeare s plays How can we answer this query quickly? Unstructured data in 1680 Which plays of Shakespeare contain the words Brutus AND Caesar but NOT Calpurnia? Key idea: we can pre- compute some informa*on about the plays/documents that will make queries much faster What informa*on do we need? Indexing: for each word, keep track of which documents it occurs in Inverted index For each term/word, store a list of all documents that contain it What data structures might we use for this? Brutus Brutus 2 4 8 16 32 64 128 array 2 4 8 16 32 64 128 linked list Brutus 4 128 32 64 2 8 16 hashtable docid 6

Inverted index The most common approach is to use a linked list representa*on Pos)ng Brutus 2 4 8 16 32 64 128 Inverted index construc*on Documents to be indexed Friends, Romans, countrymen. text preprocessing friend, roman, countrymen. Calpurnia 1 2 3 5 8 13 21 34 indexer Caesar Dictionary 13 16 Postings lists Inverted index friend roman countryman 2 4 1 2 13 16 Boolean retrieval Support queries that are boolean expressions: A boolean query uses AND, OR and NOT to join query terms Caesar AND Brutus AND NOT Calpurnia Pomona AND College (Mike OR Michael) AND Jordan AND NOT(Nike OR Gatorade) Given only these opera*ons, what types of ques*ons can t we answer? Phrases, e.g. Middlebury College Proximity, Michael within 2 words of Jordan Regular expression- like Boolean retrieval Primary commercial retrieval tool for 3 decades Professional searchers (e.g., lawyers) s*ll like boolean queries Why? You know exactly what you re gepng, a query either matches or it doesn t Through trial and error, can frequently fine tune the query appropriately Don t have to worry about underlying heuris*cs (e.g. PageRank, term weigh*ngs, synonym, etc ) 7

Example: WestLaw http://www.westlaw.com/ Largest commercial (paying subscribers) legal search service (started 1975; ranking added 1992) Tens of terabytes of data; 700,000 users Majority of users still use boolean queries Example query: What is the statute of limitations in cases involving the federal tort claims act? LIMIT! /3 STATUTE ACTION /S FEDERAL /2 TORT /3 CLAIM All words starting with LIMIT Example: WestLaw http://www.westlaw.com/ Largest commercial (paying subscribers) legal search service (started 1975; ranking added 1992) Tens of terabytes of data; 700,000 users Majority of users still use boolean queries Example query: What is the statute of limitations in cases involving the federal tort claims act? LIMIT! /3 STATUTE ACTION /S FEDERAL /2 TORT /3 CLAIM Example: WestLaw http://www.westlaw.com/ Largest commercial (paying subscribers) legal search service (started 1975; ranking added 1992) Tens of terabytes of data; 700,000 users Majority of users still use boolean queries Example query: What is the statute of limitations in cases involving the federal tort claims act? LIMIT! /3 STATUTE ACTION /S FEDERAL /2 TORT /3 CLAIM /3 = within 3 words, /S = in same sentence Query processing: AND What needs to happen to process: Brutus AND Caesar Locate Brutus and Caesar in the Dic*onary; Retrieve pos*ngs lists Merge the two pos*ngs: 8

The merge Walk through the two pos*ngs simultaneously The merge Walk through the two pos*ngs simultaneously The merge Walk through the two pos*ngs simultaneously The merge Walk through the two pos*ngs simultaneously 9

The merge Walk through the two pos*ngs simultaneously The merge Walk through the two pos*ngs simultaneously The merge Walk through the two pos*ngs simultaneously The merge Walk through the two pos*ngs simultaneously What assumption are we making about the postings lists? For efficiency, when we construct the index, we ensure that the postings lists are sorted 10

The merge Walk through the two pos*ngs simultaneously Boolean queries: More general merges Which of the following queries can we s*ll do in *me O(length1+length2)? Brutus AND NOT Caesar Brutus OR NOT Caesar What is the running time? O(length1 + length2) From boolean to Google What are we missing? Phrases Middlebury College Proximity: Find Gates NEAR Microso6. Ranking search results Incorporate link structure document importance From boolean to Google Phrases Middlebury College Proximity: Find Gates NEAR Microso6 Ranking search results Incorporate link structure document importance 11

Posi*onal indexes In the pos*ngs, store a list of the posi*ons in the document where the term occurred From boolean to Google Phrases Middlebury College Proximity: Find Gates NEAR Microso6 Ranking search results Incorporate link structure document importance Rank documents by text similarity Ranked informa*on retrieval! Simple version: Vector space ranking (e.g. TF- IDF) include occurrence frequency weigh*ng (e.g. IDF) rank results by similarity between query and document Realis*c version: many more things in the pot treat different occurrences differently (e.g. *tle, header, link text, ) many other weigh*ngs document importance spam hand- craued/policy rules IR with TF- IDF How can we change our inverted index to make ranked queries (e.g. TF- IDF) fast? Store the TF ini*ally in the index In addi*on, store the number of documents the term occurs in in the index IDFs We can either compute these on the fly using the number of documents in each term We can make another pass through the index and update the weights for each entry 12

From boolean to Google Phrases Middlebury College Proximity: Find Gates NEAR Microso6 Ranking search results include occurrence frequency weigh*ng treat different occurrences differently (e.g. *tle, header, link text, ) Incorporate link structure document importance The Web as a Directed Graph Page A hyperlink Page B A hyperlink between pages denotes author perceived relevance AND importance How can we use this information? Query- independent ordering What is pagerank? First genera*on: using link counts as simple measures of popularity Two basic sugges*ons: Undirected popularity: Each page gets a score = the number of in- links plus the number of out- links (3+2=5) Directed popularity: Score of a page = number of its in- links (3) The random surfer model Imagine a user surfing the web randomly using a web browser The pagerank score of a page is the probability that that user will visit a given page http://images.clipartof.com/small/7872-clipart-picture-of-a-world-earth- Globe-Mascot-Cartoon-Character-Surfing-On-A-Blue-And-Yellow-Surfboard.jpg problems? 13

Random surfer model We want to model the behavior of a random user interfacing the web through a browser Model is independent of content (i.e. just graph structure) What types of behavior should we model and how? Where to start Following links on a page Typing in a url (bookmarks) What happens if we get a page with no outlinks Back bu7on on browser Random surfer model Start at a random page Go out of the current page along one of the links on that page, equiprobably 1/3 1/3 1/3 Telepor*ng If a page has no outlinks always jump to random page With some fixed probability, randomly jump to any other page, otherwise follow links The ques*ons Given a graph and a telepor*ng probability, we have some probability of visi*ng every page What is that probability for each page in the graph? Pagerank summary Preprocessing: Given a graph of links, build matrix P From it compute steady state of each state An entry is a number between 0 and 1: the pagerank of a page Query processing: Retrieve pages mee*ng query Integrate pagerank score with other scoring (e.g. w- idf) Rank pages by this combined score http://3.bp.blogspot.com/_zago7gjcqai/rkyo5ucmbdi/ AAAAAAAACLo/zsHdSlKc-q4/s640/searchology-web-graph.png 14

Pagerank problems? Can s*ll fool pagerank link farms Create a bunch of pages that are *ghtly linked and on topic, then link a few pages to off- topic pages link exchanges I ll pay you to link to me I ll link to you if you ll link to me buy old URLs post on blogs, etc. with URLs Create crappy content (but s*ll may seem relevant) IR Evalua*on Like any research area, an important component is how to evaluate a system What are important features for an IR system? How might we automa*cally evaluate the performance of a system? Compare two systems? What data might be useful? Measures for a search engine Data for evalua*on How fast does it index (how frequently can we update the index) How fast does it search How big is the index Expressiveness of query language UI Is it free? Quality of the search results Test queries Documents IR system 15

Many other evalua*on measures IR Research F1 Precision at K 11- point average precision mean average precision (MAP) score normalized discounted cumula*ve gain (NDGC) $$$$ Online adver*sing $ How do search engines make money? How much money do they make? http://www.iab.net/about_the_iab/recent_press_releases/press_release_archive/press_release/pr-060509 16

Where the $ comes from 3 major types of online ads keyword search display classifieds Banner ads Keyword linked ads other http://www.informationweek.com/news/internet/reporting/ showarticle.jhtml?articleid=207800456 Context linked ads Banner ads Paid search components User Advertiser Ad platform/exchange Publisher Ad server standardized set of sizes 17

Paid search query User query Ad platform/exchange Publisher Ad server What is required of the adver*ser? Adver*ser set of keywords landing page Advertiser Ad platform/exchange Publisher Ad server ad copy bids $ 18

A bit more structure than this <100K keywords Advertiser millions of keywords campaign1 Adgroups Adgroups are the key structure Adcopy and landing pages are associated at the adcopy level Keywords should be *ghtly themed promotes targe*ng makes google, yahoo, etc. happy <100 keywords adgroup1 adgroup2 adgroup3 keyword1 keyword2 Crea*ng an AdWords Ad Behind the scenes keywords Advertiser Ad platform/exchange Advertiser keywords query Publisher Ad server Advertiser keywords 75 19

Behind the scenes keywords Advertiser keywords Advertiser keywords Advertiser? query Ad platform/exchange Publisher Ad server matching problem Behind the scenes For all the matches advertiser A bid $ advertiser B bid $ advertiser C bid $ Other data (site content, ad content, account, ) Search engine ad ranking Behind the scenes: keyword auc*on Search ad ranking Site bids for keyword: dog the bounty hunter Other data (site content, ad content, account, ) Display ranking Bids are CPC (cost per click) How do you think Google determines ad ranking? Web site A bid $ Web site B bid $ Web site C bid $ Search engine ad ranking Web site B Web site A Web site C score = CPC * CTR * quality score * randomness cost/clicks * clicks/impression = cost/impression Is it a good web pages? Good adcopy? Adcopy related to keyword? Enhances user experience, promoting return users don t want people reverse engineering th system data gathering 20