Introduction to Search Engine Technology CS , Technion, Winter 2011/12

Similar documents
Information Retrieval Spring Web retrieval

Information Retrieval (IR) Introduction to Information Retrieval. Lecture Overview. Why do we need IR? Basics of an IR system.

Information Retrieval May 15. Web retrieval

Information Retrieval

Mining Web Data. Lijun Zhang

Information Retrieval Lecture 4: Web Search. Challenges of Web Search 2. Natural Language and Information Processing (NLIP) Group

Information retrieval

Information Retrieval

Mining Web Data. Lijun Zhang

10/10/13. Traditional database system. Information Retrieval. Information Retrieval. Information retrieval system? Information Retrieval Issues

SEARCH ENGINE INSIDE OUT

Link Structure Analysis

A Survey on Web Information Retrieval Technologies

Searching the Web What is this Page Known for? Luis De Alba

Chapter 27 Introduction to Information Retrieval and Web Search

CS 345A Data Mining Lecture 1. Introduction to Web Mining

DATA MINING II - 1DL460. Spring 2014"

DATA MINING - 1DL105, 1DL111

5 Choosing keywords Initially choosing keywords Frequent and rare keywords Evaluating the competition rates of search

Information Retrieval. Lecture 9 - Web search basics

Lecture #3: PageRank Algorithm The Mathematics of Google Search

CHAPTER THREE INFORMATION RETRIEVAL SYSTEM

THE WEB SEARCH ENGINE

Jargon Buster. Ad Network. Analytics or Web Analytics Tools. Avatar. App (Application) Blog. Banner Ad

Information Retrieval

International Journal of Scientific & Engineering Research Volume 2, Issue 12, December ISSN Web Search Engine

Search Engines. Information Retrieval in Practice

6 WAYS Google s First Page

Link Analysis and Web Search

UNIT-V WEB MINING. 3/18/2012 Prof. Asha Ambhaikar, RCET Bhilai.

Chapter 2. Architecture of a Search Engine

CS47300: Web Information Search and Management

Lec 8: Adaptive Information Retrieval 2

Ranking of ads. Sponsored Search

Searching the Web [Arasu 01]

Web Search Ranking. (COSC 488) Nazli Goharian Evaluation of Web Search Engines: High Precision Search

Review: Searching the Web [Arasu 2001]

Web Search Basics. Berlin Chen Department of Computer Science & Information Engineering National Taiwan Normal University

NBA 600: Day 15 Online Search 116 March Daniel Huttenlocher

Relevant?!? Algoritmi per IR. Goal of a Search Engine. Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Web Search

SEO. Definitions/Acronyms. Definitions/Acronyms

Query Evaluation Strategies

5. search engine marketing

Information Retrieval. (M&S Ch 15)

Information Retrieval. CS630 Representing and Accessing Digital Information. What is a Retrieval Model? Basic IR Processes

World Wide Web has specific challenges and opportunities

Web Search Basics Introduction to Information Retrieval INF 141/ CS 121 Donald J. Patterson

Chapter 6: Information Retrieval and Web Search. An introduction

VALLIAMMAI ENGINEERING COLLEGE SRM Nagar, Kattankulathur DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING QUESTION BANK VII SEMESTER

Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p.

Part I: Data Mining Foundations

An Oracle White Paper October Oracle Social Cloud Platform Text Analytics

CS Search Engine Technology

Web Crawling. Jitali Patel 1, Hardik Jethva 2 Dept. of Computer Science and Engineering, Nirma University, Ahmedabad, Gujarat, India

seosummit seosummit April 24-26, 2017 Copyright 2017 Rebecca Gill & ithemes

COMP Page Rank

Informa/on Retrieval. Text Search. CISC437/637, Lecture #23 Ben CartereAe. Consider a database consis/ng of long textual informa/on fields

Information Retrieval CS Lecture 01. Razvan C. Bunescu School of Electrical Engineering and Computer Science

The Anatomy of a Large-Scale Hypertextual Web Search Engine

Overture Advertiser Workbook. Chapter 4: Tracking Your Results

Detecting Spam Web Pages

Bruno Martins. 1 st Semester 2012/2013

SEO and Monetizing The Content. Digital 2011 March 30 th Thinking on a different level

An Overview of Search Engine. Hai-Yang Xu Dev Lead of Search Technology Center Microsoft Research Asia

Advertising Network Affiliate Marketing Algorithm Analytics Auto responder autoresponder Backlinks Blog

Module 1: Internet Basics for Web Development (II)

An Introduction to Search Engines and Web Navigation

Empowering People with Knowledge the Next Frontier for Web Search. Wei-Ying Ma Assistant Managing Director Microsoft Research Asia

Lecture 8: Linkage algorithms and web search

THE QUICK AND EASY GUIDE

Brief (non-technical) history

The Web: Concepts and Technology. January 15: Course Overview

SME Developing and managing your online presence. Presented by: Rasheed Girvan Global Directories

Information Retrieval

Information Retrieval and Web Search Engines

THE HISTORY & EVOLUTION OF SEARCH

WebReach Product Glossary

google SEO UpdatE the RiSE Of NOt provided and hummingbird october 2013

Lecture 5: Information Retrieval using the Vector Space Model

Information Retrieval. Lecture 11 - Link analysis

Bing Liu. Web Data Mining. Exploring Hyperlinks, Contents, and Usage Data. With 177 Figures. Springer

INTRODUCTION. Chapter GENERAL

AN OVERVIEW OF SEARCHING AND DISCOVERING WEB BASED INFORMATION RESOURCES

Administrative. Web crawlers. Web Crawlers and Link Analysis!

modern database systems lecture 4 : information retrieval

Lecture 9: I: Web Retrieval II: Webology. Johan Bollen Old Dominion University Department of Computer Science

What Is Voice SEO and Why Should My Site Be Optimized For Voice Search?

DATA MINING II - 1DL460. Spring 2017

Information Retrieval and Web Search

SEO WITH SHOPIFY: DOES SHOPIFY HAVE GOOD SEO?

Query Evaluation Strategies

60-538: Information Retrieval

New Technology Briefing

Search Engine Optimization. MBA 563 Week 6

Crawler. Crawler. Crawler. Crawler. Anchors. URL Resolver Indexer. Barrels. Doc Index Sorter. Sorter. URL Server

Information Retrieval II

Web Search Basics. Berlin Chen Department t of Computer Science & Information Engineering National Taiwan Normal University

Plan for today. CS276B Text Retrieval and Mining Winter Evolution of search engines. Connectivity analysis

Knowledge Retrieval. Franz J. Kurfess. Computer Science Department California Polytechnic State University San Luis Obispo, CA, U.S.A.

Information Networks. Hacettepe University Department of Information Management DOK 422: Information Networks

Transcription:

Introduction to Search Engine Technology CS 236621, Technion, Winter 2011/12 Ronny Lempel Yahoo! Labs, Haifa What this Course is All About Will try to sample the science that drives Web search engines, highlighting several areas of ongoing world-wide (academic and industrial) research Algorithms, data structures, system issues and theoretical problems Just the tip of the iceberg won t be able to discuss everything This is an introductory lecture - many of the areas will be further elaborated upon during the course 22 October 2011 CS236621 Search Engine Technology 2 1

Administration Web page: http://webcourse.cs.technion.ac.il/236621/winter2011-2012/en/ Reception hour: by prior appointment, before class Email: cs236620@yahoo.com Teaching Assistant: Naama Kraus (tutorial on Thursday, 9:30) Grading: (20 points, after week 2) Problem sheet (25 points, after week 6) Problem sheet + programming assignment tweaking the Lucene open-source search library (20 points, after week 10) Problem sheet (35 points, after semester) Final assigment Some HW assignments will be in pairs, some will be individual 22 October 2011 CS236621 Search Engine Technology 3 Mission Impossible? Search engines: Crawl and index tens of billions of documents Answer hundreds of millions of queries per day Devote less than 1 second of processing time per query execution Users: Submit very short queries (averaging about 2.6 terms) Expect to receive the most relevant results on the Web In a blink of an eye In terms of 1990 IR research - almost unimaginable! 22 October 2011 CS236621 Search Engine Technology 4 2

The Static Web Corpus Bulk 10s of billions of static pages; doubling every 8-24 months Lack of stability Estimates vary widely (half life of page is several months) Lexicon Size 10s-100s of millions of words Heterogeneity Type of documents.. Text, pictures, audio, video, scripts, Quality From junk to the best content on earth Languages 100+ Duplication Syntactic 30%-40% (near) duplicates Semantic?? Complex graph topology 8 links/page in the average; bow-tie structure Spam 100s of millions of pages 22 October 2011 CS236621 Search Engine Technology 5 Web Searchers - Observations Make ill defined queries Short (2.54 terms average, 80% contain less than 3 words) Use imprecise (and often misspelled) terms Unfamiliar with query syntax (80% queries without operator) Wide variance in information needs, expectations, education/knowledge, IP bandwidth, human patience Different modalities (mobile, desktop) = different needs, even with same person Specific behavior 85% look over one result screen only (mostly above the fold ) 78% of queries are not modified Overall, we as users are investing low cognitive effort per query (formulating and looking at results) 22 October 2011 CS236621 Search Engine Technology 6 3

Evolution of Search Engines Pre-search: human edited directories High quality, authoritative, reliable But inherently non-scalable and doesn t cater to niches First generation use only on page, text data 1994, the Yahoo! directory Information Retrieval techniques such as TF/IDF, Vector Space model adapted to HTML Scalable, automatic 1995-1997 AV, Excite, Lycos, etc But ranking is poor and easy to spam 22 October 2011 CS236621 Search Engine Technology 7 Evolution of Search Engines (2) Second generation - use off-page, web-specific data Link analysis Anchor-text (how authors refer to a page when linking to it) Click-through data (what results people click on) From 1998. Made popular by Google and used by everyone now Ranking is much improved, especially for navigational queries But popularity rules and results are very homogenous Spam evolving into a cat-and-mouse game 22 October 2011 CS236621 Search Engine Technology 8 4

Evolution of Search Engines (3) Third generation answer the need behind the query, focusing on the need rather than on the keywords Simple context determination Spatial (user location, target location) Query stream (previous queries) Shortcuts via regular expressions Integrate multiple sources of data Maps, videos, images, news First half of decade: Google, Yahoo!, Ask Search assistance tools that help users create good queries and streamline search efforts Spell correction, expand/narrow suggestions, facets From 2006 and on: some of this happens as you type, before the query is actually evaluated (AJAX) 22 October 2011 CS236621 Search Engine Technology 9 Evolution of Search Engines (3.5) More recent efforts (2005 to date): Integration of search and text analysis, e.g. understanding concepts in documents and matching those to queries Named entities, directives/modifiers ( download, reviews etc.) Social search making use of UGC (user-generated content), social networks, etc. Optimizing the experience delivered by the search result page as a whole rather than as a collection of 10 independent links Page real-estate is scarce, need to make every pixel count Emphasis on diversity of results More elaborate personalization Apps 22 October 2011 CS236621 Search Engine Technology 10 5

Anatomy of a Search Engine Image taken from Searching the Web, Arasu et al., TOIT 1(1), 2001 22 October 2011 CS236621 Search Engine Technology 11 Weeks 1-2: Introduction to Information Retrieval 6

Classic Information Retrieval Given a query, the system retrieves a set B of documents Every retrieved document is either relevant or irrelevant to the query Quality metrics: Recall : (A B) / A Precision : (A B) / B 22 October 2011 CS236621 Search Engine Technology 13 Recall and Precision on the Web Relevance of document to queries is not binary there are many shades of gray Broad-topic queries: abundance problem Precision is the dominating factor: users mostly satisfied with a few good results (a few authoritative pages) Narrow-topic queries: Find a needle in an enormous haystack Recall demands engines cover significant portions of the Web Common measure: precision@10 Nowadays larger emphasis on diversity 22 October 2011 CS236621 Search Engine Technology 14 7

The Boolean Model Simple model based on set theory Queries are specified as Boolean expressions. indexing units are words Boolean Operator: OR, AND, NOT Example: q = java AND compilers AND ( unix OR linux ) Relevance: A document is considered relevant to the query if it satisfies the query Boolean expression. Pros: Fast (bitmap vector operations) Binary decision (doc is relevant or not) Cons: Binary decision - What about ranking? Who speaks Boolean? 22 October 2011 CS236621 Search Engine Technology 15 Vector Space Model Documents and queries are both represented as vectors in a highdimensional vector space in which each word is a dimension Relevance is measured by similarity, i.e. by the cosine of the angle between doc-vectors and the query-vector j d 2 d 1 q d Sim( q, d) = = q d i i w w iq 2 iq w i id w 2 id α 1 q i 22 October 2011 CS236621 Search Engine Technology 16 8

Probabilistic Retrieval Models Attempt to predict the probability that a given document is relevant to a given query: Pr(R d,q) Rank retrieved documents according to this probability of relevance Probability Ranking Principle: if a retrieval system s response to each request is a ranking of the documents in decreasing probability of usefulness to the user then the overall effectiveness of the system to its users will be the best - Stephen E. Robertson, J. Documentation 1977 Rely on accurate estimates of probabilities 22 October 2011 CS236621 Search Engine Technology 17 Classic and Modern IR Issues Polysemy (ambiguity): does Jordan refer to Michael Jordan, the Jordan River, or the Hashemite Kingdom of Jordan? Can we ensure that our SERP * will represent all aspects? Synonymy: elevator/lift, movie/film Do the models hold when queries are very short? Short queries require the addition of proximity considerations (query term hits should be in proximity to each other) how does this fit with the models? Information Retrieval merits a course of its own! * SERP Search Results Page 22 October 2011 CS236621 Search Engine Technology 18 9

Weeks 3-6: The Inverted Index The Inverted Index Data Structure An Inverted Index consists of 2 elements: The lexicon (AKA vocabulary, dictionary) The inverted file (AKA postings file) The Lexicon is the set of all indexing units (terms) in a given collection The Inverted file is a set of postings lists a list per term. The list consists of posting elements. The list of term t holds the locations (documents + offsets therein) where t appeared. Additional data (payload) is often encoded per location Many variations and degrees of freedom Supports efficient lookup from word to occurrences Essentially a way of representing and accessing (by row) a huge and very sparse matrix 22 October 2011 CS236621 Search Engine Technology 20 10

Inverted Index Example The good 1 2 the bad 3 4 and the ugly 5 6 7 Doc #1 As good as it gets, and more 1 2 3 4 5 6 7 Term Doc #2 Doc #3 Lexicon DF The 1 Good 2 Bad 2 And 3 Ugly 2 As 1 It 2 Gets 1 More 2 Is 1 (1; 1,3,6) (1; 2) (2; 2) (1; 4) (3; 5) (1; 5) (2; 6) (3; 4,8) (1; 7) (3; 3) (2; 1,3) (2; 4) (3; 2,6) (2; 5) (2, 7) (3, 9) (3; 1,7) Is it ugly and bad? It is, and more! 1 2 3 4 5 6 7 8 9 Basic questions: How to efficiently build an inverted index? How does the inverted index efficiently support the various retrieval models (Boolean, Vector Space, etc.)? How does the inverted index handle complex operators, e.g. phrase queries? 22 October 2011 CS236621 Search Engine Technology 21 Course Will Cover Index representation, and algorithms for batch index build + lexicon representation Evaluation schemes (term/document at a time), early termination and pruning schemes Index compression and pruning Efficient identification of near-duplicate pages Large-scale search engines have huge indices, which are distributed across thousands of machines. How can an index be distributed, and what are the implications on query processing? 22 October 2011 CS236621 Search Engine Technology 22 11

Weeks 7-8: Link Structure Analysis The Information Need Behind the Query [Broder, SIGIR Forum 36(2), 2002] Informational I want to learn more about the roman empire Many possible fine results Navigational Find me the home page of el al In principle, a single correct result In practice, correct result may depend on language/locale Transactional I want to buy a digital camera Good results: e-tailors that sell digital cameras 22 October 2011 CS236621 Search Engine Technology 24 12

Navigational Queries & Anchor Text Anchor text the highlighted clickable text of a link Physically in page A, but actually describes the content of page B Many times, concisely defines page B Often, better than the text present on page B itself Anchor text is the most important factor in treating navigational queries Case study: the query engine on Microsoft s Live Search Famous pop-culture examples: French Military Victories, miserable failure 22 October 2011 CS236621 Search Engine Technology 25 Case study: the query engine on Microsoft s Live Search The word engine does not appear Anywhere on the page, nor in its source 22 October 2011 CS236621 Search Engine Technology 26 13

Link-Structure Analysis Beyond anchor-text: the connectivity patterns between Web pages contain a gold mine of information A link from page a to page b can often be interpreted as: 1. A recommendation, by a s author, of the contents of b. 2. Evidence that pages a,b share some topic of interest. A co-citation of a and b (by a third page c) may also constitute evidence that a,b share some topic of interest a c b 22 October 2011 CS236621 Search Engine Technology 27 Hubs and Authorities Notions proposed by Jon Kleinberg (1998) Two types of quality Web pages that pertain to a given topic hubs: resource lists, containing links to many authorities hubs authorities Authorities: pages containing authoritative content on the topic HITS (Hypertext Induced Topic Search): an algorithm that identifies hubs and authorities, given a topic of interest t Based on the Perron-Frobenius theory of non-negative matrices 22 October 2011 CS236621 Search Engine Technology 28 14

Example an Authority for the Query movies 22 October 2011 CS236621 Search Engine Technology 29 Example a Hub for the Query movies 22 October 2011 CS236621 Search Engine Technology 30 15

PageRank (Brin & Page, 1998) Named after Google s co-founder, Larry Page A global, query independent importance measure of Web pages A page is considered important if it receives many links from important pages Based on Markov chains and random walks 22 October 2011 CS236621 Search Engine Technology 31 PageRank: Random Surfer Model A random surfer moves from page to page. Upon leaving page p, the surfer chooses one of two actions: 1. Follows an outgoing link of p (chosen uniformly at random), w.p. d 2. Jumps to an arbitrary Web page (chosen uniformly at random), w.p. 1-d The vector of PageRanks is the stationary distribution of this (ergodic) random walk 22 October 2011 CS236621 Search Engine Technology 32 16

Link Analysis Algorithms Many variants and refinements of both HITS and PageRank have been proposed Course will cover SALSA & Topic-Sensitive PageRank Other approaches include: Max-flow techniques [Flake et al., SIGKDD 2000] Machine learning and Bayesian techniques Examples of applications: Ranking pages (topic specific/global importance) Categorization, clustering, finding related pages Identifying virtual communities A wealth of literature 22 October 2011 CS236621 Search Engine Technology 33 Stability of Link Analysis IR algorithms should be robust small changes in the input should not dramatically alter the results The Web changes constantly: pages and links are created, deleted and modified? Do the known link-based algorithms produce stable scores/rankings?? Does stability of scores imply stability of rankings? Different research approaches have been applied to these questions, some problems are still unsolved 22 October 2011 CS236621 Search Engine Technology 34 17

Size & Dynamics of the Web Billions of pages, tens of millions of sites Hundreds of terabytes (10 14 ) of data Projected growth rate: doubling in size every two years (some estimate rate is much quicker) New pages are added; existing pages are modified and deleted Unstructured information in chaotic disorder Many media types Widely distributed, both geographically and organizationally 22 October 2011 CS236621 Search Engine Technology 35 The Web as a Graph Pages as graph nodes, hyperlinks as edges Sometimes sites are taken as the nodes Some natural questions: 1. Distribution of the number of in-links to a page 2. Distribution of the number of out-links from a page 3. Distribution of the number of pages in a site 4. Connectivity: is it possible to reach most pages from most pages? 5. Are there models that explain how the Web graph evolved to its current shape? 22 October 2011 CS236621 Search Engine Technology 36 18

Weeks 9-10: Crawlers and Caches Crawlers - Framework The World Wide Web can be viewed as a directed graph Nodes are web objects, such as html pages, image files, pdf files, forms, etc... Nodes are uniquely identified by URLs Edges are html hyperlinks, pointing from an html page to its embedded objects or to other web objects Edges are uniquely identified by their source and target nodes 22 October 2011 CS236621 Search Engine Technology 38 19

Generic Web Crawling Algorithm Given a root set of distinct URLs: Put the root URLs in a queue While queue is not empty: Take the first URL x out of the queue Retrieve the contents of x from the web Do whatever you want with it Mark x as visited If x is an html page, parse it to find hyperlink URLs For each hyperlink URL y within x: If y is not marked visited, put y into queue This is often the computational bottleneck 22 October 2011 CS236621 Search Engine Technology 39 Crawler Design Issues Writing a simple crawler is a fairly easy task one can be written in Java in a few lines of code However, there are many issues that need to be considered when writing a robust Web-scale crawler: Redirections Encrypted and password protected data Politeness - avoiding server load and respecting Robot Exclusion Protocol Script handling Language and encoding issues The challenge of dynamic pages Duplicate pages Crawl prioritization and variable revisit rates to ensure freshness Focused crawling Concurrency and distribution issues And many many more 22 October 2011 CS236621 Search Engine Technology 40 20

Query Result Caching Exploits locality of reference in the query streams that are submitted to search engines: Query popularities vary widely, from the extremely popular to the very rare Although a minority, significant number of users still view multiple result pages per query (motivates prefetching) Successful caching and prefetching of results can: Lower the number/cost of query executions Shorten the engine s response time and lower its hardware requirements 22 October 2011 CS236621 Search Engine Technology 41 Query Log Analysis Analysis of 7160190 queries submitted to AltaVista during September 2001: The 7160190 queries requested 4496503 distinct result pages, belonging to 2657410 distinct query strings 67% of the 2657410 strings were only requested once; the most popular string was queried 31546 times 48% of the result pages were only requested once; the 50 most requested pages account for almost 2% of all requests Qualitative behavior consistent with analyses of query logs of other engines 22 October 2011 CS236621 Search Engine Technology 42 21

Result Caching and Prefetching Simple LRU-based caches of query results can achieve cache hit-rates of about 30% (Markatos, 2000) Prefetching deeper result pages, along with more complex caching policies, can achieve hit rates of over 50% What is the cost associated with prefetching additional results? What happens when the index changes? BTW, what about caching of other data in search engines, e.g. postings lists? 22 October 2011 CS236621 Search Engine Technology 43 Weeks 11-14: Users, Advertising and Long-Tail Economics 22

Search Assistance Help users complete their online tasks quickly and easily Examples on next slides, from the top 4 search engines, show: Task-oriented mashups Query completion tools Suggestions of related concepts and queries Federation of multiple sources of information Challenges: Which experience to trigger for which query/user? Which data items to trigger given an experience? Lots(!!!) of data mining including usage mining - behind the scenes The bigger picture: which (layout), what (content), where (on the page) and how (to render) 22 October 2011 CS236621 Search Engine Technology 45 22 October 2011 CS236621 Search Engine Technology 46 23

22 October 2011 CS236621 Search Engine Technology 47 22 October 2011 CS236621 Search Engine Technology 48 24

22 October 2011 CS236621 Search Engine Technology 49 The Advertising Market Internet advertising is a $10 10 business that is growing faster than old media advertising (radio, TV, newspapers & magazines, mail, outdoors) Online advertising budgets still lagging in proportion to time spent online, which is growing fast at the expense of old media Seen as a driver to continued growth of online advertising market Traditional advertising: Few, expensive opportunities Targeting en-masse, by immediate context only Difficult to measure effectiveness Internet advertising: Billions of opportunities daily Open to personalization via rich context of impression Effectiveness is measurable: can measure click-through rates as % of impressions, and conversions as % of clicks 22 October 2011 CS236621 Search Engine Technology 50 25

Internet Advertising Types: Sponsored Search textual ads served on search results pages, triggered by query terms on which advertisers bid 22 October 2011 CS236621 Search Engine Technology 51 Internet Advertising Types: Contextual Ads (a.k.a. Content Match) textual ads served on third party ( publisher ) pages, triggered by content of page on which advertisers bid 22 October 2011 CS236621 Search Engine Technology 52 26

Internet Advertising Types: Display Ads Graphical elements (e.g. banners) of standard sizes, served on publisher pages, mainly targeting audience demographics Guaranteed delivery (GD): advertiser buys a display campaign months in advance, requesting ## impressions to certain demographics in a given time period; ad agency pays fine for under-delivery Non-guaranteed delivery (NGD): advertiser bids in real time for impressions, with no guarantees that ads will be shown 22 October 2011 CS236621 Search Engine Technology 53 Internet Advertising Models 1. Cost per Impression (sometimes called Cost per Mille, i.e. 1000 impressions) advertiser pays for ad to be shown Main model for display advertising 2. Cost per Click advertiser only pays when ad is clicked by user Main model for sponsored search and content match Sponsored search, 2005: 50% of clicks < $0.25, 80% < $0.50 3. Cost per Action advertiser only pays when the user s interaction with the ad is followed by some transaction (not necessarily purchase) 22 October 2011 CS236621 Search Engine Technology 54 27

Computational Advertising Challenges Conflicting goals: Advertiser: campaign effectiveness maximize ROI Publisher: revenue without jeopardizing image of site User: relevant ads, or at least ads that do not harm the experience Ad agency (search engine): own revenue, while balancing goals of other parties; sometimes ad agency is also the publisher Ad agency challenges: Which ads to trigger per query/page/impression? How to order them? What to charge for them? In general, how to set the rules of the marketplace so as to ensure goals of parties are met (Game Theory/Mechanism Design)? Which guaranteed delivery contracts to accept? How to balance GD with NGD? Helping advertisers choose bid phrases effectively Combating spam and fraud Scale!! 22 October 2011 CS236621 Search Engine Technology 55 The Long Tail How the infinite choice available in e-commerce and the ease of publishing online have redefined the business of media & entertainment Mottos: The biggest money is in the smallest sales Embrace niches 22 October 2011 CS236621 Search Engine Technology 56 28

Long Tail Economics and Recommender Systems People cannot intelligently navigate product or media catalogs consisting of millions of items e-inventories of books, songs, movies, albums are practically infinite In particular, niche audiences cannot find niche content Recommender systems, collaborative filtering and social tagging services are key to Long Tail economics Find an item you like, and we ll point you to similar items Pioneered by Amazon; today, used everywhere 22 October 2011 CS236621 Search Engine Technology 57 29