Internet Search. (COSC 488) Nazli Goharian Nazli Goharian, 2005, Outline

Size: px

Start display at page:

Download "Internet Search. (COSC 488) Nazli Goharian Nazli Goharian, 2005, Outline"

Myra Wilson
5 years ago
Views:

1 Internet Search (COSC 488) Nazli Goharian Nazli Goharian, 2005, Outline Web: Indexing & Efficiency Partitioned Indexing Index Tiering & other early termination techniques Index in Dynamic Environment Improving effectiveness of Web search engines Web page ranking Query log, anchor text, authority/hub, page rank, sponsored search, localized search, social search Result snippets Social Search tagging, collaborative search/filtering, recommender system Real-time search Peer-to-Peer Search 2 1

2 The Web Document collections are scattered across many geographical areas. Constraints prohibiting the centralization of data include: Data security Volume Rate of change Political and legal constraints Other proprietary motivations 3 Web Search Parallel and distributed processing Web search tools access data distributed on servers worldwide but indexed centrally. Most of these systems have a partitioned index on large clusters of servers with a centralized control. They store pointers in the form of hypertext links to various Web servers. 4 2

3 Partitioned Indexing Partitioning of index across multiple machines, based on either: Terms (Global index organization) Each node holds posting list for some terms Using content-index, query terms sent to nodes having the terms Higher concurrency level, but larger postings lists Documents (Local index organization) more common Each node holds a complete index (shorter PLs) Query terms sent to all nodes Top k results from each node merged Global statistics (e.g.. idf) must be calculated A Hybrid approach in Tiered Indexing may be used 5 Index Tiering A popular early termination technique to improve the efficiency of query processing Dividing nodes into two tiers to allocate the index of most popular documents on tier 1 and the rest on tier 2. Search tier 1 first, if not enough results then search tier 2. Note: other popular early termination techniques (top-doc and query pruning) were discussed earlier in the semester! 6 3

4 Distributed Index Construction Not possible on a single machine Various architecture for distributed indexing MapReduce architecture (a term-partitioned index) Master node assigns tasks to worker nodes (map workers & reduce workers) to split up the computing jobs: Map Phase: Parsing & building localized <term, doc> pairs Reduce Phase: Combining/merging posting pairs for each term 7 MapReduce (Cont d) Map & reduce phases can be done in parallel on many machines A map machine can be a reducer machine in the process Data broken into pieces (shards) generally 16M-64 M [128M] and send to map workers as they finish their job Map workers work on one shard at a time (generally), unless having more than one CPU, parse and generate <term,doc> pair (can be combined to <term,doc,tf> Sort based on term, and then secondary key (doc_id) The same keys (terms) are assigned to the same reduce worker Load should be balanced on the reducers 8 4

5 MapReduce (Cont d) Taken from: C. Manning, P. Raghavan, H. Schutze, Introduction to Information Retrieval, Cambridge University Press, Query Servers Each server has its own disk holding a portion of index Queries are distributed, via a centralized control, to servers that contain the related posting lists Common terms may map to many servers No single point of resource contention (efficient) If a server crashes, that portion of index is not available 10 5

6 Index in Dynamic Environment Data collection is not static Reconstruct the index periodically from scratch (many search engines use this) Maintain an auxiliary index to store new document & remerge with existing index Maintain multiple indexes - complicated in maintaining collection statistics 11 Outline Web: Indexing & Efficiency Partitioned Indexing Index Tiering & other early termination techniques Index in Dynamic Environment Improving effectiveness of Web search engines Web page ranking Query log, anchor text, authority/hub, page rank, sponsored search, localized search, social seacrh Result snippets Social Search tagging, collaborative search/filtering, recommender system Real-time search Peer-to-Peer Search 12 6

7 Definitions. Web graph: each page is a node and links are directed edges from one node to other node Out-links (out-degree) A: links from page A to B In-links (in-degree) A: links from other pages to A Sink: if out-links = 0 Source: if in-links=0 Static page: pages that are generated prior to any request Dynamic page: pages that generated as the result of a request Hidden/deep web: pages with no links/password protected/via a Form, Indexable Web: union of pages indexed by major search engines 13 Evaluation of Web Search Engines: High Precision Search Traditional IR systems are evaluated based on precision and recall. Web search engines are evaluated based on top N documents. Recall estimation is very difficult Precision is of limited concern, as many users do not look beyond 1 st screen. => How fast and accurate the first results screen is generated? 14 7

8 Web Page Ranking Considering both query dependant and query independent scores (captured during indexing), a global score is generated for each page: Query dependant score Similarity measures such as Cosine, BM25, proximity, Query independent score Link analysis (anchor text, popularity metrics such as: authorities and hub, page rank, ) Sponsored search Localized search Query log analysis etc. 15 Query Log Analysis Using user query patterns on certain days and time of day, week, month, and year, many optimizations are possible: Pre-cache likely Web pages in anticipation of user queries to reduce page access delays; increasing system throughput (efficiency optimization) Possible to adjust relevance ranking to tune for certain user queries (accuracy optimization) 16 8

9 Anchor Text Short, 2-3 terms, describe the linked/destination page. May/may not be a different point of view than the author s. Anchor text of links to a doc d i included in index for d i Extended anchor text (text surrounding anchor text) may also be used Generally weighted based on frequency (notion of idf) Spamming problem 17 Page Rank A scoring mechanism in Web search (trade marked by Google and patented by Stanford) Generally calculated at the time of crawling Using incoming and outgoing links as an indicator of popularity, adjusts Web page score Popular page is defined as a page that - Many Web pages link to it (inlinks) - Important (popular) pages link to it May be affected by link spam 18 9

10 Page Rank PageRank ( A) = (1 d ) + d N PageRank ( D ) D... D n C ( D 1 i ) i C(D i ) : number of links out from page D i d : damping factor (from 0-1; commonly 0.85) N: total number of pages An Iterative Algorithm: Initially all pages are assigned an arbitrary page rank (1/n), summing to 1 Iteratively calculate the scores until the new scores do not change significantly To converge faster, may initialize page ranks based on number of inlinks, log info,. 19 Authorities and Hub Various algorithms based on assigning each retrieved web page two scores: Authority and Hub scores. (HITS: Hyperlink-Induced Topic Search, 1999) Authority page: an authoritative source on a given topic Hub page: page listing pointers to authority pages on a topic Authority score: summation of scores of all the hubs pointing to that authority page Hub score: summation of scores of all authority pages the hub is pointing to 20 10

11 Computing Authority and Hub Scores Retrieve all pages containing the query term t. This is called root set. (~200 pgs) Create a set including union of root set pages, pages that point to root set pages, and pages that root set pages point to. This is called base set. Using the base set to compute the hub and authority scores. An iterative algorithm: Initialize hubs and authorities a score of 1 Update s(h) and s(a) 21 Sponsored Search Search system vendors sell advertisers keywords so that whenever such words are issued in a query, the advertiser s desired homepage link is returned. Sponsored search results are biased towards advertisers with higher bids, click frequency of Ads, Significant revenue is generated to search engine vendors via such search approach (ex.: per click (50 sents to 15 dollars) 22 11

12 Sponsored Search Search engines maintain an advertisement database (Description of advertisement, link to that page, bids, popularity, ) Searching the advertisement database for a match to: query terms keywords extracted from retrieved result page (pseudo-relevance feedback, page features, ) Ranking advertisements based on bids (on keywords) and advertisement popularity (using clickthrough data logs) 23 Localized Search Using geographic information to modify the ranking of results (in addition to SC scores, link based scores, ). Geographic information maybe derived from: Location of device sending the query Context of query restaurant near Al Capone s home s town restaurant Near White Sox stadium Geographic location in the query Chicago restaurants Geographic location in a document metadata 24 12

13 Result Snippets Providing users a short summary (snippet) of page (title, url, link to cached page, snippet). Static snippets Query independent Created at indexing time and cached Containing title, n number of sentences/words, (NLP can be used) Dynamic snippets Query dependent Created at the time of results scoring Windows of the document - also called KWIC (keyword in context) 25 Result Snippets Index maintains sentence level information Snippet sentences can be picked: Based on query term(s): heading Location in document (n th sentence) Closeness of query terms in sentence Ratio of query terms in sentence Unique query terms in sentence From page metadata 26 13

14 Result Snippets An effective snippet should:(clarke et al 2007 s clickthrough analysis) have all the query terms (unless already included in title) Use the page metadata, if needed Display URL and mark the query terms Provide meaningful snippets vs. only some keywords 27 Outline Web: Indexing & Efficiency Partitioned Indexing Index Tiering & other early termination techniques Index in Dynamic Environment Improving effectiveness of Web search engines Web page ranking Query log, anchor text, authority/hub, page rank, sponsored search, localized search, social search Result snippets Social Search tagging, collaborative search/filtering, recommender system Real-time search Peer-to-Peer Search 28 14

15 Social Search Social search introduces new aspects to search engines Village paradigm (Collaborative) [Horowitz & Kamvar, WWW 10] Crowd/ Social network /friends vs. Corpus-based Routing questions to potential answerers Community of users, sharing goal or interest, participate in search and interact with each other online YouTube, Twitter, Flickr, Facebook, Myspace, LinkedIn, forums, blogs, online games, From Wikipedia: Social search or a social search engine is a type of web search that takes into account the Social Graph of the person initiating the search query. When applied to web search this Social-Graph approach to relevance is in contrast to established algorithmic or machine-based approaches where relevance is determined by analyzing the text of each document or the link structure of the documents 29 Real-Time Search Traditional search indexes the crawled pages Real-time search results of search engines such as Google, Bing, Yahoo come from variety of real-time search services such as twitter, flicker, your-tube, etc. Receive data directly from various social media and blogs (subscribed to social networking sites) A filtering engine identifies spams Measuring relevance -- The ranking is based on: Time, relevance to query, number of followers of authors, reputation of a link defined by the frequency of forwarding (re-tweets), First real time search: Summize in 2007 with real time trend analysis later on merged with twitter 2008) 30 15

16 Social Search Documents or websites are deemed relevant if searcher s social network were also interested in it. Nature of queries Many cases opinionated, subjective Query length (Many cases longer queries than Web s) Index Storing user s behavior ( responsiveness, answer quality, expertise) Mapping users to topics 31 Social Search Social Search Ranking based on combination of: Query-dependent (prob. of a good answer to query q by user u) Similarity of results to query (various ranking: cosine, bm25, proxomity, ) Relatedness of query/results to user Query-independent How many users bookmarked x Social Trust Similarity of asker to answerer -- user profiles similarity, users connectedness 32 16

17 Social Search Mapping users to topics. An example: [Horowitz & Kamvar, WWW 10] User specifies interest /expertise in topics Friends of users indicate the expertise of user u in topics Automatically identified topics from User s existing online profiles User s homepages, blogs User s status messages (Twitter, Facebook, IM, ) 33 Social Search Measuring connectedness using cosine similarity over various features, such as: [Horowitz & Kamvar, WWW 10] Social connection (common friends and affiliations) Demographic similarity Profile similarity (e.g., common favorite movies) Vocabulary match (e.g., IM shortcuts) Chattiness match (frequency of follow-up messages) Verbosity match (the average length of messages) Politeness match (e.g., use of Thanks! ) Speed match (responsiveness to other users) 34 17

18 Social Search Sample approach: [Karweg, et.al, CIKM 11]: Social Relevance Score (SRS) ranks the result elements of a query according to their social relevance for the user. It is calculated based on 2 factors: Engagement Intensity: how intense the users interacted with the result Engagement: Interaction in terms of recommendation, rating, status messages Intensity: effort of textual feedback vs. rating score /thumps up Trust Score : level of trust to those who recommend a link Assigned by users & refined by social network analysis using page-rank on social graph SRS(i): social rank score of document/page i X: a user in social network interacted/recommended page i SRS ( i) = t ( x). e ( i) x E i s x 35 Social Search -- Trust Trust has been discussed for years in sociology and social psychology [Marsh, Ph.D. dissertation,1994] formalized trust as a computational concept (agents that keep history of behaviors) Trust in peer-to-peer, EigenTrust [Kamvar et al. 2004] (corrupt vs. valid files) Various efforts in formalization of trust in recommender systems and social network [Swearingen and Sinha,2001], [Ziegler and Golbeck [2006]. The more similar two people were, the greater the trust between them [Ziegler and Golbeck [2006]. Trust in a person is a commitment to an action based on a belief that the future actions of that person will lead to a good outcome. Example: Alice trusts Bob regarding if she chooses to read a message (commits to an action) that Bob sends her (based on her belief that Bob will not waste her time 36 18

19 Tagging Social media sites allow users to tag the data User tags act as manual indexing of data in addition to automatic indexing User tags serve as folksonomy Tags are used to organize and search data Challenges with the tagged data: Vocabulary mismatch Noisy or Spam tags Missing tags 37 Searching Tagged Data: Vocabulary Mismatch problem Tag keywords describe textual or non-textual data and are used to search for items Tags are very sparse (only few keywords) Boolean (conjunctive, disjunctive) search can lead to high precision/low recall or high recall/low precision To reduce the vocabulary mismatch perform stemming, or pseudo-relevance feedback 38 19

20 Searching Tagged Data: Noisy and Spam Tags Spam/misspelled/non-relevant tags mislead search Some incentive must be provided to users to report spam tags, and to enter good quality tags. Log and statistical information may help to identify spam tags 39 Searching Tagged Data: Missing Tags Automatically generate tags for items with missing tags, using: Term weight of textual representation of item Classification of item to a label (i.e.. Tag) 40 20

Tag Clouds The most popular tags are represented to users to provide a more wide view of collection Tag cloud displays the tags as a weighted list The font size

21 Tag Clouds The most popular tags are represented to users to provide a more wide view of collection Tag cloud displays the tags as a weighted list The font size is proportional to the weight Thanks to: tagcloud generator & F. Silvestri, CNR, Italy, S. Orlando, U. of Venice, Italy 41 Recommender Systems 21

Web Search Ranking. (COSC 488) Nazli Goharian Evaluation of Web Search Engines: High Precision Search

Web Search Ranking. (COSC 488) Nazli Goharian Evaluation of Web Search Engines: High Precision Search Web Search Ranking (COSC 488) Nazli Goharian nazli@cs.georgetown.edu 1 Evaluation of Web Search Engines: High Precision Search Traditional IR systems are evaluated based on precision and recall. Web search