Search Engine Overview

Size: px

Start display at page:

Download "Search Engine Overview"

Kimberly Weaver
6 years ago
Views:

1 Search Engine Overview - System, Algorithms and Challenges Ji-Rong Wen Senior Researcher / Group Manager Web Search and Mining Group Microsoft Research Asia

2 Outline An Introduction to Search Engine Architecture Top 10 Myths about Search Engine Computer Science in Search Engine Top 10 Challenges in Search Engine 2

3 Outline An Introduction to Search Engine Architecture Top 10 Myths about Search Engine Computer Science in Search Engine Top 10 Challenges in Search Engine 3

4 Architecture of a Typical Search Engine Query User Interface Online Part Caching Indexing and Ranking Inverted Index Index Builder Page Ranks Link Analysis Cached Pages Web Page Parser Pages Links & Anchors Link Map Web Graph Builder Web Graph Page & Site Statistics Crawler Web Offline Part 4

5 Architecture Crawler Query Functions Fetch Web pages by following hyperlinks Refresh pages periodically User Interface Core Problems Limited bandwidth & storage Caching vs. huge data volume Page update frequency Indexing and Ranking Solutions Prioritize crawling based on page ranks and Inverted other statistics Index Builder Index Page Ranks Online Part Link Analysis Cached Pages Web Page Parser Pages Links & Anchors Link Map Web Graph Builder Web Graph Page & Site Statistics Crawler Web Offline Part 5

6 Homework (1) How to estimate the refresh rate of a page? References Junghoo Cho, Hector Garcia-Molina. Effective page refresh policies for Web crawlers. ACM Transactions on Database Systems, 28(4): December Junghoo Cho, Hector Garcia-Molina, Lawrence Page. Efficient Crawling Through URL Ordering. Computer Networks and ISDN Systems, 30(1-7): ,

7 Architecture Page Parser Query User Interface Caching Indexing and Ranking Functions Extract data streams for indexing Title: words in <title> </title> URL Body Anchor text Plain text H1_6 Bold, Italic, etc. Online Part Large, Medium, Small Build partial link map Send found hyperlinks to crawler Inverted Index Cached Pages Page & Site Statistics Index Builder Web Page Parser Pages Crawler Web Links & Anchors Link Map Core Page Problems Ranks Link Analysis what features to be extracted? Web Graph Builder Web Graph Offline Part 7

8 Traditional Text Retrieval Relevance ranking based on term distribution Term frequency (TF) * Inverse document frequency (IDF) Length normalization Query Iran nuke agenda CNN.com International Search The Web CNN.com Enhanced by Home Page World U.S. World Business Technology Science & Space Entertainment November Updated 0130 GMT 0936 HKT IAEA Iran had secret nuke agenda The International Atomic Energy Agency has concluded that Iran has secretly produced small amounts of nuclear materials including low enriched uranium and plutonium that could be used to develop nuclear weapons according to a confidential report obtained by CNN FULL STORY Snap inspections allowed Gallery Iran's nuclear facilities Interactive How uranium is enriched Score EXPLOSIONS ROCK BAGHDAD Mortars strike the heavily fortified site of the coalition HQ in Iraq Full Story Video Coalition causalities Bush hails sacrifice MORE TOP STORIES AI oaeda strategy shift Experts London target, Iran Saudi bomb suspects questioned Video 8

What s More for Web Search Compared to plain text, a web page has many rich structures Different term types and formats Hyperlink structure 2D visual layout structure Title: CNN.

9 What s More for Web Search Compared to plain text, a web page has many rich structures Different term types and formats Hyperlink structure 2D visual layout structure Title: CNN.com International H1: IAEA: Iran had secret nuke agenda H3: EXPLOSIONS ROCK BAGHDAD TEXT BODY (with position and font type): The International Atomic Energy Agency has concluded that Iran has secretly produced small amounts of nuclear materials including low enriched uranium and plutonium that could be used to develop nuclear weapons according to a confidential report obtained by CNN Hyperlink: Image: URL: Anchor Text: AI oaeda, Iran URL: Alt & Caption: Iran nuke Anchor Text: CNN Homepage News 9

10 Homework (2) Write a Web page parser to get the terms in title, url, and body, with the position and font information for each term References W3C HTML 4.01 Specification 10

Architecture Index Builder Query User Interface Online Part Functions Build efficient index based on parsed page data Caching TermID DocNu m Indexing and Ranking DocID HitNu m Hit Hit Hit Inverted

11 Architecture Index Builder Query User Interface Online Part Functions Build efficient index based on parsed page data Caching TermID DocNu m Indexing and Ranking DocID HitNu m Hit Hit Hit Inverted Index Cached Pages Page & Site Statistics Index Builder Web Page Parser Pages Crawler Web Links & Anchors Link Map DocID Page Ranks HitNu m Hit Hit Hit Link Analysis Core Problems Web Index Graph structure Builder Web Graph Efficiency vs. limited memory & distributed Offline Part Solutions Distributed indexing Partition by document, not partition by term 11

12 Indexing Techniques Inverted Index Signature file Suffix Tree 12

13 Inverted Index (1/3) Documents: doc1: dog, cat, animal; doc2: dog; doc3: cat; doc4: cat, animal; doc5: dog, animal Inverted Index: dog: doc1, doc2, doc 5 cat: doc1, doc3, doc4 animal: doc1, doc4, doc5 13

14 Inverted Index (2/3) doc-1: doc-2: term 11, term 12, term 13, term 21, term 22, term 23, term-1 term-2 term-m doc doc doc doc doc doc doc doc doc-n: term n, 1, term n,2, term n,3, Document collection Inverted index 14

15 Inverted Index (3/3) bill D 113, 4 D 149, 5 D 196, 1 D 222, 2 D 267, 6 D 289, 9 D 345, 3 D 376, 8 D 453, 7 Positions clinton D 113, 4 D 189, 2 D 267, 8 D 346, 1 D 376, 4 D 618, 1 D 572, 3 15

16 Signature File (1/3) Documents: doc1: dog, cat, animal; doc2: dog; doc3: cat; Compute hash codes: hash(dog) = 0110; hash(cat) = 1100 Hash(animal) = 0010 Signature of documents: doc1: doc2: doc3: doc4: doc5: doc4: cat, animal; doc5: dog, animal 16

17 Signature File (2/3) Cite (CPS296.1): 17

18 Signature File (3/3) Bit-sliced Signature File Cite (CPS296.1): 18

19 Suffix Tree (1/3) String S = b b a b a b Add right end marker: b b a b a b $ Label it: Suffixes: 7: $ 6: b$ 5: ab$ 4: bab$ 3: abab$ 2: babab$ 1: bbabab$ 19

20 Suffix Trees (2/3) The compact representation 20

21 Suffix Trees (3/3) Suffix tree can be constructed in linear time Even on-line linear time construction algorithm exists. Refer to this review: Roberto Grossi, G.F. Italiano. Suffix Trees and their Applications in String Algorithms. 21

22 Homework (3) Write an inverted index building algorithm, with the following constraints: a. memory is not sufficient to hold all documents b. memory is not sufficient to hold the whole index References Your data structure textbook 22

Architecture Link Analysis * Functions Query Measure the quality (or authority) of a page based on the link graph User Interface Online Part Caching Indexing and Ranking Inverted Index Builder Index

23 Architecture Link Analysis * Functions Query Measure the quality (or authority) of a page based on the link graph User Interface Online Part Caching Indexing and Ranking Inverted Index Builder Index Core Problems Efficient algorithm on a Web huge Page graph Parser Cached Link-spam? Pages Links & Is link analysis the only way Pages to determine Anchors the quality of pages? Page & Site Crawler Statistics Link Map Page Ranks Web Graph Builder Link Analysis Web Graph Offline Part Web 23

24 Homework (4) Write a toy PageRank algorithm Why HITS algorithm is not a good choice for search engine? References Larry Page, Sergey Brin, R. Motwani, T. Winograd, The PageRank Citation Ranking: Bringing Order to the Web (1998), Stanford Digital Library Technologies Project Jon M. Kleinberg, Authoritative Sources in a Hyperlinked Environment (1999), Journal of the ACM 24

25 Architecture Indexing and Ranking Query The core problems in the IR community, and has been studied for decades Inverted Index Cached Pages Page & Site Statistics User Interface Caching Indexing and Ranking Index Builder Web Page Parser Pages Crawler Web Links & Anchors Link Map Online Part Functions Indexing: quickly locating pages containg query terms Ranking: sort pages according to relevance to the query Core Problems Accuracy: ranking functions with hundreds of parameters: Anchor text Page Ranks Page rank Link Analysis Term proximity Web Graph TF*IDF Builder Web Graph Performance: an inverted list for a hot term may be hundreds of megabytes. Offline Part Solutions Accuracy: model, tuning or learning? Performance: Top-K query & index pruning 25

26 Ranking A core problem in Information Retrieval (IR): Determine the relevance of a document to a query Query: Document: Relevant? How relevant? 26

27 Relevance Computation bill D 113, 4 D 149, 5 D 196, 1 D 222, 2 D 267, 6 D 289, 9 D 345, 3 D 376, 8 D 453, 7 Positions clinton D 113, 4 D 189, 2 D 267, 8 D 346, 1 D 376, 4 D 618, 1 D 572, 3 Q = {bill, clinton } D = D 113 Relevance(Q, D)? Key factors: tf(t,d) df(t) D For most IR models and relevance formulas, Relevance(Q, D) = f(tf(t,d), df(t), D ) 27

28 IR Perspectives and Modeling * IR Models & Perspectives IR models define the representation of documents, queries, and the relevance relationship between them The key behind all IR models is primary perspectives on information retrieval Model Boolean model Vector space model Probabilistic model Language model Perspective Set theory and Boolean algebra Vector and linear algebra Probabilistic 28

29 Architecture Caching Query Inverted Index User Interface Caching Indexing and Ranking Index Builder Functions Caching results Online of frequent Part queries to answer thousands of queries per second with interactive response times Core Problems What to cach? Solutions Multiple level caching Page Ranks Query level Link Analysis Term Level Cached Pages Web Page Parser Pages Links & Anchors Link Map Web Graph Builder Web Graph Page & Site Statistics Crawler Web Offline Part 29

30 How does a Web search engine process thousands of queries per second? 30

31 Multiple Clusters (Sub-Engines) Queries DNS-based loading balancing system Accounting for 1). The user s geographic proximity to each physical cluster 2). The available capacity at the various clusters Goal: 4000 query/second # of clusters: query/sec per cluster Cluster (Sub-engine) Cluster (Sub-engine) Multiple clusters distributed worldwide For each query, only one HTTP request is sent (to only one cluster). 31

32 Inside a Cluster Queries (to this cluster) Hardware-based load balancer Goal: 400 query/sec (per cluster) 80% queries are cached? Web Server Web Server Search result caching (1) (2) (3) (4) 3 replica? One Index/Content Shard IndexShards ContentShards 400 * (1-80%) / 3 = 27 query/sec per server 32

33 Inside an Index Server Goal: 27 query/sec (per index server) (2). Compute inverted-list intersection, evaluate docs, sort by relevance score Query: t 1 t Inverted list for t (2) 2 1 term-1 term-2 term-m doc doc doc doc doc (1) Inverted list for t 2 doc doc doc Top-K DocIds (1). Load inverted lists for the query terms Inverted index Optimize for performance: Utilizing state-of-the-art dynamic pruning algorithms to compute top-k efficiently. 33

Homework (5) Below is the slowest query I found on Google. Explain why. (Hint: invalidating index pruning) a the the the is 14.52 seconds References Xiaohui Long, Torsten Suel.

34 Homework (5) Below is the slowest query I found on Google. Explain why. (Hint: invalidating index pruning) a the the the is seconds References Xiaohui Long, Torsten Suel. Optimized Query Execution in Large Search Engines with Global Page Ordering. VLDB 2003 Xiaohui Long, Torsten Suel. Three-Level Caching for Efficient Query Processing in Large Web Search Engines. 14th International World Wide Web Conference (WWW),

35 With the knowledge learned so far, you can build a decent single-machine search engine by yourself. Have a try if you want, it will only cost you several weeks of time.

36 But

37 Some Facts of a Real Commercial Search Engine Huge data volume 10B pages * 10K/per page = 100T Crawling bandwidth 100T/ (14 * 24 * 3600) = 82MB/second Performance queries/second, response time < 1 second 10,000+ machines System failure is normal: If one machine fails once in one year, P(at least one machine failed in each hour) = 68%. High reliability: data are never allowed to corrupt High availability: 7*24 serving High scalability: machines are added or removed every day The electric power consumed by a large data center can supply a city with 50,000 people! The largest computer clusters in the world A lot of tough things to be solved 37

38 Mega-Datacenters for Internet Services 38

39 Columbia

40 Now you can understand why there is a search engine company worth $100,000,000,000+

41 Outline An Introduction to Search Engine Architecture Top 10 Myths about Search Engine Computer Science in Search Engine Top 10 Challenges in Search Engine 41

42 #1 Myth: Some search engines are close to perfect. Fact: They are perfect because you have no choice Search engines lower our expectations We are getting used to their poor performance Is this perfect? 42

43 #2 Myth: There are magic algorithms in search engines Fact: There is no a single magic algorithm can make you win the search battle PageRank is not that important as you think. It is only one small factor among many many others that search engines use to determine the ranking Search algorithms are keeping improving 43

44 #3 Myth: Most of the information on the Web has been indexed by search engines. Fact: Only a very tiny fraction of Web information is being indexed. Seen URLs >> crawled URLs Dynamic contents, deep Web, Web 2.0 contents 44

45 #4 Myth: It is easy to switch to another search engine. Fact: Users only switch to a search engine significant better than the current one. 45

46 #5 Myth: Ranking is the most important thing Fact: An infrastructure enabling quick innovations is most important No good infrastructure, no good ranking Good ranking is the result of many hard efforts behind 46

47 #6 Myth: Search engine is equivalent to Web information retrieval Fact: Search engine is equivalent to Web-scale information management Information acquisition, processing, storage, access, indexing, querying, mining Managing the information in the world 47

48 #7 Myth: Cool feature is the king Fact: Do simple thing and do it best is the king In terms of features, the current search engines are in fact the same as those ten years ago Ideas vs. ideas do work! Of course, only if you have a really cool idea that can change the game 48

49 #8 Myth: Ideas in top conference papers are excellent Fact: Most of them DO NOT work at all! Toy system Small dataset Scholastic evaluation How to narrow the gap between academy and industry? 49

50 #9 Myth: Most of Web search researchers are from the IR community Fact: They come from diverse fields For example, search researchers in Microsoft Research Asia are from database, machine learning, system, IR, multimedia, etc. 50

51 #10 Myth: We know what is next-generation search engine Fact: We don t know Many efforts Users will tell 51

52 Outline An Introduction to Search Engine Architecture Top 10 Myths about Search Engine Computer Science in Search Engine Top 10 Challenges in Search Engine 52

53 Information Retrieval in SE Information Retrieval Text Retrieval Information Retrieval = Information Retrieval Web is the largest information source Go to check the percentage of Web search related papers in SIGIR

54 Systems in SE Search engine data centers: the largest distributed computing platforms in the world When the scale is large enough, it becomes a system problem Infrastructure for Web-scale data processing is it Web OS? Google File System : best paper of SOSP 03 54

55 Database in SE Is Web a Huge Database? Most data on the Web are in fact (semi-)structured Database people want to manage more data Online Database everywhere DB+IR workshops in SIGMOD, VLDB, SIGIR, and WWW WebDB workshop Database (DR) Information Retrieval (IR) Data Structured Unstructured Model Deterministic Probabilistic Inference Deduction Induction Query language Artificial Natural Query specification Complete Incomplete Matching Exact match Partial match, best match Items wanted Matching Relevant Error response Sensitive Insensitive Data update Full-support Not support WebDB? a long way to go Transaction Support Not support Usage Application-oriented Human-oriented 55

56 Machine Learning & Data Mining in SE Data! A huge amount of data!! Various kinds of data!!! Data mining and machine learning people are exciting If you have a lot of data, then you don t need a lot of methodology. Moore s Law Constant: Data Collection Rates -> Improvement Rates. All Web-scale data processing tasks needs to be automatic Learning to Learning to ranking Learning to crawling Learning to extracting 56

57 Others Multimedia Social Science User Interface Network Hardware You can get a PhD degree by working on Web search problems 57

58 Outline An Introduction to Search Engine Architecture Top 10 Myths about Search Engine Computer Science in Search Engine Top 10 Challenges in Search Engine 58

59 #1: Spamming and Content Quality * Click Money, Spam Click ==> Spam Money An endless game between spammers and search engines How to determine the quality of web content? Traditional IR: every document is authoritative and accurate. 59

60 Homework (6) Prove either of the following propositions: There are spam-immune ranking algorithms There is NO a spam-immune ranking algorithm 60

61 #2: Data Acquisition Growing speed of the Web >> Growth of indexing capability of search engines. Re-crawl frequently updated pages: news, blog, bbs Dynamic contents: deep Web, Web 2.0 Crawling is the first step of search, but its importance is largely ignored by academia. 61

62 Homework (7) How to crawl blogs? 62

63 #3: Infrastructure The Cycle of Web Innovation Ideas Prototypes Products Testing Deployment Continuous Innovation, tuning, and hacking RTM RTW Always Beta! Quick Prototyping and deployment Very Difficult to Do Web-scale Innovations Long innovation cycle How difficult to test a new algorithm in 5B pages? How difficult to calculate the query frequencies in 100T search logs? 63

64 #4: Ranking * Essence of ranking How to combine innumerous evidences to do a good ranking 64

65 #5: Evaluation * Traditional IR evaluation Limited binary judgment Static collection of documents (few million) A small set of queries (around ) Use pooling Pool top 1000 results from various techniques Assume all possible relevant documents judged On the Web Biased against revolutionary new methods Judge new documents if needed Collection is dynamic 10-20% urls change every month Spam methods are dynamic Need to keep the collection recent Queries are also time sensitive Topics are hot then not Need to keep a representative sample Result quality is important Multiple level judgment Clicks as implicit judgment? 65

66 #6: Query Formulation Query = information need? How do you compose your queries? Guess if the terms occur in the wanted pages Relevant to terms, instead of relevant to query What to do if the guess fails? 66

67 #7: Personalization Personalized search, a long history, but never a success story Is personalized search really useful? There is NOT a widely-used personalized search engine How does personalized search work in a real large-scale search engine? User study: in a closed environment Does one size fit all? It is unclear whether personalization is consistently effective on different queries, for different users, and under different search contexts 67

68 #8: Structure in the Web Are Web data really unstructured? More structure = better search Layout Structure Discussion Thread Structure Community Structure Category Structure Link Structure Interface Interface Interface Deep Structure Database Database Database 68

69 #9: Go Beyond Page-level Search * Is page the only and best atomic information unit? A tradition from IR, but not necessary for Web What we did and am doing Block-based search Deep Web search Object-level search 69

70 #10: The Next Big Thing? 70

71 Thanks!

Anatomy of a search engine. Design criteria of a search engine Architecture Data structures

Anatomy of a search engine. Design criteria of a search engine Architecture Data structures Anatomy of a search engine Design criteria of a search engine Architecture Data structures Step-1: Crawling the web Google has a fast distributed crawling system Each crawler keeps roughly 300 connection