An Overview of Search Engine. Hai-Yang Xu Dev Lead of Search Technology Center Microsoft Research Asia

Size: px

Start display at page:

Download "An Overview of Search Engine. Hai-Yang Xu Dev Lead of Search Technology Center Microsoft Research Asia"

Debra Hines
6 years ago
Views:

1 An Overview of Search Engine Hai-Yang Xu Dev Lead of Search Technology Center Microsoft Research Asia July 24,

2 Outline History of Search Engine Difference Between Software and Service Architecture of Search Engine 5 Tips On Optimizing Search Engine 3 Secrets On Implementing Search Engine 2

3 History of Search Engine Personal or Academic Site( ) Internet Portal( ) Technology Provider( ) Search Portal(2002-) 3

4 Personal or Academic Site( ) Archie WebCrawler Lycos Excite Yahoo 4

5 Internet Portal ( ) Yahoo! Lycos Excite Infoseek 5

6 Technology Provider ( ) AltaVista Inkotomi Fast/AlltheWeb Google Goto/Overture 6

7 Search Portal (2002-) Google Yahoo MSN ASK 7

8 Lessons From The Past Technology is the biggest challenge Search engine always is an important application of Internet Search engine can always be developed better 8

9 Architecture of Search Engine URL DB Crawler Page DB Query Result Page Search 9

10 Difference Between Software and Service Product vs. Experience Feature vs. Refinement Develop vs. Operate Release vs. Serve Code vs. Parameter Update vs. Tune Bug Free vs. Optimal 10

11 Crawler Crawling is more difficult than what you think Stability in downloading, computation and storage Scalability High Performance 11

12 Performance Content Analysis tf*idf html tag html visual information Link Analysis PageRank, Spam 12

13 Search Huge Traffic Huge Data Complicated Computation Very Large Server Cluster 13

14 Engineering Problem of Search Engine Intellectual Problem Optimization Non-intellectual Problem Implementation 14

15 5 Tips On Optimizing Search Engine Define problem from user perspective System-level thinking Feature is more important than classification method Tradeoff Combine several simple solutions to a powerful solution 15

16 Non-intellectual Problem Architecture High Performance 16

17 3 Secrets On Implementing Search Engine Cache Signature Hash Table 17

18 Cache What is cache What to do with cache is more important than how to cache Search result page cache 18

19 Cache (Cont.) Front-end Cache & Back-end Cache Caching Merged & Caching Raw Caching & Caching Display Information Caching everything 19

20 Cache (cont.) Cache them before search Cache In Disk 2 Terms Cache 20

21 Signature What is signature What problems can benefit from signature trick Signature algorithm Probability of conflict 21

22 Hash Table What is hash table Implementation of hash table using signature 22

23 Document is an by doc is an by term just is a hash table of terms 23

24 Query Statistics Problem: From query log file, we want to get frequency of each query Solution 1: Sort query log file, then count each query Solution 2: Hash table 24

25 O(n) Sort Problem: Sort student record according to examination score Solution 1: qsort(), O(n*logn) Solution 2: Hash table, O(n) 25

26 De-duplicate URLs Problem: De-duplicate URLs with same content Solution 1: Sort, then compare Solution 2: Hash table 26

27 Set Operations A*B A+B A-B B-A 27

28 Page Storage Architecture Crawl Page DB Crawl Page Distributor Page DB Page DB Page DB Page DB 28

29 Architecture Page DB 29

30 Search Architecture Load Balancer Query Result Page Frontend Frontend Frontend 30

31 Crawling Architecture How about this solution: Url DB Url DB Url DB Url DB Url DB URL Url DB Crawl Crawl Crawl Crawl Crawl Crawl Page DB WRONG! 31

32 Crawling Architecture (Cont.) Url DB Url DB Url DB Url DB URL Url DB Crawl Crawl Crawl Crawl Crawl Page DB URL Distributor 32

33 Summary History of Search Engine Difference Between Software and Service Architecture of Search Engine 5 Tips On Optimizing Search Engine 3 Secrets On Implementing Search Engine 33

34 34 Thank you!

35 35 Q&A

Relevant?!? Algoritmi per IR. Goal of a Search Engine. Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Web Search

Relevant?!? Algoritmi per IR. Goal of a Search Engine. Prof. Paolo Ferragina, Algoritmi per Information Retrieval Web Search Algoritmi per IR Web Search Goal of a Search Engine Retrieve docs that are relevant for the user query Doc: file word or pdf, web page, email, blog, e-book,... Query: paradigm bag of words Relevant?!?