Scalable Search Engine Solution

Size: px

Start display at page:

Download "Scalable Search Engine Solution"

Philip Barber
5 years ago
Views:

1 Scalable Search Engine Solution A Case Study of BBS Yifu Huang School of Computer Science, Fudan University huangyifu@fudan.edu.cn COMP Information Retrieval Project, 2013 Yifu Huang (FDU CS) COMP Project 2013/12/24 1 / 20

2 Outline 1 Motivation 2 Architecture 3 Implementation 4 Demo 5 Discussion Yifu Huang (FDU CS) COMP Project 2013/12/24 2 / 20

3 Motivation Motivation Current BBS donot support the full text search, but we need it indeed Get familiar with the implementation details behind search engine Build a scalable search engine solution that can be easily reused Yifu Huang (FDU CS) COMP Project 2013/12/24 3 / 20

4 Motivation Motivation Current BBS donot support the full text search, but we need it indeed Get familiar with the implementation details behind search engine Build a scalable search engine solution that can be easily reused Yifu Huang (FDU CS) COMP Project 2013/12/24 3 / 20

5 Motivation Motivation Current BBS donot support the full text search, but we need it indeed Get familiar with the implementation details behind search engine Build a scalable search engine solution that can be easily reused Yifu Huang (FDU CS) COMP Project 2013/12/24 3 / 20

6 Architecture Architecture Yifu Huang (FDU CS) COMP Project 2013/12/24 4 / 20

7 Architecture Architecture (cont.) Cluster Data Intel(R) Core(TM)2 Duo CPU 2.93GHz 4GB RAM Hadoop Namenode/Jobtracker: 1, datanode/tasktracker: 24 HBase Master: 1, zookeeper: 3, regionserver: 24 Board: 376 Post: Yifu Huang (FDU CS) COMP Project 2013/12/24 5 / 20

8 Architecture Architecture (cont.) Cluster Data Intel(R) Core(TM)2 Duo CPU 2.93GHz 4GB RAM Hadoop Namenode/Jobtracker: 1, datanode/tasktracker: 24 HBase Master: 1, zookeeper: 3, regionserver: 24 Board: 376 Post: Yifu Huang (FDU CS) COMP Project 2013/12/24 5 / 20

9 Hadoop Common HDFS Conguration, serialization, compression, RPC,... Hdfs://namenode:9000/ Replication: 3 Yifu Huang (FDU CS) COMP Project 2013/12/24 6 / 20

10 Hadoop (cont.) MapReduce Map (k1, v1) -> (k2, v2) Reduce (k2, v2) -> (k3, v3) Yifu Huang (FDU CS) COMP Project 2013/12/24 7 / 20

11 HBase Model (row, column, time) -> cell Yifu Huang (FDU CS) COMP Project 2013/12/24 8 / 20

12 HBase (cont.) Storage Yifu Huang (FDU CS) COMP Project 2013/12/24 9 / 20

13 Nutch Inject Key idea The number of URL is large, so we map them to dierent nodes and inject them into HBase webpage table Map (oset, line, reversedurl, page) Normalize Filter Regex,... Regex, prex, sux,,domain, automaton,... ReversedUrl Page Reduce - default E.g. " becomes "com.foo.bar:8983:http/to/index.html?a=b" FetchTime, fetchinterval, metadata, score, marker,... Yifu Huang (FDU CS) COMP Project 2013/12/24 10 / 20

14 Nutch Inject Key idea The number of URL is large, so we map them to dierent nodes and inject them into HBase webpage table Map (oset, line, reversedurl, page) Normalize Filter Regex,... Regex, prex, sux,,domain, automaton,... ReversedUrl Page Reduce - default E.g. " becomes "com.foo.bar:8983:http/to/index.html?a=b" FetchTime, fetchinterval, metadata, score, marker,... Yifu Huang (FDU CS) COMP Project 2013/12/24 10 / 20

15 Nutch Inject Key idea The number of URL is large, so we map them to dierent nodes and inject them into HBase webpage table Map (oset, line, reversedurl, page) Normalize Filter Regex,... Regex, prex, sux,,domain, automaton,... ReversedUrl Page Reduce - default E.g. " becomes "com.foo.bar:8983:http/to/index.html?a=b" FetchTime, fetchinterval, metadata, score, marker,... Yifu Huang (FDU CS) COMP Project 2013/12/24 10 / 20

16 Nutch Generate Key idea We select some URL from HBase webpage table, map them to dierent nodes, and prepare to fech Map (reversedurl, page, (url, scorce), page) Check Mark, distance, fetch schedule, score,... Reduce ((url, scorce), page, reversedurl, page) Record Set HostCount, domaincount,... BatchId, marker,... Yifu Huang (FDU CS) COMP Project 2013/12/24 11 / 20

17 Nutch Generate Key idea We select some URL from HBase webpage table, map them to dierent nodes, and prepare to fech Map (reversedurl, page, (url, scorce), page) Check Mark, distance, fetch schedule, score,... Reduce ((url, scorce), page, reversedurl, page) Record Set HostCount, domaincount,... BatchId, marker,... Yifu Huang (FDU CS) COMP Project 2013/12/24 11 / 20

18 Nutch Generate Key idea We select some URL from HBase webpage table, map them to dierent nodes, and prepare to fech Map (reversedurl, page, (url, scorce), page) Check Mark, distance, fetch schedule, score,... Reduce ((url, scorce), page, reversedurl, page) Record Set HostCount, domaincount,... BatchId, marker,... Yifu Huang (FDU CS) COMP Project 2013/12/24 11 / 20

19 Nutch Fetch Key idea We map URL with random id to dierent nodes and fetch them with multi-threads in each node Map (reversedurl, page, random_id, (conf, reversedurl, page)) Check BatchId, marker, resume,... Reduce (random_id, (conf, reversedurl, page), reversedurl, page) One producer to multiple consumers Yifu Huang (FDU CS) COMP Project 2013/12/24 12 / 20

20 Nutch Fetch Key idea We map URL with random id to dierent nodes and fetch them with multi-threads in each node Map (reversedurl, page, random_id, (conf, reversedurl, page)) Check BatchId, marker, resume,... Reduce (random_id, (conf, reversedurl, page), reversedurl, page) One producer to multiple consumers Yifu Huang (FDU CS) COMP Project 2013/12/24 12 / 20

21 Nutch Fetch Key idea We map URL with random id to dierent nodes and fetch them with multi-threads in each node Map (reversedurl, page, random_id, (conf, reversedurl, page)) Check BatchId, marker, resume,... Reduce (random_id, (conf, reversedurl, page), reversedurl, page) One producer to multiple consumers Yifu Huang (FDU CS) COMP Project 2013/12/24 12 / 20

22 Nutch Fetch (cont.) QueueFeeder Feed the queues with input items, and re-lls them as items are consumed by FetcherThread-s FetcherThread Pick items from queues and fetches the pages Check robot rules Check crawl delay schedule Get page content Check status code Yifu Huang (FDU CS) COMP Project 2013/12/24 13 / 20

23 Nutch Fetch (cont.) QueueFeeder Feed the queues with input items, and re-lls them as items are consumed by FetcherThread-s FetcherThread Pick items from queues and fetches the pages Check robot rules Check crawl delay schedule Get page content Check status code Yifu Huang (FDU CS) COMP Project 2013/12/24 13 / 20

24 Nutch Parse Key idea We map URL to dierent nodes, extract eld from them and save into HBase webpage table Map (reversedurl, page, reversedurl, page) Set Reduce - default Text, title, signature, outlinks,... Yifu Huang (FDU CS) COMP Project 2013/12/24 14 / 20

25 Nutch Parse Key idea We map URL to dierent nodes, extract eld from them and save into HBase webpage table Map (reversedurl, page, reversedurl, page) Set Reduce - default Text, title, signature, outlinks,... Yifu Huang (FDU CS) COMP Project 2013/12/24 14 / 20

26 Nutch Parse Key idea We map URL to dierent nodes, extract eld from them and save into HBase webpage table Map (reversedurl, page, reversedurl, page) Set Reduce - default Text, title, signature, outlinks,... Yifu Huang (FDU CS) COMP Project 2013/12/24 14 / 20

27 Nutch Updatedb Key idea We map URL to dierent nodes, extract their outlinks, and prepare to fetch these outlinks Map (reversedurl, page, (reversedout, score), pageout) Check outlink depth... Reduce ((reversedout, score), pageout, reversedout, pageout) Set inlinks... Yifu Huang (FDU CS) COMP Project 2013/12/24 15 / 20

28 Nutch Updatedb Key idea We map URL to dierent nodes, extract their outlinks, and prepare to fetch these outlinks Map (reversedurl, page, (reversedout, score), pageout) Check outlink depth... Reduce ((reversedout, score), pageout, reversedout, pageout) Set inlinks... Yifu Huang (FDU CS) COMP Project 2013/12/24 15 / 20

29 Nutch Updatedb Key idea We map URL to dierent nodes, extract their outlinks, and prepare to fetch these outlinks Map (reversedurl, page, (reversedout, score), pageout) Check outlink depth... Reduce ((reversedout, score), pageout, reversedout, pageout) Set inlinks... Yifu Huang (FDU CS) COMP Project 2013/12/24 15 / 20

30 Nutch Solrindex Key idea We map URL to dierent nodes, generate document and build index to solr server Map (reversedurl, page, reversedurl, doc) Set Reduce - default Id, digest, batchid, boost,... Yifu Huang (FDU CS) COMP Project 2013/12/24 16 / 20

31 Nutch Solrindex Key idea We map URL to dierent nodes, generate document and build index to solr server Map (reversedurl, page, reversedurl, doc) Set Reduce - default Id, digest, batchid, boost,... Yifu Huang (FDU CS) COMP Project 2013/12/24 16 / 20

32 Nutch Solrindex Key idea We map URL to dierent nodes, generate document and build index to solr server Map (reversedurl, page, reversedurl, doc) Set Reduce - default Id, digest, batchid, boost,... Yifu Huang (FDU CS) COMP Project 2013/12/24 16 / 20

33 Solr Yifu Huang (FDU CS) COMP Project 2013/12/24 17 / 20

34 Demo Demo Yifu Huang (FDU CS) COMP Project 2013/12/24 18 / 20

35 Discussion Discussion Result BBS search engine demo A scalable search engine solution that can be easily reused Future Work Incremental crawler Field search Personalized recommendation... Yifu Huang (FDU CS) COMP Project 2013/12/24 19 / 20

36 Discussion Discussion Result BBS search engine demo A scalable search engine solution that can be easily reused Future Work Incremental crawler Field search Personalized recommendation... Yifu Huang (FDU CS) COMP Project 2013/12/24 19 / 20

37 Appendix References I [1] The Google File System. SOSP2003. [2] MapReduce: Simplied Data Processing on Large Clusters. OSDI2004. [3] Bigtable: A Distributed Storage System for Structured Data. OSDI2006. [4] Hadoop: The Denitive Guide [5] Data-Intensive Text Processing with MapReduce [6] The Hadoop Distributed File System. MSST2010. [7] Apache Hadoop YARN: Yet Another Resource Negotiator. SOCC2013. [8] HBase: The Denitive Guide Yifu Huang (FDU CS) COMP Project 2013/12/24 20 / 20

Large Scale Distributed Deep Networks

Large Scale Distributed Deep Networks Yifu Huang School of Computer Science, Fudan University huangyifu@fudan.edu.cn COMP630030 Data Intensive Computing Report, 2013 Yifu Huang (FDU CS) COMP630030 Report