Query based Site Selection for Distributed Search Engines

Size: px

Start display at page:

Download "Query based Site Selection for Distributed Search Engines"

Valentine Baldwin
5 years ago
Views:

1 Query based Site Selection for Distributed Search Engines Nobuyoshi SATO, Minoru DAAWA, Minoru EHARA, Yoshifumi SAKAI, Hideki MORI Department of Information and Computer Sciences, Toyo niversity, JAPAN Abstract We have developed a distributed search engine, called Cooperative Search Engine (CSE), in order to retrieve fresh information. In CSE, a local search engine located in each Web server makes an index of local pages. And, a Meta search server integrates these local search engines in order to realize a global search engine. In such a way, the communication delay occurs at retrieval time. So, we have developed several speedup techniques in order to realize fast retrieval. However, these techniques cannot be used for first page retrieval in Next 10 search if the page has not been searched yet. So, we have proposed Query based Site Selection (QbSS), which is widely available in all cases. In this paper, we describe QbSS in detail and discuss its features. 1. Introduction Search engines are very important for Web page retrieval. Typical search engines employ centralized architecture. In such a centralized search engine, a robot collects Web pages and an indexer makes an index of these pages to search fast. Now the update interval is defined as the period that a page is published but cannot be searched yet. In this case, centralized architecture has a problem that the update interval is very long. For an example, oogle wastes 4 weeks[1]. So, we have developed a distributed search engine, Cooperative Search Engine (CSE)[2][3] in order to reduce the update interval. In CSE, a local search engine located in each Web server makes an index of local pages. Furthermore, a meta search engine integrates these local search engines in order to realize a global search engine. By such a mechanism, though the update interval is reduced, communication overhead is increased. As this result, early CSE is suited for intranet information retrieval in small-scale networks that consist of less than 100 servers. However, international enterprises often have more than 100 servers in their domains. In order to solve the scalability of CSE, we have developed several techniques such as Score based Site Selection (SbSS)[7], Persistent Cache[9]. In SbSS, when second or later page is retrieved in Next 10 search, a client sends a query to at most top 10 sites by holding maximum score of each server. As this result, CSE realizes the scalability on retrieving second or later page. Persistent Cache keeps valid data after updating and it realizes the scalability on retrieving first page searched once. However, we still have a problem that these techniques mentioned above cannot be used if first page of Next 10 search has not been searched yet. In this paper, we propose Query based Site Selection (QbSS) in order to reduce the retrieval time even if first page of Next 10 search has not been searched yet. QbSS is one of site selection techniques based on Boolean formula of a query. CSE supports Boolean search based on Boolean formula. In Boolean search of CSE, the operations and, or, and and-not are available. Let S A and S B be the set of target sites for search queries A and B, respectively. Then, the set of target sites for queries A and B, A or B, and A not B are S A S B, S A S B, and S A, respectively. By this selection of the target sites, the number of messages in search process is saved. The remainder of this paper is organized as follows: We describe about the overview of CSE and its behaviors in section 2. We describe about QbSS in section 3, and evaluate it in section 4. In section 5, we survey the related works on distributed information retrieval. Finally, we summarize conclusions and future works. 2. Cooperative Search Engine First, we explain a basic idea of CSE. In order to minimize the update interval, every web site basically makes indices via a local indexer. However, these sites are not cooperative yet. Each site sends the information about what (i.e. which words) it knows to the manager. This information is called Forward Knowledge (FK), and is Meta knowledge indicating what each site knows. FK is the same as FI of Ingrid. When searching, the manager tells which site has documents including any word in the query to the client, and then the client sends the query to all of those sites. In this way, since CSE needs two-pass communication at searching, the retrieval time of CSE becomes longer than that of a centralized search engine. CSE consists of the following components (see Figure 1). Location Server (LS): It manages FK exclusively. sing FK, LS performs Query based Site Selection described later. LS also has Site

2 Figure 1 The overview of CSE selection Cache (SC) which caches results of site selection. Cache Server (CS): It caches FK and retrieval results. LS can be thought of as the top-level CS. It realizes Next 10 searches by caching retrieval results. Furthermore, it realizes a parallel search by calling LMSE mentioned later in parallel. Local Meta Search Engine (LMSE): It receives queries from a user, sends it to CS (ser I/F in Figure 1), and does local search process by calling LSE mentioned later (Engine I/F in Figure 1). It works as the Meta search engine that abstracts the difference between LSEs. Local Search Engine (LSE): It gathers documents locally (atherer in Figure 1), makes a local index (Indexer in Fig. 1), and retrieves documents by using the index (Engine in Figure 1). In CSE, Namazu[4] can be used as a LSE. Namazu has widely used as the search services on various Japanese sites. Next, we explain how the update process is done. In CSE, pdate I/F of LSE carries out the update process periodically. The algorithm for the update process in CSE is as follows. 1. atherer of LSE gathers all the documents (Web pages) in the target Web sites using direct access (i.e. via NFS) if available, using archived access (i.e. via CI) if it is available but direct access is not available, and using HTTP access otherwise. Here, we explain archived access in detail. In archived access, a special CI that provides mobile agent place functions is used. A mobile agent is sent to that place. The agent archives local files, compresses them and sends back to the gatherer. 2. Indexer of LSE makes an index for gathered documents by parallel processing based on Boss-Worker model. 3. pdate phase 1: Each LMSE i updates as follows Engine I/F of LMSE i obtains from the corresponding LSE the total number N i of all the documents, the set K i of all the words appearing in some documents, and the number n k,i of all the documents including word k, and sends to CS all of them together with its own RL CS sends all the contents received from each LMSE i to the upper-level CS. The transmission of the contents is terminated when they reach the top-level CS (namely, LS) LS calculates the value of idf(k) = log( N i / n k,i ) from N k,i and N i for each word k. 4. pdate phase 2: Each LMSE i updates as follows 4.1. LMSEi receives the set of Boolean queries Q which has been searched and the set of idf values from LS Engine I/F of LMSE i obtains from the corresponding LSE the highest score max d D S i (d,q) for each q {Q,K i }, S i (d,k) is a score of document d containing k, D is the set of all the documents in the site, and sends to CS all of them together with its own RL CS sends all the contents received from each LMSE i to the upper-level CS. The transmission of the contents is terminated when they reach the top-level CS (namely, LS). Note that the data transferred between each module are mainly used for distributed calculation to obtain the score based on the tf*idf method. We call this method the distributed tf*idf method. The score based on the distributed tf*idf method is calculated at the search process. So we will give the detail about the score when we explain the search process in CSE. For the good performance of the update process, the performance of the search process is sacrificed in CSE. Here we explain how the search process in CSE is done. 1. When LMSE 0 receives a query from a user, it sends the query to CS. 2. CS obtains from LS all the LMSEs expected to have documents satisfying the query. 3. CS sends the query to each of all LMSEs obtained. 4. Each LMSE searches documents satisfying the query by using LSE, and returns the result to CS. 5. CS combines with all the results received from LMSEs, and returns it to LMSE LMSE 0 displays the search result to the user..here, we describe the design of scalable architecture for the distributed search engine, CSE. In CSE, at searching time, there is the problem that communication delay occurs. Such a problem is solved by using following techniques. Look Ahead Cache in Next 10 Search[5][6] To shorten the delay on search process, CS

3 prepares the next result for the Next 10 search. That is, the search result is divided into page units, and each page unit is cached in advance by background process without increasing the response time. Score based Site Selection (SbSS)[7] In the Next 10 search, the score of the next ranked document in each site is gathered in advance, and the requests to the sites with low-ranked documents are suppressed. By this suppression, the network traffic does not increase unnecessarily. For example, there are more than 100,000 domain sites in Japan. However, by using this technique, about ten sites are sufficient to requests on each continuous search. lobal Shared Cache (SC)[8] A LMSE sends a query to the nearest CS. Many CS may send same requests to LMSEs. So, in order to globally share cached retrieval results among CSs, we proposed lobal Shared Cache (SC). In this method, LS memories the authority CS a of each query and tells CSs CS a instead of LMSEs. CS caches the cached contents of CS a. Persistent Cache(PC)[9] There is at least one CS in CSE in order to improve the response time of retrieval. However, the cache becomes invalid soon because the update interval is very short in CSE. aluable first page is also lost. Therefore, we need persistent cache, which holds valid cache data before and after updating. In this method, there are two update phases. At first update phase, each LMSE sends the number of documents including each word to LS, and LS detects idf of each word. At second update phase, preliminary search is performed using new idfs in order to update caches. Query based Site Selection(QbSS)[10] CSE supports Boolean search based on Boolean formula. In Boolean search of CSE, the operations and, or, and and-not are available. Let S A and S B be the set of target sites for search queries A and B, respectively. Then, the set of target sites for queries A and B, A or B, and A and-not B are S A S B, S A S B, and S A, respectively. By this selection of the target sites, the number of messages in search process is saved. These techniques are used as follows: if the previous page of Next 10 search has been already searched LAC else if query does not contain and or and-not SbSS else if it has been searched since index was updated SC else if it has been searched once PC else // query is new QbSS fi Only QbSS is not scalable in these techniques. Therefore, it is important to improve the precision of QbSS. 3. Query based Site Selection In CSE, when retrieval a query is given, LS commissions LMSEs to search local documents. So, it is important for reducing the retrieval time to select LMSEs having at least one document satisfying the condition represented by the query. LS has a set of all keywords appearing in at least one local document, for every LMSE. We will describe below how LMSEs are appropriately selected in feasible time based on this information. Let { k 1,..., k n } be a set of all keywords, and for each keyword k i, let x i be a Boolean variable indicating whether k i appears in a document. In CSE, a retrieval query is given as a Boolean formula f of variables x 1,..., x n, where AND, OR, and NOT operators are available in the formula. In the following, we think of a document as an n-dimensional vector whose i-th entry takes the value 1 if keyword k i appears in the document, and 0 otherwise. Then, for a retrieval query f, the target to be found is a set { d f(d) = 1 } of documents. For an LMSE L, let d(l) denote an n-dimensional vector obtained by taking bitwise-or operations over all documents in L. Then, we can regard LS as having d(l) for every LMSE L. However, it is impossible from d(l) to determine the set of documents maintained in L exactly. On the other hand, if a document d is in L, then it follows from the definition of d(l) that d d(l) holds, where d d(l) means that, for any i = 1,..., n, the i-th entry of d is less than or equal to the i-th entry of d(l). Thus, we handle { d d d(l) } as the set of documents maintained in L. For given retrieval query f and LMSE L, whether there exists a document in { d d d(l) } that satisfies f can be determined by testing whether there exists a document d such that both f(d) = 1 and d d(l) hold. Let f be a Boolean function such that f (d) = 1 if and only if there exists d d with f(d ) = 1, which is known as the minimal monotone function of f[11]. Then the above condition, there exists a document d such that both f(d) = 1 and d d(l) hold, can be replaced by a simple condition f (d(l)) = 1. nfortunately, the problem of constructing f from a given Boolean formula f is NP hard, and hence no polynomial time algorithm for this

4 problem is known. For example, expanding f into a disjunctive normal form formula, and then eliminating all negative literals (that is, negated variables) from the resulting formula yield f[11]. However this algorithm is not feasible since, in general, the length of the disjunctive normal form formula of f is exponentially larger than the length of f. So, in CSE, LMSEs possibly having at least one document satisfying f are selected by using a formula f obtained by simply eliminating all negated subformulas form retrieval query f, instead of using the minimal monotone function f of f. Clearly, f can be obtained from f in linear time with respect to the length of f. Furthermore, we can guarantee by the following two facts that all LMSEs selected by f are also selected by f. Fact 1. Eliminating negated subformulas from f does not change the value that f takes for any document d with f(d) = 1. Fact 2. No NOT operators appear in f, and hence, for any document d, if there exists d d such that f (d ) = 1 then f (d) = 1 holds. However, using f instead of f, LMSEs that have no document satisfying f may be selected because f (d) = 0 does not necessarily imply f (d) = 0. For an extreme example, if f = x 1 AND (NOT x 1 ) then f = 0 (a constant function that always takes value 0) and f = x 1, which implies that all LMSEs with a document in which keyword k 1 appears are selected even though we can guarantee that none of these LMSEs has documents satisfying f. In the next section, we will experimentally examine the accuracy and efficiency of selecting LMSEs by f. The methods we will examine are as follows: 1. The method of simply constructing the minimal monotone function (SIMP): constructing the formula OR Y X ± ( f Y = 1, X Y = 0 AND (AND x Y x)), where X and X ± are the set of all variables which are negatively used and both positively and negatively used in f, respectively, and f = 1, W = 0 denotes the Boolean function obtained from f by setting any variable in to 1 and any variable in W to The method of simply constructing the minimal monotone function after pruning down subformulas (PRN): constructing the minimal monotone function using SIMP after pruning f of subformulas without changing the minimal monotone function. 3. Query Based Site Selection (QbSS): simply eliminating all negated subformulas form f. Note that the methods SIMP and PRN are guaranteed to output f, the minimal monotone function of f in exponential time with respect to the size of X ± concerning f and the formula obtained by pruning f of subformulas, respectively, whereas the method QbSS outputs f in linear time with respect to the length of f as mentioned before. 4. Evaluations First, we compare query lengths in 3 kinds of monotonization: SIMP, PRN, and QbSS. Each query length is equivalent to the number of literals. There are 2 kinds of literals in a query: keyword and Boolean operator. A query is represented as full complete binary tree with depth N. Leaf nodes correspond to keywords and inner nodes correspond to one of 3 Boolean operators ( and, or, and-not ). The number of literals in an input query is 2 N+1-1. An input query is converted by monotonization. Now, we show randomly generated query length and query lengths converted by different monotonization methods as Figure 2. Here, each of orig, qbss, simp, prun are query length of an input query, query length of QbSS, query length of SIMP, query length of PRN respectively. In addition, the size of word set is fixed to 100. Although all query lengths grow exponentially, query length of QbSS is shortest with these methods. Query length of min becomes too large if N>5. Query length of PRN becomes also too large if N>6. From this result, we conclude that QbSS is suitable for various N. Query length is depended on the size of word set. We show query length of SIMP in various word sets as Figure 3. Here, min W is a series in which the size of word set is W. At first, we discuss the case that there are J I P N [ T W S QTKI SD KO R RTWP 0 Figure 2. Query lengths in monotonization methods J I P N [ T W S QTKI KO R KOR KOR 0 Figure 3. Query lengths in SIMP method

5 ? O = KO I KP E Q T R K SD KO R RTWP Figure 4. Processing times of monotonization methods few words in a query (W=10). If the number of literals grows then the words appear in a query any number of times. So, this monotonization causes the effect of reduction. Next, we discuss the case that there are many words in a query (W=1000). In this case, the query is rarely monotonized because there are few same words in the query. Finally, in the case of W=100, the effect of reduction is not expected because the query is frequently monotonized between dependent words. Pruned method is as same as the minimum monotonization method. However, QbSS can reduce query length stably in the wide range of W. Next, we compare the processing times of 3 monotonization methods (see Figure 4). The processing time of QbSS is O(n) because QbSS is computed by traversing the query tree. The processing time of SIMP is O(2 n ). PRN method is slower than SIMP method if N<6. However, the growth of PRN method is more gently sloped than SIMP method. PRN method is O(2 n ) in the worst case because it uses SIMP method internally. Next, we compare these site selection effects by three monotonization methods. We show the relationship of the size of word set (#words) to the number of sites (#sites) selected using queries converted by 3 methods as Figure 5. Here, the number of sites is 100, the number of documents in a site is 10, the number of words in a document is 10. The number of selected sites is depended on the size of word set. QbSS cannot select 0 SD KO R RTWP Y QTF Figure 5. #words vs #sites in the case of N=6 sites if the size of word set is small. In such a case, the same word is used any number of times in a query or a document. So, there are few words in a query monotonized by QbSS. Such a query matches to almost sites. However, the number of sites selected by QbSS is larger compared with other methods and its difference is 20-30%. Although QbSS may not select sites efficiently, we conclude that QbSS is efficient because QbSS is O(n). Furthermore, the effect of QbSS is as same as other methods if the size of word set is large. Now we consider scaling figure 5 to 10 4 times. The scale of network such as #words=10 5, #sites=10 6 is very large. Such a network is only Internet. nfortunately, QbSS is not suitable for such a large-scale network. However, QbSS is useful for middle-scale network with #words=10 5, #sites=10 5. Furthermore, QbSS is suitable for small-scale network such as enterprise intranets because QbSS can select average 10% of sites. In the above discussion, we assume that the distribution of words is uniformed. However, the distribution of words is not actually balanced. In such a case, QbSS can reduce more the number of selected sites. In addition, the retrieval time may be reduced by sorting keywords in a query in the following order: concrete and clauses (e.g. the length of a keyword string is long), abstract and clauses (e.g. the length of a keyword string is short), abstract and-not clauses, and concrete and-not clauses. 5. Related Works Many researchers have already studied on distributed information retrieval and they have developed the following systems, Archie, WAIS, Whois++, and so on. These are not search engines for Web pages. However, Forward Knowledge (FK), which is introduced by Whois++, is a basic idea for distributed information retrieval. Several FK-based distributed Web page retrieval systems such as Harvest, Ingrid, and so on, are developed. In Whois++[12], FKs are grouped as a centroid, each server transfers queries by using FK if it does not know their destinations. This is known as query routing. Most famous research on distributed information retrieval will be Harvest[13]. Harvest consists of atherer and Broker. A atherer collects documents, summarizes them as SOIF (Summary Object Interchange Format), and transfer is to a Broker. SOIF is the summary of a document, which consists of author s name, title, key words and so on. Actually, a atherer needs to send almost full texts of collected documents to a Broker, because the full text of a document must be included in SOIF in Harvest s full text search. A Broker makes an index internally. A Broker accepts a query and retrieves by cooperating with other Brokers. In Harvest,

6 both limpse and Nebula are employed as search engines, which really make indexes and search. The index size of limpse is very small and Nebula can search documents very fast. In Harvest, atherer itself can access documents directly. However, because atherer does not make an index, it needs to send the index to a Broker. Therefore, Harvest cannot reduce the update interval than CSE. Ingrid[14] is the information infrastructure developed by NTT, which aims to realize topic-level retrieval. Ingrid links collected resources each other and makes an original topology. Forward Information (FI) servers manage this topology. Ingrid navigator communicates with FI servers in order to search the way to a resource. Ingrid is flexible but its communication latency is long because the way is sequentially searched. In CSE, only LS searches the way, so it may become bottleneck but its communication latency is short. 6. Conclusions In this paper, we proposed Query based Site Selection (QbSS) as a site selection method which is used for new query. QbSS can select 10% of all sites in middle-scale network such as enterprise. Although query monotonized by QbSS is not minimum, QbSS is enough in practice. The computing time of QbSS, O(n) is shorter than the computing time of minimum monotonization, O(2 n ). However, the effect of site selection of QbSS is as same as that of minimum monotonization. Therefore, QbSS is efficient and it increases the scalability of CSE. Acknowledgement This research was cooperatively performed as a part of Mobile Agent based Web Robot project in Toyo niversity and a part of Scalable Distributed Search Engine for Fresh Information Retrieval ( ) in rant-in-aid for Scientific Research promoted by Japan Society for the Promotion of Science (JSPS). References [1] oogle, oogle Information for Webmasters, [2] Nobuyoshi Sato, Minoru ehara, Yoshifumi Sakai, Hideki Mori, Distributed Information Retrieval by using Cooperative Meta Search Engines, in Proceedings of The 21st IEEE International Conference on Distributed Computing Systems Workshops (Multimedia Network Systems, MNS2001), pp , [3] Nobuyoshi Sato, Minoru ehara, Yoshifumi Sakai, Hideki Mori, Fresh Information Retrieval using Cooperative Meta Search Engines, in Proceedings of the 16th International Conference on Information Networking (ICOIN-16), ol.ii, pp.7a-2-1 7, [4] The Namazu Project, Namazu, [5] Nobuyoshi Sato, Takashi Yamamoto, Yoshihiro Nishida, Minoru ehara, Hideki Mori, Look Ahead Cache for Next 10 in Cooperative Search Engine, in Proceedings of DPSWS 2000, IPSJ Symposium Series, ol.2000, No.15, pp , 2000 (in Japanese). [6] Nobuyoshi Sato, Minoru ehara, Yoshifumi Sakai, Hideki Mori, Fresh Information Retrieval in Cooperative Search Engine, in Proceedings of 2nd International Conference on Software Engineering, Artificial Intelligence, Networking & Parallel / Distributed Computing 2001 (SNPD 01), pp , Nagoya Japan, [7] Nobuyoshi Sato, Minoru ehara, Yoshifumi Sakai, Hideki Mori, Score Based Site Selection in Cooperative Search Engine, in Proceedings of DICOMO 2001, IPSJ Symposium Series, ol.2001, No.7, pp , 2001 (in Japanese) [8] Nobuyoshi Sato, Minoru ehara, Yoshifumi Sakai, Hideki Mori, lobal Shared Cache in Cooperative Search Engine, in Proceedings of DPSWS 2001, IPSJ Symposium Series, ol.2001, No.13, pp , 2001 (in Japanese). [9] Nobuyoshi Sato, Minoru ehara, Yoshifumi Sakai, Hideki Mori, Persistent Cache in Cooperative Search Engine, in Proceedings of the 22nd IEEE International Conference on Distributed Computing Systems Workshops (Multimedia Network Systems and Applications, MNSA 2002), pp , [10] Yoshifumi Sakai, Nobuyoshi Sato, Minoru ehara, Hideki Mori, The Optimal Monotonization for Search Queries in Cooperative Search Engine, in Proceedings of DICOMO2001, IPSJ Symposium Series, ol.2001, No.7, pp , 2001 (in Japanese). [11] N. H. Bshouty, Exact learning Boolean functions via the monotone theory, Information and Computation, ol.123, pp , [12] C. Weider, J. Fullton, S. Spero: Architecture of the Whois++ Index Service, RFC1913, [13] C. Mic Bowman, Peter B. Danzig, Darren R. Hardy, di Manber, Michael F. Schwartz: The Harvest Information Discovery and Access System, in Proceedings of the 2nd WWW Conference, earching/schwartz.harvest/schwartz.harvest.html, [14] Nippon Telegraph and Telephone Corp. Ingrid,

Temporal Ranking for Fresh Information Retrieval

Temporal Ranking for Fresh Information Retrieval Nobuyoshi Sato Dept. of Information and Computer Sciences Toyo University Kawagoe, Saitama, Japan jju@ds.cs.toyo.ac.jp Minoru Uehara Dept. of Information