Database Selection and Keyword Search of Structured Databases: Powerful Search for Naive Users

Similar documents
Keyword search in relational databases. By SO Tsz Yan Amanda & HON Ka Lam Ethan

Top-k Keyword Search Over Graphs Based On Backward Search

Implementation of Skyline Sweeping Algorithm

Supporting Fuzzy Keyword Search in Databases

Information Discovery, Extraction and Integration for the Hidden Web

RELATIVE QUERY RESULTS RANKING FOR ONLINE USERS IN WEB DATABASES

Searching Databases with Keywords

Distributed KIDS Labs 1

A New Technique to Optimize User s Browsing Session using Data Mining

Ontology Based Prediction of Difficult Keyword Queries

Data Warehousing Alternatives for Mobile Environments

Keyword Search over Hybrid XML-Relational Databases

Keyword query interpretation over structured data

DIRA : A FRAMEWORK OF DATA INTEGRATION USING DATA QUALITY

Ontology based Model and Procedure Creation for Topic Analysis in Chinese Language

Leveraging Set Relations in Exact Set Similarity Join

GlOSS: Text-Source Discovery over the Internet

Keyword query interpretation over structured data

Designing Views to Answer Queries under Set, Bag,and BagSet Semantics

Modelling Structures in Data Mining Techniques

Leveraging Transitive Relations for Crowdsourced Joins*

Incompatibility Dimensions and Integration of Atomic Commit Protocols

Database Technology Introduction. Heiko Paulheim

Relational Databases

An Evolution of Mathematical Tools

A MODEL FOR ADVANCED QUERY CAPABILITY DESCRIPTION IN MEDIATOR SYSTEMS

Principles of Dataspaces

modern database systems lecture 4 : information retrieval

Chapter S:II. II. Search Space Representation

Extending Keyword Search to Metadata in Relational Database

Querying Data with Transact SQL

Information Retrieval CSCI

Effective Top-k Keyword Search in Relational Databases Considering Query Semantics

International Journal of Advance Foundation and Research in Science & Engineering (IJAFRSE) Volume 1, Issue 2, July 2014.

WEIGHTING QUERY TERMS USING WORDNET ONTOLOGY

NON-CENTRALIZED DISTINCT L-DIVERSITY

ResPubliQA 2010

RSDC 09: Tag Recommendation Using Keywords and Association Rules

A CORBA-based Multidatabase System - Panorama Project

Relational Algebra and Calculus

60-538: Information Retrieval

Processing Rank-Aware Queries in P2P Systems

Inverted Index for Fast Nearest Neighbour

Data Access Paths for Frequent Itemsets Discovery

STRUCTURE-BASED QUERY EXPANSION FOR XML SEARCH ENGINE

A Data warehouse within a Federated database architecture

International Journal of Advance Engineering and Research Development. Performance Enhancement of Search System

Designing and Building an Automatic Information Retrieval System for Handling the Arabic Data

CHAPTER THREE INFORMATION RETRIEVAL SYSTEM

Processing Structural Constraints

Deep Web Content Mining

Roadmap. Roadmap. Ranking Web Pages. PageRank. Roadmap. Random Walks in Ranking Query Results in Semistructured Databases

Clustering Algorithms for Data Stream

6. Relational Algebra (Part II)

CS143: Relational Model

Outline. q Database integration & querying. q Peer-to-Peer data management q Stream data management q MapReduce-based distributed data management

UML-Based Conceptual Modeling of Pattern-Bases

Consistency and Set Intersection

Enabling Users to Visually Evaluate the Effectiveness of Different Search Queries or Engines

Chapter 2 Overview of the Design Methodology

Incompatibility Dimensions and Integration of Atomic Commit Protocols

second_language research_teaching sla vivian_cook language_department idl

EXTRACTION INFORMATION ADAPTIVE WEB. The Amorphic system works to extract Web information for use in business intelligence applications.

Elementary IR: Scalable Boolean Text Search. (Compare with R & G )

Applying the KISS Principle for the CLEF- IP 2010 Prior Art Candidate Patent Search Task

Addressed Issue. P2P What are we looking at? What is Peer-to-Peer? What can databases do for P2P? What can databases do for P2P?

STRUCTURED ENTITY QUERYING OVER UNSTRUCTURED TEXT JIAHUI JIANG THESIS

DBMS. Relational Model. Module Title?

INCONSISTENT DATABASES

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University

Chapter 27 Introduction to Information Retrieval and Web Search

Data integration supports seamless access to autonomous, heterogeneous information

Requirements Engineering for Enterprise Systems

II B.Sc(IT) [ BATCH] IV SEMESTER CORE: RELATIONAL DATABASE MANAGEMENT SYSTEM - 412A Multiple Choice Questions.

ISSN: (Online) Volume 2, Issue 3, March 2014 International Journal of Advance Research in Computer Science and Management Studies

Updates through Views

Databases and Information Retrieval Integration TIETS42. Kostas Stefanidis Autumn 2016

LECTURE 8: SETS. Software Engineering Mike Wooldridge

Handout 9: Imperative Programs and State

WEB PAGE RE-RANKING TECHNIQUE IN SEARCH ENGINE

Effective Keyword Search in Relational Databases for Lyrics

Indexing. Week 14, Spring Edited by M. Naci Akkøk, , Contains slides from 8-9. April 2002 by Hector Garcia-Molina, Vera Goebel

DATA MODELS FOR SEMISTRUCTURED DATA

Implementation Techniques

Downloading Hidden Web Content

Enhancing Internet Search Engines to Achieve Concept-based Retrieval

XQuery Optimization Based on Rewriting

Automated Online News Classification with Personalization

Bipartite Graph Partitioning and Content-based Image Clustering

CS 377 Database Systems

Querying Data with Transact-SQL

Information Management (IM)

Database System Concepts and Architecture

Chapter 6: Information Retrieval and Web Search. An introduction

Relational Data Model

2. Discovery of Association Rules

INTERNATIONAL JOURNAL OF RESEARCH IN COMPUTER APPLICATIONS AND ROBOTICS ISSN EFFECTIVE KEYWORD SEARCH OF FUZZY TYPE IN XML

CS425 Fall 2016 Boris Glavic Chapter 1: Introduction

Equivalence Detection Using Parse-tree Normalization for Math Search

Predictive Indexing for Fast Search

Transcription:

Database Selection and Keyword Search of Structured Databases: Powerful Search for Naive Users Mohammad HasSan@ Reda Alhajjl Mike J. Ridley" Ken Barked " School of Informatics Bradford University, Bradford West Yorkshire, BD7 1DP United Kingdom m.j.ridley@scm.bradac.uk ADSA Lab, Computer Science Dept University of Calgary Calgary, Alberta T2N ln4 Canada {alhajj, harker)@cpsc.ucalgary.ca Abstract - The main target of the wrk described in this paper is to provide a powerful approach for naile users to search structured databases. Such a study is necessary especially to satisfi web users who expect the abilify to access all web contents in a unified way, regardless of the structure of the available information. Given a set of distributed structured databases and a que?y which consists of a set of keywords connected by logical operators, the approach proposed in this paper adapts both web text files search techniques and information retrieval techniques to rank the existing databases based on their relevance to the posed query. For each keyword, the user specifies a level of search, which may be column, record, or table. We developed an estimation method with statistical foundations to estimate the usefulness of individual relational databases. The system gives a hint of what databases might be useful for the user's query, based on wrd-frequency information kept for each database. Some experiments have been conducted to demonstrate the efectiveness of the proposed method in determining promising sources for a given query. As nai've end-user satisfaction is a main target and motive. w developed a prototype system with a user friendly web-based interface that accomplishes our goals in a simple andpowiful wy. 1 Introduction The way we search for information is one of the most important differences between relational and text or document databases, at least from end userk perspective. It is more complicated to access structured databases, especially in a dynamic environment like the web, where varying or unknown database structures make the query formulation process a very difficult task. An approach taken in some cases is to export data from structured databases to web pages; then provide text search on web documents. This approach results in duplication of data, with the problem of keeping the versions up-to-date, in addition tu space and time overheads. Further, it is not feasible to export every possible combination, especially with highlyconnected data. The problem handled in this paper can be stated as follows. Consider a set of distributed heterogeneous relational databases DB and a query q, which consists of keywords connected by logical operators; it is required to allow users to search and access, and eventually manipulate certain information by using a keyword search tu submit q. such that no a priori knowledge of the database schemas or the place of the required information is necessary. The system selects a subset of DB, which consists of good candidate databases for submitting q. To make this selection, the system uses an estimator function, which assesses how good each database in DE is with respect to q. It is also required to rank the databases, i.e., decide on the order according to which they should he visited in order to provide the most effective response to q. Our solution to this problem involves building a server that can suggest databases to search and the order according to which they should he visited. The system gives a hint ofwhat databases might be useful for the user query, based on word-frequency information kept for each database; this process is known as the database selection problem. These are hard problems that have already been addressed hy many other researchers, e.g., [SI. The mainstream approaches in the development of database selection techniques was document or text oriented. Our approach described in this paper addresses structured relational databases. Simply, we present a novel architecture for a global information system, which contains an arbitrary number of distributed heterogeneous autonomous relational databases. It enables users to find information of interest and the databases that contain the targeted information without a priori knowledge about the structure of the databases involved or the location of the required information. The databases are ranked 0-7803-8242-0/03/ $17.00 0 2003 IEEE 175

according to retrieval function computation. The proposed mechanism accomplishes these goals without affecting the autonomy and heterogeneity of the participating databases. The database selection problem can be described as follows. Given a query and a set of databases, we wish either to select a subset of databases to which we will send the query, or order the databases such that we may send the query to the databases in that order or send the query to the top ranked n databases. We follow the latter interpretation. We view the database selection problem as producing a ranked list of a given set of databases, in decreasing order, according to their estimated potential usefulness to a given query. The rest of the paper is organized as follows. The related work is summarized in Section 2. An overview of the proposed system is given in Section 3. The representation of the database related information is covered in Section 4. Query representation together with the estimator function and the database selection process are presented in Section 5. Section 6 includes a summary and the conclusions. 2 Related Work Adapting keyword search to structured databases has already attracted the attention of several researchers. However, existing approaches mainly concentrate on searching a single database. They also have one thing in common; almost all of them use a graph to implement the basic database representation. For instance, a framework for keyword search on databases when the schema is not known to the user is presented in [Ill. The main limitation of this work is that all keywords must be contained in the same tuple. The approach described in [7] deals with the problem of keyword search over XML documents. Goldman et al [9] address the problem of proximity search over semi-structured stores. They restrict results to tuples from one relation near a set of keywords. DataSpot [5] is a commercial system that supports keyword-based search by extracting the contents ofthe database into a hyperbase; it duplicates the contents of the database, making data integrity and maintenance difficult. Other approaches include DISCOVER [IO], DBXplorer [I], BBQ [IZ], as well as those described in [3] and [13]. Note that the growth in online resources demands efficient techniques for improving the search space in a distributed environment [6], i.e., database selection is a fundamental problem in distributed search. As the number of databases involved in the process increases, sending each query to all databases becomes no longer a feasible and reasonable approach. The primary goal is to select a small set of databases to send a query to, without sacrificing retrieval effectiveness. Our proposed solution to the usefulness estimation problem is an extension of the approach employed in GIOSS [8]. We simply took the level of search and the structure of the relational database into account. The frequency information that GIOSS keeps about the databases are different from ours; our system keeps information related to the granularity of each relational database. Also, the final query result or the chosen set according to GIOSS is different from the one we are interested in. Finally, the mainstream approaches in the development of IR systems and database selection techniques, including GIOSS, are document or text oriented. Our system has been implemented as a structured relational database application that enables users to search through the granularity of a relational database in a distributed environment. 3 An Overview of the Proposed Approach Given a set of relational databases, we can provide efficient and effective access to them by applying the following approach. I- Each site constructs its Master Relation, which is a single table that combines all the data required to be searched within the corresponding database. Then, each site computes its local database -levels of database- frequencies for each term in this master relation, and the summary index that contains the total number of elements at each level. The system requires that each database cooperates and periodically updates these frequencies, following some predefined protocol. Both, the master relation and the statistical information (frequencies) could be easily built mechanically in any relational database management system. We developed and implemented an application for building these indexes as part of our prototype. 2- A central database selection server imports these frequencies from each remote database. We developed and implemented an application to import this information from distributed remote databases. 3- The server creates a union database frequency and a union summary index. We developed and implemented an application to build this union database frequency and the union summary index. Note that the estimation method depends on these unions during the database selection process. 4- Given a query q. which consists of a set of kejwords, logical operators, search-level and search 176

method, the conceptual database selection algorithm proceeds as follows: 0 Compute keywords estimator according to the specified level and the specified logical operators. Computeiselect the relevant se1 according to the specified search method. Sort the relevant set databases selected in the previous step, and return the result as specified by the search method. All of the above enumerated steps have been incorporated in a prototype, which has been developed and implemented as a user-friendly web-based application. 4 Representation of Database Related Information In a distributed search, different types of database representatives for each database may be available to the searching system. These representatives that characterize the contents of the database are used to estimate the usefulness of a database to a given query hy applying a certain database selection algorithm. Depending on the kind of information and the database representatives available, different approaches have been proposed to estimate the usefulness of a database. Most approaches require information about the terms that appear in the database and the statistical information related to these terms, such as, term frequencies, document frequencies and term weights. One of the most popular representatives used in distributed search systems and information retrieval systems is the inverted file. To serve our purpose, we need to record whether a keyword could represent a table, a record or an attribute. In traditional IR systems, such a distinction is not required because IR systems ignore the structure; they mainly depend on a document as the granularity. As a result, we propose to use an extended inverted file that stores information at row, column and table level granularity. This extended inverted file is part of the required database representative, which we call the database uordrfrequency information; it is created as a table within a relational database. While calculating these frequencies, we just count each record that contains the considered word once, even if the word appears more than once within the record. The same thing applies for columns and tables. We may reduce the words to their stems using a stemming algorithm. We may also leave out the often frequent words with little semantics using a stopping algorithm. Another schema is also needed as part of the database representative to store the total number of elements at each level of the master relation for each database, which we call the summary index. These two relations represent the required database representative. Our approach depends on this structure to create union Frequency information, which is required for estimating the usemness of a database with respect to a given query. Database administrators are also responsible for creating this database word-frequency information and the summary index in their own local systems, to be extracted by the server as a next step. We developed an application for building these relations as part of our prototype. Note that creating the master table in relational databases environment can be done easily either using SQL or some QBE-like interface. Creating the database word-frequency information and the summary index could also be done mechanically using simple algorithms that work on the already described master table. AAer preparing the database word-frequency information at local sources, the server can extract them directly from each local source; this depends on a predefined protocol, which assumes a cooperative system. We developed and implemented an application to import this information from the distributed remote databases palticipating in the system. As a result, this application produces union frequency information that consists of two relations, namely the union database word-frequency information, and the union summary index. Building these relations is also an easy task, and could be done mechanically. We developed an application to build these relations. Afier building the union frequency information, the system server keeps the following information about each database: IS(db), the level size of database db, which is the total number of elements at each level I of db, V dbc DB and I E {t,c,r), where 1, c and r stand for table, column and record, respectively. (The union summary index) *flw,l,db), the number of elements at each level I of database db that contains keyword w, V dbe DB, I E {t,c,r), and for all keywords w. (The union database word-frequency information) This information facilitates applying the database selection process and estimating the usefulness of a database with respect to a given query. The value of flw,l,db) is the size of the result of the query: find keyword w at level 1 of database db. 177

The system dws not need to store the value of flw,l,db) explicitly if it is zero. This way, if no information is found by the system about the value of flw,/,db), then it is assumed as zero. The database selection process is presented in the next section, after introducing query representation supported by the proposed approach. 5 A Closer look at the Proposed Approach 5.1 Query Representation Our approach considers boolean queries that consist of atomic subqueries connected by the boolean AND operator, denoted A. However, OR and NOT queries are also possible by the same way. An atomic subquery is a single keyword. Since all words in the searchable level of each database have to be explicitly integrated in its database word-frequency information, supporting phrase search would lead to an enormous overhead. Therefore, our implementation will only consider single word subqueries. So, a query in our model can be represented by an unordered subset of W, where W={wl,w2,..., w.} is the set of all words in the union database word-frequency information relation. Since we are searching relational databases, the word(s) may be associated with what we call a level o/ search, which is a structure within relational databases, such as table, column or values. The set of these levels is denoted L=(t c, r). and the definition of query q is extended to incorporate the words and the associated level as follows: 9=f([w,,w,..., wj0i w, EW,/EL) j= a... Example 1 Consider the following statement that represents a query in our system: 4 = (([Honda I Red], r)}, which is equivalent to: find Honda A red within record level. This query looks for the two atomic subqueries (keywords), namely Honda, and Red; and the record level has been specified as the search level. Note that it is also possible to specify different levels of search for individual words in the same query. For a non-experienced end-user, levels can be named in a more intuitive manner; some hints or classification about the semantics of these levels have been added to the user interface. We consider only boolean queries because most current commercial online services and information vendors worldwide as well as traditional library systems support boolean query models to access their databases, offering well-maintained information systems in a number of fields such as science, business, and law [4]. Also, more web search engines, e.g., AltaVista, Excite, and WehCrawler are adopting boolean queries in their advanced interfaces. Other important reasons for adopting boolean model is the size of the database representative it requires, and boolean queries could be used and expressed-well in searching relational databases hy utilizing RDBMS functionality. 5.2 Estimated Merit of a Query Consider ls(db) andflw,l,db) for a set of databases DE, which we intend to search in order to satisfy some query 9; we assume that each database dbe DE has some merit with respect to query 9, denoted m(q,db). Merit could be defined as the actual number of elements present at the specified level of db and that satisfy query 9. Database selection algorithms do not know the actual merit of a database with respect to a query. Rather, they provide a means which can be used to estimate the merit. Such estimates are used to produce estimated ranking, which may not be the same as the desired ranking; the baseline ranking is based on actual merit. One approach to evaluating a database selection technique determines the degree for which the selection technique is able to produce database estimated ranking that approximates the desired baseline ranking. Each database dbe DB has an estimated merit with respect to the given query q, denoted Em(q,db), which is computed by a database selection algorithm. This estimated merit estimates the potential usefulness of a database for the given query. We expect that Em(q,db) is an attempt to implicitly or explicitly estimate the actual merit m(q,db). Ideally, To illustrate these definitions, consider a set of six databases DE=(dbl, db2, dbj, db,, dbs, dba}, that we wish to search in order to satisfy some query q, and assume that each database db E DE has some merit with respect to q; merit values are 3, 0, 6, 2, 0, and 4, respectively. As we would like to search the databases in order of decreasing merit with respect to q. for this example, we would like to report the databases in the following order: (db,. db6, db,. db4, db2, db,). Because db, and dbshave no merit with respect to 9, we should exclude them from the produced estimated ranking list to report only (dbj. dbs. db,, db3. 178

The tasks of the database selection algorithm are: 1) Calculate the estimated merit for each database with respect to the given query; 2) Select only databases that have non-zero estimated merit; and 3) Rank databases selected in step 2, based on their estimated merits, to produce the desired estimated databases ranking list. More details about each of these tasks are presented next, starting with the estimator function used to calculate the estimated merit for each database with respect to a given query. Table 1: Portion of the database frequency information kept by the saver for four databases dbj dbi I d63 dbr elements which are present at level 1 in db and satisfy the query:find wia w,h.. A w. within kvel I, is given according to this type of estimator as: Estimate ($nd[w, A w2 A... A w,, 11, db) = The produced estimated value represents the estimated merit of the database with respect to the given query, i.e., Em(q.db). Example 2 Consider four databases dbl, dbl, dbs. db,, and suppose that the server has collected the corresponding statistics, a portion of which is given in Table 1. Further, assume that the system received the query: q = {([Honda, Red], r)) which is equivalent to: 5.3 The Estimator Function The estimator is a function that estimates the merit of a database with respect to a given query. So, given the frequenciesj(w,l,db) and the level size ls(db), for a set of databases DB, the system uses the estimator function to assess for each database db in DE, how many elements at level 1 of db satisfy query q. This estimator is built based on the assumption that keywords appear in various elements of any level of a database following independent and uniform probability distribution. By uniform distribution we mean that the events in any sample mace are "equally likely" to occur. Saying that events A and B are independent means that the Occurrence or non-occurrence of event B provides no information about whether event A has also occurred. An equation that defines the independence of events A and E can be stated as follow. Two events, A and E, are independent yand only if P(AnB)=P(A)P(B), where P(A) and P(B) are, the probabilities of events A and E, respectively. This equation can be extended to n events in a straightforward manner. find "Honda" A Ted" within record level This query searches for records that contain both words, Honda and Red; the system estimates the number of matching records in each of the four databases in the following way. Database dbl contains 500 records, 50 of which contain the word Honda. Therefore, the probability that a record in db, contains the word Honda is 0.1. Similarly, the probability that a record in dbl contains the word Red is 0.2. Under the assumption that words appear independently at the record level, i.e., the probability that a record in db, contains both words is 0.1x0.2=0.02. Consequently, we can estimate the merit of db, with respect to q as: so 100 Em(q,db )--x-x500 tnn ' - snn Similarly, Em(%.-...- 80 0 Em(q,db )--x-x700=0, ' - 700 700 20 80 Em(q, db )--x-x1500-1500 1500 =IO records = -X- loo I5O x1000=15, 1000 1000 and = 1.067 records, Such estimators are usually called independent estimators [SI. Under this assumption, given a database 5.4 The SdeCtiOn Function for Relevant db. the total number of elements at level I of db, and any Search n keywords WI,..., and w,, the probability that any element at level I contains all ofw,,,,,, and wn, is given by: Once the estimated merit Em(%db) has calculated for each database, we can determine the databases that Mx,,,xm Ndb) IS(db> ' the estimated number Of should be included within the final estimated list, which we intend to report to users, as follows: 179

Equation 3 identifies databases that only have positive values for their estimated merits. These values assess the matching information based on the specified level. The subset of databases identified by Equation 3 is called Estimated Relevant databases; they are produced hy a search process called Relevant Search method. Relevance is usually defined in the context of IR as an abstract measure of how well a source (documentidatabase) satisfies user information needs. Ideally, a relevant source is one that would interest the user who issued the query. Unfortunately, this is a subjective notation and difficult to quantify; also it is not possible to know its exact value. One way to address this problem is to consider as relevant any database matching the user query. We think it is fair in our approach to simply look at databases with matching level elements as relevant databases. Therefore, we define Relevant databases to he the set of databases that have positive actual merits, i.e., positive number of level elements matching q. Formally, Rm(q,DB) =(dbe DB Im(q,db) > 0) (4) In order to evaluate the selection function, we need to compare the estimated relevant subset of databases S,(q.DB) against the actual relevant subset of databases with respect to q, i.e., R,(q, DE), Concerning Example 2, we can determine the selected set using Equation 3, and the obtained result is the subset of databases with positive merits, i.e., (db,, db2, dbr). This subset is also considered as the estimated relevant databases. The final step in the database selection algorithm is to rank the selected subset of databases according to the estimated merits in decreasing order, and the reported final ranked set of databases is, (db2, dbi, db4). 5.5 The Selection Functions for the Other Search Methods The result produced using the selection function in Equation 3, and the corresponding database selection algorithm described above is based on the assumption that the user is interested in the set of all relevant databases. However, users may be interest in different types of results. For instance, the user may he interested only in the best database for the given query, which is the database that contains more matching elements than any other database. There are several reasons that may impose choosing this type of result such as limited resources, e.g., time, space and money. In other words, the user wants to submit the query only to a single database which is expected to return the highest result. An intermediate case between reporting all relevant databases and reporting only the best database is to report a set of best databases, i.e., users may be interested in the set of best databases that have matching elements. This is an important case especially, when there is more than one database predicted by the estimator to have estimated merit values very close, within a threshold kom the highest estimated merit with respect to query q. This case neglects databases with low merit values compared to the highest merit. 5.5.1 Best database search method Consider frequencies /(w,l,db) and IS(&) for a set of databases DE, which we wish to search in order to satisfy some query q; the target is to search only the best single database that contains information relevant to q, i.e., the database that has the maximum estimated merit. The steps for the algorithm that achieves this target are: 1) Calculating the estimated merit for each database with respect to the given query (as described before). 2) Selecting and reporting only the database that has the maximum non-zero estimated merit. In the first step, the estimated merit for each database is computed using the estimator function in Equation 2, and to decide on the database to be reported, we apply the selection function: Sdq, DB) = (dbs DBI E<% db) > 0 I\ 64% db) = $2: E<q, db')) Equation 5 identifies a single database that has the maximum positive value for its estimated merit. We consider the database identified by Equation 5 as the Estimated Best Database; it is produced by a search process known as the Best Search Method. To evaluate this selection function, we need to compare the estimated hest database produced by this function against the actual best database with respect to query q. We define the Best Database, denoted B.(q,DB), as the database that has the maximum positive actual merit, i.e., the highest positive number of elements matching query q. Formally, B,(q,DB) =(db~ DB 1 m(q,db) > O h m(q,db) = max m(q,db')) *se08 (6) (5) 180

To illustrate this, again consider Example 2; if the user is interested only in the best database that contains the required information, then the estimated merit must be computed for each database by the same way as in the case of the relevant search method. This way, we can determine the estimated best database using Equation 5, and as a result (db2) is selected and reported to the user. 5.5.2 Best-group search method Consider database frequencies flw,l,db) and IS(db), for a set of databases DE, which we wish to search in order to satisfy some query q; we would like to search the subset for the best databases that satisfy query q. In this case, the desired behavior of the database selection algorithm is-to select and report the database with the maximum estimated merit along with all databases that have estimated merit close within a threshold to the maximum. So, the selection function in this case should be extended to make the definition of the best database more looser, by also letting databases with estimated merit close to the maximum be considered as best databases. The database selection algorithm for this search method proceeds as follows: I) Calculate the estimated merit for each database with respect to the given query. 2) Specify the maximum non-zero estimated merit. 3) Sclect databases that have estimated merit close to the maximum. 4) Rank databases selected in step 3, based on their estimated merits, and produce the desired estimated databases ranking list. In the first step, the estimated merit for each database is computed using the estimator function in Equation 2. Then, the maximum estimated merit is specified, and to select the databases to be reported to the user, we apply the following selection function, for a given threshold 20: Equation 7 identifies a subset of databases that have the highest positive estimated merits. We consider the identified databases as the Estimated Best-Group Databases, and the search process that produced them as the Best-Group Search Method. This function may be considered as a special case of the function defined by Gravano [8], which was introduced as an improvement to GIOSS system. The (7) selected databases identified by Equation 7 should be ranked first according to their estimated merits, and then reported to the user as the fmal estimated list of ranked databases. Parameter E changes the best database selection function in Equation 5 to facilitate selecting databases that are close to the predicted maximum one. Therefore, the larger E is, the more databases will be included in the selected best-group databases. In order to evaluate this selection function, we need to compare the estimated best-group databases produced by this function against the actual best-group subset of DB with respect to query q. We define the Best-group Databases B,(q,DB) to include databases that have the highest positive actual merit values, close to the maximum merit, i.e., the highest positive number of elements matching query q. Formally, (8) Here, it is worth mentioning that Equations 3 and 4 coincide with Equations 7 and 8, respectively, for el. Also, Equations 5 and 6 coincide with Equations 7 and 8, respectively, for 4. Note that the user who submits a query does not specify a value for e. The semer should fix the value of E according to the desired meaning of the best-group. Some other factors may affect selecting the value of E, like the total number of databases in the system, i.e., if the number of databases in the system is too large, then E may be set io have a small value. In general, the higher the value of E, the more databases tend to be selected by the selection function. To illustrate the impact of this new selection function, let us consider the databases in Example 2. If the user is interested in the best-group databases that contain the required information, and if E is specified by the server as 0.5, then the estimated merits will be computed for each database by the same way as in the case of relevant search method. This way, the maximum estimated merit is specified as 15; then the best-group databases is estimated using Equation 8, and based on the following calculations: Em(l,db,)=lO, which is greater than zero and HSO~; Em(l,db2)=15, has the maximum estimated merit; Em(l,db+O; and Em(l.db,)=l.O67, which is greater than zero, but w.05. 181

Since only db, and db2 satisfy Equation 8, the set {db,, db2) will be selected as the best-group. The final step is to rank the selected subset of databases in decreasing order according to their estimated merits. So, the final estimated ranking list that will be reported to the user is (db2, db,). 6 Summary and Conclusions In this paper, we described a mechanism that enables users to find information of interest and databases that contain such information without any a priori knowledge about its location. This mechanism employs the statistics information extracted from each local database in a step to estimate the number of potentially useful databases. Our estimation methods are based upon established statistics theory. This mechanism also accomplishes these goals with no effect on the autonomy and heterogeneity of the participating databases. Further, the flexibility and ease with which the databases join and leave the system, and organize themselves within the system is one of the most important features of our system. To support relational query processing using keyword search, we gave the necessary extensions to the inverted file in order to take the stmcture of relational databases into account. The experiment results are encouraging. Finally, a prototype that incorporated each part of the architecture has been proposed, which includes: building the index and summary tables for each database; extracting the index and summary tables from individual databases; building the union index and the union summary tables; and building the main system with a web-based interface. Currently, we are working on different evaluation parameters. Also, we are also planning to evaluate the storage requirements of the proposed approach. References [5] S. Dar, et al, DTL s DataSpot: Database Exploration Using Plain Language, Proceedings of the International Conference on Very Large Databases, pp.645-649, 1998. [6] J. French, et al, Comparing the performance of Database Selection Algorithms, Proceedings of ACM SIGIR Conference on Research and Development in IR, pp.238-245, 1999. [7] D. Florescu, D. Kossmann and 1. Manolescu, Integrating Keyword Search into XML Query Processing, Proceedings of the international Conference on WWW, 2000. [8] L. Gravano, H. Garcia-Molina and A. Tomasic, GIOSS: Text-source discovery over the internet, ACM-TODS, Vo1.24, No.2, pp.229-264, June 1999. [9] R. Goldman, et al, Proximity search in databases, Proceedings of the International Conference on Very Large Databases, pp.26-37, 1998. [IO] V. Hristidis and Y. Papakonstantinou, DISCOVER Keyword Search in Relational Databases, Proceedings of the International Conference on Very Large Databases, 2002. [Ill U. Masermann and G. Vossen, Design and Implementation of a Novel Approach to Keyword Searching in Relational Databases, Proceedings of ADBIS-DASFAA, 2000. [I21 K.D. Munroe and Y. Papakonstantinou, BBQ: A visual interface for integrated browsing and querying of XML, Proceedings of IEEE International conference on Data Engineering, nnn- LUUL. S. Agrawal, S. Chaudhuri, and G. Das, DBXplorer: A System for Keyword-Based Search [13] N.L. Sards and A.J. Mragyati, A system for over Relational Databases, Proceedings of IEEE keyword-based searching in databases, International conference on Data Engineering, http://xxx.lanl.gov/archivdcs, 200 I. 2002. R. Baeza-Yates and B. Ribeiro-Neto, Modem Information Retrieval, ACM Press, 1999. G. Bhalotia, et al, Keyword Searching and Browsing in Databases using BANKS, Proceedings of IEEE International conference on Data Engineering, 2002. C. Chuan, et al, Predicate Rewriting for Translating Boolean Queries in a Heterogeneous Information System, ACM Transaction on Information Systems, Vol. 17, No.1, pp.1-39, 1999. 182