Schema Mapping in P2P Networks based on Classification and Probing

Size: px

Start display at page:

Download "Schema Mapping in P2P Networks based on Classification and Probing"

Junior Lawson
6 years ago
Views:

1 Schema Mapping in P2P Networks based on Classification and Probing Guoliang Li 1, Beng Chin Ooi 2, Bei Yu 2, and Lizhu Zhou 1 1 Department of Computer Science and Technology Tsinghua University, Beijing 084, China {liguoliang, dcszlz}@tsinghua.edu.cn 2 School of Computing, National University of Singapore, Singapore {ooibc, yubei}@comp.nus.edu.sg Abstract. In this paper, we address the problems of adaptive schema mappings between different peers in peer-to-peer network and searching for interesting data residing at different peers based on such mappings. We begin by classifying the shared schema of each peer into a taxonomy of relation categories and attribute categories. We then propose our adaptive schema mapping by selectively probing the shared schema with query probes, which are generated by the classification rules. To improve the accuracy of schema mapping, we introduce the notion of confusion matrix and prior-knowledge. Finally, we present the query reformulation strategy for retrieving and integrating data from all relevant peers. We have implemented our proposed schema mapping and query processing methods in real settings with real datasets. The experimental results show that our method can be adopted effectively in practice. 1 Introduction Sharing data among multiple sources is crucial in a wide range of applications, including enterprise data management, large-scale scientific projects, government agencies and the World-Wide Web in general. Data integration approaches offer an architecture for data sharing in which data is queried through a mediated schema, but physically stored at the source locations based on their own schemas. Recent data integration systems have been successful at enabling data sharing, but on a relatively small scale, due to the expensive cost of constructing the mediated schema. Recently, peer data management systems (PDMS) have been proposed as an architecture for decentralized data sharing [1, 2, 9, 19, 20, 23]. A PDMS consists of a set of (physical) peers, and each peer has an associated schema, denoted as peer schema, that represents its domain of interest. Some peers store actual data with mappings between their physical schemas to their relevant peer schemas. However, a peer may not have complete data instances for its peer schema, since individual peers typically do not contain complete information about a domain. This calls for schema mappings in order to tap on relevant peers for more complete answers. Mapping all data sources to a single global schema (or mediator) in a PDMS is not feasible due to the decentralization and scalability requirements of P2P systems. Therefore, in a PDMS, mappings between disparate schemas are built directly and stored locally, such that when a query is posed at a peer, the answers are obtained by integrating retrieved results of reformulated queries from relevant peers, which are generated by exploring the mappings. Schema mapping of most existing proposals for PDMS such as Hyperion [2, 15], Piazza [9, 23], and PeerDB [19, 20] all require human intervention, which is inefficient and ineffective for large networks and dynamic sources. Therefore, an

2 adaptive way for generating schema mapping is highly desirable. In this paper, we propose such a schema mapping method based on classification. We classify the shared schemas (relational tables and attributes) of individual peers into a taxonomy of relation categories and associated attribute categories, which essentially represent various conceptual domains. For all peers that have relations belonging to the same category, schema mappings are generated for them. When a new peer joins, classification of its shared schema is performed by probing its relations with query probes generated from classification rules, and consequently, it will be assigned to one or more relation categories to which the probing results have best matches. Subsequently, its schema is mapped to peers in the same categories. The advantage of our classification-based schema mapping is that its simplicity and modeling uniformity allow integrating the contents of several sources without having to tackle complex structural differences. Another advantage is that query evaluation in classification-based sources can be done efficiently. Our system is based on a super-peer P2P network in which super peers themselves are organized in a structured overlay, such as BATON [12], and normal peers within the cluster managed by a super peer are unstructured. The categories are distributed among super peers, through which normal peers build schema mappings. Our categories structure is distinct from a global schema (or mediator), since it is distributed among all the super peers, and it is used for peers to generate schema mappings, not for users to pose queries. In this paper, we make the following contributions: We propose a method for schema mapping based on classification and probing in PDMS. We adopt the notion of confusion matrix [16] and apply prior-knowledge to improve the accuracy of schema mapping whenever there are overlapping instances among the shared schemas. We present query formulation strategies for reformulating local queries among relevant peers to achieve efficient query answering. The paper is organized as follows. We discuss the related work in Section 2. Section 3 presents how to create the schema mapping, and Section 4 describes the query reformulation and evaluation strategies. In Section 5, we provide extensive experimental evaluations of our method and we conclude the paper in Section 6. 2 Related Work There is no doubt a long stream of research on schema mapping, and we shall briefly review recent and relevant proposals. Kang et al. [14] investigated schema matching techniques that worked in the presence of opaque column names and data values. Yu et al. [27] proposed a method about constraint-based XML data integration. Dhamankar et al. [4] described the imap system which semiautomatically discovered both 1-1 and complex matches. These three methods are only efficient for centralized environment. More recently, the database community has begun to exploit P2P technologies for database applications [2, 6, 8, 9, 13, 15, 22, 23, 26]. In [8], the problem of data placement for P2P system was addressed and how data management could be applied to P2P was presented. In [26], the class of hybrid P2P systems, where some functionality is still centralized, was studied. In [13], caching of OLAP queries was addressed in the context of a P2P network. Ooi et al. [18 20] introduced an IR technique into schema mapping in PDMS. Halevy et al. addressed the issue of schema mediation and proposed a language for mediating between peer schemas in [9]. Hyperion project was proposed in [2, 15, 22],

3 which created schema mapping via mapping tables and required human input. The codb P2P DB prototype system that measures the performance of various networks arranged in different topologies was proposed in [6]. Schema mapping of existing studies mostly require human input or intervention. For example, in PeerDB [19], users are expected to provide additional descriptions for the relation and attribute names. In this paper, we would like to take schema mapping one step further by not relying on the additional input imposed on the users. Accordingly, we propose a practical and adaptive solution based on classification and probing. 3 Classification-based Schema Mapping In this section, we first give an overview on how to construct a classification scheme for various peer schemas in Section 3.1. Then we describe the classificationbased schema mapping in detail in Section 3.2 and Section Classification Overview Schema Relation Mapping query probes Schema Mapping Classification & Rules Classifier Sample Schemas Attribute Mapping query probes &prior-knowledge Query Schema Mapping mapping data standardiz ation User Interface standard form query standard form query standard form query Peer P 1 Peer P Peer P n final result to user Localization Localization Localization Query Reformulation result result result Integration Fig. 1. Architecture of Schema Mapping based on Classification and Probing Figure 1 shows the overall architecture of our classification-based schema mapping method. Similar to a conceptual taxonomy, all the shared schemas in our system (relations and their attributes) are classified into certain categories and each category may contain some subcategories. A hierarchical classification scheme is introduced as follows. Hierarchical Classification Scheme: A hierarchical classification scheme is a rooted directed tree whose nodes correspond to categories. Category includes relation category and attribute category, and each relation category has some attribute subcategories. An edge from relation category u to another relation category v denotes specialization; while an edge from relation category v to its attribute category v i denotes the relation w.r.t. v has an attribute w.r.t. v i. PID C (confidence) PID C a 0.96 d 0.98 root b 0.98 e 0.96 Name c 0.92 f 0.95 PID C Author PID protein d 0.84 programme Comp d e 0.06 PID C Year e f 0.02 d 0.10 f SeqID Length Sequence C/C++ e 0.88 Java Delphi PID a b c C PID a b c C PID a b c C Name Author Comp f Year 0.03 Name Author Year Name Author Fig. 2. A classification structure Figure 2 illustrates a hierarchical classification scheme of our running example, where ellipse denotes relation category and rounded rectangle denotes attribute category. A relation category has several attribute subcategories, which correspond to its attributes. In Figure 2, the root node has two relation categories Protein and Programme, while Programme has three relation subcategories - Java,C/C++,Delphi and four attribute subcategories -Name,Author,Comp,Year. C Comp

4 PeerID Relevant peers Local Schema Mapped Category Peer a Peer b,c Kinases Protein ID SeqID len Length seq Sequence Peer b Peer a,c annexin Protein identifier SeqID length Length seqs Sequence Peer c Peer a,b protein Protein number SeqID seqlength Length sequence Protein number SeqID seq Sequence Fig. 3. Local schema mapping We have mentioned that the classification structure is maintained by super peers that can be organized with existing overlays, e.g. BATON [12]. Each super peer maintains a subset of categories, where each category is associated with its own classification rules (including prior-knowledge for attribute categories, which will be introduced in Section 3.3), and the physical addresses of the peers that have classified some relations into it. The categories on super peers are indexed with BATON s distributed index facility, so that we can go through the classification hierarchy from any of the super peers. The local schema mappings are not maintained by super peers but by normal peers. Each peer maintains its local schema mapping with its matched categories, and the identifiers of the relevant peers that have classified schemas into the same categories. With our running example, category protein has three peers a, b, and c that have classified their schemas into it. Correspondingly, each of peers a, b and c maintains its local schema mappings with protein as shown in Figure 3. Initially, the hierarchical classification scheme can be extracted from existing classifications using special-purpose languages and tools, or it can also be constructed from scratch. If there are certain sample schemas in PDMS, we can use many existing methods such as naive bayes classifier [5], C4.5 [21], RIPPER [3], and Support Vector Machine [24], to classify them. On the other hand, if there are no sample schemas, we can construct the classification from scratch: for a new schema, once it matches certain categories in the existing hierarchical classification schema, it will be classified into these categories; otherwise, it will be inserted into the hierarchical classification scheme as a new category (which may be constructed as a parent category of some existing categories). Generally, schema mapping based on classification, in our approach, operates in two phases: 1) Relation mapping (Section 3.2); and 2) Attribute mapping (Section 3.3), which are presented in the following subsections respectively. 3.2 Relation Mapping We create relation mapping between relevant peers by classifying their relations into the most relevant categories through query probing. The probe-based method has been used for mining hidden-web data in [7, 11, 25], which is orthogonal to the schema mapping of our work. Since the classification rules can capture the characteristics of various categories, we generate query probes according to these classification rules, which are used to differentiate various relations. Basically, if a query probe returns expected results from a relation, this relation is related with the category w.r.t. the query probe, and subsequently it is classified into the category.

5 Now we describe the class of rule-based classifiers and show how we can use a rule-based classifier to generate a set of query probes that will help us estimate the number of results for each category of interest in a relation. In a rule-based classifier, the classification decisions are based on a set of logical classification rules, κ i C i, where the antecedents of the rules are conjunctions of words and the consequents are the category assignments. For example, the following classification rules are part of a classifier for the categories Java book and protein, respectively. Java AN D book Java book ; %a%c%g%t% protein Such rules can be used to classify previously unseen relations. For example, the first rule will classify the relation containing the keywords Java and book into the category Java book. The second will classify the relation containing the keywords %a%c%g%t% into the category protein. We can simulate the behavior of a rule-based classifier over all categories of the classification scheme by mapping each rule κ i C i of the classifier into a boolean query q i that is the conjunction of all keywords appeared in κ i. Thus, if we send the query probe q i to a new relation R, the query will match exactly f(q i ) results in R that would have been classified by the associated rule into category C i. Actually, instead of retrieving the concrete results, we only need keep the number of matches reported for each query probe, and use this number as the measure of whether the probed relation satisfies the corresponding classification rule. Having the result for each query probe, we can construct a good approximation of the Weight and Confidence vectors for a relation R. We approximate the number of results of R in category C i as the total number of matches from all query probes derived from rules with category C i. Using this information we generate the approximated weight and confidence vectors for R, with which we decide how to classify R into one or more categories in the classification scheme. Weight vector: Consider a relation R and a hierarchical classification scheme C={C 1, C 2,..., C n }, where each category C i C is associated with a classification rule κ i C i. f(r, κ i ) represents the number of results when using κ i to probe R. The weight of relation R for C i, W(R; C i )=f(r, κ i ), is the number of answers in R on category C i. Confidence vector: In the same setting as weight vector, the estimated confidence of R for C i, S(R;C i ), is: S(R; C i ) = S(R; P arent(c i )) W(R; C i ) C j is a child of P arent(c i) W(R; C j). As a special case, S(R; root )=1. W(R;C i ) defines the absolute amount of the results that relation R contains about category C i, while S(R;C i ) defines the relative amount of the results that relation R contain about C i. As described above, a weight-based classification would classify a relation into a category when the relation has a substantial number of results in the given category. Alternatively, a confidence-based classification would classify a relation into a category when a significant fraction of the results it contains are of this specific category. In general, however, we are interested in balancing both weight and confidence with two associated thresholds, τ s and τ c, respectively, as captured in the following. Formally, to classify R into certain categories, we use classification criterion described in the following.

6 Classification Criterion: Consider a classification scheme C with categories {C 1 ; C 2 ;...; C n } and a relation R. R is classified into category C i if it satisfies all the following conditions: W(R;C i ) τ w, S(R;C i ) τ c. W(R;C j ) τ w, S(R;C j ) τ c for any ancestor C j of C i. W(R;C k )<τ w or S(R;C k )<τ c for any child C k of C i. where 0 τ c < 1 and τ w 1 are the given thresholds. With our hierarchical classification scheme, we classify the relations in a topdown way. A new relation is first classified by the root-level classifier and then recursively pushed down to the lower level classifiers. A relation R is pushed down to the category C j when W(R;C j ) and S(R;C j ) are no less than thresholds τ w (for weight) and τ c (for confidence), respectively. If a category C k can not match with R, we can prune the whole subtree rooted at C k. The final set of categories, into which we classify R, is the approximate categories of R in C. The probe-based method relies on category classifiers to define query probes and obtain category match information for a relation. Unfortunately, classifiers are not always perfect sometimes they can wrongly classify relations into incorrect categories and leave some relations that do not match any rules unclassified. Here, we present a novel method to adjust the initial probing results in order to avoid such potential errors. It is a common practice in the machine learning community to report classification using a confusion matrix [16]. We adapt this notion for use in our probing scenario. Confusion Matrix: Consider a classification scheme with categories {C 1 ; C 2 ;...; C n }, for each category C i, there is a relevant relation R i mapped with it. Confusion Matrix M=(m ij ) is an n*n matrix, where m ij is the number of matches generated from R j for query probe w.r.t. category C i, divided by the number of results in R j. In a perfect setting, the probes for C i match only results in R i and each result in R i matches exactly one probe for C i. In this case the confusion matrix is the identity matrix. The process to create confusion matrix is: i) Generate the query probes from classification rules of the categories and probe the relations w.r.t. the categories in the classification scheme. ii) Create an auxiliary confusion matrix X =(x ij ) and set x ij equal to the number of matches from R j for query probe w.r.t. category C i. iii) Normalize the columns of X by dividing column j with the number of results in R j. The result is confusion matrix M. Example 1. Suppose that we have a classifier for three categories C 1 = C/C++, C 2 = JAVA, C 3 = Delphi, and there are three relations R 1, R 2, R 3 with 2000,1500, 0 records for C/C++, JAVA, Delphi, respectively. After probing these three relations with the three query probes generated from the classification rules, we construct the following confusion matrix. Element m 13 = 2000 means that it misclassifies records of R 1 into R 3. M = = Interestingly, multiplying the confusion matrix with the weight vector that represents the exact correct number of results for each category, yields, the weight vector with the number of results in each category as matched by the query

7 probes. For instance, in Example 1, there are exact 2000 results for C 1, 1500 results for C 2 and 0 results for C 3, and the probe results are 1830, 1420, 0. We can infer the exact weight vector, EW, form probe result and matrix M, where EW(C)=M 1 W(C). Hence, when classifying a relation, we will multiply M 1 with W(C) to obtain a better approximation of the weight vector. 3.3 Attribute Mapping After classifying a relation into certain relation categories, we have to classify its attributes into the associated attribute categories, which can be performed similarly with relation mapping described in Section 3.2. In addition, since attributes have their own characteristics, we introduce some techniques to improve the accuracy of attribute mapping in this section. Each attribute of a relation is associated with a particular type, such as string, number, date, etc., and different types capture different characteristics. Moreover, an attribute may be restricted with certain domain, such as attribute Age (age of human beings), is any number between 0 and 150, since there is no person whose age is larger than 150 or less than 0. Therefore, we introduce prior-knowledge for attribute mapping. Prior-knowledge: Consider relation category C with attribute categories {L 1 ; L 2 ;...L p }. Each attribute category L i must satisfy the prior-knowledge χ i, represented as: L i =χ i. χ i can be generated manually or automatically, and we generate it automatically based on machine learning technique. Any attribute that does not satisfy χ i, cannot be mapped with L i. Therefore, we can generate a query probe, which does not satisfy χ i, to probe an unknown attribute A j. If there are some results returned for this query probe, it is obvious that the attribute category L i cannot map to A j ; otherwise, we probe A j with the query probe that is generated by the classification rules w.r.t L i, and approximate the count of the probing results to represent the correlation of the two attributes. Accordingly, we can more accurately create attribute mapping with the help of the prior-knowledge. Example 2. Consider category Person(ID,Name,Age,Sex) with prior-knowledge: 1) ID =Number(0,00);2)Name = String;3)Age =Number[0, 150];4)Sex ={alternative of two values(e.g.male;female)}. A relation People has been mapped to the relation category Person. Now we consider how to create attribute mapping between them. Since there are only two values of Sex, we probe each attribute of People through the query probe generated according to the prior-knowledge of Sex: Select count(distinct probe-attribute) from People. If the result of this query probe is larger than two, we can make sure this probe-attribute cannot be mapped to Sex. Also, we probe each attribute of People through the query probe generated according to the prior-knowledge of Age: Select probe-attribute from People where probe-attribute>150 or probe-attribute<0. If the probe query does not return empty, we make sure that probe-attribute cannot match with Age. Formally, we introduce correlative matrix to create attribute mapping. Correlative Matrix: Consider category C with attribute subcategories {L 1 ;L 2 ;...L p }, and each L i is associated with a prior-knowledge χ i. Relation R with attributes {A 1 ; A 2 ;...A q } is relation mapping with C. The correlative matrix Corr(C,R )={m ij } is a p*q matrix. m ij =0, if A j does not satisfy χ i ; otherwise m ij is the number of results using the query probe generated by the classification rule of L i, to probe A j. Relative Correlative Matrix: In the same setting as correlative matrix, the relative correlative matrix RCorr(C,R )={r ij } is a p*q matrix, and r ij = mij p. k=1 m kj

8 Attribute Mapping Criterion: Consider category C with attribute subcategories {L 1 ;L 2 ;...L p } and relation R with attributes {A 1 ; A 2 ;...A q }. L i maps to A j if Corr(C,R)={m ij } and RCorr(C,R)={r ij } satisfy: r ij τ c and m ij τ w, where τ w, τ c are thresholds of weight and confidence respectively. Example 3. Consider category C=Person with attribute subcategories:id,name, Age,Sex; Relation R=people (p id, p name, p age, gender). R is relation mapped to C, and we demonstrate how to create attribute mapping between C and R. We probe each attribute of R using the prior-knowledge of attribute subcategories in C (the prior-knowledge in example 2) and get the correlative and relative correlative matrixes. We can see each attribute of R exactly maps the corresponding attribute of C with the help of prior-knowledge. Corr(C, R) = p id p name p age gender ID Name Age Sex RCorr(C, R) = p id p name p age gender ID Name Age Sex Reformulation With the created schema mapping, we can reformulate the query issued to a peer over its peer schema to the queries over the peer schemas of its relevant peers, such that they can understand and answer it. We first define the standard form query and local form query in our system. Standard form query: A standard form query is the query composed of relations and attributes of the relational categories and attribute categories in the hierarchical classification scheme. Local form query: A local form query of peer P is the query composed of the relations and attributes of P s local peer schema. Query reformulation with our method operates in three phases, which are described in the following subsections separately. 4.1 Standardization In the standardization phase, the peer need transform the received query into the standard form query, which is represented by certain relation categories and their corresponding attribute categories. Consider the issued query is represented as a triple Q= <R, A, C>, where R is a relation name, A is the attribute set composed of {A 1 ; A 2 ;...; A p }, C is the condition set (If the query contains more than one relation, we can decompose it into multiple queries with single relation and then integrate them.). We first find all the categories, {R 1 ; R 2 ;...R n }, where each R i is mapped to R, through the local schema mapping. Then we look at the attribute subcategories of R i, N i ={N i1 ; N i2 ;...; N ip }, where N ik is mapped with A k. Let relevantp eers(r i ) and relevantp eers(n ik ) denote the sets of peers that have classified some relations and attributes in R i and N ik, respectively. If P i =relevant- P eers(r i ) ( p k=1 relevantp eers(n i k )) Φ, R i has a schema mapping with R, and we can reformulate Q to Q i by replacing R with R i and A k with N ik, and send the standard form query Q i to the peers in P i. In addition, we can get the set of all the peers, P= n i=1 P i, which have relations mapped to R. 4.2 Localization When the relevant peers receive the standard form query from the query initiator, they need reformulate it into their local form query over their own peer schemas in order to execute it. The reformulation process for transforming a standard form query into a local form query is similar to the way described in Section 4.1. We also consider

9 the standard form query as a triple Q=<R, A, C>. We first find the set, S, composed of local relations S i that map to R. If S i contains all the attributes in A, we rewrite Q by replacing R with S i, and A with corresponding attributes in S i. In some cases, the local peer cannot reformulate the standard form query Q into a local form query with one relation, because it need join several relations to answer Q. For example, if S i S and there is an attribute A k A, which is not an attribute of S i, in this way, there must be S j S that has an attribute A k. If A S i Sj, we can answer Q through joining S i and S j ; otherwise we need further find more relation(s) to join with S i and S j in order to answer Q. After a relevant peer answers the reformulated local form query, it returns the results that are encapsulated by the attributes in A, such that the query initiator can recognize them. 4.3 Integration When receiving the answers from relevant peers, the query initiator transforms those answers from various peers represented by attributes of the standard form query into the answers represented with its local attributes, and integrates these results to return to the user. Consider the issued query is in a triple Q=<R, A, C>, and its corresponding reformulated standard form queries are represented as Q 1 <R 1, A 1, C 1 >;Q 2 <R 2, A 2, C 2 >;...; Q n <R n, A n, C n >. The mapping from the issued query to the standard form queries is one-to-many, but the mapping in reverse is one-to-one. Therefore, the transformation of the attributes in answers is much easier than query reformulation. Suppose a relevant peer returns its results of Q i <R i, A i, C i >, and since the querying peer has known the mapping from Q to Q i in standardization phase, it can simply transform the attributes in the answers by replacing the attributes in A i with the corresponding attributes in A, Finally, it integrates the results from all relevant peers and returns them to the user. 5 Experimental Study In this section, we report performance study for evaluating our schema mapping method. The proposed method was implemented in Java. We used the Amalgam schema and data integration test suite [17] and THALIA benchmark [10] as our experimental data sources 1. We evaluate our method from two aspects. First, we study the effectiveness of our schema mapping strategy for matching two schemas. Second, we look at the performance of schema mapping and query processing in a real P2P network setting. 5.1 Mapping between two Schemas We first evaluate the quality of mappings obtained with our method between two schemas in this section. Given two schemas, we first classify the relations and attributes of either one schema and get the corresponding classification rules, then we create schema mapping between them by classifying the relations and attributes of the other schema with our probing strategy. We use precision and recall to evaluate the quality of the mappings obtained. Precision is the fraction of the number of correct relation mappings (Correct relation mapping means both relations and their attributes are mapped correctly) and the number of total obtained relation mapping. Recall is the fraction of the number of correct relation mappings obtained and the number of total correct relation mappings. Consider two schemas S and T, we denote the precision of probing T with S as P, that is, S is the schema classified firstly. Similarly, P is the precision ST T S 1 There are 28 databases and 35 tables in THALIA. We transform THALIA data into 35 relation tables, denoted as S 5, and we also create schema mapping between them.

10 of probing S with T. In addition, we define F ST =F T S = 2 P ST P T S P +P. These three ST T S metrics are used to evaluate the precision of schema mappings between S and T. In the same way, we define corresponding metrics for recall as R, R and ST T S F ST =F T S= 2 R ST R T S R +R. ST T S Matching Precision(%) 50 P P F Matching Recall(%) R R F Matching Precision(%) S 1 S 2 S 1 S 3 S 1 S 4 S 2 S 3 S 2 S 4 S 3 S 4 Precision without prior-knowledge P P F Matching Recall(%) S 1 S 2 S 1 S 3 S 1 S 4 S 2 S 3 S 2 S 4 S 3 S 4 Recall without prior-knowledge R R F 40 S 1 S 2 S 1 S 3 S 1 S 4 S 2 S 3 S 2 S 4 S 3 S 4 Precision with prior-knowledge S 1 S 2 S 1 S 3 S 1 S 4 S 2 S 3 S 2 S 4 S 3 S 4 Recall with prior-knowledge Fig. 4. Mapping between two schemas Our method is evaluated in two cases. First, we create schema mappings without prior-knowledge. Second, we create schema mappings with prior-knowledge. Figure 4 describes the experimented results of matching 6 pairs of schemas from S 1 to S 4, where P, P, R, R, F and F denote P, P, R, R, F ST T S ST T S ST and F ST respectively. The experiment results show that our method achieves high precision and recall. When there is no prior-knowledge, the precision is about 55-75%, and the recall is about -%. Given some prior-knowledge, the accuracy of schema mapping improves dramatically. The precision reaches 75-%, and the recall increases to 85-%. Also, we can see that P and P do not make considerable difference, which shows the stability of our ST T S method. 5.2 Mapping in PDMS In this section, we evaluate our method in PDMS and compare its performance with PeerDB [19]. The experimental environment consists of 32 PCs (thereinto, 8 PCs are super peers) with Intel Pentium 2.4MHz processor and 512M of RAM. All the PCs are running on Windows XP operating system. We classify some sample schemas using bayes classifiers [5] into categories, and the eight super peers maintain these categories. Each normal peer shares its peer schema and joins one randomly chosen super peer by classifying its relations into certain categories. Matching Precision(%) PeerDB(# of keywords=2) PeerDB(# of keywords=5) No prior-knowledge With prior-knowledge Matching Recall(%) PeerDB(# of keywords=2) PeerDB(# of keywords=5) No prior-knowledge With prior-knowledge 50 S 1 50 S 2 S 3 S 4 S 5 S 1 S 2 Schemas Fig. 5. Schema mapping in PDMS S 3 Schemas S 4 S 5

11 In this experiment, we first evaluate the quality of schema mappings generated with the two approaches, then we compare the effectiveness of query processing in PDMS with the two methods. Quality of schema mapping: Similar to Section 5.1, we use precision and recall to evaluate the quality of schema mappings for each of the schema from S 1 to S 5. Figure 5 shows the experimental results of our method (with and without prior-knowledge) compared with that of peerdb (with 2 keywords annotated for each relation and with 5 keywords annotated for each relation). Not surprisingly, we can see that in PDMS our schema mapping method with prior-knowledge is more effective than that without prior-knowledge. The precision and recall with prior-knowledge are larger than % for most schemas. It can be observed that our method is superior to the PeerDB approach for most schemas (except S 3 ). Generally, the precision and recall of our method beats that of PeerDB by 10% to 20%. Moreover, PeerDB depends on the keywords annotated to a schema, which must be generated manually. Annotating more keywords to a schema could improve the recall, but degrades the precision. The experiment result shows that our method has good schema mapping performance in PDMS whenever there are overlap instances of the schemas. Effectiveness of query processing: With the created schema mappings, we evaluate the effectiveness of query processing of the two approaches. We also use the notions of precision and recall for our evaluation. Here precision is defined as the fraction of the number of correct returned answers to the total number of returned answers, and recall is the fraction of the number of correct returned answers to the total number of correct answers. We generate six queries to evaluate the two methods, in which four queries are based on Amalgam schemas and two are based THALIA schemas. There are two queries that contain join operations. Figure 6 shows the experiment results. Again, we can see that our method is more effective than the PeerDB approach. Query Processing Precision(%) 50 Q 1 PeerDB(# of keywords=2) PeerDB(# of keywords=5) No prior-knowledge With prior-knowledge Q 2 Q 3 Q 4 Queries Q 5 Q 6 Query Processing Recall(%) 50 Q 1 PeerDB(# of keywords=2) PeerDB(# of keywords=5) No prior-knowledge With prior-knowledge Q 2 Q 3 Q 4 Queries Fig. 6. Query processing in PDMS Q 5 Q 6 6 Conclusion In this paper, we propose a method for effective schema mapping based on classification and probing in a PDMS. We classify each peer schema into certain categories through probing, and the relations in the same category can be mapped to each other. We enhance the classification-based mapping by the application of confusion matrix and prior-knowledge. We also present strategy for reformulating query over a local peer schema to queries on various relevant peer schemas for effective query answering. Our experimented results show that our method achieves high accuracy for schema mapping on real datasets. Acknowledgement This work is supported by the National Natural Science Foundation of China under Grant No , the National Grand Fundamental Research 973 Program of China under Grant No.2006CB303103, the National High Technology

12 Development 863 Program of China under Grant No.2006AA01A101, Tsinghua Basic Research Foundation under Grant No. JCqn , and Zhejiang Natural Science Foundation under Grant No. Y References 1. K. Aberer, P. Cudre-Mauroux, and M. Hauswirth. A framework for semantic gossiping. SIGMOD Record, 31(4): , M. Arenas, V. Kantere, A. Kementsietsidis, I. Kiringa, R. J. Miller, and J. Mylopoulos. The hyperion project:from data integration to data coordination. SIG- MOD Record, 32(3):53 58, W. W. Cohen. Learning trees and rules with set-valued features. In AAAI, pages 9 716, R. Dhamankar, Y. Lee, A. Doan, and et al. imap: Discovering complex semantic matches between database schemas. In SIGMOD, R. O. Duda and P. E. Hart. Pattern classication and scene analysis. In Wiley, E. Franconi, G. Kuper, A. Lopatenko, and I. Zaihrayeu. Queries and updates in the codb peer to peer database. In VLDB, L. Gravano, P. G. Ipeirotis, and M. Sahami. QProber: A system for automatic classication of hidden-web databases. 21(1):1 41, S. Gribble, A. Halevy, Z. Ives, M. Rodrig, and D. Suciu. What can databases do for peer-to-peer. In WebDB, A. Halevy, Z. Ives, D. Suciu, and I. Tatarinov. Schema mediation in peer data management systems. In ICDE, pages , J. Hammer, M. Stonebraker, and O. Topsakal. THALIA: Test harness for the assessment of legacy information integration approaches. In ICDE, P. G. Ipeirotis, L. Gravano, and M. Sahami. Probe, count, and classify: Categorizing hidden-web databases. pages 61 78, H. V. Jagadish, B. C. Ooi, and Q. H. Vu. BATON: A balanced tree structure for peer-to-peer networks. In VLDB, pages , P. Kalnis, W. S. Ng, B. C. Ooi, D. Papadias, and K. L. Tan. An adaptive peer-topeer network for distributed caching of olap results. In SIGMOD, J. Kang and J. Naughton. On schema matching with opaque column names and data values. In SIGMOD, A. Kementsietsidis, M. Arenas, and R. J. Miller. Mapping data in peer to peer systems: Semantics and algorithmic issues. In SIGMOD, R. Kohavi and F. Provost. Glossary of terms. 30(2/3): , R. J. Miller, D. Fisla, M. Huang, D. Kymlicka, F. Ku, and V. Lee. Amalgam schema and data integration test suite. miller/amalgam, W. S. Ng, B. C. Ooi, and K. L. Tan. Bestpeer: A self-configurable peer-to-peer system. In ICDE, W. S. Ng, B. C. Ooi, K.-L. Tan, and A. Zhou. PeerDB:A p2p-based system for distributed data sharing. In ICDE, B. C. Ooi, Y. Shu, and K.-L. Tan. Relational data sharing in peer-based data management systems. SIGMOD Record, 32(3):59 64, J. R. Quinlan. C4.5: Programs for machine learning. In Morgan Kauf-mann Publishers, Inc., P. Rodriguez-Gianolli, M. Garzetti, L. Jiang, and et al. Data sharing in the hyperion peer database system. In VLDB, I. Tatatinov, Z. Ives, J. Madhavan, and A. H. et al. The piazza peer data management project. SIGMOD Record, 32(3):47 52, V. N. Vapnik. Statistical learning theory. In Wiley-Interscience, J. Wang, J.-R. Wen, F. H. Lochovsky, and W.-Y. Ma. Instance-based schema matching for web databases by domain-specific query probing. In VLDB, B. Yang and H. Garcia-Molina. Comparing hybrid peer-to-peer systems. In VLDB, C. Yu and L. Popa. Constraint-based XML query rewriting for data integration. In SIGMOD, 2004.

Keyword Join: Realizing Keyword Search for Information Integration

Keyword Join: Realizing Keyword Search for Information Integration Bei YU, Ling LIU 2, Beng Chin OOI,3 and Kian-Lee TAN,3 Singapore-MIT Alliance, National University of Singapore 2 College of Computing,