Schema Mapping in P2P Networks based on Classification and Probing

Size: px
Start display at page:

Download "Schema Mapping in P2P Networks based on Classification and Probing"

Transcription

1 Schema Mapping in P2P Networks based on Classification and Probing Guoliang Li 1, Beng Chin Ooi 2, Bei Yu 2, and Lizhu Zhou 1 1 Department of Computer Science and Technology Tsinghua University, Beijing 084, China {liguoliang, dcszlz}@tsinghua.edu.cn 2 School of Computing, National University of Singapore, Singapore {ooibc, yubei}@comp.nus.edu.sg Abstract. In this paper, we address the problems of adaptive schema mappings between different peers in peer-to-peer network and searching for interesting data residing at different peers based on such mappings. We begin by classifying the shared schema of each peer into a taxonomy of relation categories and attribute categories. We then propose our adaptive schema mapping by selectively probing the shared schema with query probes, which are generated by the classification rules. To improve the accuracy of schema mapping, we introduce the notion of confusion matrix and prior-knowledge. Finally, we present the query reformulation strategy for retrieving and integrating data from all relevant peers. We have implemented our proposed schema mapping and query processing methods in real settings with real datasets. The experimental results show that our method can be adopted effectively in practice. 1 Introduction Sharing data among multiple sources is crucial in a wide range of applications, including enterprise data management, large-scale scientific projects, government agencies and the World-Wide Web in general. Data integration approaches offer an architecture for data sharing in which data is queried through a mediated schema, but physically stored at the source locations based on their own schemas. Recent data integration systems have been successful at enabling data sharing, but on a relatively small scale, due to the expensive cost of constructing the mediated schema. Recently, peer data management systems (PDMS) have been proposed as an architecture for decentralized data sharing [1, 2, 9, 19, 20, 23]. A PDMS consists of a set of (physical) peers, and each peer has an associated schema, denoted as peer schema, that represents its domain of interest. Some peers store actual data with mappings between their physical schemas to their relevant peer schemas. However, a peer may not have complete data instances for its peer schema, since individual peers typically do not contain complete information about a domain. This calls for schema mappings in order to tap on relevant peers for more complete answers. Mapping all data sources to a single global schema (or mediator) in a PDMS is not feasible due to the decentralization and scalability requirements of P2P systems. Therefore, in a PDMS, mappings between disparate schemas are built directly and stored locally, such that when a query is posed at a peer, the answers are obtained by integrating retrieved results of reformulated queries from relevant peers, which are generated by exploring the mappings. Schema mapping of most existing proposals for PDMS such as Hyperion [2, 15], Piazza [9, 23], and PeerDB [19, 20] all require human intervention, which is inefficient and ineffective for large networks and dynamic sources. Therefore, an

2 adaptive way for generating schema mapping is highly desirable. In this paper, we propose such a schema mapping method based on classification. We classify the shared schemas (relational tables and attributes) of individual peers into a taxonomy of relation categories and associated attribute categories, which essentially represent various conceptual domains. For all peers that have relations belonging to the same category, schema mappings are generated for them. When a new peer joins, classification of its shared schema is performed by probing its relations with query probes generated from classification rules, and consequently, it will be assigned to one or more relation categories to which the probing results have best matches. Subsequently, its schema is mapped to peers in the same categories. The advantage of our classification-based schema mapping is that its simplicity and modeling uniformity allow integrating the contents of several sources without having to tackle complex structural differences. Another advantage is that query evaluation in classification-based sources can be done efficiently. Our system is based on a super-peer P2P network in which super peers themselves are organized in a structured overlay, such as BATON [12], and normal peers within the cluster managed by a super peer are unstructured. The categories are distributed among super peers, through which normal peers build schema mappings. Our categories structure is distinct from a global schema (or mediator), since it is distributed among all the super peers, and it is used for peers to generate schema mappings, not for users to pose queries. In this paper, we make the following contributions: We propose a method for schema mapping based on classification and probing in PDMS. We adopt the notion of confusion matrix [16] and apply prior-knowledge to improve the accuracy of schema mapping whenever there are overlapping instances among the shared schemas. We present query formulation strategies for reformulating local queries among relevant peers to achieve efficient query answering. The paper is organized as follows. We discuss the related work in Section 2. Section 3 presents how to create the schema mapping, and Section 4 describes the query reformulation and evaluation strategies. In Section 5, we provide extensive experimental evaluations of our method and we conclude the paper in Section 6. 2 Related Work There is no doubt a long stream of research on schema mapping, and we shall briefly review recent and relevant proposals. Kang et al. [14] investigated schema matching techniques that worked in the presence of opaque column names and data values. Yu et al. [27] proposed a method about constraint-based XML data integration. Dhamankar et al. [4] described the imap system which semiautomatically discovered both 1-1 and complex matches. These three methods are only efficient for centralized environment. More recently, the database community has begun to exploit P2P technologies for database applications [2, 6, 8, 9, 13, 15, 22, 23, 26]. In [8], the problem of data placement for P2P system was addressed and how data management could be applied to P2P was presented. In [26], the class of hybrid P2P systems, where some functionality is still centralized, was studied. In [13], caching of OLAP queries was addressed in the context of a P2P network. Ooi et al. [18 20] introduced an IR technique into schema mapping in PDMS. Halevy et al. addressed the issue of schema mediation and proposed a language for mediating between peer schemas in [9]. Hyperion project was proposed in [2, 15, 22],

3 which created schema mapping via mapping tables and required human input. The codb P2P DB prototype system that measures the performance of various networks arranged in different topologies was proposed in [6]. Schema mapping of existing studies mostly require human input or intervention. For example, in PeerDB [19], users are expected to provide additional descriptions for the relation and attribute names. In this paper, we would like to take schema mapping one step further by not relying on the additional input imposed on the users. Accordingly, we propose a practical and adaptive solution based on classification and probing. 3 Classification-based Schema Mapping In this section, we first give an overview on how to construct a classification scheme for various peer schemas in Section 3.1. Then we describe the classificationbased schema mapping in detail in Section 3.2 and Section Classification Overview Schema Relation Mapping query probes Schema Mapping Classification & Rules Classifier Sample Schemas Attribute Mapping query probes &prior-knowledge Query Schema Mapping mapping data standardiz ation User Interface standard form query standard form query standard form query Peer P 1 Peer P Peer P n final result to user Localization Localization Localization Query Reformulation result result result Integration Fig. 1. Architecture of Schema Mapping based on Classification and Probing Figure 1 shows the overall architecture of our classification-based schema mapping method. Similar to a conceptual taxonomy, all the shared schemas in our system (relations and their attributes) are classified into certain categories and each category may contain some subcategories. A hierarchical classification scheme is introduced as follows. Hierarchical Classification Scheme: A hierarchical classification scheme is a rooted directed tree whose nodes correspond to categories. Category includes relation category and attribute category, and each relation category has some attribute subcategories. An edge from relation category u to another relation category v denotes specialization; while an edge from relation category v to its attribute category v i denotes the relation w.r.t. v has an attribute w.r.t. v i. PID C (confidence) PID C a 0.96 d 0.98 root b 0.98 e 0.96 Name c 0.92 f 0.95 PID C Author PID protein d 0.84 programme Comp d e 0.06 PID C Year e f 0.02 d 0.10 f SeqID Length Sequence C/C++ e 0.88 Java Delphi PID a b c C PID a b c C PID a b c C Name Author Comp f Year 0.03 Name Author Year Name Author Fig. 2. A classification structure Figure 2 illustrates a hierarchical classification scheme of our running example, where ellipse denotes relation category and rounded rectangle denotes attribute category. A relation category has several attribute subcategories, which correspond to its attributes. In Figure 2, the root node has two relation categories Protein and Programme, while Programme has three relation subcategories - Java,C/C++,Delphi and four attribute subcategories -Name,Author,Comp,Year. C Comp

4 PeerID Relevant peers Local Schema Mapped Category Peer a Peer b,c Kinases Protein ID SeqID len Length seq Sequence Peer b Peer a,c annexin Protein identifier SeqID length Length seqs Sequence Peer c Peer a,b protein Protein number SeqID seqlength Length sequence Protein number SeqID seq Sequence Fig. 3. Local schema mapping We have mentioned that the classification structure is maintained by super peers that can be organized with existing overlays, e.g. BATON [12]. Each super peer maintains a subset of categories, where each category is associated with its own classification rules (including prior-knowledge for attribute categories, which will be introduced in Section 3.3), and the physical addresses of the peers that have classified some relations into it. The categories on super peers are indexed with BATON s distributed index facility, so that we can go through the classification hierarchy from any of the super peers. The local schema mappings are not maintained by super peers but by normal peers. Each peer maintains its local schema mapping with its matched categories, and the identifiers of the relevant peers that have classified schemas into the same categories. With our running example, category protein has three peers a, b, and c that have classified their schemas into it. Correspondingly, each of peers a, b and c maintains its local schema mappings with protein as shown in Figure 3. Initially, the hierarchical classification scheme can be extracted from existing classifications using special-purpose languages and tools, or it can also be constructed from scratch. If there are certain sample schemas in PDMS, we can use many existing methods such as naive bayes classifier [5], C4.5 [21], RIPPER [3], and Support Vector Machine [24], to classify them. On the other hand, if there are no sample schemas, we can construct the classification from scratch: for a new schema, once it matches certain categories in the existing hierarchical classification schema, it will be classified into these categories; otherwise, it will be inserted into the hierarchical classification scheme as a new category (which may be constructed as a parent category of some existing categories). Generally, schema mapping based on classification, in our approach, operates in two phases: 1) Relation mapping (Section 3.2); and 2) Attribute mapping (Section 3.3), which are presented in the following subsections respectively. 3.2 Relation Mapping We create relation mapping between relevant peers by classifying their relations into the most relevant categories through query probing. The probe-based method has been used for mining hidden-web data in [7, 11, 25], which is orthogonal to the schema mapping of our work. Since the classification rules can capture the characteristics of various categories, we generate query probes according to these classification rules, which are used to differentiate various relations. Basically, if a query probe returns expected results from a relation, this relation is related with the category w.r.t. the query probe, and subsequently it is classified into the category.

5 Now we describe the class of rule-based classifiers and show how we can use a rule-based classifier to generate a set of query probes that will help us estimate the number of results for each category of interest in a relation. In a rule-based classifier, the classification decisions are based on a set of logical classification rules, κ i C i, where the antecedents of the rules are conjunctions of words and the consequents are the category assignments. For example, the following classification rules are part of a classifier for the categories Java book and protein, respectively. Java AN D book Java book ; %a%c%g%t% protein Such rules can be used to classify previously unseen relations. For example, the first rule will classify the relation containing the keywords Java and book into the category Java book. The second will classify the relation containing the keywords %a%c%g%t% into the category protein. We can simulate the behavior of a rule-based classifier over all categories of the classification scheme by mapping each rule κ i C i of the classifier into a boolean query q i that is the conjunction of all keywords appeared in κ i. Thus, if we send the query probe q i to a new relation R, the query will match exactly f(q i ) results in R that would have been classified by the associated rule into category C i. Actually, instead of retrieving the concrete results, we only need keep the number of matches reported for each query probe, and use this number as the measure of whether the probed relation satisfies the corresponding classification rule. Having the result for each query probe, we can construct a good approximation of the Weight and Confidence vectors for a relation R. We approximate the number of results of R in category C i as the total number of matches from all query probes derived from rules with category C i. Using this information we generate the approximated weight and confidence vectors for R, with which we decide how to classify R into one or more categories in the classification scheme. Weight vector: Consider a relation R and a hierarchical classification scheme C={C 1, C 2,..., C n }, where each category C i C is associated with a classification rule κ i C i. f(r, κ i ) represents the number of results when using κ i to probe R. The weight of relation R for C i, W(R; C i )=f(r, κ i ), is the number of answers in R on category C i. Confidence vector: In the same setting as weight vector, the estimated confidence of R for C i, S(R;C i ), is: S(R; C i ) = S(R; P arent(c i )) W(R; C i ) C j is a child of P arent(c i) W(R; C j). As a special case, S(R; root )=1. W(R;C i ) defines the absolute amount of the results that relation R contains about category C i, while S(R;C i ) defines the relative amount of the results that relation R contain about C i. As described above, a weight-based classification would classify a relation into a category when the relation has a substantial number of results in the given category. Alternatively, a confidence-based classification would classify a relation into a category when a significant fraction of the results it contains are of this specific category. In general, however, we are interested in balancing both weight and confidence with two associated thresholds, τ s and τ c, respectively, as captured in the following. Formally, to classify R into certain categories, we use classification criterion described in the following.

6 Classification Criterion: Consider a classification scheme C with categories {C 1 ; C 2 ;...; C n } and a relation R. R is classified into category C i if it satisfies all the following conditions: W(R;C i ) τ w, S(R;C i ) τ c. W(R;C j ) τ w, S(R;C j ) τ c for any ancestor C j of C i. W(R;C k )<τ w or S(R;C k )<τ c for any child C k of C i. where 0 τ c < 1 and τ w 1 are the given thresholds. With our hierarchical classification scheme, we classify the relations in a topdown way. A new relation is first classified by the root-level classifier and then recursively pushed down to the lower level classifiers. A relation R is pushed down to the category C j when W(R;C j ) and S(R;C j ) are no less than thresholds τ w (for weight) and τ c (for confidence), respectively. If a category C k can not match with R, we can prune the whole subtree rooted at C k. The final set of categories, into which we classify R, is the approximate categories of R in C. The probe-based method relies on category classifiers to define query probes and obtain category match information for a relation. Unfortunately, classifiers are not always perfect sometimes they can wrongly classify relations into incorrect categories and leave some relations that do not match any rules unclassified. Here, we present a novel method to adjust the initial probing results in order to avoid such potential errors. It is a common practice in the machine learning community to report classification using a confusion matrix [16]. We adapt this notion for use in our probing scenario. Confusion Matrix: Consider a classification scheme with categories {C 1 ; C 2 ;...; C n }, for each category C i, there is a relevant relation R i mapped with it. Confusion Matrix M=(m ij ) is an n*n matrix, where m ij is the number of matches generated from R j for query probe w.r.t. category C i, divided by the number of results in R j. In a perfect setting, the probes for C i match only results in R i and each result in R i matches exactly one probe for C i. In this case the confusion matrix is the identity matrix. The process to create confusion matrix is: i) Generate the query probes from classification rules of the categories and probe the relations w.r.t. the categories in the classification scheme. ii) Create an auxiliary confusion matrix X =(x ij ) and set x ij equal to the number of matches from R j for query probe w.r.t. category C i. iii) Normalize the columns of X by dividing column j with the number of results in R j. The result is confusion matrix M. Example 1. Suppose that we have a classifier for three categories C 1 = C/C++, C 2 = JAVA, C 3 = Delphi, and there are three relations R 1, R 2, R 3 with 2000,1500, 0 records for C/C++, JAVA, Delphi, respectively. After probing these three relations with the three query probes generated from the classification rules, we construct the following confusion matrix. Element m 13 = 2000 means that it misclassifies records of R 1 into R 3. M = = Interestingly, multiplying the confusion matrix with the weight vector that represents the exact correct number of results for each category, yields, the weight vector with the number of results in each category as matched by the query

7 probes. For instance, in Example 1, there are exact 2000 results for C 1, 1500 results for C 2 and 0 results for C 3, and the probe results are 1830, 1420, 0. We can infer the exact weight vector, EW, form probe result and matrix M, where EW(C)=M 1 W(C). Hence, when classifying a relation, we will multiply M 1 with W(C) to obtain a better approximation of the weight vector. 3.3 Attribute Mapping After classifying a relation into certain relation categories, we have to classify its attributes into the associated attribute categories, which can be performed similarly with relation mapping described in Section 3.2. In addition, since attributes have their own characteristics, we introduce some techniques to improve the accuracy of attribute mapping in this section. Each attribute of a relation is associated with a particular type, such as string, number, date, etc., and different types capture different characteristics. Moreover, an attribute may be restricted with certain domain, such as attribute Age (age of human beings), is any number between 0 and 150, since there is no person whose age is larger than 150 or less than 0. Therefore, we introduce prior-knowledge for attribute mapping. Prior-knowledge: Consider relation category C with attribute categories {L 1 ; L 2 ;...L p }. Each attribute category L i must satisfy the prior-knowledge χ i, represented as: L i =χ i. χ i can be generated manually or automatically, and we generate it automatically based on machine learning technique. Any attribute that does not satisfy χ i, cannot be mapped with L i. Therefore, we can generate a query probe, which does not satisfy χ i, to probe an unknown attribute A j. If there are some results returned for this query probe, it is obvious that the attribute category L i cannot map to A j ; otherwise, we probe A j with the query probe that is generated by the classification rules w.r.t L i, and approximate the count of the probing results to represent the correlation of the two attributes. Accordingly, we can more accurately create attribute mapping with the help of the prior-knowledge. Example 2. Consider category Person(ID,Name,Age,Sex) with prior-knowledge: 1) ID =Number(0,00);2)Name = String;3)Age =Number[0, 150];4)Sex ={alternative of two values(e.g.male;female)}. A relation People has been mapped to the relation category Person. Now we consider how to create attribute mapping between them. Since there are only two values of Sex, we probe each attribute of People through the query probe generated according to the prior-knowledge of Sex: Select count(distinct probe-attribute) from People. If the result of this query probe is larger than two, we can make sure this probe-attribute cannot be mapped to Sex. Also, we probe each attribute of People through the query probe generated according to the prior-knowledge of Age: Select probe-attribute from People where probe-attribute>150 or probe-attribute<0. If the probe query does not return empty, we make sure that probe-attribute cannot match with Age. Formally, we introduce correlative matrix to create attribute mapping. Correlative Matrix: Consider category C with attribute subcategories {L 1 ;L 2 ;...L p }, and each L i is associated with a prior-knowledge χ i. Relation R with attributes {A 1 ; A 2 ;...A q } is relation mapping with C. The correlative matrix Corr(C,R )={m ij } is a p*q matrix. m ij =0, if A j does not satisfy χ i ; otherwise m ij is the number of results using the query probe generated by the classification rule of L i, to probe A j. Relative Correlative Matrix: In the same setting as correlative matrix, the relative correlative matrix RCorr(C,R )={r ij } is a p*q matrix, and r ij = mij p. k=1 m kj

8 Attribute Mapping Criterion: Consider category C with attribute subcategories {L 1 ;L 2 ;...L p } and relation R with attributes {A 1 ; A 2 ;...A q }. L i maps to A j if Corr(C,R)={m ij } and RCorr(C,R)={r ij } satisfy: r ij τ c and m ij τ w, where τ w, τ c are thresholds of weight and confidence respectively. Example 3. Consider category C=Person with attribute subcategories:id,name, Age,Sex; Relation R=people (p id, p name, p age, gender). R is relation mapped to C, and we demonstrate how to create attribute mapping between C and R. We probe each attribute of R using the prior-knowledge of attribute subcategories in C (the prior-knowledge in example 2) and get the correlative and relative correlative matrixes. We can see each attribute of R exactly maps the corresponding attribute of C with the help of prior-knowledge. Corr(C, R) = p id p name p age gender ID Name Age Sex RCorr(C, R) = p id p name p age gender ID Name Age Sex Reformulation With the created schema mapping, we can reformulate the query issued to a peer over its peer schema to the queries over the peer schemas of its relevant peers, such that they can understand and answer it. We first define the standard form query and local form query in our system. Standard form query: A standard form query is the query composed of relations and attributes of the relational categories and attribute categories in the hierarchical classification scheme. Local form query: A local form query of peer P is the query composed of the relations and attributes of P s local peer schema. Query reformulation with our method operates in three phases, which are described in the following subsections separately. 4.1 Standardization In the standardization phase, the peer need transform the received query into the standard form query, which is represented by certain relation categories and their corresponding attribute categories. Consider the issued query is represented as a triple Q= <R, A, C>, where R is a relation name, A is the attribute set composed of {A 1 ; A 2 ;...; A p }, C is the condition set (If the query contains more than one relation, we can decompose it into multiple queries with single relation and then integrate them.). We first find all the categories, {R 1 ; R 2 ;...R n }, where each R i is mapped to R, through the local schema mapping. Then we look at the attribute subcategories of R i, N i ={N i1 ; N i2 ;...; N ip }, where N ik is mapped with A k. Let relevantp eers(r i ) and relevantp eers(n ik ) denote the sets of peers that have classified some relations and attributes in R i and N ik, respectively. If P i =relevant- P eers(r i ) ( p k=1 relevantp eers(n i k )) Φ, R i has a schema mapping with R, and we can reformulate Q to Q i by replacing R with R i and A k with N ik, and send the standard form query Q i to the peers in P i. In addition, we can get the set of all the peers, P= n i=1 P i, which have relations mapped to R. 4.2 Localization When the relevant peers receive the standard form query from the query initiator, they need reformulate it into their local form query over their own peer schemas in order to execute it. The reformulation process for transforming a standard form query into a local form query is similar to the way described in Section 4.1. We also consider

9 the standard form query as a triple Q=<R, A, C>. We first find the set, S, composed of local relations S i that map to R. If S i contains all the attributes in A, we rewrite Q by replacing R with S i, and A with corresponding attributes in S i. In some cases, the local peer cannot reformulate the standard form query Q into a local form query with one relation, because it need join several relations to answer Q. For example, if S i S and there is an attribute A k A, which is not an attribute of S i, in this way, there must be S j S that has an attribute A k. If A S i Sj, we can answer Q through joining S i and S j ; otherwise we need further find more relation(s) to join with S i and S j in order to answer Q. After a relevant peer answers the reformulated local form query, it returns the results that are encapsulated by the attributes in A, such that the query initiator can recognize them. 4.3 Integration When receiving the answers from relevant peers, the query initiator transforms those answers from various peers represented by attributes of the standard form query into the answers represented with its local attributes, and integrates these results to return to the user. Consider the issued query is in a triple Q=<R, A, C>, and its corresponding reformulated standard form queries are represented as Q 1 <R 1, A 1, C 1 >;Q 2 <R 2, A 2, C 2 >;...; Q n <R n, A n, C n >. The mapping from the issued query to the standard form queries is one-to-many, but the mapping in reverse is one-to-one. Therefore, the transformation of the attributes in answers is much easier than query reformulation. Suppose a relevant peer returns its results of Q i <R i, A i, C i >, and since the querying peer has known the mapping from Q to Q i in standardization phase, it can simply transform the attributes in the answers by replacing the attributes in A i with the corresponding attributes in A, Finally, it integrates the results from all relevant peers and returns them to the user. 5 Experimental Study In this section, we report performance study for evaluating our schema mapping method. The proposed method was implemented in Java. We used the Amalgam schema and data integration test suite [17] and THALIA benchmark [10] as our experimental data sources 1. We evaluate our method from two aspects. First, we study the effectiveness of our schema mapping strategy for matching two schemas. Second, we look at the performance of schema mapping and query processing in a real P2P network setting. 5.1 Mapping between two Schemas We first evaluate the quality of mappings obtained with our method between two schemas in this section. Given two schemas, we first classify the relations and attributes of either one schema and get the corresponding classification rules, then we create schema mapping between them by classifying the relations and attributes of the other schema with our probing strategy. We use precision and recall to evaluate the quality of the mappings obtained. Precision is the fraction of the number of correct relation mappings (Correct relation mapping means both relations and their attributes are mapped correctly) and the number of total obtained relation mapping. Recall is the fraction of the number of correct relation mappings obtained and the number of total correct relation mappings. Consider two schemas S and T, we denote the precision of probing T with S as P, that is, S is the schema classified firstly. Similarly, P is the precision ST T S 1 There are 28 databases and 35 tables in THALIA. We transform THALIA data into 35 relation tables, denoted as S 5, and we also create schema mapping between them.

10 of probing S with T. In addition, we define F ST =F T S = 2 P ST P T S P +P. These three ST T S metrics are used to evaluate the precision of schema mappings between S and T. In the same way, we define corresponding metrics for recall as R, R and ST T S F ST =F T S= 2 R ST R T S R +R. ST T S Matching Precision(%) 50 P P F Matching Recall(%) R R F Matching Precision(%) S 1 S 2 S 1 S 3 S 1 S 4 S 2 S 3 S 2 S 4 S 3 S 4 Precision without prior-knowledge P P F Matching Recall(%) S 1 S 2 S 1 S 3 S 1 S 4 S 2 S 3 S 2 S 4 S 3 S 4 Recall without prior-knowledge R R F 40 S 1 S 2 S 1 S 3 S 1 S 4 S 2 S 3 S 2 S 4 S 3 S 4 Precision with prior-knowledge S 1 S 2 S 1 S 3 S 1 S 4 S 2 S 3 S 2 S 4 S 3 S 4 Recall with prior-knowledge Fig. 4. Mapping between two schemas Our method is evaluated in two cases. First, we create schema mappings without prior-knowledge. Second, we create schema mappings with prior-knowledge. Figure 4 describes the experimented results of matching 6 pairs of schemas from S 1 to S 4, where P, P, R, R, F and F denote P, P, R, R, F ST T S ST T S ST and F ST respectively. The experiment results show that our method achieves high precision and recall. When there is no prior-knowledge, the precision is about 55-75%, and the recall is about -%. Given some prior-knowledge, the accuracy of schema mapping improves dramatically. The precision reaches 75-%, and the recall increases to 85-%. Also, we can see that P and P do not make considerable difference, which shows the stability of our ST T S method. 5.2 Mapping in PDMS In this section, we evaluate our method in PDMS and compare its performance with PeerDB [19]. The experimental environment consists of 32 PCs (thereinto, 8 PCs are super peers) with Intel Pentium 2.4MHz processor and 512M of RAM. All the PCs are running on Windows XP operating system. We classify some sample schemas using bayes classifiers [5] into categories, and the eight super peers maintain these categories. Each normal peer shares its peer schema and joins one randomly chosen super peer by classifying its relations into certain categories. Matching Precision(%) PeerDB(# of keywords=2) PeerDB(# of keywords=5) No prior-knowledge With prior-knowledge Matching Recall(%) PeerDB(# of keywords=2) PeerDB(# of keywords=5) No prior-knowledge With prior-knowledge 50 S 1 50 S 2 S 3 S 4 S 5 S 1 S 2 Schemas Fig. 5. Schema mapping in PDMS S 3 Schemas S 4 S 5

11 In this experiment, we first evaluate the quality of schema mappings generated with the two approaches, then we compare the effectiveness of query processing in PDMS with the two methods. Quality of schema mapping: Similar to Section 5.1, we use precision and recall to evaluate the quality of schema mappings for each of the schema from S 1 to S 5. Figure 5 shows the experimental results of our method (with and without prior-knowledge) compared with that of peerdb (with 2 keywords annotated for each relation and with 5 keywords annotated for each relation). Not surprisingly, we can see that in PDMS our schema mapping method with prior-knowledge is more effective than that without prior-knowledge. The precision and recall with prior-knowledge are larger than % for most schemas. It can be observed that our method is superior to the PeerDB approach for most schemas (except S 3 ). Generally, the precision and recall of our method beats that of PeerDB by 10% to 20%. Moreover, PeerDB depends on the keywords annotated to a schema, which must be generated manually. Annotating more keywords to a schema could improve the recall, but degrades the precision. The experiment result shows that our method has good schema mapping performance in PDMS whenever there are overlap instances of the schemas. Effectiveness of query processing: With the created schema mappings, we evaluate the effectiveness of query processing of the two approaches. We also use the notions of precision and recall for our evaluation. Here precision is defined as the fraction of the number of correct returned answers to the total number of returned answers, and recall is the fraction of the number of correct returned answers to the total number of correct answers. We generate six queries to evaluate the two methods, in which four queries are based on Amalgam schemas and two are based THALIA schemas. There are two queries that contain join operations. Figure 6 shows the experiment results. Again, we can see that our method is more effective than the PeerDB approach. Query Processing Precision(%) 50 Q 1 PeerDB(# of keywords=2) PeerDB(# of keywords=5) No prior-knowledge With prior-knowledge Q 2 Q 3 Q 4 Queries Q 5 Q 6 Query Processing Recall(%) 50 Q 1 PeerDB(# of keywords=2) PeerDB(# of keywords=5) No prior-knowledge With prior-knowledge Q 2 Q 3 Q 4 Queries Fig. 6. Query processing in PDMS Q 5 Q 6 6 Conclusion In this paper, we propose a method for effective schema mapping based on classification and probing in a PDMS. We classify each peer schema into certain categories through probing, and the relations in the same category can be mapped to each other. We enhance the classification-based mapping by the application of confusion matrix and prior-knowledge. We also present strategy for reformulating query over a local peer schema to queries on various relevant peer schemas for effective query answering. Our experimented results show that our method achieves high accuracy for schema mapping on real datasets. Acknowledgement This work is supported by the National Natural Science Foundation of China under Grant No , the National Grand Fundamental Research 973 Program of China under Grant No.2006CB303103, the National High Technology

12 Development 863 Program of China under Grant No.2006AA01A101, Tsinghua Basic Research Foundation under Grant No. JCqn , and Zhejiang Natural Science Foundation under Grant No. Y References 1. K. Aberer, P. Cudre-Mauroux, and M. Hauswirth. A framework for semantic gossiping. SIGMOD Record, 31(4): , M. Arenas, V. Kantere, A. Kementsietsidis, I. Kiringa, R. J. Miller, and J. Mylopoulos. The hyperion project:from data integration to data coordination. SIG- MOD Record, 32(3):53 58, W. W. Cohen. Learning trees and rules with set-valued features. In AAAI, pages 9 716, R. Dhamankar, Y. Lee, A. Doan, and et al. imap: Discovering complex semantic matches between database schemas. In SIGMOD, R. O. Duda and P. E. Hart. Pattern classication and scene analysis. In Wiley, E. Franconi, G. Kuper, A. Lopatenko, and I. Zaihrayeu. Queries and updates in the codb peer to peer database. In VLDB, L. Gravano, P. G. Ipeirotis, and M. Sahami. QProber: A system for automatic classication of hidden-web databases. 21(1):1 41, S. Gribble, A. Halevy, Z. Ives, M. Rodrig, and D. Suciu. What can databases do for peer-to-peer. In WebDB, A. Halevy, Z. Ives, D. Suciu, and I. Tatarinov. Schema mediation in peer data management systems. In ICDE, pages , J. Hammer, M. Stonebraker, and O. Topsakal. THALIA: Test harness for the assessment of legacy information integration approaches. In ICDE, P. G. Ipeirotis, L. Gravano, and M. Sahami. Probe, count, and classify: Categorizing hidden-web databases. pages 61 78, H. V. Jagadish, B. C. Ooi, and Q. H. Vu. BATON: A balanced tree structure for peer-to-peer networks. In VLDB, pages , P. Kalnis, W. S. Ng, B. C. Ooi, D. Papadias, and K. L. Tan. An adaptive peer-topeer network for distributed caching of olap results. In SIGMOD, J. Kang and J. Naughton. On schema matching with opaque column names and data values. In SIGMOD, A. Kementsietsidis, M. Arenas, and R. J. Miller. Mapping data in peer to peer systems: Semantics and algorithmic issues. In SIGMOD, R. Kohavi and F. Provost. Glossary of terms. 30(2/3): , R. J. Miller, D. Fisla, M. Huang, D. Kymlicka, F. Ku, and V. Lee. Amalgam schema and data integration test suite. miller/amalgam, W. S. Ng, B. C. Ooi, and K. L. Tan. Bestpeer: A self-configurable peer-to-peer system. In ICDE, W. S. Ng, B. C. Ooi, K.-L. Tan, and A. Zhou. PeerDB:A p2p-based system for distributed data sharing. In ICDE, B. C. Ooi, Y. Shu, and K.-L. Tan. Relational data sharing in peer-based data management systems. SIGMOD Record, 32(3):59 64, J. R. Quinlan. C4.5: Programs for machine learning. In Morgan Kauf-mann Publishers, Inc., P. Rodriguez-Gianolli, M. Garzetti, L. Jiang, and et al. Data sharing in the hyperion peer database system. In VLDB, I. Tatatinov, Z. Ives, J. Madhavan, and A. H. et al. The piazza peer data management project. SIGMOD Record, 32(3):47 52, V. N. Vapnik. Statistical learning theory. In Wiley-Interscience, J. Wang, J.-R. Wen, F. H. Lochovsky, and W.-Y. Ma. Instance-based schema matching for web databases by domain-specific query probing. In VLDB, B. Yang and H. Garcia-Molina. Comparing hybrid peer-to-peer systems. In VLDB, C. Yu and L. Popa. Constraint-based XML query rewriting for data integration. In SIGMOD, 2004.

Keyword Join: Realizing Keyword Search for Information Integration

Keyword Join: Realizing Keyword Search for Information Integration Keyword Join: Realizing Keyword Search for Information Integration Bei YU, Ling LIU 2, Beng Chin OOI,3 and Kian-Lee TAN,3 Singapore-MIT Alliance, National University of Singapore 2 College of Computing,

More information

QProber: A System for Automatic Classification of Hidden-Web Databases

QProber: A System for Automatic Classification of Hidden-Web Databases QProber: A System for Automatic Classification of Hidden-Web Databases LUIS GRAVANO and PANAGIOTIS G. IPEIROTIS Computer Science Department Columbia University and MEHRAN SAHAMI Computer Science Department

More information

Distributed Search over the Hidden Web: Hierarchical Database Sampling and Selection

Distributed Search over the Hidden Web: Hierarchical Database Sampling and Selection Distributed Search over the Hidden Web: Hierarchical Database Sampling and Selection P.G. Ipeirotis & L. Gravano Computer Science Department, Columbia University Amr El-Helw CS856 University of Waterloo

More information

QProber: A System for Automatic Classification of Hidden-Web Resources

QProber: A System for Automatic Classification of Hidden-Web Resources QProber: A System for Automatic Classification of Hidden-Web Resources Panagiotis G. Ipeirotis and Luis Gravano Computer Science Department Columbia University and Mehran Sahami E.piphany, Inc. The contents

More information

Keyword Join: Realizing Keyword Search in P2P-based Database Systems

Keyword Join: Realizing Keyword Search in P2P-based Database Systems Keyword Join: Realizing Keyword Search in P2P-based Database Systems Bei Yu, Ling Liu 2, Beng Chin Ooi 3 and Kian-Lee Tan 3 Singapore-MIT Alliance 2 Georgia Institute of Technology, 3 National University

More information

Top-k Keyword Search Over Graphs Based On Backward Search

Top-k Keyword Search Over Graphs Based On Backward Search Top-k Keyword Search Over Graphs Based On Backward Search Jia-Hui Zeng, Jiu-Ming Huang, Shu-Qiang Yang 1College of Computer National University of Defense Technology, Changsha, China 2College of Computer

More information

Extending E-R for Modelling XML Keys

Extending E-R for Modelling XML Keys Extending E-R for Modelling XML Keys Martin Necasky Faculty of Mathematics and Physics, Charles University, Prague, Czech Republic martin.necasky@mff.cuni.cz Jaroslav Pokorny Faculty of Mathematics and

More information

Peer-to-Peer Systems. Chapter General Characteristics

Peer-to-Peer Systems. Chapter General Characteristics Chapter 2 Peer-to-Peer Systems Abstract In this chapter, a basic overview is given of P2P systems, architectures, and search strategies in P2P systems. More specific concepts that are outlined include

More information

Addressed Issue. P2P What are we looking at? What is Peer-to-Peer? What can databases do for P2P? What can databases do for P2P?

Addressed Issue. P2P What are we looking at? What is Peer-to-Peer? What can databases do for P2P? What can databases do for P2P? Peer-to-Peer Data Management - Part 1- Alex Coman acoman@cs.ualberta.ca Addressed Issue [1] Placement and retrieval of data [2] Server architectures for hybrid P2P [3] Improve search in pure P2P systems

More information

PathStack : A Holistic Path Join Algorithm for Path Query with Not-predicates on XML Data

PathStack : A Holistic Path Join Algorithm for Path Query with Not-predicates on XML Data PathStack : A Holistic Path Join Algorithm for Path Query with Not-predicates on XML Data Enhua Jiao, Tok Wang Ling, Chee-Yong Chan School of Computing, National University of Singapore {jiaoenhu,lingtw,chancy}@comp.nus.edu.sg

More information

A FRAMEWORK FOR EFFICIENT DATA SEARCH THROUGH XML TREE PATTERNS

A FRAMEWORK FOR EFFICIENT DATA SEARCH THROUGH XML TREE PATTERNS A FRAMEWORK FOR EFFICIENT DATA SEARCH THROUGH XML TREE PATTERNS SRIVANI SARIKONDA 1 PG Scholar Department of CSE P.SANDEEP REDDY 2 Associate professor Department of CSE DR.M.V.SIVA PRASAD 3 Principal Abstract:

More information

P2P Schema-Mapping over Network-bound XML Data

P2P Schema-Mapping over Network-bound XML Data Fourth International Conference on Semantics, Knowledge and Grid P2P Schema-Mapping over Network-bound XML Data Carmela Comito 1, Domenico Talia 2 DEIS - University of Calabria Via P. Bucci 41 c,87036,

More information

Accelerating XML Structural Matching Using Suffix Bitmaps

Accelerating XML Structural Matching Using Suffix Bitmaps Accelerating XML Structural Matching Using Suffix Bitmaps Feng Shao, Gang Chen, and Jinxiang Dong Dept. of Computer Science, Zhejiang University, Hangzhou, P.R. China microf_shao@msn.com, cg@zju.edu.cn,

More information

Symmetrically Exploiting XML

Symmetrically Exploiting XML Symmetrically Exploiting XML Shuohao Zhang and Curtis Dyreson School of E.E. and Computer Science Washington State University Pullman, Washington, USA The 15 th International World Wide Web Conference

More information

A Distance-Based Classifier Using Dissimilarity Based on Class Conditional Probability and Within-Class Variation. Kwanyong Lee 1 and Hyeyoung Park 2

A Distance-Based Classifier Using Dissimilarity Based on Class Conditional Probability and Within-Class Variation. Kwanyong Lee 1 and Hyeyoung Park 2 A Distance-Based Classifier Using Dissimilarity Based on Class Conditional Probability and Within-Class Variation Kwanyong Lee 1 and Hyeyoung Park 2 1. Department of Computer Science, Korea National Open

More information

HELIOS: a General Framework for Ontology-based Knowledge Sharing and Evolution in P2P Systems

HELIOS: a General Framework for Ontology-based Knowledge Sharing and Evolution in P2P Systems HELIOS: a General Framework for Ontology-based Knowledge Sharing and Evolution in P2P Systems S. Castano, A. Ferrara, S. Montanelli, D. Zucchelli Università degli Studi di Milano DICO - Via Comelico, 39,

More information

Processing Rank-Aware Queries in P2P Systems

Processing Rank-Aware Queries in P2P Systems Processing Rank-Aware Queries in P2P Systems Katja Hose, Marcel Karnstedt, Anke Koch, Kai-Uwe Sattler, and Daniel Zinn Department of Computer Science and Automation, TU Ilmenau P.O. Box 100565, D-98684

More information

Compression of the Stream Array Data Structure

Compression of the Stream Array Data Structure Compression of the Stream Array Data Structure Radim Bača and Martin Pawlas Department of Computer Science, Technical University of Ostrava Czech Republic {radim.baca,martin.pawlas}@vsb.cz Abstract. In

More information

Mining Quantitative Association Rules on Overlapped Intervals

Mining Quantitative Association Rules on Overlapped Intervals Mining Quantitative Association Rules on Overlapped Intervals Qiang Tong 1,3, Baoping Yan 2, and Yuanchun Zhou 1,3 1 Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China {tongqiang,

More information

An Improved Apriori Algorithm for Association Rules

An Improved Apriori Algorithm for Association Rules Research article An Improved Apriori Algorithm for Association Rules Hassan M. Najadat 1, Mohammed Al-Maolegi 2, Bassam Arkok 3 Computer Science, Jordan University of Science and Technology, Irbid, Jordan

More information

System P: Query Answering in PDMS under Limited Resources

System P: Query Answering in PDMS under Limited Resources System P: Query Answering in PDMS under Limited Resources Armin Roth Felix Naumann Tobias Hübner Martin Schweigert Humboldt-Universität zu Berlin Berlin, Germany {aroth, naumann, thuebner, martin.schweigert}@informatik.hu-berlin.de

More information

Keywords Data alignment, Data annotation, Web database, Search Result Record

Keywords Data alignment, Data annotation, Web database, Search Result Record Volume 5, Issue 8, August 2015 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Annotating Web

More information

Carnegie Mellon Univ. Dept. of Computer Science /615 DB Applications. Data mining - detailed outline. Problem

Carnegie Mellon Univ. Dept. of Computer Science /615 DB Applications. Data mining - detailed outline. Problem Faloutsos & Pavlo 15415/615 Carnegie Mellon Univ. Dept. of Computer Science 15415/615 DB Applications Lecture # 24: Data Warehousing / Data Mining (R&G, ch 25 and 26) Data mining detailed outline Problem

More information

Development Framework for Mobile Social Applications

Development Framework for Mobile Social Applications Development Framework for Mobile Social Applications Alexandre de Spindler, Michael Grossniklaus, and Moira C. Norrie Institute for Information Systems, ETH Zurich CH-8092 Zurich, Switzerland {despindler,grossniklaus,norrie}@inf.ethz.ch

More information

An Efficient XML Index Structure with Bottom-Up Query Processing

An Efficient XML Index Structure with Bottom-Up Query Processing An Efficient XML Index Structure with Bottom-Up Query Processing Dong Min Seo, Jae Soo Yoo, and Ki Hyung Cho Department of Computer and Communication Engineering, Chungbuk National University, 48 Gaesin-dong,

More information

A Framework for Securing Databases from Intrusion Threats

A Framework for Securing Databases from Intrusion Threats A Framework for Securing Databases from Intrusion Threats R. Prince Jeyaseelan James Department of Computer Applications, Valliammai Engineering College Affiliated to Anna University, Chennai, India Email:

More information

An Extended Byte Carry Labeling Scheme for Dynamic XML Data

An Extended Byte Carry Labeling Scheme for Dynamic XML Data Available online at www.sciencedirect.com Procedia Engineering 15 (2011) 5488 5492 An Extended Byte Carry Labeling Scheme for Dynamic XML Data YU Sheng a,b WU Minghui a,b, * LIU Lin a,b a School of Computer

More information

Leveraging Transitive Relations for Crowdsourced Joins*

Leveraging Transitive Relations for Crowdsourced Joins* Leveraging Transitive Relations for Crowdsourced Joins* Jiannan Wang #, Guoliang Li #, Tim Kraska, Michael J. Franklin, Jianhua Feng # # Department of Computer Science, Tsinghua University, Brown University,

More information

Exploiting peer group concept for adaptive and highly available services

Exploiting peer group concept for adaptive and highly available services Computing in High Energy and Nuclear Physics, 24-28 March 2003 La Jolla California 1 Exploiting peer group concept for adaptive and highly available services Muhammad Asif Jan Centre for European Nuclear

More information

Information Discovery, Extraction and Integration for the Hidden Web

Information Discovery, Extraction and Integration for the Hidden Web Information Discovery, Extraction and Integration for the Hidden Web Jiying Wang Department of Computer Science University of Science and Technology Clear Water Bay, Kowloon Hong Kong cswangjy@cs.ust.hk

More information

Optimal k-anonymity with Flexible Generalization Schemes through Bottom-up Searching

Optimal k-anonymity with Flexible Generalization Schemes through Bottom-up Searching Optimal k-anonymity with Flexible Generalization Schemes through Bottom-up Searching Tiancheng Li Ninghui Li CERIAS and Department of Computer Science, Purdue University 250 N. University Street, West

More information

A peer to peer based Recommendation System used for sharing large scale data

A peer to peer based Recommendation System used for sharing large scale data A peer to peer based Recommendation System used for sharing large scale data 1 A.CHINNA MANTHRU NAIK, 2 RAMMOHANREDDY.D Associate Professor 1 manthrunaik.a@newton.edu.in, 2 rammohanreddy.51@gmail.com Abstract-The

More information

Data mining - detailed outline. Carnegie Mellon Univ. Dept. of Computer Science /615 DB Applications. Problem.

Data mining - detailed outline. Carnegie Mellon Univ. Dept. of Computer Science /615 DB Applications. Problem. Faloutsos & Pavlo 15415/615 Carnegie Mellon Univ. Dept. of Computer Science 15415/615 DB Applications Data Warehousing / Data Mining (R&G, ch 25 and 26) C. Faloutsos and A. Pavlo Data mining detailed outline

More information

UAPRIORI: AN ALGORITHM FOR FINDING SEQUENTIAL PATTERNS IN PROBABILISTIC DATA

UAPRIORI: AN ALGORITHM FOR FINDING SEQUENTIAL PATTERNS IN PROBABILISTIC DATA UAPRIORI: AN ALGORITHM FOR FINDING SEQUENTIAL PATTERNS IN PROBABILISTIC DATA METANAT HOOSHSADAT, SAMANEH BAYAT, PARISA NAEIMI, MAHDIEH S. MIRIAN, OSMAR R. ZAÏANE Computing Science Department, University

More information

NON-CENTRALIZED DISTINCT L-DIVERSITY

NON-CENTRALIZED DISTINCT L-DIVERSITY NON-CENTRALIZED DISTINCT L-DIVERSITY Chi Hong Cheong 1, Dan Wu 2, and Man Hon Wong 3 1,3 Department of Computer Science and Engineering, The Chinese University of Hong Kong, Hong Kong {chcheong, mhwong}@cse.cuhk.edu.hk

More information

Use of Multi-category Proximal SVM for Data Set Reduction

Use of Multi-category Proximal SVM for Data Set Reduction Use of Multi-category Proximal SVM for Data Set Reduction S.V.N Vishwanathan and M Narasimha Murty Department of Computer Science and Automation, Indian Institute of Science, Bangalore 560 012, India Abstract.

More information

Query- vs. Crawling-based Classification of Searchable Web Databases

Query- vs. Crawling-based Classification of Searchable Web Databases Query- vs. Crawling-based Classification of Searchable Web Databases Luis Gravano Panagiotis G. Ipeirotis Mehran Sahami gravano@cs.columbia.edu pirot@cs.columbia.edu sahami@epiphany.com Columbia University

More information

Modeling and Simulating Discrete Event Systems in Metropolis

Modeling and Simulating Discrete Event Systems in Metropolis Modeling and Simulating Discrete Event Systems in Metropolis Guang Yang EECS 290N Report December 15, 2004 University of California at Berkeley Berkeley, CA, 94720, USA guyang@eecs.berkeley.edu Abstract

More information

Multi-Modal Data Fusion: A Description

Multi-Modal Data Fusion: A Description Multi-Modal Data Fusion: A Description Sarah Coppock and Lawrence J. Mazlack ECECS Department University of Cincinnati Cincinnati, Ohio 45221-0030 USA {coppocs,mazlack}@uc.edu Abstract. Clustering groups

More information

Faster Clustering with DBSCAN

Faster Clustering with DBSCAN Faster Clustering with DBSCAN Marzena Kryszkiewicz and Lukasz Skonieczny Institute of Computer Science, Warsaw University of Technology, Nowowiejska 15/19, 00-665 Warsaw, Poland Abstract. Grouping data

More information

Leveraging Set Relations in Exact Set Similarity Join

Leveraging Set Relations in Exact Set Similarity Join Leveraging Set Relations in Exact Set Similarity Join Xubo Wang, Lu Qin, Xuemin Lin, Ying Zhang, and Lijun Chang University of New South Wales, Australia University of Technology Sydney, Australia {xwang,lxue,ljchang}@cse.unsw.edu.au,

More information

Evaluation of Keyword Search System with Ranking

Evaluation of Keyword Search System with Ranking Evaluation of Keyword Search System with Ranking P.Saranya, Dr.S.Babu UG Scholar, Department of CSE, Final Year, IFET College of Engineering, Villupuram, Tamil nadu, India Associate Professor, Department

More information

Kanata: Adaptation and Evolution in Data Sharing Systems

Kanata: Adaptation and Evolution in Data Sharing Systems Kanata: Adaptation and Evolution in Data Sharing Systems Periklis Andritsos Ariel Fuxman Anastasios Kementsietsidis Renée J. Miller Yannis Velegrakis Department of Computer Science University of Toronto

More information

Improving Recognition through Object Sub-categorization

Improving Recognition through Object Sub-categorization Improving Recognition through Object Sub-categorization Al Mansur and Yoshinori Kuno Graduate School of Science and Engineering, Saitama University, 255 Shimo-Okubo, Sakura-ku, Saitama-shi, Saitama 338-8570,

More information

Benefit and Cost of Query Answering in PDMS

Benefit and Cost of Query Answering in PDMS Benefit and Cost of Query Answering in PDMS Armin Roth 1 and Felix Naumann 1 Humboldt-Universität zu Berlin Unter den inden 6, 10099 Berlin, Germany aroth,naumann@informatik.hu-berlin.de Abstract. data

More information

Evaluating XPath Queries

Evaluating XPath Queries Chapter 8 Evaluating XPath Queries Peter Wood (BBK) XML Data Management 201 / 353 Introduction When XML documents are small and can fit in memory, evaluating XPath expressions can be done efficiently But

More information

Novel Materialized View Selection in a Multidimensional Database

Novel Materialized View Selection in a Multidimensional Database Graphic Era University From the SelectedWorks of vijay singh Winter February 10, 2009 Novel Materialized View Selection in a Multidimensional Database vijay singh Available at: https://works.bepress.com/vijaysingh/5/

More information

Improving Quality of Products in Hard Drive Manufacturing by Decision Tree Technique

Improving Quality of Products in Hard Drive Manufacturing by Decision Tree Technique Improving Quality of Products in Hard Drive Manufacturing by Decision Tree Technique Anotai Siltepavet 1, Sukree Sinthupinyo 2 and Prabhas Chongstitvatana 3 1 Computer Engineering, Chulalongkorn University,

More information

FUZZY C-MEANS ALGORITHM BASED ON PRETREATMENT OF SIMILARITY RELATIONTP

FUZZY C-MEANS ALGORITHM BASED ON PRETREATMENT OF SIMILARITY RELATIONTP Dynamics of Continuous, Discrete and Impulsive Systems Series B: Applications & Algorithms 14 (2007) 103-111 Copyright c 2007 Watam Press FUZZY C-MEANS ALGORITHM BASED ON PRETREATMENT OF SIMILARITY RELATIONTP

More information

Designing an Information Integration and Interoperability System First Steps

Designing an Information Integration and Interoperability System First Steps Designing an Information Integration and Interoperability System First Steps 1 Introduction Dongfeng Chen Rada Chirkova Fereidoon Sadri October 19, 2006 The problem of processing queries in semantic interoperability

More information

Mining XML Functional Dependencies through Formal Concept Analysis

Mining XML Functional Dependencies through Formal Concept Analysis Mining XML Functional Dependencies through Formal Concept Analysis Viorica Varga May 6, 2010 Outline Definitions for XML Functional Dependencies Introduction to FCA FCA tool to detect XML FDs Finding XML

More information

AN IMPROVISED FREQUENT PATTERN TREE BASED ASSOCIATION RULE MINING TECHNIQUE WITH MINING FREQUENT ITEM SETS ALGORITHM AND A MODIFIED HEADER TABLE

AN IMPROVISED FREQUENT PATTERN TREE BASED ASSOCIATION RULE MINING TECHNIQUE WITH MINING FREQUENT ITEM SETS ALGORITHM AND A MODIFIED HEADER TABLE AN IMPROVISED FREQUENT PATTERN TREE BASED ASSOCIATION RULE MINING TECHNIQUE WITH MINING FREQUENT ITEM SETS ALGORITHM AND A MODIFIED HEADER TABLE Vandit Agarwal 1, Mandhani Kushal 2 and Preetham Kumar 3

More information

AC-Close: Efficiently Mining Approximate Closed Itemsets by Core Pattern Recovery

AC-Close: Efficiently Mining Approximate Closed Itemsets by Core Pattern Recovery : Efficiently Mining Approximate Closed Itemsets by Core Pattern Recovery Hong Cheng Philip S. Yu Jiawei Han University of Illinois at Urbana-Champaign IBM T. J. Watson Research Center {hcheng3, hanj}@cs.uiuc.edu,

More information

Annotating Multiple Web Databases Using Svm

Annotating Multiple Web Databases Using Svm Annotating Multiple Web Databases Using Svm M.Yazhmozhi 1, M. Lavanya 2, Dr. N. Rajkumar 3 PG Scholar, Department of Software Engineering, Sri Ramakrishna Engineering College, Coimbatore, India 1, 3 Head

More information

Efficient Remining of Generalized Multi-supported Association Rules under Support Update

Efficient Remining of Generalized Multi-supported Association Rules under Support Update Efficient Remining of Generalized Multi-supported Association Rules under Support Update WEN-YANG LIN 1 and MING-CHENG TSENG 1 Dept. of Information Management, Institute of Information Engineering I-Shou

More information

Matching and Alignment: What is the Cost of User Post-match Effort?

Matching and Alignment: What is the Cost of User Post-match Effort? Matching and Alignment: What is the Cost of User Post-match Effort? (Short paper) Fabien Duchateau 1 and Zohra Bellahsene 2 and Remi Coletta 2 1 Norwegian University of Science and Technology NO-7491 Trondheim,

More information

Thematic Schema Building for Mediation-based Peer-to-Peer Architecture 1

Thematic Schema Building for Mediation-based Peer-to-Peer Architecture 1 Electronic Notes in Theoretical Computer Science 150 (2006) 21 36 www.elsevier.com/locate/entcs Thematic Schema Building for Mediation-based Peer-to-Peer Architecture 1 Nicolas Lumineau 2 Anne Doucet 2

More information

Fast and Effective Interpolation Using Median Filter

Fast and Effective Interpolation Using Median Filter Fast and Effective Interpolation Using Median Filter Jian Zhang 1, *, Siwei Ma 2, Yongbing Zhang 1, and Debin Zhao 1 1 Department of Computer Science, Harbin Institute of Technology, Harbin 150001, P.R.

More information

COMPARISON OF DENSITY-BASED CLUSTERING ALGORITHMS

COMPARISON OF DENSITY-BASED CLUSTERING ALGORITHMS COMPARISON OF DENSITY-BASED CLUSTERING ALGORITHMS Mariam Rehman Lahore College for Women University Lahore, Pakistan mariam.rehman321@gmail.com Syed Atif Mehdi University of Management and Technology Lahore,

More information

Concept Tree Based Clustering Visualization with Shaded Similarity Matrices

Concept Tree Based Clustering Visualization with Shaded Similarity Matrices Syracuse University SURFACE School of Information Studies: Faculty Scholarship School of Information Studies (ischool) 12-2002 Concept Tree Based Clustering Visualization with Shaded Similarity Matrices

More information

BayesTH-MCRDR Algorithm for Automatic Classification of Web Document

BayesTH-MCRDR Algorithm for Automatic Classification of Web Document BayesTH-MCRDR Algorithm for Automatic Classification of Web Document Woo-Chul Cho and Debbie Richards Department of Computing, Macquarie University, Sydney, NSW 2109, Australia {wccho, richards}@ics.mq.edu.au

More information

FM-WAP Mining: In Search of Frequent Mutating Web Access Patterns from Historical Web Usage Data

FM-WAP Mining: In Search of Frequent Mutating Web Access Patterns from Historical Web Usage Data FM-WAP Mining: In Search of Frequent Mutating Web Access Patterns from Historical Web Usage Data Qiankun Zhao Nanyang Technological University, Singapore and Sourav S. Bhowmick Nanyang Technological University,

More information

Maintaining Frequent Itemsets over High-Speed Data Streams

Maintaining Frequent Itemsets over High-Speed Data Streams Maintaining Frequent Itemsets over High-Speed Data Streams James Cheng, Yiping Ke, and Wilfred Ng Department of Computer Science Hong Kong University of Science and Technology Clear Water Bay, Kowloon,

More information

Lecture Topic Projects 1 Intro, schedule, and logistics 2 Data Science components and tasks 3 Data types Project #1 out 4 Introduction to R,

Lecture Topic Projects 1 Intro, schedule, and logistics 2 Data Science components and tasks 3 Data types Project #1 out 4 Introduction to R, Lecture Topic Projects 1 Intro, schedule, and logistics 2 Data Science components and tasks 3 Data types Project #1 out 4 Introduction to R, statistics foundations 5 Introduction to D3, visual analytics

More information

Monotone Constraints in Frequent Tree Mining

Monotone Constraints in Frequent Tree Mining Monotone Constraints in Frequent Tree Mining Jeroen De Knijf Ad Feelders Abstract Recent studies show that using constraints that can be pushed into the mining process, substantially improves the performance

More information

Hybrid Feature Selection for Modeling Intrusion Detection Systems

Hybrid Feature Selection for Modeling Intrusion Detection Systems Hybrid Feature Selection for Modeling Intrusion Detection Systems Srilatha Chebrolu, Ajith Abraham and Johnson P Thomas Department of Computer Science, Oklahoma State University, USA ajith.abraham@ieee.org,

More information

A Two-phase Distributed Training Algorithm for Linear SVM in WSN

A Two-phase Distributed Training Algorithm for Linear SVM in WSN Proceedings of the World Congress on Electrical Engineering and Computer Systems and Science (EECSS 015) Barcelona, Spain July 13-14, 015 Paper o. 30 A wo-phase Distributed raining Algorithm for Linear

More information

TRIE BASED METHODS FOR STRING SIMILARTIY JOINS

TRIE BASED METHODS FOR STRING SIMILARTIY JOINS TRIE BASED METHODS FOR STRING SIMILARTIY JOINS Venkat Charan Varma Buddharaju #10498995 Department of Computer and Information Science University of MIssissippi ENGR-654 INFORMATION SYSTEM PRINCIPLES RESEARCH

More information

SA-IFIM: Incrementally Mining Frequent Itemsets in Update Distorted Databases

SA-IFIM: Incrementally Mining Frequent Itemsets in Update Distorted Databases SA-IFIM: Incrementally Mining Frequent Itemsets in Update Distorted Databases Jinlong Wang, Congfu Xu, Hongwei Dan, and Yunhe Pan Institute of Artificial Intelligence, Zhejiang University Hangzhou, 310027,

More information

Study on Classifiers using Genetic Algorithm and Class based Rules Generation

Study on Classifiers using Genetic Algorithm and Class based Rules Generation 2012 International Conference on Software and Computer Applications (ICSCA 2012) IPCSIT vol. 41 (2012) (2012) IACSIT Press, Singapore Study on Classifiers using Genetic Algorithm and Class based Rules

More information

Joint Entity Resolution

Joint Entity Resolution Joint Entity Resolution Steven Euijong Whang, Hector Garcia-Molina Computer Science Department, Stanford University 353 Serra Mall, Stanford, CA 94305, USA {swhang, hector}@cs.stanford.edu No Institute

More information

Cost-sensitive C4.5 with post-pruning and competition

Cost-sensitive C4.5 with post-pruning and competition Cost-sensitive C4.5 with post-pruning and competition Zilong Xu, Fan Min, William Zhu Lab of Granular Computing, Zhangzhou Normal University, Zhangzhou 363, China Abstract Decision tree is an effective

More information

Holistic and Compact Selectivity Estimation for Hybrid Queries over RDF Graphs

Holistic and Compact Selectivity Estimation for Hybrid Queries over RDF Graphs Holistic and Compact Selectivity Estimation for Hybrid Queries over RDF Graphs Authors: Andreas Wagner, Veli Bicer, Thanh Tran, and Rudi Studer Presenter: Freddy Lecue IBM Research Ireland 2014 International

More information

Efficient Common Items Extraction from Multiple Sorted Lists

Efficient Common Items Extraction from Multiple Sorted Lists 00 th International Asia-Pacific Web Conference Efficient Common Items Extraction from Multiple Sorted Lists Wei Lu,, Chuitian Rong,, Jinchuan Chen, Xiaoyong Du,, Gabriel Pui Cheong Fung, Xiaofang Zhou

More information

Semantic Overlay Networks

Semantic Overlay Networks Semantic Overlay Networks Arturo Crespo and Hector Garcia-Molina Write-up by Pavel Serdyukov Saarland University, Department of Computer Science Saarbrücken, December 2003 Content 1 Motivation... 3 2 Introduction

More information

Benchmarking the UB-tree

Benchmarking the UB-tree Benchmarking the UB-tree Michal Krátký, Tomáš Skopal Department of Computer Science, VŠB Technical University of Ostrava, tř. 17. listopadu 15, Ostrava, Czech Republic michal.kratky@vsb.cz, tomas.skopal@vsb.cz

More information

An Empirical Study of Lazy Multilabel Classification Algorithms

An Empirical Study of Lazy Multilabel Classification Algorithms An Empirical Study of Lazy Multilabel Classification Algorithms E. Spyromitros and G. Tsoumakas and I. Vlahavas Department of Informatics, Aristotle University of Thessaloniki, 54124 Thessaloniki, Greece

More information

STRUCTURE-BASED QUERY EXPANSION FOR XML SEARCH ENGINE

STRUCTURE-BASED QUERY EXPANSION FOR XML SEARCH ENGINE STRUCTURE-BASED QUERY EXPANSION FOR XML SEARCH ENGINE Wei-ning Qian, Hai-lei Qian, Li Wei, Yan Wang and Ao-ying Zhou Computer Science Department Fudan University Shanghai 200433 E-mail: wnqian@fudan.edu.cn

More information

Keyword Search over Hybrid XML-Relational Databases

Keyword Search over Hybrid XML-Relational Databases SICE Annual Conference 2008 August 20-22, 2008, The University Electro-Communications, Japan Keyword Search over Hybrid XML-Relational Databases Liru Zhang 1 Tadashi Ohmori 1 and Mamoru Hoshi 1 1 Graduate

More information

Hidden-Web Databases: Classification and Search

Hidden-Web Databases: Classification and Search Hidden-Web Databases: Classification and Search Luis Gravano Columbia University http://www.cs.columbia.edu/~gravano Joint work with Panos Ipeirotis (Columbia) and Mehran Sahami (Stanford/Google) Outline

More information

Parallelizing Structural Joins to Process Queries over Big XML Data Using MapReduce

Parallelizing Structural Joins to Process Queries over Big XML Data Using MapReduce Parallelizing Structural Joins to Process Queries over Big XML Data Using MapReduce Huayu Wu Institute for Infocomm Research, A*STAR, Singapore huwu@i2r.a-star.edu.sg Abstract. Processing XML queries over

More information

Semantic Query Routing Experiences in a PDMS

Semantic Query Routing Experiences in a PDMS Semantic Query Routing Experiences in a PDMS Federica Mandreoli, Riccardo Martoglia, Wilma Penzo, and Simona Sassatelli DII University of Modena and Reggio Emilia, Italy {fmandreoli,rmartoglia,sassatelli}@unimo.it

More information

Part 12: Advanced Topics in Collaborative Filtering. Francesco Ricci

Part 12: Advanced Topics in Collaborative Filtering. Francesco Ricci Part 12: Advanced Topics in Collaborative Filtering Francesco Ricci Content Generating recommendations in CF using frequency of ratings Role of neighborhood size Comparison of CF with association rules

More information

Structural and Syntactic Pattern Recognition

Structural and Syntactic Pattern Recognition Structural and Syntactic Pattern Recognition Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Fall 2017 CS 551, Fall 2017 c 2017, Selim Aksoy (Bilkent

More information

Hierarchical Online Mining for Associative Rules

Hierarchical Online Mining for Associative Rules Hierarchical Online Mining for Associative Rules Naresh Jotwani Dhirubhai Ambani Institute of Information & Communication Technology Gandhinagar 382009 INDIA naresh_jotwani@da-iict.org Abstract Mining

More information

A Peer-to-peer Framework for Caching Range Queries

A Peer-to-peer Framework for Caching Range Queries A Peer-to-peer Framework for Caching Range Queries O. D. Şahin A. Gupta D. Agrawal A. El Abbadi Department of Computer Science University of California Santa Barbara, CA 9316, USA {odsahin, abhishek, agrawal,

More information

Automatic Query Type Identification Based on Click Through Information

Automatic Query Type Identification Based on Click Through Information Automatic Query Type Identification Based on Click Through Information Yiqun Liu 1,MinZhang 1,LiyunRu 2, and Shaoping Ma 1 1 State Key Lab of Intelligent Tech. & Sys., Tsinghua University, Beijing, China

More information

Learning mappings and queries

Learning mappings and queries Learning mappings and queries Marie Jacob University Of Pennsylvania DEIS 2010 1 Schema mappings Denote relationships between schemas Relates source schema S and target schema T Defined in a query language

More information

Weka ( )

Weka (  ) Weka ( http://www.cs.waikato.ac.nz/ml/weka/ ) The phases in which classifier s design can be divided are reflected in WEKA s Explorer structure: Data pre-processing (filtering) and representation Supervised

More information

Discovery of Multi-level Association Rules from Primitive Level Frequent Patterns Tree

Discovery of Multi-level Association Rules from Primitive Level Frequent Patterns Tree Discovery of Multi-level Association Rules from Primitive Level Frequent Patterns Tree Virendra Kumar Shrivastava 1, Parveen Kumar 2, K. R. Pardasani 3 1 Department of Computer Science & Engineering, Singhania

More information

Ontology Based Prediction of Difficult Keyword Queries

Ontology Based Prediction of Difficult Keyword Queries Ontology Based Prediction of Difficult Keyword Queries Lubna.C*, Kasim K Pursuing M.Tech (CSE)*, Associate Professor (CSE) MEA Engineering College, Perinthalmanna Kerala, India lubna9990@gmail.com, kasim_mlp@gmail.com

More information

Improved Attack on Full-round Grain-128

Improved Attack on Full-round Grain-128 Improved Attack on Full-round Grain-128 Ximing Fu 1, and Xiaoyun Wang 1,2,3,4, and Jiazhe Chen 5, and Marc Stevens 6, and Xiaoyang Dong 2 1 Department of Computer Science and Technology, Tsinghua University,

More information

Improving Quality of Products in Hard Drive Manufacturing by Decision Tree Technique

Improving Quality of Products in Hard Drive Manufacturing by Decision Tree Technique www.ijcsi.org 29 Improving Quality of Products in Hard Drive Manufacturing by Decision Tree Technique Anotai Siltepavet 1, Sukree Sinthupinyo 2 and Prabhas Chongstitvatana 3 1 Computer Engineering, Chulalongkorn

More information

An Information-Theoretic Approach to the Prepruning of Classification Rules

An Information-Theoretic Approach to the Prepruning of Classification Rules An Information-Theoretic Approach to the Prepruning of Classification Rules Max Bramer University of Portsmouth, Portsmouth, UK Abstract: Keywords: The automatic induction of classification rules from

More information

Diversity Coloring for Distributed Storage in Mobile Networks

Diversity Coloring for Distributed Storage in Mobile Networks Diversity Coloring for Distributed Storage in Mobile Networks Anxiao (Andrew) Jiang and Jehoshua Bruck California Institute of Technology Abstract: Storing multiple copies of files is crucial for ensuring

More information

Teiid Designer User Guide 7.5.0

Teiid Designer User Guide 7.5.0 Teiid Designer User Guide 1 7.5.0 1. Introduction... 1 1.1. What is Teiid Designer?... 1 1.2. Why Use Teiid Designer?... 2 1.3. Metadata Overview... 2 1.3.1. What is Metadata... 2 1.3.2. Editing Metadata

More information

Detect tracking behavior among trajectory data

Detect tracking behavior among trajectory data Detect tracking behavior among trajectory data Jianqiu Xu, Jiangang Zhou Nanjing University of Aeronautics and Astronautics, China, jianqiu@nuaa.edu.cn, jiangangzhou@nuaa.edu.cn Abstract. Due to the continuing

More information

An Approach for Privacy Preserving in Association Rule Mining Using Data Restriction

An Approach for Privacy Preserving in Association Rule Mining Using Data Restriction International Journal of Engineering Science Invention Volume 2 Issue 1 January. 2013 An Approach for Privacy Preserving in Association Rule Mining Using Data Restriction Janakiramaiah Bonam 1, Dr.RamaMohan

More information

A New Technique to Optimize User s Browsing Session using Data Mining

A New Technique to Optimize User s Browsing Session using Data Mining Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 4, Issue. 3, March 2015,

More information

Semi-Supervised Clustering with Partial Background Information

Semi-Supervised Clustering with Partial Background Information Semi-Supervised Clustering with Partial Background Information Jing Gao Pang-Ning Tan Haibin Cheng Abstract Incorporating background knowledge into unsupervised clustering algorithms has been the subject

More information