Schema Mapping in P2P Networks based on Classification and Probing
|
|
- Junior Lawson
- 6 years ago
- Views:
Transcription
1 Schema Mapping in P2P Networks based on Classification and Probing Guoliang Li 1, Beng Chin Ooi 2, Bei Yu 2, and Lizhu Zhou 1 1 Department of Computer Science and Technology Tsinghua University, Beijing 084, China {liguoliang, dcszlz}@tsinghua.edu.cn 2 School of Computing, National University of Singapore, Singapore {ooibc, yubei}@comp.nus.edu.sg Abstract. In this paper, we address the problems of adaptive schema mappings between different peers in peer-to-peer network and searching for interesting data residing at different peers based on such mappings. We begin by classifying the shared schema of each peer into a taxonomy of relation categories and attribute categories. We then propose our adaptive schema mapping by selectively probing the shared schema with query probes, which are generated by the classification rules. To improve the accuracy of schema mapping, we introduce the notion of confusion matrix and prior-knowledge. Finally, we present the query reformulation strategy for retrieving and integrating data from all relevant peers. We have implemented our proposed schema mapping and query processing methods in real settings with real datasets. The experimental results show that our method can be adopted effectively in practice. 1 Introduction Sharing data among multiple sources is crucial in a wide range of applications, including enterprise data management, large-scale scientific projects, government agencies and the World-Wide Web in general. Data integration approaches offer an architecture for data sharing in which data is queried through a mediated schema, but physically stored at the source locations based on their own schemas. Recent data integration systems have been successful at enabling data sharing, but on a relatively small scale, due to the expensive cost of constructing the mediated schema. Recently, peer data management systems (PDMS) have been proposed as an architecture for decentralized data sharing [1, 2, 9, 19, 20, 23]. A PDMS consists of a set of (physical) peers, and each peer has an associated schema, denoted as peer schema, that represents its domain of interest. Some peers store actual data with mappings between their physical schemas to their relevant peer schemas. However, a peer may not have complete data instances for its peer schema, since individual peers typically do not contain complete information about a domain. This calls for schema mappings in order to tap on relevant peers for more complete answers. Mapping all data sources to a single global schema (or mediator) in a PDMS is not feasible due to the decentralization and scalability requirements of P2P systems. Therefore, in a PDMS, mappings between disparate schemas are built directly and stored locally, such that when a query is posed at a peer, the answers are obtained by integrating retrieved results of reformulated queries from relevant peers, which are generated by exploring the mappings. Schema mapping of most existing proposals for PDMS such as Hyperion [2, 15], Piazza [9, 23], and PeerDB [19, 20] all require human intervention, which is inefficient and ineffective for large networks and dynamic sources. Therefore, an
2 adaptive way for generating schema mapping is highly desirable. In this paper, we propose such a schema mapping method based on classification. We classify the shared schemas (relational tables and attributes) of individual peers into a taxonomy of relation categories and associated attribute categories, which essentially represent various conceptual domains. For all peers that have relations belonging to the same category, schema mappings are generated for them. When a new peer joins, classification of its shared schema is performed by probing its relations with query probes generated from classification rules, and consequently, it will be assigned to one or more relation categories to which the probing results have best matches. Subsequently, its schema is mapped to peers in the same categories. The advantage of our classification-based schema mapping is that its simplicity and modeling uniformity allow integrating the contents of several sources without having to tackle complex structural differences. Another advantage is that query evaluation in classification-based sources can be done efficiently. Our system is based on a super-peer P2P network in which super peers themselves are organized in a structured overlay, such as BATON [12], and normal peers within the cluster managed by a super peer are unstructured. The categories are distributed among super peers, through which normal peers build schema mappings. Our categories structure is distinct from a global schema (or mediator), since it is distributed among all the super peers, and it is used for peers to generate schema mappings, not for users to pose queries. In this paper, we make the following contributions: We propose a method for schema mapping based on classification and probing in PDMS. We adopt the notion of confusion matrix [16] and apply prior-knowledge to improve the accuracy of schema mapping whenever there are overlapping instances among the shared schemas. We present query formulation strategies for reformulating local queries among relevant peers to achieve efficient query answering. The paper is organized as follows. We discuss the related work in Section 2. Section 3 presents how to create the schema mapping, and Section 4 describes the query reformulation and evaluation strategies. In Section 5, we provide extensive experimental evaluations of our method and we conclude the paper in Section 6. 2 Related Work There is no doubt a long stream of research on schema mapping, and we shall briefly review recent and relevant proposals. Kang et al. [14] investigated schema matching techniques that worked in the presence of opaque column names and data values. Yu et al. [27] proposed a method about constraint-based XML data integration. Dhamankar et al. [4] described the imap system which semiautomatically discovered both 1-1 and complex matches. These three methods are only efficient for centralized environment. More recently, the database community has begun to exploit P2P technologies for database applications [2, 6, 8, 9, 13, 15, 22, 23, 26]. In [8], the problem of data placement for P2P system was addressed and how data management could be applied to P2P was presented. In [26], the class of hybrid P2P systems, where some functionality is still centralized, was studied. In [13], caching of OLAP queries was addressed in the context of a P2P network. Ooi et al. [18 20] introduced an IR technique into schema mapping in PDMS. Halevy et al. addressed the issue of schema mediation and proposed a language for mediating between peer schemas in [9]. Hyperion project was proposed in [2, 15, 22],
3 which created schema mapping via mapping tables and required human input. The codb P2P DB prototype system that measures the performance of various networks arranged in different topologies was proposed in [6]. Schema mapping of existing studies mostly require human input or intervention. For example, in PeerDB [19], users are expected to provide additional descriptions for the relation and attribute names. In this paper, we would like to take schema mapping one step further by not relying on the additional input imposed on the users. Accordingly, we propose a practical and adaptive solution based on classification and probing. 3 Classification-based Schema Mapping In this section, we first give an overview on how to construct a classification scheme for various peer schemas in Section 3.1. Then we describe the classificationbased schema mapping in detail in Section 3.2 and Section Classification Overview Schema Relation Mapping query probes Schema Mapping Classification & Rules Classifier Sample Schemas Attribute Mapping query probes &prior-knowledge Query Schema Mapping mapping data standardiz ation User Interface standard form query standard form query standard form query Peer P 1 Peer P Peer P n final result to user Localization Localization Localization Query Reformulation result result result Integration Fig. 1. Architecture of Schema Mapping based on Classification and Probing Figure 1 shows the overall architecture of our classification-based schema mapping method. Similar to a conceptual taxonomy, all the shared schemas in our system (relations and their attributes) are classified into certain categories and each category may contain some subcategories. A hierarchical classification scheme is introduced as follows. Hierarchical Classification Scheme: A hierarchical classification scheme is a rooted directed tree whose nodes correspond to categories. Category includes relation category and attribute category, and each relation category has some attribute subcategories. An edge from relation category u to another relation category v denotes specialization; while an edge from relation category v to its attribute category v i denotes the relation w.r.t. v has an attribute w.r.t. v i. PID C (confidence) PID C a 0.96 d 0.98 root b 0.98 e 0.96 Name c 0.92 f 0.95 PID C Author PID protein d 0.84 programme Comp d e 0.06 PID C Year e f 0.02 d 0.10 f SeqID Length Sequence C/C++ e 0.88 Java Delphi PID a b c C PID a b c C PID a b c C Name Author Comp f Year 0.03 Name Author Year Name Author Fig. 2. A classification structure Figure 2 illustrates a hierarchical classification scheme of our running example, where ellipse denotes relation category and rounded rectangle denotes attribute category. A relation category has several attribute subcategories, which correspond to its attributes. In Figure 2, the root node has two relation categories Protein and Programme, while Programme has three relation subcategories - Java,C/C++,Delphi and four attribute subcategories -Name,Author,Comp,Year. C Comp
4 PeerID Relevant peers Local Schema Mapped Category Peer a Peer b,c Kinases Protein ID SeqID len Length seq Sequence Peer b Peer a,c annexin Protein identifier SeqID length Length seqs Sequence Peer c Peer a,b protein Protein number SeqID seqlength Length sequence Protein number SeqID seq Sequence Fig. 3. Local schema mapping We have mentioned that the classification structure is maintained by super peers that can be organized with existing overlays, e.g. BATON [12]. Each super peer maintains a subset of categories, where each category is associated with its own classification rules (including prior-knowledge for attribute categories, which will be introduced in Section 3.3), and the physical addresses of the peers that have classified some relations into it. The categories on super peers are indexed with BATON s distributed index facility, so that we can go through the classification hierarchy from any of the super peers. The local schema mappings are not maintained by super peers but by normal peers. Each peer maintains its local schema mapping with its matched categories, and the identifiers of the relevant peers that have classified schemas into the same categories. With our running example, category protein has three peers a, b, and c that have classified their schemas into it. Correspondingly, each of peers a, b and c maintains its local schema mappings with protein as shown in Figure 3. Initially, the hierarchical classification scheme can be extracted from existing classifications using special-purpose languages and tools, or it can also be constructed from scratch. If there are certain sample schemas in PDMS, we can use many existing methods such as naive bayes classifier [5], C4.5 [21], RIPPER [3], and Support Vector Machine [24], to classify them. On the other hand, if there are no sample schemas, we can construct the classification from scratch: for a new schema, once it matches certain categories in the existing hierarchical classification schema, it will be classified into these categories; otherwise, it will be inserted into the hierarchical classification scheme as a new category (which may be constructed as a parent category of some existing categories). Generally, schema mapping based on classification, in our approach, operates in two phases: 1) Relation mapping (Section 3.2); and 2) Attribute mapping (Section 3.3), which are presented in the following subsections respectively. 3.2 Relation Mapping We create relation mapping between relevant peers by classifying their relations into the most relevant categories through query probing. The probe-based method has been used for mining hidden-web data in [7, 11, 25], which is orthogonal to the schema mapping of our work. Since the classification rules can capture the characteristics of various categories, we generate query probes according to these classification rules, which are used to differentiate various relations. Basically, if a query probe returns expected results from a relation, this relation is related with the category w.r.t. the query probe, and subsequently it is classified into the category.
5 Now we describe the class of rule-based classifiers and show how we can use a rule-based classifier to generate a set of query probes that will help us estimate the number of results for each category of interest in a relation. In a rule-based classifier, the classification decisions are based on a set of logical classification rules, κ i C i, where the antecedents of the rules are conjunctions of words and the consequents are the category assignments. For example, the following classification rules are part of a classifier for the categories Java book and protein, respectively. Java AN D book Java book ; %a%c%g%t% protein Such rules can be used to classify previously unseen relations. For example, the first rule will classify the relation containing the keywords Java and book into the category Java book. The second will classify the relation containing the keywords %a%c%g%t% into the category protein. We can simulate the behavior of a rule-based classifier over all categories of the classification scheme by mapping each rule κ i C i of the classifier into a boolean query q i that is the conjunction of all keywords appeared in κ i. Thus, if we send the query probe q i to a new relation R, the query will match exactly f(q i ) results in R that would have been classified by the associated rule into category C i. Actually, instead of retrieving the concrete results, we only need keep the number of matches reported for each query probe, and use this number as the measure of whether the probed relation satisfies the corresponding classification rule. Having the result for each query probe, we can construct a good approximation of the Weight and Confidence vectors for a relation R. We approximate the number of results of R in category C i as the total number of matches from all query probes derived from rules with category C i. Using this information we generate the approximated weight and confidence vectors for R, with which we decide how to classify R into one or more categories in the classification scheme. Weight vector: Consider a relation R and a hierarchical classification scheme C={C 1, C 2,..., C n }, where each category C i C is associated with a classification rule κ i C i. f(r, κ i ) represents the number of results when using κ i to probe R. The weight of relation R for C i, W(R; C i )=f(r, κ i ), is the number of answers in R on category C i. Confidence vector: In the same setting as weight vector, the estimated confidence of R for C i, S(R;C i ), is: S(R; C i ) = S(R; P arent(c i )) W(R; C i ) C j is a child of P arent(c i) W(R; C j). As a special case, S(R; root )=1. W(R;C i ) defines the absolute amount of the results that relation R contains about category C i, while S(R;C i ) defines the relative amount of the results that relation R contain about C i. As described above, a weight-based classification would classify a relation into a category when the relation has a substantial number of results in the given category. Alternatively, a confidence-based classification would classify a relation into a category when a significant fraction of the results it contains are of this specific category. In general, however, we are interested in balancing both weight and confidence with two associated thresholds, τ s and τ c, respectively, as captured in the following. Formally, to classify R into certain categories, we use classification criterion described in the following.
6 Classification Criterion: Consider a classification scheme C with categories {C 1 ; C 2 ;...; C n } and a relation R. R is classified into category C i if it satisfies all the following conditions: W(R;C i ) τ w, S(R;C i ) τ c. W(R;C j ) τ w, S(R;C j ) τ c for any ancestor C j of C i. W(R;C k )<τ w or S(R;C k )<τ c for any child C k of C i. where 0 τ c < 1 and τ w 1 are the given thresholds. With our hierarchical classification scheme, we classify the relations in a topdown way. A new relation is first classified by the root-level classifier and then recursively pushed down to the lower level classifiers. A relation R is pushed down to the category C j when W(R;C j ) and S(R;C j ) are no less than thresholds τ w (for weight) and τ c (for confidence), respectively. If a category C k can not match with R, we can prune the whole subtree rooted at C k. The final set of categories, into which we classify R, is the approximate categories of R in C. The probe-based method relies on category classifiers to define query probes and obtain category match information for a relation. Unfortunately, classifiers are not always perfect sometimes they can wrongly classify relations into incorrect categories and leave some relations that do not match any rules unclassified. Here, we present a novel method to adjust the initial probing results in order to avoid such potential errors. It is a common practice in the machine learning community to report classification using a confusion matrix [16]. We adapt this notion for use in our probing scenario. Confusion Matrix: Consider a classification scheme with categories {C 1 ; C 2 ;...; C n }, for each category C i, there is a relevant relation R i mapped with it. Confusion Matrix M=(m ij ) is an n*n matrix, where m ij is the number of matches generated from R j for query probe w.r.t. category C i, divided by the number of results in R j. In a perfect setting, the probes for C i match only results in R i and each result in R i matches exactly one probe for C i. In this case the confusion matrix is the identity matrix. The process to create confusion matrix is: i) Generate the query probes from classification rules of the categories and probe the relations w.r.t. the categories in the classification scheme. ii) Create an auxiliary confusion matrix X =(x ij ) and set x ij equal to the number of matches from R j for query probe w.r.t. category C i. iii) Normalize the columns of X by dividing column j with the number of results in R j. The result is confusion matrix M. Example 1. Suppose that we have a classifier for three categories C 1 = C/C++, C 2 = JAVA, C 3 = Delphi, and there are three relations R 1, R 2, R 3 with 2000,1500, 0 records for C/C++, JAVA, Delphi, respectively. After probing these three relations with the three query probes generated from the classification rules, we construct the following confusion matrix. Element m 13 = 2000 means that it misclassifies records of R 1 into R 3. M = = Interestingly, multiplying the confusion matrix with the weight vector that represents the exact correct number of results for each category, yields, the weight vector with the number of results in each category as matched by the query
7 probes. For instance, in Example 1, there are exact 2000 results for C 1, 1500 results for C 2 and 0 results for C 3, and the probe results are 1830, 1420, 0. We can infer the exact weight vector, EW, form probe result and matrix M, where EW(C)=M 1 W(C). Hence, when classifying a relation, we will multiply M 1 with W(C) to obtain a better approximation of the weight vector. 3.3 Attribute Mapping After classifying a relation into certain relation categories, we have to classify its attributes into the associated attribute categories, which can be performed similarly with relation mapping described in Section 3.2. In addition, since attributes have their own characteristics, we introduce some techniques to improve the accuracy of attribute mapping in this section. Each attribute of a relation is associated with a particular type, such as string, number, date, etc., and different types capture different characteristics. Moreover, an attribute may be restricted with certain domain, such as attribute Age (age of human beings), is any number between 0 and 150, since there is no person whose age is larger than 150 or less than 0. Therefore, we introduce prior-knowledge for attribute mapping. Prior-knowledge: Consider relation category C with attribute categories {L 1 ; L 2 ;...L p }. Each attribute category L i must satisfy the prior-knowledge χ i, represented as: L i =χ i. χ i can be generated manually or automatically, and we generate it automatically based on machine learning technique. Any attribute that does not satisfy χ i, cannot be mapped with L i. Therefore, we can generate a query probe, which does not satisfy χ i, to probe an unknown attribute A j. If there are some results returned for this query probe, it is obvious that the attribute category L i cannot map to A j ; otherwise, we probe A j with the query probe that is generated by the classification rules w.r.t L i, and approximate the count of the probing results to represent the correlation of the two attributes. Accordingly, we can more accurately create attribute mapping with the help of the prior-knowledge. Example 2. Consider category Person(ID,Name,Age,Sex) with prior-knowledge: 1) ID =Number(0,00);2)Name = String;3)Age =Number[0, 150];4)Sex ={alternative of two values(e.g.male;female)}. A relation People has been mapped to the relation category Person. Now we consider how to create attribute mapping between them. Since there are only two values of Sex, we probe each attribute of People through the query probe generated according to the prior-knowledge of Sex: Select count(distinct probe-attribute) from People. If the result of this query probe is larger than two, we can make sure this probe-attribute cannot be mapped to Sex. Also, we probe each attribute of People through the query probe generated according to the prior-knowledge of Age: Select probe-attribute from People where probe-attribute>150 or probe-attribute<0. If the probe query does not return empty, we make sure that probe-attribute cannot match with Age. Formally, we introduce correlative matrix to create attribute mapping. Correlative Matrix: Consider category C with attribute subcategories {L 1 ;L 2 ;...L p }, and each L i is associated with a prior-knowledge χ i. Relation R with attributes {A 1 ; A 2 ;...A q } is relation mapping with C. The correlative matrix Corr(C,R )={m ij } is a p*q matrix. m ij =0, if A j does not satisfy χ i ; otherwise m ij is the number of results using the query probe generated by the classification rule of L i, to probe A j. Relative Correlative Matrix: In the same setting as correlative matrix, the relative correlative matrix RCorr(C,R )={r ij } is a p*q matrix, and r ij = mij p. k=1 m kj
8 Attribute Mapping Criterion: Consider category C with attribute subcategories {L 1 ;L 2 ;...L p } and relation R with attributes {A 1 ; A 2 ;...A q }. L i maps to A j if Corr(C,R)={m ij } and RCorr(C,R)={r ij } satisfy: r ij τ c and m ij τ w, where τ w, τ c are thresholds of weight and confidence respectively. Example 3. Consider category C=Person with attribute subcategories:id,name, Age,Sex; Relation R=people (p id, p name, p age, gender). R is relation mapped to C, and we demonstrate how to create attribute mapping between C and R. We probe each attribute of R using the prior-knowledge of attribute subcategories in C (the prior-knowledge in example 2) and get the correlative and relative correlative matrixes. We can see each attribute of R exactly maps the corresponding attribute of C with the help of prior-knowledge. Corr(C, R) = p id p name p age gender ID Name Age Sex RCorr(C, R) = p id p name p age gender ID Name Age Sex Reformulation With the created schema mapping, we can reformulate the query issued to a peer over its peer schema to the queries over the peer schemas of its relevant peers, such that they can understand and answer it. We first define the standard form query and local form query in our system. Standard form query: A standard form query is the query composed of relations and attributes of the relational categories and attribute categories in the hierarchical classification scheme. Local form query: A local form query of peer P is the query composed of the relations and attributes of P s local peer schema. Query reformulation with our method operates in three phases, which are described in the following subsections separately. 4.1 Standardization In the standardization phase, the peer need transform the received query into the standard form query, which is represented by certain relation categories and their corresponding attribute categories. Consider the issued query is represented as a triple Q= <R, A, C>, where R is a relation name, A is the attribute set composed of {A 1 ; A 2 ;...; A p }, C is the condition set (If the query contains more than one relation, we can decompose it into multiple queries with single relation and then integrate them.). We first find all the categories, {R 1 ; R 2 ;...R n }, where each R i is mapped to R, through the local schema mapping. Then we look at the attribute subcategories of R i, N i ={N i1 ; N i2 ;...; N ip }, where N ik is mapped with A k. Let relevantp eers(r i ) and relevantp eers(n ik ) denote the sets of peers that have classified some relations and attributes in R i and N ik, respectively. If P i =relevant- P eers(r i ) ( p k=1 relevantp eers(n i k )) Φ, R i has a schema mapping with R, and we can reformulate Q to Q i by replacing R with R i and A k with N ik, and send the standard form query Q i to the peers in P i. In addition, we can get the set of all the peers, P= n i=1 P i, which have relations mapped to R. 4.2 Localization When the relevant peers receive the standard form query from the query initiator, they need reformulate it into their local form query over their own peer schemas in order to execute it. The reformulation process for transforming a standard form query into a local form query is similar to the way described in Section 4.1. We also consider
9 the standard form query as a triple Q=<R, A, C>. We first find the set, S, composed of local relations S i that map to R. If S i contains all the attributes in A, we rewrite Q by replacing R with S i, and A with corresponding attributes in S i. In some cases, the local peer cannot reformulate the standard form query Q into a local form query with one relation, because it need join several relations to answer Q. For example, if S i S and there is an attribute A k A, which is not an attribute of S i, in this way, there must be S j S that has an attribute A k. If A S i Sj, we can answer Q through joining S i and S j ; otherwise we need further find more relation(s) to join with S i and S j in order to answer Q. After a relevant peer answers the reformulated local form query, it returns the results that are encapsulated by the attributes in A, such that the query initiator can recognize them. 4.3 Integration When receiving the answers from relevant peers, the query initiator transforms those answers from various peers represented by attributes of the standard form query into the answers represented with its local attributes, and integrates these results to return to the user. Consider the issued query is in a triple Q=<R, A, C>, and its corresponding reformulated standard form queries are represented as Q 1 <R 1, A 1, C 1 >;Q 2 <R 2, A 2, C 2 >;...; Q n <R n, A n, C n >. The mapping from the issued query to the standard form queries is one-to-many, but the mapping in reverse is one-to-one. Therefore, the transformation of the attributes in answers is much easier than query reformulation. Suppose a relevant peer returns its results of Q i <R i, A i, C i >, and since the querying peer has known the mapping from Q to Q i in standardization phase, it can simply transform the attributes in the answers by replacing the attributes in A i with the corresponding attributes in A, Finally, it integrates the results from all relevant peers and returns them to the user. 5 Experimental Study In this section, we report performance study for evaluating our schema mapping method. The proposed method was implemented in Java. We used the Amalgam schema and data integration test suite [17] and THALIA benchmark [10] as our experimental data sources 1. We evaluate our method from two aspects. First, we study the effectiveness of our schema mapping strategy for matching two schemas. Second, we look at the performance of schema mapping and query processing in a real P2P network setting. 5.1 Mapping between two Schemas We first evaluate the quality of mappings obtained with our method between two schemas in this section. Given two schemas, we first classify the relations and attributes of either one schema and get the corresponding classification rules, then we create schema mapping between them by classifying the relations and attributes of the other schema with our probing strategy. We use precision and recall to evaluate the quality of the mappings obtained. Precision is the fraction of the number of correct relation mappings (Correct relation mapping means both relations and their attributes are mapped correctly) and the number of total obtained relation mapping. Recall is the fraction of the number of correct relation mappings obtained and the number of total correct relation mappings. Consider two schemas S and T, we denote the precision of probing T with S as P, that is, S is the schema classified firstly. Similarly, P is the precision ST T S 1 There are 28 databases and 35 tables in THALIA. We transform THALIA data into 35 relation tables, denoted as S 5, and we also create schema mapping between them.
10 of probing S with T. In addition, we define F ST =F T S = 2 P ST P T S P +P. These three ST T S metrics are used to evaluate the precision of schema mappings between S and T. In the same way, we define corresponding metrics for recall as R, R and ST T S F ST =F T S= 2 R ST R T S R +R. ST T S Matching Precision(%) 50 P P F Matching Recall(%) R R F Matching Precision(%) S 1 S 2 S 1 S 3 S 1 S 4 S 2 S 3 S 2 S 4 S 3 S 4 Precision without prior-knowledge P P F Matching Recall(%) S 1 S 2 S 1 S 3 S 1 S 4 S 2 S 3 S 2 S 4 S 3 S 4 Recall without prior-knowledge R R F 40 S 1 S 2 S 1 S 3 S 1 S 4 S 2 S 3 S 2 S 4 S 3 S 4 Precision with prior-knowledge S 1 S 2 S 1 S 3 S 1 S 4 S 2 S 3 S 2 S 4 S 3 S 4 Recall with prior-knowledge Fig. 4. Mapping between two schemas Our method is evaluated in two cases. First, we create schema mappings without prior-knowledge. Second, we create schema mappings with prior-knowledge. Figure 4 describes the experimented results of matching 6 pairs of schemas from S 1 to S 4, where P, P, R, R, F and F denote P, P, R, R, F ST T S ST T S ST and F ST respectively. The experiment results show that our method achieves high precision and recall. When there is no prior-knowledge, the precision is about 55-75%, and the recall is about -%. Given some prior-knowledge, the accuracy of schema mapping improves dramatically. The precision reaches 75-%, and the recall increases to 85-%. Also, we can see that P and P do not make considerable difference, which shows the stability of our ST T S method. 5.2 Mapping in PDMS In this section, we evaluate our method in PDMS and compare its performance with PeerDB [19]. The experimental environment consists of 32 PCs (thereinto, 8 PCs are super peers) with Intel Pentium 2.4MHz processor and 512M of RAM. All the PCs are running on Windows XP operating system. We classify some sample schemas using bayes classifiers [5] into categories, and the eight super peers maintain these categories. Each normal peer shares its peer schema and joins one randomly chosen super peer by classifying its relations into certain categories. Matching Precision(%) PeerDB(# of keywords=2) PeerDB(# of keywords=5) No prior-knowledge With prior-knowledge Matching Recall(%) PeerDB(# of keywords=2) PeerDB(# of keywords=5) No prior-knowledge With prior-knowledge 50 S 1 50 S 2 S 3 S 4 S 5 S 1 S 2 Schemas Fig. 5. Schema mapping in PDMS S 3 Schemas S 4 S 5
11 In this experiment, we first evaluate the quality of schema mappings generated with the two approaches, then we compare the effectiveness of query processing in PDMS with the two methods. Quality of schema mapping: Similar to Section 5.1, we use precision and recall to evaluate the quality of schema mappings for each of the schema from S 1 to S 5. Figure 5 shows the experimental results of our method (with and without prior-knowledge) compared with that of peerdb (with 2 keywords annotated for each relation and with 5 keywords annotated for each relation). Not surprisingly, we can see that in PDMS our schema mapping method with prior-knowledge is more effective than that without prior-knowledge. The precision and recall with prior-knowledge are larger than % for most schemas. It can be observed that our method is superior to the PeerDB approach for most schemas (except S 3 ). Generally, the precision and recall of our method beats that of PeerDB by 10% to 20%. Moreover, PeerDB depends on the keywords annotated to a schema, which must be generated manually. Annotating more keywords to a schema could improve the recall, but degrades the precision. The experiment result shows that our method has good schema mapping performance in PDMS whenever there are overlap instances of the schemas. Effectiveness of query processing: With the created schema mappings, we evaluate the effectiveness of query processing of the two approaches. We also use the notions of precision and recall for our evaluation. Here precision is defined as the fraction of the number of correct returned answers to the total number of returned answers, and recall is the fraction of the number of correct returned answers to the total number of correct answers. We generate six queries to evaluate the two methods, in which four queries are based on Amalgam schemas and two are based THALIA schemas. There are two queries that contain join operations. Figure 6 shows the experiment results. Again, we can see that our method is more effective than the PeerDB approach. Query Processing Precision(%) 50 Q 1 PeerDB(# of keywords=2) PeerDB(# of keywords=5) No prior-knowledge With prior-knowledge Q 2 Q 3 Q 4 Queries Q 5 Q 6 Query Processing Recall(%) 50 Q 1 PeerDB(# of keywords=2) PeerDB(# of keywords=5) No prior-knowledge With prior-knowledge Q 2 Q 3 Q 4 Queries Fig. 6. Query processing in PDMS Q 5 Q 6 6 Conclusion In this paper, we propose a method for effective schema mapping based on classification and probing in a PDMS. We classify each peer schema into certain categories through probing, and the relations in the same category can be mapped to each other. We enhance the classification-based mapping by the application of confusion matrix and prior-knowledge. We also present strategy for reformulating query over a local peer schema to queries on various relevant peer schemas for effective query answering. Our experimented results show that our method achieves high accuracy for schema mapping on real datasets. Acknowledgement This work is supported by the National Natural Science Foundation of China under Grant No , the National Grand Fundamental Research 973 Program of China under Grant No.2006CB303103, the National High Technology
12 Development 863 Program of China under Grant No.2006AA01A101, Tsinghua Basic Research Foundation under Grant No. JCqn , and Zhejiang Natural Science Foundation under Grant No. Y References 1. K. Aberer, P. Cudre-Mauroux, and M. Hauswirth. A framework for semantic gossiping. SIGMOD Record, 31(4): , M. Arenas, V. Kantere, A. Kementsietsidis, I. Kiringa, R. J. Miller, and J. Mylopoulos. The hyperion project:from data integration to data coordination. SIG- MOD Record, 32(3):53 58, W. W. Cohen. Learning trees and rules with set-valued features. In AAAI, pages 9 716, R. Dhamankar, Y. Lee, A. Doan, and et al. imap: Discovering complex semantic matches between database schemas. In SIGMOD, R. O. Duda and P. E. Hart. Pattern classication and scene analysis. In Wiley, E. Franconi, G. Kuper, A. Lopatenko, and I. Zaihrayeu. Queries and updates in the codb peer to peer database. In VLDB, L. Gravano, P. G. Ipeirotis, and M. Sahami. QProber: A system for automatic classication of hidden-web databases. 21(1):1 41, S. Gribble, A. Halevy, Z. Ives, M. Rodrig, and D. Suciu. What can databases do for peer-to-peer. In WebDB, A. Halevy, Z. Ives, D. Suciu, and I. Tatarinov. Schema mediation in peer data management systems. In ICDE, pages , J. Hammer, M. Stonebraker, and O. Topsakal. THALIA: Test harness for the assessment of legacy information integration approaches. In ICDE, P. G. Ipeirotis, L. Gravano, and M. Sahami. Probe, count, and classify: Categorizing hidden-web databases. pages 61 78, H. V. Jagadish, B. C. Ooi, and Q. H. Vu. BATON: A balanced tree structure for peer-to-peer networks. In VLDB, pages , P. Kalnis, W. S. Ng, B. C. Ooi, D. Papadias, and K. L. Tan. An adaptive peer-topeer network for distributed caching of olap results. In SIGMOD, J. Kang and J. Naughton. On schema matching with opaque column names and data values. In SIGMOD, A. Kementsietsidis, M. Arenas, and R. J. Miller. Mapping data in peer to peer systems: Semantics and algorithmic issues. In SIGMOD, R. Kohavi and F. Provost. Glossary of terms. 30(2/3): , R. J. Miller, D. Fisla, M. Huang, D. Kymlicka, F. Ku, and V. Lee. Amalgam schema and data integration test suite. miller/amalgam, W. S. Ng, B. C. Ooi, and K. L. Tan. Bestpeer: A self-configurable peer-to-peer system. In ICDE, W. S. Ng, B. C. Ooi, K.-L. Tan, and A. Zhou. PeerDB:A p2p-based system for distributed data sharing. In ICDE, B. C. Ooi, Y. Shu, and K.-L. Tan. Relational data sharing in peer-based data management systems. SIGMOD Record, 32(3):59 64, J. R. Quinlan. C4.5: Programs for machine learning. In Morgan Kauf-mann Publishers, Inc., P. Rodriguez-Gianolli, M. Garzetti, L. Jiang, and et al. Data sharing in the hyperion peer database system. In VLDB, I. Tatatinov, Z. Ives, J. Madhavan, and A. H. et al. The piazza peer data management project. SIGMOD Record, 32(3):47 52, V. N. Vapnik. Statistical learning theory. In Wiley-Interscience, J. Wang, J.-R. Wen, F. H. Lochovsky, and W.-Y. Ma. Instance-based schema matching for web databases by domain-specific query probing. In VLDB, B. Yang and H. Garcia-Molina. Comparing hybrid peer-to-peer systems. In VLDB, C. Yu and L. Popa. Constraint-based XML query rewriting for data integration. In SIGMOD, 2004.
Keyword Join: Realizing Keyword Search for Information Integration
Keyword Join: Realizing Keyword Search for Information Integration Bei YU, Ling LIU 2, Beng Chin OOI,3 and Kian-Lee TAN,3 Singapore-MIT Alliance, National University of Singapore 2 College of Computing,
More informationQProber: A System for Automatic Classification of Hidden-Web Databases
QProber: A System for Automatic Classification of Hidden-Web Databases LUIS GRAVANO and PANAGIOTIS G. IPEIROTIS Computer Science Department Columbia University and MEHRAN SAHAMI Computer Science Department
More informationDistributed Search over the Hidden Web: Hierarchical Database Sampling and Selection
Distributed Search over the Hidden Web: Hierarchical Database Sampling and Selection P.G. Ipeirotis & L. Gravano Computer Science Department, Columbia University Amr El-Helw CS856 University of Waterloo
More informationQProber: A System for Automatic Classification of Hidden-Web Resources
QProber: A System for Automatic Classification of Hidden-Web Resources Panagiotis G. Ipeirotis and Luis Gravano Computer Science Department Columbia University and Mehran Sahami E.piphany, Inc. The contents
More informationKeyword Join: Realizing Keyword Search in P2P-based Database Systems
Keyword Join: Realizing Keyword Search in P2P-based Database Systems Bei Yu, Ling Liu 2, Beng Chin Ooi 3 and Kian-Lee Tan 3 Singapore-MIT Alliance 2 Georgia Institute of Technology, 3 National University
More informationTop-k Keyword Search Over Graphs Based On Backward Search
Top-k Keyword Search Over Graphs Based On Backward Search Jia-Hui Zeng, Jiu-Ming Huang, Shu-Qiang Yang 1College of Computer National University of Defense Technology, Changsha, China 2College of Computer
More informationExtending E-R for Modelling XML Keys
Extending E-R for Modelling XML Keys Martin Necasky Faculty of Mathematics and Physics, Charles University, Prague, Czech Republic martin.necasky@mff.cuni.cz Jaroslav Pokorny Faculty of Mathematics and
More informationPeer-to-Peer Systems. Chapter General Characteristics
Chapter 2 Peer-to-Peer Systems Abstract In this chapter, a basic overview is given of P2P systems, architectures, and search strategies in P2P systems. More specific concepts that are outlined include
More informationAddressed Issue. P2P What are we looking at? What is Peer-to-Peer? What can databases do for P2P? What can databases do for P2P?
Peer-to-Peer Data Management - Part 1- Alex Coman acoman@cs.ualberta.ca Addressed Issue [1] Placement and retrieval of data [2] Server architectures for hybrid P2P [3] Improve search in pure P2P systems
More informationPathStack : A Holistic Path Join Algorithm for Path Query with Not-predicates on XML Data
PathStack : A Holistic Path Join Algorithm for Path Query with Not-predicates on XML Data Enhua Jiao, Tok Wang Ling, Chee-Yong Chan School of Computing, National University of Singapore {jiaoenhu,lingtw,chancy}@comp.nus.edu.sg
More informationA FRAMEWORK FOR EFFICIENT DATA SEARCH THROUGH XML TREE PATTERNS
A FRAMEWORK FOR EFFICIENT DATA SEARCH THROUGH XML TREE PATTERNS SRIVANI SARIKONDA 1 PG Scholar Department of CSE P.SANDEEP REDDY 2 Associate professor Department of CSE DR.M.V.SIVA PRASAD 3 Principal Abstract:
More informationP2P Schema-Mapping over Network-bound XML Data
Fourth International Conference on Semantics, Knowledge and Grid P2P Schema-Mapping over Network-bound XML Data Carmela Comito 1, Domenico Talia 2 DEIS - University of Calabria Via P. Bucci 41 c,87036,
More informationAccelerating XML Structural Matching Using Suffix Bitmaps
Accelerating XML Structural Matching Using Suffix Bitmaps Feng Shao, Gang Chen, and Jinxiang Dong Dept. of Computer Science, Zhejiang University, Hangzhou, P.R. China microf_shao@msn.com, cg@zju.edu.cn,
More informationSymmetrically Exploiting XML
Symmetrically Exploiting XML Shuohao Zhang and Curtis Dyreson School of E.E. and Computer Science Washington State University Pullman, Washington, USA The 15 th International World Wide Web Conference
More informationA Distance-Based Classifier Using Dissimilarity Based on Class Conditional Probability and Within-Class Variation. Kwanyong Lee 1 and Hyeyoung Park 2
A Distance-Based Classifier Using Dissimilarity Based on Class Conditional Probability and Within-Class Variation Kwanyong Lee 1 and Hyeyoung Park 2 1. Department of Computer Science, Korea National Open
More informationHELIOS: a General Framework for Ontology-based Knowledge Sharing and Evolution in P2P Systems
HELIOS: a General Framework for Ontology-based Knowledge Sharing and Evolution in P2P Systems S. Castano, A. Ferrara, S. Montanelli, D. Zucchelli Università degli Studi di Milano DICO - Via Comelico, 39,
More informationProcessing Rank-Aware Queries in P2P Systems
Processing Rank-Aware Queries in P2P Systems Katja Hose, Marcel Karnstedt, Anke Koch, Kai-Uwe Sattler, and Daniel Zinn Department of Computer Science and Automation, TU Ilmenau P.O. Box 100565, D-98684
More informationCompression of the Stream Array Data Structure
Compression of the Stream Array Data Structure Radim Bača and Martin Pawlas Department of Computer Science, Technical University of Ostrava Czech Republic {radim.baca,martin.pawlas}@vsb.cz Abstract. In
More informationMining Quantitative Association Rules on Overlapped Intervals
Mining Quantitative Association Rules on Overlapped Intervals Qiang Tong 1,3, Baoping Yan 2, and Yuanchun Zhou 1,3 1 Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China {tongqiang,
More informationAn Improved Apriori Algorithm for Association Rules
Research article An Improved Apriori Algorithm for Association Rules Hassan M. Najadat 1, Mohammed Al-Maolegi 2, Bassam Arkok 3 Computer Science, Jordan University of Science and Technology, Irbid, Jordan
More informationSystem P: Query Answering in PDMS under Limited Resources
System P: Query Answering in PDMS under Limited Resources Armin Roth Felix Naumann Tobias Hübner Martin Schweigert Humboldt-Universität zu Berlin Berlin, Germany {aroth, naumann, thuebner, martin.schweigert}@informatik.hu-berlin.de
More informationKeywords Data alignment, Data annotation, Web database, Search Result Record
Volume 5, Issue 8, August 2015 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Annotating Web
More informationCarnegie Mellon Univ. Dept. of Computer Science /615 DB Applications. Data mining - detailed outline. Problem
Faloutsos & Pavlo 15415/615 Carnegie Mellon Univ. Dept. of Computer Science 15415/615 DB Applications Lecture # 24: Data Warehousing / Data Mining (R&G, ch 25 and 26) Data mining detailed outline Problem
More informationDevelopment Framework for Mobile Social Applications
Development Framework for Mobile Social Applications Alexandre de Spindler, Michael Grossniklaus, and Moira C. Norrie Institute for Information Systems, ETH Zurich CH-8092 Zurich, Switzerland {despindler,grossniklaus,norrie}@inf.ethz.ch
More informationAn Efficient XML Index Structure with Bottom-Up Query Processing
An Efficient XML Index Structure with Bottom-Up Query Processing Dong Min Seo, Jae Soo Yoo, and Ki Hyung Cho Department of Computer and Communication Engineering, Chungbuk National University, 48 Gaesin-dong,
More informationA Framework for Securing Databases from Intrusion Threats
A Framework for Securing Databases from Intrusion Threats R. Prince Jeyaseelan James Department of Computer Applications, Valliammai Engineering College Affiliated to Anna University, Chennai, India Email:
More informationAn Extended Byte Carry Labeling Scheme for Dynamic XML Data
Available online at www.sciencedirect.com Procedia Engineering 15 (2011) 5488 5492 An Extended Byte Carry Labeling Scheme for Dynamic XML Data YU Sheng a,b WU Minghui a,b, * LIU Lin a,b a School of Computer
More informationLeveraging Transitive Relations for Crowdsourced Joins*
Leveraging Transitive Relations for Crowdsourced Joins* Jiannan Wang #, Guoliang Li #, Tim Kraska, Michael J. Franklin, Jianhua Feng # # Department of Computer Science, Tsinghua University, Brown University,
More informationExploiting peer group concept for adaptive and highly available services
Computing in High Energy and Nuclear Physics, 24-28 March 2003 La Jolla California 1 Exploiting peer group concept for adaptive and highly available services Muhammad Asif Jan Centre for European Nuclear
More informationInformation Discovery, Extraction and Integration for the Hidden Web
Information Discovery, Extraction and Integration for the Hidden Web Jiying Wang Department of Computer Science University of Science and Technology Clear Water Bay, Kowloon Hong Kong cswangjy@cs.ust.hk
More informationOptimal k-anonymity with Flexible Generalization Schemes through Bottom-up Searching
Optimal k-anonymity with Flexible Generalization Schemes through Bottom-up Searching Tiancheng Li Ninghui Li CERIAS and Department of Computer Science, Purdue University 250 N. University Street, West
More informationA peer to peer based Recommendation System used for sharing large scale data
A peer to peer based Recommendation System used for sharing large scale data 1 A.CHINNA MANTHRU NAIK, 2 RAMMOHANREDDY.D Associate Professor 1 manthrunaik.a@newton.edu.in, 2 rammohanreddy.51@gmail.com Abstract-The
More informationData mining - detailed outline. Carnegie Mellon Univ. Dept. of Computer Science /615 DB Applications. Problem.
Faloutsos & Pavlo 15415/615 Carnegie Mellon Univ. Dept. of Computer Science 15415/615 DB Applications Data Warehousing / Data Mining (R&G, ch 25 and 26) C. Faloutsos and A. Pavlo Data mining detailed outline
More informationUAPRIORI: AN ALGORITHM FOR FINDING SEQUENTIAL PATTERNS IN PROBABILISTIC DATA
UAPRIORI: AN ALGORITHM FOR FINDING SEQUENTIAL PATTERNS IN PROBABILISTIC DATA METANAT HOOSHSADAT, SAMANEH BAYAT, PARISA NAEIMI, MAHDIEH S. MIRIAN, OSMAR R. ZAÏANE Computing Science Department, University
More informationNON-CENTRALIZED DISTINCT L-DIVERSITY
NON-CENTRALIZED DISTINCT L-DIVERSITY Chi Hong Cheong 1, Dan Wu 2, and Man Hon Wong 3 1,3 Department of Computer Science and Engineering, The Chinese University of Hong Kong, Hong Kong {chcheong, mhwong}@cse.cuhk.edu.hk
More informationUse of Multi-category Proximal SVM for Data Set Reduction
Use of Multi-category Proximal SVM for Data Set Reduction S.V.N Vishwanathan and M Narasimha Murty Department of Computer Science and Automation, Indian Institute of Science, Bangalore 560 012, India Abstract.
More informationQuery- vs. Crawling-based Classification of Searchable Web Databases
Query- vs. Crawling-based Classification of Searchable Web Databases Luis Gravano Panagiotis G. Ipeirotis Mehran Sahami gravano@cs.columbia.edu pirot@cs.columbia.edu sahami@epiphany.com Columbia University
More informationModeling and Simulating Discrete Event Systems in Metropolis
Modeling and Simulating Discrete Event Systems in Metropolis Guang Yang EECS 290N Report December 15, 2004 University of California at Berkeley Berkeley, CA, 94720, USA guyang@eecs.berkeley.edu Abstract
More informationMulti-Modal Data Fusion: A Description
Multi-Modal Data Fusion: A Description Sarah Coppock and Lawrence J. Mazlack ECECS Department University of Cincinnati Cincinnati, Ohio 45221-0030 USA {coppocs,mazlack}@uc.edu Abstract. Clustering groups
More informationFaster Clustering with DBSCAN
Faster Clustering with DBSCAN Marzena Kryszkiewicz and Lukasz Skonieczny Institute of Computer Science, Warsaw University of Technology, Nowowiejska 15/19, 00-665 Warsaw, Poland Abstract. Grouping data
More informationLeveraging Set Relations in Exact Set Similarity Join
Leveraging Set Relations in Exact Set Similarity Join Xubo Wang, Lu Qin, Xuemin Lin, Ying Zhang, and Lijun Chang University of New South Wales, Australia University of Technology Sydney, Australia {xwang,lxue,ljchang}@cse.unsw.edu.au,
More informationEvaluation of Keyword Search System with Ranking
Evaluation of Keyword Search System with Ranking P.Saranya, Dr.S.Babu UG Scholar, Department of CSE, Final Year, IFET College of Engineering, Villupuram, Tamil nadu, India Associate Professor, Department
More informationKanata: Adaptation and Evolution in Data Sharing Systems
Kanata: Adaptation and Evolution in Data Sharing Systems Periklis Andritsos Ariel Fuxman Anastasios Kementsietsidis Renée J. Miller Yannis Velegrakis Department of Computer Science University of Toronto
More informationImproving Recognition through Object Sub-categorization
Improving Recognition through Object Sub-categorization Al Mansur and Yoshinori Kuno Graduate School of Science and Engineering, Saitama University, 255 Shimo-Okubo, Sakura-ku, Saitama-shi, Saitama 338-8570,
More informationBenefit and Cost of Query Answering in PDMS
Benefit and Cost of Query Answering in PDMS Armin Roth 1 and Felix Naumann 1 Humboldt-Universität zu Berlin Unter den inden 6, 10099 Berlin, Germany aroth,naumann@informatik.hu-berlin.de Abstract. data
More informationEvaluating XPath Queries
Chapter 8 Evaluating XPath Queries Peter Wood (BBK) XML Data Management 201 / 353 Introduction When XML documents are small and can fit in memory, evaluating XPath expressions can be done efficiently But
More informationNovel Materialized View Selection in a Multidimensional Database
Graphic Era University From the SelectedWorks of vijay singh Winter February 10, 2009 Novel Materialized View Selection in a Multidimensional Database vijay singh Available at: https://works.bepress.com/vijaysingh/5/
More informationImproving Quality of Products in Hard Drive Manufacturing by Decision Tree Technique
Improving Quality of Products in Hard Drive Manufacturing by Decision Tree Technique Anotai Siltepavet 1, Sukree Sinthupinyo 2 and Prabhas Chongstitvatana 3 1 Computer Engineering, Chulalongkorn University,
More informationFUZZY C-MEANS ALGORITHM BASED ON PRETREATMENT OF SIMILARITY RELATIONTP
Dynamics of Continuous, Discrete and Impulsive Systems Series B: Applications & Algorithms 14 (2007) 103-111 Copyright c 2007 Watam Press FUZZY C-MEANS ALGORITHM BASED ON PRETREATMENT OF SIMILARITY RELATIONTP
More informationDesigning an Information Integration and Interoperability System First Steps
Designing an Information Integration and Interoperability System First Steps 1 Introduction Dongfeng Chen Rada Chirkova Fereidoon Sadri October 19, 2006 The problem of processing queries in semantic interoperability
More informationMining XML Functional Dependencies through Formal Concept Analysis
Mining XML Functional Dependencies through Formal Concept Analysis Viorica Varga May 6, 2010 Outline Definitions for XML Functional Dependencies Introduction to FCA FCA tool to detect XML FDs Finding XML
More informationAN IMPROVISED FREQUENT PATTERN TREE BASED ASSOCIATION RULE MINING TECHNIQUE WITH MINING FREQUENT ITEM SETS ALGORITHM AND A MODIFIED HEADER TABLE
AN IMPROVISED FREQUENT PATTERN TREE BASED ASSOCIATION RULE MINING TECHNIQUE WITH MINING FREQUENT ITEM SETS ALGORITHM AND A MODIFIED HEADER TABLE Vandit Agarwal 1, Mandhani Kushal 2 and Preetham Kumar 3
More informationAC-Close: Efficiently Mining Approximate Closed Itemsets by Core Pattern Recovery
: Efficiently Mining Approximate Closed Itemsets by Core Pattern Recovery Hong Cheng Philip S. Yu Jiawei Han University of Illinois at Urbana-Champaign IBM T. J. Watson Research Center {hcheng3, hanj}@cs.uiuc.edu,
More informationAnnotating Multiple Web Databases Using Svm
Annotating Multiple Web Databases Using Svm M.Yazhmozhi 1, M. Lavanya 2, Dr. N. Rajkumar 3 PG Scholar, Department of Software Engineering, Sri Ramakrishna Engineering College, Coimbatore, India 1, 3 Head
More informationEfficient Remining of Generalized Multi-supported Association Rules under Support Update
Efficient Remining of Generalized Multi-supported Association Rules under Support Update WEN-YANG LIN 1 and MING-CHENG TSENG 1 Dept. of Information Management, Institute of Information Engineering I-Shou
More informationMatching and Alignment: What is the Cost of User Post-match Effort?
Matching and Alignment: What is the Cost of User Post-match Effort? (Short paper) Fabien Duchateau 1 and Zohra Bellahsene 2 and Remi Coletta 2 1 Norwegian University of Science and Technology NO-7491 Trondheim,
More informationThematic Schema Building for Mediation-based Peer-to-Peer Architecture 1
Electronic Notes in Theoretical Computer Science 150 (2006) 21 36 www.elsevier.com/locate/entcs Thematic Schema Building for Mediation-based Peer-to-Peer Architecture 1 Nicolas Lumineau 2 Anne Doucet 2
More informationFast and Effective Interpolation Using Median Filter
Fast and Effective Interpolation Using Median Filter Jian Zhang 1, *, Siwei Ma 2, Yongbing Zhang 1, and Debin Zhao 1 1 Department of Computer Science, Harbin Institute of Technology, Harbin 150001, P.R.
More informationCOMPARISON OF DENSITY-BASED CLUSTERING ALGORITHMS
COMPARISON OF DENSITY-BASED CLUSTERING ALGORITHMS Mariam Rehman Lahore College for Women University Lahore, Pakistan mariam.rehman321@gmail.com Syed Atif Mehdi University of Management and Technology Lahore,
More informationConcept Tree Based Clustering Visualization with Shaded Similarity Matrices
Syracuse University SURFACE School of Information Studies: Faculty Scholarship School of Information Studies (ischool) 12-2002 Concept Tree Based Clustering Visualization with Shaded Similarity Matrices
More informationBayesTH-MCRDR Algorithm for Automatic Classification of Web Document
BayesTH-MCRDR Algorithm for Automatic Classification of Web Document Woo-Chul Cho and Debbie Richards Department of Computing, Macquarie University, Sydney, NSW 2109, Australia {wccho, richards}@ics.mq.edu.au
More informationFM-WAP Mining: In Search of Frequent Mutating Web Access Patterns from Historical Web Usage Data
FM-WAP Mining: In Search of Frequent Mutating Web Access Patterns from Historical Web Usage Data Qiankun Zhao Nanyang Technological University, Singapore and Sourav S. Bhowmick Nanyang Technological University,
More informationMaintaining Frequent Itemsets over High-Speed Data Streams
Maintaining Frequent Itemsets over High-Speed Data Streams James Cheng, Yiping Ke, and Wilfred Ng Department of Computer Science Hong Kong University of Science and Technology Clear Water Bay, Kowloon,
More informationLecture Topic Projects 1 Intro, schedule, and logistics 2 Data Science components and tasks 3 Data types Project #1 out 4 Introduction to R,
Lecture Topic Projects 1 Intro, schedule, and logistics 2 Data Science components and tasks 3 Data types Project #1 out 4 Introduction to R, statistics foundations 5 Introduction to D3, visual analytics
More informationMonotone Constraints in Frequent Tree Mining
Monotone Constraints in Frequent Tree Mining Jeroen De Knijf Ad Feelders Abstract Recent studies show that using constraints that can be pushed into the mining process, substantially improves the performance
More informationHybrid Feature Selection for Modeling Intrusion Detection Systems
Hybrid Feature Selection for Modeling Intrusion Detection Systems Srilatha Chebrolu, Ajith Abraham and Johnson P Thomas Department of Computer Science, Oklahoma State University, USA ajith.abraham@ieee.org,
More informationA Two-phase Distributed Training Algorithm for Linear SVM in WSN
Proceedings of the World Congress on Electrical Engineering and Computer Systems and Science (EECSS 015) Barcelona, Spain July 13-14, 015 Paper o. 30 A wo-phase Distributed raining Algorithm for Linear
More informationTRIE BASED METHODS FOR STRING SIMILARTIY JOINS
TRIE BASED METHODS FOR STRING SIMILARTIY JOINS Venkat Charan Varma Buddharaju #10498995 Department of Computer and Information Science University of MIssissippi ENGR-654 INFORMATION SYSTEM PRINCIPLES RESEARCH
More informationSA-IFIM: Incrementally Mining Frequent Itemsets in Update Distorted Databases
SA-IFIM: Incrementally Mining Frequent Itemsets in Update Distorted Databases Jinlong Wang, Congfu Xu, Hongwei Dan, and Yunhe Pan Institute of Artificial Intelligence, Zhejiang University Hangzhou, 310027,
More informationStudy on Classifiers using Genetic Algorithm and Class based Rules Generation
2012 International Conference on Software and Computer Applications (ICSCA 2012) IPCSIT vol. 41 (2012) (2012) IACSIT Press, Singapore Study on Classifiers using Genetic Algorithm and Class based Rules
More informationJoint Entity Resolution
Joint Entity Resolution Steven Euijong Whang, Hector Garcia-Molina Computer Science Department, Stanford University 353 Serra Mall, Stanford, CA 94305, USA {swhang, hector}@cs.stanford.edu No Institute
More informationCost-sensitive C4.5 with post-pruning and competition
Cost-sensitive C4.5 with post-pruning and competition Zilong Xu, Fan Min, William Zhu Lab of Granular Computing, Zhangzhou Normal University, Zhangzhou 363, China Abstract Decision tree is an effective
More informationHolistic and Compact Selectivity Estimation for Hybrid Queries over RDF Graphs
Holistic and Compact Selectivity Estimation for Hybrid Queries over RDF Graphs Authors: Andreas Wagner, Veli Bicer, Thanh Tran, and Rudi Studer Presenter: Freddy Lecue IBM Research Ireland 2014 International
More informationEfficient Common Items Extraction from Multiple Sorted Lists
00 th International Asia-Pacific Web Conference Efficient Common Items Extraction from Multiple Sorted Lists Wei Lu,, Chuitian Rong,, Jinchuan Chen, Xiaoyong Du,, Gabriel Pui Cheong Fung, Xiaofang Zhou
More informationSemantic Overlay Networks
Semantic Overlay Networks Arturo Crespo and Hector Garcia-Molina Write-up by Pavel Serdyukov Saarland University, Department of Computer Science Saarbrücken, December 2003 Content 1 Motivation... 3 2 Introduction
More informationBenchmarking the UB-tree
Benchmarking the UB-tree Michal Krátký, Tomáš Skopal Department of Computer Science, VŠB Technical University of Ostrava, tř. 17. listopadu 15, Ostrava, Czech Republic michal.kratky@vsb.cz, tomas.skopal@vsb.cz
More informationAn Empirical Study of Lazy Multilabel Classification Algorithms
An Empirical Study of Lazy Multilabel Classification Algorithms E. Spyromitros and G. Tsoumakas and I. Vlahavas Department of Informatics, Aristotle University of Thessaloniki, 54124 Thessaloniki, Greece
More informationSTRUCTURE-BASED QUERY EXPANSION FOR XML SEARCH ENGINE
STRUCTURE-BASED QUERY EXPANSION FOR XML SEARCH ENGINE Wei-ning Qian, Hai-lei Qian, Li Wei, Yan Wang and Ao-ying Zhou Computer Science Department Fudan University Shanghai 200433 E-mail: wnqian@fudan.edu.cn
More informationKeyword Search over Hybrid XML-Relational Databases
SICE Annual Conference 2008 August 20-22, 2008, The University Electro-Communications, Japan Keyword Search over Hybrid XML-Relational Databases Liru Zhang 1 Tadashi Ohmori 1 and Mamoru Hoshi 1 1 Graduate
More informationHidden-Web Databases: Classification and Search
Hidden-Web Databases: Classification and Search Luis Gravano Columbia University http://www.cs.columbia.edu/~gravano Joint work with Panos Ipeirotis (Columbia) and Mehran Sahami (Stanford/Google) Outline
More informationParallelizing Structural Joins to Process Queries over Big XML Data Using MapReduce
Parallelizing Structural Joins to Process Queries over Big XML Data Using MapReduce Huayu Wu Institute for Infocomm Research, A*STAR, Singapore huwu@i2r.a-star.edu.sg Abstract. Processing XML queries over
More informationSemantic Query Routing Experiences in a PDMS
Semantic Query Routing Experiences in a PDMS Federica Mandreoli, Riccardo Martoglia, Wilma Penzo, and Simona Sassatelli DII University of Modena and Reggio Emilia, Italy {fmandreoli,rmartoglia,sassatelli}@unimo.it
More informationPart 12: Advanced Topics in Collaborative Filtering. Francesco Ricci
Part 12: Advanced Topics in Collaborative Filtering Francesco Ricci Content Generating recommendations in CF using frequency of ratings Role of neighborhood size Comparison of CF with association rules
More informationStructural and Syntactic Pattern Recognition
Structural and Syntactic Pattern Recognition Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Fall 2017 CS 551, Fall 2017 c 2017, Selim Aksoy (Bilkent
More informationHierarchical Online Mining for Associative Rules
Hierarchical Online Mining for Associative Rules Naresh Jotwani Dhirubhai Ambani Institute of Information & Communication Technology Gandhinagar 382009 INDIA naresh_jotwani@da-iict.org Abstract Mining
More informationA Peer-to-peer Framework for Caching Range Queries
A Peer-to-peer Framework for Caching Range Queries O. D. Şahin A. Gupta D. Agrawal A. El Abbadi Department of Computer Science University of California Santa Barbara, CA 9316, USA {odsahin, abhishek, agrawal,
More informationAutomatic Query Type Identification Based on Click Through Information
Automatic Query Type Identification Based on Click Through Information Yiqun Liu 1,MinZhang 1,LiyunRu 2, and Shaoping Ma 1 1 State Key Lab of Intelligent Tech. & Sys., Tsinghua University, Beijing, China
More informationLearning mappings and queries
Learning mappings and queries Marie Jacob University Of Pennsylvania DEIS 2010 1 Schema mappings Denote relationships between schemas Relates source schema S and target schema T Defined in a query language
More informationWeka ( )
Weka ( http://www.cs.waikato.ac.nz/ml/weka/ ) The phases in which classifier s design can be divided are reflected in WEKA s Explorer structure: Data pre-processing (filtering) and representation Supervised
More informationDiscovery of Multi-level Association Rules from Primitive Level Frequent Patterns Tree
Discovery of Multi-level Association Rules from Primitive Level Frequent Patterns Tree Virendra Kumar Shrivastava 1, Parveen Kumar 2, K. R. Pardasani 3 1 Department of Computer Science & Engineering, Singhania
More informationOntology Based Prediction of Difficult Keyword Queries
Ontology Based Prediction of Difficult Keyword Queries Lubna.C*, Kasim K Pursuing M.Tech (CSE)*, Associate Professor (CSE) MEA Engineering College, Perinthalmanna Kerala, India lubna9990@gmail.com, kasim_mlp@gmail.com
More informationImproved Attack on Full-round Grain-128
Improved Attack on Full-round Grain-128 Ximing Fu 1, and Xiaoyun Wang 1,2,3,4, and Jiazhe Chen 5, and Marc Stevens 6, and Xiaoyang Dong 2 1 Department of Computer Science and Technology, Tsinghua University,
More informationImproving Quality of Products in Hard Drive Manufacturing by Decision Tree Technique
www.ijcsi.org 29 Improving Quality of Products in Hard Drive Manufacturing by Decision Tree Technique Anotai Siltepavet 1, Sukree Sinthupinyo 2 and Prabhas Chongstitvatana 3 1 Computer Engineering, Chulalongkorn
More informationAn Information-Theoretic Approach to the Prepruning of Classification Rules
An Information-Theoretic Approach to the Prepruning of Classification Rules Max Bramer University of Portsmouth, Portsmouth, UK Abstract: Keywords: The automatic induction of classification rules from
More informationDiversity Coloring for Distributed Storage in Mobile Networks
Diversity Coloring for Distributed Storage in Mobile Networks Anxiao (Andrew) Jiang and Jehoshua Bruck California Institute of Technology Abstract: Storing multiple copies of files is crucial for ensuring
More informationTeiid Designer User Guide 7.5.0
Teiid Designer User Guide 1 7.5.0 1. Introduction... 1 1.1. What is Teiid Designer?... 1 1.2. Why Use Teiid Designer?... 2 1.3. Metadata Overview... 2 1.3.1. What is Metadata... 2 1.3.2. Editing Metadata
More informationDetect tracking behavior among trajectory data
Detect tracking behavior among trajectory data Jianqiu Xu, Jiangang Zhou Nanjing University of Aeronautics and Astronautics, China, jianqiu@nuaa.edu.cn, jiangangzhou@nuaa.edu.cn Abstract. Due to the continuing
More informationAn Approach for Privacy Preserving in Association Rule Mining Using Data Restriction
International Journal of Engineering Science Invention Volume 2 Issue 1 January. 2013 An Approach for Privacy Preserving in Association Rule Mining Using Data Restriction Janakiramaiah Bonam 1, Dr.RamaMohan
More informationA New Technique to Optimize User s Browsing Session using Data Mining
Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 4, Issue. 3, March 2015,
More informationSemi-Supervised Clustering with Partial Background Information
Semi-Supervised Clustering with Partial Background Information Jing Gao Pang-Ning Tan Haibin Cheng Abstract Incorporating background knowledge into unsupervised clustering algorithms has been the subject
More information