Prudent Schema Matching For Web Forms

Size: px

Start display at page:

Download "Prudent Schema Matching For Web Forms"

Valerie Walton
6 years ago
Views:

1 Prudent Schema Matching For Web Forms Thanh Nguyen School of Computing University of Utah Hoa Nguyen School of Computing University of Utah Juliana Freire School of Computing University of Utah ABSTRACT There is an increasing number of data sources on the Web whose contents are hidden and can only be accessed through form interfaces. Several applications have emerged that aim to automate and simplify the access to such content, from hidden-web crawlers and meta-searchers to Web information integration systems. Since for any given domain there many different sources, an important requirement for these applications is the ability to automatically understand the form interfaces and determine correspondences between elements from different forms. While the problem of form schema matching has received substantial attention recently, existing approaches have important limitations. Notably, they assume that element labels can be reliably extracted from the forms and normalized most adopt manually extracted data for experiments, and their effectiveness is reduced in the presence of incorrectly extracted as well as rare labels. In large collections of forms, however, not only there can be a substantial number of rare attributes, but also automated approaches to label extraction are required, which invariably lead to errors and noise in the data. In this paper, we propose PruSM, an approach for matching form schemas which prudently determines matches among form elements. PruSM does not require any manual pre-processing of forms, and it effectively handles both noisy data and rare labels. It does so by: carefully selecting matches with the highest confidence; and then using these to resolve uncertain matches and to identify additional matches that include infrequent labels. We have carried out an extensive experimental evaluation using over 2,000 forms in multiple domains and the results show that our approach is highly effective and able to accurately determine matches in large form collections. 1. INTRODUCTION It is estimated there there are millions of databases on the Web [27]. Because the contents of these databases are hidden and are only exposed on demand, as users fill out and submit Web forms, they are out of reach for tradi- Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the VLDB copyright notice and the title of the publication and its date appear, and notice is given that copying is by permission of the Very Large Data Base Endowment. To copy otherwise, or to republish, to post on servers or to redistribute to lists, requires a fee and/or special permission from the publisher, ACM. VLDB 08, August 24-30, 2008, Auckland, New Zealand Copyright 2008 VLDB Endowment, ACM /00/00. tional search engines, whose crawlers are are only able to crawl through explicit HTML links. Several applications have emerged which attempt to uncover hidden-web information and make it more easily accessible, including: metasearchers [16, 17, 39, 40], hidden-web crawlers [1, 4, 31, 28], and Web information integration systems [10, 21, 38]. These applications face several challenges, from locating relevant forms [6] and determining their domains [8, 7], to understanding the semantics of the form elements [41]. In this paper, we focus on the problem of form schema matching, i.e., identifying matches among elements across different forms, a step that is essential to form understanding. Consider the forms in Figure 1. To build a meta-searcher for these forms, we need to identify the correspondences among their elements so that a query posed through the meta-searcher unified interface can be translated into a series of queries that are fired against these forms [16, 17, 40, 10]. For instance, if a user is interested in cars manufactured by Honda, the value Honda should be used as input for the elements that correspond to auto make. But finding these correspondences is challenging due to the great variability in how forms are designed, even within well-defined domains like auto search. Different element labels can represent the same semantic concept and similar labels can represent different concepts. As illustrated in Figure 1, there are a several distinct labels used to represent the concept auto make: make, please select an auto make, manufacturer, by brand, start here-choose make,car make-model. Also, there is a wide variation in the frequency of some labels within a domain (Figure 6): while some are frequent others are very rare. And sometimes, elements that have no apparent similarity represent the same concept (i.e., they are synonyms). In forms for searching for books, {First name, Last name} is often a synonym for Author, and in airfare search, {Children, Adult, Senior} is often a synonym for Passenger. This problem is compounded when one is faced with large form collections. Before the form elements can be matched, their labels must be extracted [41, 29]. But although effective and automated approaches have been developed for label extraction tasks, they are not perfect. For example, the accuracy of LabelEx [29] for extracting element labels varies between 86% and 94%. As a result, the matching process must take into account not only heterogeneity in form design but also noise introduced by incorrectly extracted labels. Several approaches have been proposed for form-schema matching [38, 32, 19, 35, 27, 36, 37]. However, all of these

Figure 1: Matching form schemas in the Auto domain. Different labels are used to represent the same concept. approaches assume that labels for the form elements are correctly extracted.

2 Figure 1: Matching form schemas in the Auto domain. Different labels are used to represent the same concept. approaches assume that labels for the form elements are correctly extracted. And some require additional manual preprocessing of forms. For example, He and Chang [19] require that different variations of a label be grouped together (all variations of make e.g., select a car make and vehicle make must be normalized to make ); and Wu et al. [38] require that a hierarchical representation of the form be accurately extracted. In the presence of noisy (incorrectly) extracted labels, the effectiveness of these techniques can be greatly reduced. Incorrectly extracted labels can lead to initial incorrect matches which are subsequently propagated to other matches. As we discuss in Section 4, noisy form schemas can lead up to 30% reduction in the matching accuracy of previous works. Another limitation of existing approaches is related to rare attributes, i.e., labels (or groups of labels) that have low frequency in a form collection. Such attributes tend to confuse statistical approaches which require a sufficient number of samples [19]. Our Approach. In this paper, we propose PruSM (Prudent Schema Matching), a new, scalable approach for schema maching in large form collections. PruSM is designed to deal with and reduce the effect of the noise present in such collections. A novel feature of PruSM is how it combines the various sources of similarity information for form elements. Each element is associated with a label (i.e., a textual description), a name defined in the HTML standard, the element type (e.g., text box, selection list), a data type (e.g., integer, string) and in some cases, a set of domain values. But not all elements contain all these pieces information, and as we discussed above, the information can be incorrect. Furthermore, if considered in isolation, an individual component can be insufficient for matching purposes. Consider the following example. Example 1. In the Auto domain, using domain values can result in an incorrect matching between mileage and price, since they sometimes have a similar value range. On the other hand, if only label similarity is considered, an incorrect match can be derived for model and model year. Although they correspond to different concepts, they share a term that is important in the collection the term model occurs as a label in most forms in this domain. Besides element similarity, co-occurrence statistics can be used to identify mappings [19, 35]. For example, by observing that manufacturer and brand co-occur with a similar set of attributes but rarely co-occur together, it is possible to infer that they are synonyms. However, when used in isolation, attribute correlation can lead to incorrect matches. In particular, correlation matching scores can be artificially high for rare attributes, since rare attributes seldom cooccur with (all) other attributes. The matcher we propose prudently combines and validates matches across elements using multiple feature spaces, including label similarity, domain values similarity and at- Figure 2: Web form matching framework tribute correlation. This is in contrast to other approaches that use only a subset of the available information: some just consider the labels associated with elements while the others concentrate on internal structure, attribute type or the relationships with other entities [34]. It is worthy of note that: match quality does not monotonically increase with the number of information sources using all the available information is not always a good solution; and combining this information in an effective manner is a non-trivial task, since the importance of the different components can vary from form to form, and some form elements do not contain all components. As we discuss later in the paper, when combining the different components prudently, we obtain reliable and confident matches, including matches for rare attributes. Contributions. Our contributions can be summarized as follows: We propose PruSM, a new schema matching approach that prudently combines different sources of similarity information. PruSM is also prudent in deriving matches: it first derives matches for frequent attributes and uses these to both resolve uncertain matches as well as to identify new matches that inclure rare labels. As a result, PruSM is robust to noise and rare attributes, and it is also able to deal with attribute fragmentation. We present a detailed experimental evaluation, using a collection with over 2,000 forms, which shows that PruSM obtains high precision and recall without any manual preprocessing of form data. Furthermore, PruSM has higher accuracy (between 10% and 57%) than other form-schema matching approches. These results indicate that PruSM is scalable and effective for large collections of automatically gathered forms. Outline. The rest of this paper is organized as follows. The problem of Web-form schema matching is defined in Section 2 and we present the PruSM approach in Section 3. In Section 4, we discuss our experimental evaluation. Section 5 describes the related work. We conclude in Section 6, where we outline directions for future work. 2. MATCHING WEB FORMS In this paper, we propose a form-schema matching approach that is scalable and effective for large form collections. The specific scenario we consider is illustrated in Figure 2. Forms are automatically gathered by a Web crawler [6]

Figure 3: Web forms components and stored in a form repository. These forms are then parsed, their element labels extracted [29], and associated schemas derived.

3 Figure 3: Web forms components and stored in a form repository. These forms are then parsed, their element labels extracted [29], and associated schemas derived. Finally, matches are derived between elements in different form schemas. 2.1 Problem Definition A Web form F contains set of elements E = {e 1, e 2,..., e n}. Each element e i is represented by a tuple (l i, v i), where l i is the label of e i (i.e., a textual description) and v i is a vector that contains a list of possible values for e i. An element label l is a bag of terms {t 1, t 2,..., t m}, where each term t i is associated a weight w i. For example, element labels of the form in Figure 3 are Make, Model, Maximum price, Search within and Your ZIP; domain values for element Model are {All, MDX, RDX, RL, TL, TSX}. The composite label Your ZIP consists of two terms, and intuitively, ZIP is the most important term since it conveys the meaning of the element. In our model, a higher weight is associated with the term ZIP than with Your. Definition 1. The form-schema matching problem can be stated as follows: given a set of Web forms F = S n i=1 fi in a given domain D, a schema matching process identifies all the correspondences (matches) among elements across forms f i F, and groups them into clusters C i = {C 1, C 2,..., C k }, where each cluster contains only elements that share the same meaning. 2.2 Computing similarities In order to compute the similarity between form elements, we first group elements that have the same label. Let A i be an attribute which corresponds to the set of elements e that have the same label l. Given two attributes (A i, A j), we quantify the similarity between them using three different measures: label similarity, domain-value similarity, and correlation. Label Similarity. Because forms are designed for human consumption, labels are descriptive and are often the most important source for identifying similar elements. We define the label similarity between two attributes (A i, A j) as the cosine distance [2] between the term vectors for their labels: lsim(a i, A j) = cos(l i, l j) (1) where cos(x, y) = P i pp xiyi pp i x2 i i y2 i To determine term weight, we can use TF-IDF. TF-IDF provides a measure of term importance within a collection: the importance of a term increases proportionally to the number of times that word appears in a document but it (2) is offset by the frequency of that word in other documents in the collection [2]. In addition to term frequency (TF), we also use the singular token frequency (STF) [29] to capture the importance of terms. The intuition behind STF comes from the fact that generic terms such as "Please", "Select", "of", "available" rarely appear alone in a label, and thus are unlikely to represent an important concept they do not have a complete meaning by themselves. These terms usually have high frequency, since they appear in many composite labels. For instance, "select" appears in "Select a car make", "Select a State", "Select a model", etc. We use STF to distinguish between high-frequency terms that appear alone (which are likely to be important) and high-frequency terms that always appear with other labels. Term weight w i is computed according to the equations below: w(t i) = p T F (t i) ST F (t i) (3) ST F (t i) = T F (t i) = n(ti) P t (4) n(ti appear alone) n(t i) However, initially, since terms in the labels are not normalized, the above weighting scheme mail fail to accurately represent term importance. As we discuss in Section 3.3, after identifying a set of matches, we simplify labels from long-form into short-form, and this boosts the STF score and the weight of important terms. This boosting is very important to obtain additional matches which include rare labels (see Example 8). Domain-Value Similarity. We compute the similarity between the domain values from two elements as the cosine distance between their corresponding of value vectors. To compute the domain similarity between two attributes, we first need to aggregate all the domain values for each attribute. Given an attribute A k, we build a vector that contains all occurrences of values associated with the label l for A k and their frequencies: D k = S i=1..n vi : frequency, where l i = l. The cosine distance (Equation 2) is then used to measure the similarity between the vectors for the two attributes: (5) dsim(a i, A j) = cos(d i, D j) (6) Example 2. Consider the following scenario: element e 1 and e 2 have the same label, and so do e 3 and e 4. The values associated with these elements are, respectively: v 1={a,b,d}, v 2={a,d}, v 3={a,b,c}, v 4={a,b}. These four elements are aggregated into two attributes A 1 = {e 1, e 2}, A 2 = {e 3, e 4} with D 1={a:2,b:1,d:2}, D 2={a:2,b:2,c:1}. The similarity between A 1 and A 2 is: dsim(a 1, A 2) = cos(d 1, D 2) = The domain values are a good source of similarity information. However, they are not always available (see e.g., the elements Your ZIP or Title in Figure 3). Therefore, we should consider values as supporting information to validate (or reinforce) a match. Correlation. By holistically looking at the form schemas and considering them all simultaneously, we can leverage an implicit source of similarity information: attribute correlation. Correlation is a statistical measure which indicates the strength and direction of a linear relationship between two variables it can be positive or negative. For

Figure 4: Prudent schema matching framework Web forms, we exploit the fact that synonym attributes are semantic alternatives and rarely co-occur in the same query interface (push away) they are

4 Figure 4: Prudent schema matching framework Web forms, we exploit the fact that synonym attributes are semantic alternatives and rarely co-occur in the same query interface (push away) they are negatively correlated (e.g., Make and Brand). On the other hand, grouping attributes are semantic complements and often co-occur in the same query interfaces (pull together) they are positively correlated (e.g., First name and Last name, Departure date and Return date). Although there are several correlation measures (e.g., Jaccard, cosine, Jini index, Laplace, Kappa), none is universally good [30]. In the context of Web forms, two such measures have been proposed: the H-measure [20] and the X/Y measure [35]. For PruSM, we use the latter: X(A p, A q) = ( 0 if p,q form F (C p C qp)(c q C qp) (C p+c q) otherwise Y (A p, A q) = C pq min(c p, C q) The score X matching score captures negative correlation while the grouping score Y grouping score capture positive correlation. To understand these scores, consider the contingecy table for the attributes A p and A q which encodes the co-location patterns of these attributes [9]. The cells in this matrix are f 00, f 01, f 10, f 11, and the subscripts indicate the presence (1) or absence (0) of an attribute. For example, f 11 corresponds to the number of interfaces that contain both A p and A q and f 10 corresponds to the number of interfaces that contain A p but not A q. In the equations above, C qp corresponds to f 11, C p to (f 10 + f 11), and C q to (f 01 + f 11). Note that correlation is not always an accurate measure, in particular, when insufficient instances are available. However, as we describe below, correlation is very powerful and effective when combined with other measures. 3. PRUDENT MATCHING The high-level architection of PruSM is depicted in Figure 4. Given set of form schemas, the Aggregation module groups together similar schema attributes and produces outputs a set of frequent attributes (S 1) to the Matching Discovery module, and a set infrequent attributes (S 2) to the Matching Growth module. Matching Discovery finds complex matchings among frequent attributes. These complex matchings, together with infrequent attributes, are used by Matching Growth to obtain additional matchings that include infrequent attributes. Algorithm 1 describes the matching process in detail. 3.1 Aggregation The aggregation module groups together similar elements by stemming terms [2] and removing stop words 1 (e.g., the, an, in ). For example, the labels "Select a make" and 1 (7) (8) Algorithm 1 Prudent Schema Matching 1: Input: Set of attributes A of a certain domain, configuration Conf, grouping threshold T g 2: Output: Set of attribute clusters C 3: begin 4: /*1. Aggregation */ 5: Aggregate similar elements 6: Create set of frequent and infrequent attributes S 1, S 2 7: Let X pq be set of any two attributes in S 1 and compute the label similarity lsim, value similarity dsim and correlation score X between them 8: /*2. Matching Discovery for frequent attributes*/ 9: M 10: while X pq do 11: Choose {A p, A q} that has the highest X pq 12: if prudent check (Conf, A p, A q) then 13: M ExploreAMatch(A p, A q, M) 14: else 15: /*buffering uncertain matches to revise later*/ 16: B {A p, A q} 17: end if 18: Remove {A p, A q} from X pq 19: end while 20: Resolve uncertain matches in buffer B 21: /*3. Matching Growth for rare attributes*/ 22: Update ST F and term weights 23: Create a set of finer clusters C i according to M i 24: Use 1NN + co location check to assign rare attributes to their closest frequent attribute 25: Cluster remaining attributes by HAC and add them to C 26: end "Select makes" are aggregated as a new attribute "select make"; children and child are aggregated as child. Note that, unlike DCM [20] and HSM [35], we do not need to detect and remove domain specific stop words (e.g., "car", "vehicle", "auto", "books") or terms that are generic and occur frequently in multiple domains (e.g., select, choose, find, enter, please). Both DCM and HSM require manual pre-processing to simplify the labels and remove these irrelevant terms. For instance, Search for book titles is simplified to title, all variations of make e.g., select a car make and vehicle make are simplified to make. However, automatically simplifying these labels is challenging, and simplification errors can be propagated and result in incorrect matches. As our experiments show, PruSM is effective even without label simplification. For example, it is able to correctly find a match between year(44) select year(15) year rang(16) which contains the stop words select and range. An attribute is considered frequent if it occurs above a frequency threshold T c (i.e., the number of occurrences of the attribute over the the total number of schemas). Note that whereas we set T c=5% comparing, in DCM and HSM, T c is set to 20% [20, 35]. 3.2 Matching Discovery (MD) As part of our prudent strategy, only frequent attributes participate in the Matching Discovery. We consider frequent attributes first because they contain less noise and contribute to finding correct matches early (high precision), consequently, avoiding error propagation. For each pair in the frequent attribute list, we compute label and value similarity (Equations 1 and 6), and correlation matching and grouping scores (Equations 7 and 8).

As shown in Example 1, it is not easy to determine whether two attributes correspond to the same concept. Weak matchers are less reliable and can negatively impact the overall accuracy.

5 As shown in Example 1, it is not easy to determine whether two attributes correspond to the same concept. Weak matchers are less reliable and can negatively impact the overall accuracy. Thus, strong matchers should take precedence, while weak matchers are used later in the process, possibly combined with additional evidence. For example, correlation can be effective if we prudently combine it with an additional evidence, e.g., a strong label similarity or value similarity. In Matching Discovery, we iteratively find the most confident match by combining different information sources using a prudent match, i.e., a configuration C to validate a match. C consists of a set of constraints and parameters of correlation score, label similarity and value similarity that are used to validate a match, and is defined as: X(A p,a q) > T Matching score AND [dsim(a p, A q) > T V alue sim OR lsim(a p, A q) > T Label sim ]. The intuition behind this prudent check is that although attributes A p and A q may have a high correlation score, they do not provide not a good match unless additional evidence is available. Only attribute pairs passing the prudent check are considered by ExploreAMatch (Algorithm 1, line 13), where they will be checked and integrated with the existing matches. If the attribute pair is an uncertain match (high correlation but one or both of the attributes do not have similar domain values), it is added to the buffer B to be revisited later (line 16). ExploreAMatch (Algorithm 2) is modified from [35]. Instead of attribute pairs with the highest correlation, it selects the most confident pair, i.e., the pair with highest correlation that passes the prudent check. ExploreAMatch supports complex matches. We denote M = {M j, j = 1..k} as a set of complex matches, where each M j comprises a set of grouping attributes G that are matched together M j = {G j1 G j2... G jw} (there can be one or more than one grouping attributes in each group G). The algorithm proceeds as a sequential floating process, where only the highest pair emerges at each point in time. If attribute A p and A q do no appear in M and they are not close to any of the existing matches, a completely new match is created (line 6). If they are close, the attribute pair will become a new matching component in an existing match M j (line 8, line 13), or a grouping component of the existing match (lines 15) is created if two attributes that match to the same set of other attributes have a high grouping score. Note that the new attribute needs to be negatively correlated with at least one attribute in every group of M j, otherwise it will be discarded. A nice property of the Matching Discovery is that by iteratively choosing the most confident match, it automatically merges negatively-correlated attributes together and gradually groups positively-correlated attribute while, at the same time, it pushes away non-negative-correlated ones. As time progresses, floating matches become richer and bigger to form the final complex matchings. The example below illustrates this process. Example 3. Initially, the set of matches is null. Suppose we are considering the Aifare domain, and X(depart,departure date) has the highest matching score. These attributes will form a new match M j = (depart departure date). The next highest pair is (depart, return date), however the matching score X(return date, departure date)=0 and grouping score Y(return date, departure date) is above Algorithm 2 ExploreAMatch 1: Input: a candidate pair (A p, A q), set of current matches M, grouping threshold T g 2: Output: set of matches M 3: begin 4: if neither A p nor A q appears M then then 5: if neither A p nor A q not close to any M j M then 6: M M + {{A p} {A q}} 7: else 8: M j M j + ({A p} {A q}) 9: end if 10: else if only one of A p and A q appears in M then 11: /* suppose A p appear and A q not appear*/ 12: if F or each A i in M j, X qi > 0 then 13: M j M j + ( {A q}) 14: else if F or each A m {M j \ G jk }, X qm > 0 and for each A l G jk, Y ql > T g then 15: G jk G jk + {A q} 16: end if 17: end if 18: end Figure 5: No validation can lead to incorrect matchings and consequent errors T g, therefore return date is added as a grouping component with departure date: M j = (depart {departure date, return date}). Then the next highest pair is (return date, return), similarly Mj=({depart, return} {departure date, return date}) As the next example illustrates, systems that fail to perform validation can make incorrect decisions. Example 4. In Figure 5, let attributes A, D, E and F be correct matchings. Because X(A,B) > X(A,D), HSM [35] will consider A match with B and then match with C. Because D is not matched with B, attribute A will never be matched with attribute D and E and F. Thus, if a bad decision is made in an early step, it may not be corrected and negatively effect the following steps. One concrete example that we encountered when not using the validation in the auto domain is the incorrect matching of year and body style. This match eliminates the chance of matching between type and body style because type and year are negatively correlated. By identifying high confident matches first, it is possible to avoid a potentially large number of incorrect matches. Attribute fragmentation happens when attributes co-occur with different sets of attributes that belong to the same concept. A consequence of fragmentation is that it makes the grouping and matching scores between attributes lower. Example 5. Let S be a small set of schemas S={{A, C 1}, {A, C 1, D}, {B 1, C 2}, {B 1}, {B 2, C 1}}. Attributes B 1 and B 2 belong to concept B, C 1 and C 2 belong to concept C. The matching score of A and B 1 is thus lower than the matching score of A and B; and the grouping score of B 1 and C 2 is lower than grouping score of B and C. For example, X(A,B)=1.2 while X(A,B 1)=1; and Y(B,C)=1 while Y(B 1, C 1)=0. Because of validation, we can afford to use a low matching score and yet obtain high accuracy. To overcome the shortcoming of low grouping score, we apply a group rein-

6 forcement process to propagate the correlation and reinforce the grouping chance of two attributes although their grouping score is not sufficiently high. Example 6. Suppose we have the match {make, model} {select make}. Because of fragmentation, the grouping score Y(select make, select model) is low. However, the matching scores X(make, select make) and X(model, select model) are high, and so is Y(make, model). Therefore, through reinforcement, select make and select model be grouped together. The last step in MD is to resolve the buffer B which contains uncertain but potential matches (Algorithm 1, line 20). Here, we apply ExploreAMatch (Algorithm 2) again to resolve each uncertain pair. By buffering them and revising later, we take the advantage of having new extra constraints. It is worth it to perform this relaxation because some attributes do not have any domain values (e.g., departure city and arrival city in the Airfare domain) and some of the domain values are too coarse, leading to a low similarity between them (e.g., category and subject in the Book domain). However, this relaxation could affect the matching accuracy and we should be careful by choosing only highlycorrelated pairs (80% of the maximum matching score). Example 7. Let A 1 A 2 be an uncertain match in buffer B. After finding certain match A 1 A 3 A 4, we can take the advantage of having additional constraints to reject A 1 A 2 because there is no connection between A 3 A Matching Growth (MG) The second phase of PruSM is the Matching Growth, which finds additional matches for rare attributes. Initially, based on the certain matches found in MD phase, we update STF frequency (Section 2.2) to improve the term weight. Consider the following example. Example 8. In the Matching Discovery step, we can find the match: year(44) select year(15) year rang(16) which contains two stop words select and range. Using this match, we can update the weight of token year and downgrade the weight of the tokens select and range. Identifying anchor terms like "year", "make", "model", etc. is very helpful for the later phase of Matching Growth, where greater variability is present in attribute labels. Fragmentation also affects the quality of correlation signal and leads to incorrect ordering of complex matches, as shown in Example 9. We can use the attribute proximity information to break the tie and find finer matches for m:n complex matches and create a set of clusters corresponding to the finer matching set. Example 9. Given that correlation score X(departure date, return on) > X(departure date, leave on). In this case, domain values do not help because they are similar. Proximity information or label similarity can help to break the tie: {departure date, return date} {return, depart} {return on, leave on} will become {departure date, return date} {depart, return} {leave on, return on}. Next, we use 1-Nearest-Neighbor Clustering (1NN) to assign rare attributes to their most similar frequent attributes. Moreover, using the form context by looking at the list of resolved (matched) and unresolved (unmatched) elements of each form, we ensure that two elements in the same form cannot be in the same cluster (co-location check). Besides, data type compatibility is used to prevent incorrect matches. Example 10 illustrates how additional matches are derived for infrequent attributes. Note that improving the weight of important terms like "price","model" is very important to correctly identify these matches. Finally, for the remaining unmatched attributes, we run HAC algorithm (Hierarchical Agglomerative Clustering) to group similar attributes into new clusters and add these new clusters to the current set. Example 10. Additional matches in Auto domain derived by 1NN include: price up to price, price range in euro price range, model example mustang model, approximate mileage mileage, color of corvett color, if so, what is your trade year year. HAC derives also adds new clusters, for example: {within one month, within one week, within hour}, {dealer list, dealer name, omit dealer list}. 3.4 Discussion Comparison against other clustering methods. We can create homogeneous clusters by applying correlation clustering algorithms [3] to produce an optimal partition of attributes that minimizes the number of disagreements inside clusters and maximizes disagreement between clusters. Although with correlation clustering it is not necessary to specify the number of clusters k, that problem is NP-complete [3]. As our experiments show, our ExploreAMatch algorithm is both fast and effective for the data we considered. With clustering methods like K-means [22] and HAC[14], we have to decide beforehand the number of clusters or the stopping threshold. PruSM is a data-driven process, and as such, it can naturally reveal the shape of the clusters regarding to the attribute distribution and their internal interactions. Noise and rare attributes. Noise can negatively impact the correlation signal [20] while rare attributes can artificially exacerbate it [35]. Both noise and rare attributes lead to cascading errors which are irreversible and could be potentially propagated and magnified. PruSM is designed to work with a low correlation signal and yet find accurate matches. Strong matchers and strong matches take precedence to find the most confident matches first and help to prevent errors during the initial steps. To find an accurate match, the validating configuration is not required to be perfect. It suffices to favor relatively high thresholds (strict behavior) to obtain a high-precision matches first, which can be extended later. Label and value similarity thresholds in the configuration are learnt from a small set of high correlated pairs that have a clear signal of similarity [33]. By using label and value to validate a match, as we discuss in later in the experimental evaluation, PruSM is robust to a wide variation of correlation thresholds. 4. EXPERIMENTAL EVALUATION We empirically evaluate the performance PruSM over a large of number of Web forms in multiple domains. We run the experiment on two datasets: one small, clean, manually gathered dataset, and one, large and noisy, automatically gathered set of forms. Our experiments show that PruSM is superior to other schema-matching approaches in both datasets, even without performing any manual preprocessing of the data. We also study how the different PruSM components (i.e., Matching Discovery and Matching Growth) contribute to the overall accuracy. Last, but not least, we compare our approach with other holistic match-

7 Table 1: Comparing different approaches DCM HSM PSM Target Find synonyms attributes Find all the attributes correspondences Pre-processing Yes No Information Correlation Label, Domain values, Correlation Rare attributes Performance degrade when working with low frequent attributes Robust to rare attributes Only attributes that appear frequent can be matched Can identify matches of infrequent attributes Strategy Find all possible combinations of positive and negative correlated set then combine and rank them to choose the best one Iteratively find the highest correlated match first and integrate it with the set of existing matches Limitation Performance decrease when working with rare attributes or nonpreprocessed data -Huge search space with many possible combinations of positive and negative correlated set -Correlation-centric leads to incorrect matches and consequent errors -Impacted by attribute fragmentness Combining multiple information to iteratively find the most confident match first, then grow from these certain matches to extend the result (a) TEL8 Domain Number of forms Number of elements Airfare Auto Book Movies Music Hotel CarRental Job (b) WebDB Domain Number of forms Sample size Number of elements Auto Book Airfare Table 2: Database domains ing approaches. Since to the best of our knowledge HSM has the best performance among existing form-schema matching approaches, we implemented it and use it as our baseline. 4.1 Experimental Setup Datasets. We conduct the experiments on TEL8 and WebDB datasets. Table 2 provides a summary of the characteristics of these two datasets. The TEL8 dataset 2 contains manually extracted schemas for 447 deep Web sources from 8 domains. The WebDB dataset 3 contains 2,884 Web forms. These forms were harvested using a focused crawler [5, 6] and automatically classified into different domains. We use LabelEx [29] to extract all the mappings between labels and elements in these forms. Note that the data in this dataset is representative and reflects the characteristics and distribution of form labels, thus, enable us to evaluate the robustness of our approach. Figure 6 shows a histogram of the top 35 attribute labels in this dataset note that there is a wide variability in the frequency distribution of these labels. In particular, there is a large number of rare attributes which tend to confuse statistical matching approaches Effectiveness measure.to evaluate PruSM performance, we use precision and recall. Precision can be seen as a measure of fidelity, whereas recall is a measure of completeness. F-measure is the harmonic mean between precision and recall. A high F-measure means that both recall and precision are high a perfect value of F-measure is 1. We also implement a GUI tool to support the manual creation of the gold data for both TEL8 and WebDB dataset. Since there are many clusters, we measure the PruSM performance as the average precision and recall according to the sizes of each Figure 6: WebDB label histogram cluster [19]. The average precision, recall, and F-measure are defined as: P recision avg = X C i C i Pj C PC i (9) Recall avg = X C i C i Pj C RC i (10) F measure avg = 2 P recision Recall P recision + Recall (11) 4.2 Evaluating the Effectiveness of PruSM In this section, we compare the effectiveness of PruSM against HSM in both datasets: TEL8 and WebDB. Although PruSM outperforms HSM in both datasets, the improvement is much bigger in WebDB dataset, e.g., 73% in Book domain. It shows that our approach is more robust to the variability and cleanliness of the data than HSM. Effectiveness of PruSM in TEL8 dataset. Figure 7 summaries the accuracy of HSM and PruSM in TEL8 dataset. Overall, both precision and recall of PruSM are much higher than HSM in all five domains: PruSM outperforms HSM from 10% in Book domain, to 40% in Auto domain. We should note that HSM performance is low because we do not apply the Syntactic Merging process. HSM has lowest accuracy (F-measure is 0.61) in Car Rental domain due to the sparseness of labels, and it can obtains rather high accuracy in other cleaner domains, like Book (F-measure is 0.86). PruSM is able to gain more improvement in sparse domains (e.g., 40% in Auto domain and 36% in Car Rental) and less improvement in clean domains (e.g., Book (10%) and Airfare (12%)).

Domain Matchings Books Author(54) [Last Name(6), First Name(6)] Subject(17) Category(7) Format(12) Binding(6) Airfare [return(12), depart(11)] [departure date(21), return date(24)]

manufacture(9) vehicle make(5) vehicle(4) vehicle model(5) model(63) [max price(9),max mileage(4)] price rang(29) price(15) type(6) body type(4) vehicle body style(3) body style(4) exterior color(3)

8 Domain Matchings Books Author(54) [Last Name(6), First Name(6)] Subject(17) Category(7) Format(12) Binding(6) Airfare [return(12), depart(11)] [departure date(21), return date(24)] [Adult(31)Children(25)Infant(13)Senior(10)] Passenger(8) [from(23) to(22)] [departure city(5) arrival city(5)] [leave from(5) go to(4)] cabin(5) class(11) Autos year rang(13) year(34) make(72) manufacture(9) vehicle make(5) vehicle(4) vehicle model(5) model(63) [max price(9),max mileage(4)] price rang(29) price(15) type(6) body type(4) vehicle body style(3) body style(4) exterior color(3) color(6) number of door(3) cylinder(6) location(5) state(7) Movies category(13) genre(18) cast crew(6) director(38) star(16) keyword(19) Music title(49) album(16) catalog #(10) song(31) artist(54) conductor(6) composer(8) category(9) genre(5) movie title(3) movie(3) perform(7) artist perform(3) Hotel [check in date(16), check out date(12)] [check in(9), check out(7)] arrival(4) number of room(3) room(11) guest(3) adult(19) Car pick up city(6) pick up location(9) where to pick up(3) rental car type(6) car class(3) [pick up date(14), pick up time(12), drop off date(11), drop off time(9)] [drop off(4), pick up(4)] drop off city(5) drop off location(5) Job location(18) state(22) category(6) job type(12) industry(6) job title(4) title(4) Table 3: Matching results of PruSM MD in TEL8 dataset Figure 7: PruSM performance in the TEL8 dataset Because PruSM does not perform syntactic merging and it uses a low frequency threshold, the correlation signal might be weaker. However, as shown in Table 3, the PruSM Matching Discovery identifies more than 80% of all the good matches. Besides complex matches e.g., Passenger {Children, Adult, Senor}, etc., PruSM can find syntactic matches that HSM and DCM did not find, e.g., make vehicle make vehicle, vehicle model model, job title title. As reported in [35], the synonym target accuracy of DCM and HSM varied according to the attribute frequency threshold. After the syntactic merging, they consider only attributes occurring above a frequency threshold T c=20%. Note that when more rare attributes are taken into consideration (T c=5%), the occurrence pattern of infrequent attributes is not very obvious, thus, the accuracy decreases significantly. As they reported, the average HSM precision decreased from 100% to 70%; the average DCM precision decreased from 95% to 48% and recall decreased from 98% to 58%. Be- Figure 8: PruSM performance in the WebDB dataset cause the prudent matching process can help to avoid a lot of incorrect matchings with rare attributes and accordingly consequent errors, PruSM can work robustly and accurately with low attribute frequency (T c=5%). Another problem is that both DCM and HSM require pre-processing. As reported in [20], DCM performance decreased seriously when the syntactic merging was not conducted; for example in the Hotel domain, precision went down from 86% to 47% and recall decreased from 87% to 46%. Again PruSM does not encounter this problem because it does not require the syntactic merging. By considering only frequent attributes, certain matches that are found in the Matching Discovery step are used to resolve rare attribute variants. PruSM does not require pre-processing but still less effected by rare attributes. Effectiveness of PruSM in WebDB dataset. WebDB dataset is an automatic, scalable and heterogeneous crawled dataset. A nice capability of PruSM is that we don t need to manually clean data or conduct the syntactic merging process (which is an extensive work for a large real dataset). All we need to do is applying a simple aggregation step (word stemming and stop word removal) to aggregate similar elements together. It s worthy to note that PruSM can work sufficiently with automatically crawled data given its heterogeneous vocabulary (Figure 6) and a few incorrect mappings from the label extraction process by mapping wrong label to element. Figure 8 shows the overall performance of PruSM in WebDB dataset. Overall, PruSM leads to a substantial performance improvement in all three domains (42% in Auto, 38% in Airfare, 57% in Book) with both precision and recall of PruSM are higher than the baseline. PruSM performance is significantly high in Auto (90%) and Airfare (88%) which are well-defined domains. Precision is higher because of a prudent matching process. Recall is higher because MG helps to find additional matches of infrequent attributes. Recall of the MG in Airfare domain does not increase much (10% compared to 25% in Auto) because there are few infrequent attributes in this domain as shown in the low tail in Figure 6. Note that Book domain is more heterogeneous and thereby has a lower result. This could be explained by the lowest Simpson index as in [29] and there are many complex labels in Book domain like word in title abstract, title to display per page, journal title abbreviation, title series keyword or title author ISBN keyword which are easy to confuse with terms like title, abstract, journal, series or keyword. Take a closer look to the Auto domain, precision and recall of the HSM are 65% and 63%. In the PruSM Matching Discovery, they are 98% and 69%. Although the recall is still

select make model(9) make(63) model(59) select model(8) select make(7) year(44) select year(15) year rang(16) select rang of model year(6) price(14) price rang(23) price rang is(6) zip(13) zip

different component in multiple domains low, the matching set includes all the basic seeds as shown in Table 4.

Overall, PruSM leads to 41% gain of F-measure in Auto domain.

9 select make model(9) make(63) model(59) select model(8) select make(7) year(44) select year(15) year rang(16) select rang of model year(6) price(14) price rang(23) price rang is(6) zip(13) zip code(7) valid (6) (8) type(5) body style(6) search within(7) distance(4) mile(12) mileage(4) Table 4: Basic seeds Figure 10: Different combination strategies in MD Figure 9: Contribution of different component in multiple domains low, the matching set includes all the basic seeds as shown in Table 4. Based on these good seeds, the PruSM MG significantly improved recall from 69%to 87%, (precision slightly decreases) increased the final F-measure to 90%. Overall, PruSM leads to 41% gain of F-measure in Auto domain. Note that both DCM and HSM performance are affected by the schemas extraction errors which reduce the negative correlation signal, thereby affect the ranking and lead to cascading errors. Their performance decreased up to 30% performance by the extraction errors reported in [20]. PruSM does not affected much because it can can afford low grouping and matching score and still have good results 4.3 Evaluating the Effectiveness of the PruSM Components Figure 9 shows the efficiency and contribution of different components in PruSM. With prudent matching, F1 gains 41% in Auto, 33% in Airfare and 18% in Book. With 1NN, F1 gains 11% in Auto and 6% in Airfare and Book. Clustering the remainders increase recall (F1 gains 9% in Book but not significantly in Auto and Airfare). Update STF helps to increase F1 about 3.5% in all three domains. Reinforce and Buffering helps F1 to gain more than 2% in Airfare and Book and 1% in Auto. With TieBreak, F1 gains 20% in Airfare. As we observed, Prudent matching significant improves precision while 1NN and HAC improve recall (and slightly decreases precision). Update STF and TieBreak help to improve precision while Reinforce and Buffering help to increase recall. In all three domains, Prudent matching significantly improves precision. 1NN plays an important role in Auto, so does HAC in Book and TieBreak in Airfare. To evaluate the effect of syntactic merging in PruSM, we manually select and remove domain stop words and specific stop words, the overall performance increases about 2% which indicates that those stop words do not affect PruSM very much. As mentioned earlier, validation is very important: without validation, precision is very low in the MD phase and consequently the MG phase. As in Figure 9, precision is degraded up to 45% without validation. That s why we need a (a) Without Validation (b) With Validation Figure 11: MD Performance with a wide variation of Correlation Threshold high precision in the Matching Discovery phase. Figure 10 illustrates the performance of different combination strategy. As we observed, prudent matching has the highest precision (comparing with single matchers or linear combination of different matchers) and contains all the good seeds for the next MG step. The next experiment in Figure 11 illustrates that by using prudent matching, PruSM is robust to a wide variation of correlation threshold. Without validation, precision is low when the Correlation Threshold is low. This leads to lower precision in MG and lower F-measures. Figure 11(a) implies that without validation, correlation threshold must be very high in order to have a clear signal and an acceptable precision. However, recall is very low (30%) and we do not have all the good seeds. This shares the same observation with [15] where a higher similarity measure is an indication of more precise mapping. On the other hands, with validation, we can gain a high precision (the most important goal of MD phase) even with a very low Correlation Threshold as shown in Figure 11(b) and the precision is still high with a wide variation of Correlation Threshold. 5. RELATED WORK Although this problem is related to database schema matching [18, 11, 23, 26, 34], there are fundamental differences [24]. First and foremost, whereas database schemas include information about attribute names, data type, constraints (key, foreign key), for Web form schemas only the association be-

10 tween a label and an element are known, and this schema may not correspond to the schema of the data hidden behind the form. On the other hand, because Web forms are designed for human consumption, there are not as many acronyms as database schemas and the vocabulary used for labels must be descriptive so that users can easily understand their semantics. Another important difference is that whereas pair-wise matching approaches have been used to match database schemas, these do no scale well for a large number of form schemas in a given hidden-web domain. Three distinct classes of approaches have been proposed for form-schema matching: clustering [38, 32, 21], instancebased approach [36, 37] and holistic approach [19, 20, 35, 25]. Clustering approaches [38, 21, 32] need to define a precise similarity function which combines different similarity components between any two form elements. While Wu et al. [38] used pre-defined coefficients, Pei et al. [32] leveraged the distribution of Domain Cluster (DC) and Syntactic Cluster (SC) to automatically determine weights of linguistic and domain similarity. He et al. [21] used pre-defined coefficients and a hierarchical combination framework that leverages high quality matchers first and then predicts matches. A more principled approach like LSD [12] requires a mediated schema and human users to manually construct the semantic mappings in the training examples to learn these weights. By using linguistic and domain similarity to validate correlation, PruSM does not need to define a similarity function and associated weights. One drawback of some of the clustering-based approaches is that they use Wordnet [21] or leverage domain values [32] to find synonyms. Using WordNet is not sufficient to find domain-specific synonyms; and by leveraging only domain values, they can not find synonyms of attributes that have sparse domain value or do not have any domain values associated with. Furthermore, these approaches [38, 21] work with (and have only been tested) a very small number of sources, require data pre-processing and rely on high-quality (noise-free) data as input. Except for [38], all of these approaches are limited to 1:1 mappings. Wu et al. [38] support complex matching by modeling form interfaces as trees. These trees are first used to identify complex mapping and isolate possible composite attributes, and then, HAC is applied to cluster the remaining attributes to find all 1:1 mappings, which are combine with initial complex mappings to obtain additional complex mappings (using the bridging effect ). Using an ordered tree identify 1:m mappings for each form interface is expensive and does not scale to a large number of sources. The effectiveness of this approach is highly-dependent on the quality of this structure, which for the experiments discussed in the paper, was manually constructed. Besides, users are required to reconcile a potentially large number of uncertain mappings so that similarity thresholds are learned. Pei et al. [32] exploited two kinds of attributes clusters: SC and DC and aim to optimize SC by using DC (use certain attributes in DC to resolve uncertain SC attributes). To resolve the uncertainties when merging SC DC, they used a criteria function which combines syntactic similarity and domain similarity with the coefficients are automatically determined based on the distribution of SC and DC: the more elements in DC, the higher the coefficient of linguistic similarity in that cluster. However, the distribution of SC and DC clusters in different domains vary differently and this approach will have less impact when the domain values are scarce or not available. To minimize the noise effect, they do the re-sampling clusters multiple times to filter unstable attributes (outliers) while we pay attention to the frequent attributes first to avoid propagation errors. We note that they only handle simple 1:1 matching. The two-step clustering approach employed by the WISE-integrator[21] shares some basic idea with our PruSM, since it tries to derive confident matches first. However, WISE relies on the quality of input data to linearly combine different component similarities and its experimental dataset is small and manually pre-processed. Besides, the WISE-integrator only handles 1:1 matching and uses WordNet to find synonyms. Holistic approaches benefit from considering a large number of schemas simultaneously [19, 20, 35, 25]. A limitation shared by these approaches is that they require clean data and their performance decreases significantly when the input data is noisy. He and Chang [19] proposed MGS (Model, Generation, Selection) which assumed a hypothesis that labels are generated by a hidden generative model which contains a few schema concepts, each concept composing by synonym attributes with different probabilities. MGS exhaustively generates all possible models and uses statistical hypothesis test to select a good one. MGS evaluates and choose the best global schema models while we explore one match at a point of time. Among the holistic approaches, the most closely related to PruSM are DCM (Dual Correlation Mining) [20] and HSM (Holistic Schema Matching) [35] which use the attributes occurrence patterns to find complex matching by mining positive and negative correlation. DCM proposes H-measure and exploits the apriori property (i.e., downward closure) to first discover all possible positive-correlated groups, then adds these groups to the original schemas set and again mines all possible negative-correlated groups. DCM has a huge search space because there are many possible combinations of positive and negative correlated groups. Finally they select the matches that have the highest negative correlation score and remove matches that are inconsistent with the chosen one. Su et al. [35] proposed a slightly different correlation measure (matching and grouping score) and a greedy algorithm to discover synonym matchings between pairs of attributes by iteratively choosing the highest matching pair and use the grouping score to decide whether two attributes that match to the same set of other attributes belong to the same group. Although HSM has a better performance than DCM, it still suffers the from the same limitation: it requires clean data. Besides, the score-centric greedy algorithm could produce incorrect matches with rare attributes and consequent errors. Instead of choosing pairs that have the highest matching score, we modify the HSM greedy framework and exploit additional evidences to prudently choose the most confident matches first and integrate it with the existing matches. This helps to minimize irreversible incorrect matches and their consequent errors. Moreover, PruSM can afford a low matching score (but still find accurate matches) by using validation and overcome the shortcoming of low grouping score by propagating the matching score and strengthen grouping chance of two attributes although their grouping score is not high enough. We note that DCM [20] also attempts to deal with noise stemming from incorrect extraction. It does so by collecting multiple sample sets of forms and apply DCM matchers several times over each sample. The intuition is that when

Discovering Complex Matchings across Web Query Interfaces: A Correlation Mining Approach

Discovering Complex Matchings across Web Query Interfaces: A Correlation Mining Approach Bin He, Kevin Chen-Chuan Chang, Jiawei Han Computer Science Department University of Illinois at Urbana-Champaign