XML Data Integration By Graph Restructuring

XML Integation y Gaph Restuctuing Lucas Zamboulis and lexanda Poulovassilis School of Compute Science and Infomation Systems ikbeck College, Univesity of London, {lucas,ap}@dcs.bbk.ac.uk bstact This technical epot descibes the XML data integation famewok being built within the utomed heteogeneous data integation system. It pesents a desciption of the oveall famewok, as well as an oveview of and compaison with elated wok and solutions by othe eseaches. The contibutions of this eseach ae the development of two algoithms fo XML data integation, the fist fo schema integation and the second fo view mateialization, both based on gaph estuctuing. Intoduction In the past ten yeas, the Intenet and the Wold Wide Web have become an impotant pat of eveyday life. Howeve, the Web still eceives a limited amount of help fom computes. Computes ae effective in low-level opeations, such as handling lage amounts of data, seach facilities, and tansmitting and displaying data, but lack functionality in moe sophisticated tasks, such as the ones envisaged in the Semantic Web vision []. To emedy this, data has to be put on the Web in a fom that machines can undestand, o convet it into that fom, then povide the machines with the means to pocess it. XML is the fist step towads this end. It is a makup language designed to stuctue and tansmit data in an easy to manipulate fom, but it is not the total solution - it does not do anything by itself. The second step consists of logic infeence tools and tools that automate specific tasks which have been manual up to now. These tools involve the combination of XML with RDF and ontologies so as to enable computes to povide high-level sevices by communicating with othe web sevices and applications. Othe tools ae concened with automating tasks that ae up to now expensive and time-consuming, as they equie the development of new pogams evey time. Such tasks include impoting/expoting data fom/to XML files, automatic schema matching and migating and integating XML data. pat fom the eseach issues concening the Semantic Web, the advent of XML as a new data fomat has given ise to new eseach issues. Efficient stoing of XML data has esulted in the evolution of commecial elational databases to suppot impoting and expoting of XML data and also the development of native XML database poducts. The need fo efficient queying of XML data has led to the development of vaious quey languages fo XML, subsumed now by the XPath [] and XQuey [4] languages. Moeove, well-studied eseach issues concening mostly the elational domain need to be edefined in XML, in ode to find domain-specific solutions. t pesent, thee is a lot of eseach effot on developing XML-specific solutions fo the schema matching and data integation poblems. This epot descibes a famewok fo XML data integation within the utomed heteogeneous data integation system. It pesents the basis fo this famewok, which is a new schema definition language fo XML data, and a technique fo assigning unique identifies to XML elements. Two new algoithms have been developed which pefom schema integation and view mateialization, espectively. The epot also discusses elated wok and compaes the wok pesented hee with the wok of othe eseaches. Repot outline: Section. povides an oveview of the utomed system, as well as the utomed appoach to data integation. Section. pesents the schema definition language used fo XML data, specifies the epesentation of the language in tems of utomed s Common Model and pesents the unique identifies used in ou famewok. Section pesents the schema tansfomation and the view mateialization algoithms and descibes the quey engine and wappe achitectue. Section 4

eviews elated wok and compaes it with the wok pesented hee. Finally, Section 5 gives some concluding emaks on the famewok togethe with plans fo futue wok. The XML Integation Famewok. Oveview of utomed The utomed heteogeneous data integation system suppots a schema-tansfomation appoach to data integation. Figue shows the geneal integation scenaio. Each data souce is descibed by a data souce schema, denoted by LS i. Each LS i is tansfomed into a union-compatible schema US i by a seies of evesible pimitive tansfomations, theeby ceating a tansfomation pathway between a data souce schema and its espective union-compatible schema. ll the union schemas ae syntactically identical and this is asseted by a seies of id tansfomations between each pai US i and US i+ of union schemas. id is a special type of pimitive tansfomation that matches two syntactically identical constucts in two diffeent union schemas, signifying thei semantic equivalence. The tansfomation pathway containing these id tansfomations can be automatically geneated. n abitay one of the union schemas can then be designated as the global schema GS, o selected fo futhe tansfomation into a new schema that will become the global schema. Figue : The utomed integation appoach The tansfomation of a data souce schema into a union schema is accomplished by applying a seies of pimitive tansfomations, each making a delta change to the cuent schema by eithe adding, deleting o enaming one schema constuct. Each add and delete tansfomation is accompanied by a quey specifying the extent of the newly added o deleted constuct in tems of the othe schema constucts. This quey is expessed in utomed s Intemediate Quey Language, IQL [, ]. The esult is a sequence of intemediate schemas, connecting each data souce schema to its espective union schema. The quey supplied with a pimitive tansfomation defines the new o emoved schema constuct in tems of the othe schema constucts, and thus povides the necessay infomation to make pimitive tansfomations automatically evesible [7]. This means that utomed is a both-as-view (V) data integation system []. It subsumes the LV and GV appoaches, as it is possible to extact a definition of the global schema as a view ove the data souce schemas, and it is also possible to extact definitions of the data souce schemas as views ove the global schema. In Figue, each US i may contain infomation that cannot be deived fom the coesponding LS i. These constucts ae not inseted in the US i though an add tansfomation, but athe though an extend tansfomation. This takes a pai of queies that specify a lowe and an uppe bound on the extent of the new constuct. The lowe bound may be Void and the uppe bound may be ny, which espectively indicate no known infomation about the lowe o uppe bound of the extent of the new constuct. Thee may also be infomation pesent in a data souce schema LS i that should not be pesent within the coesponding US i, and this is emoved with a contact tansfomation, athe Hencefoth we use the tem union schema to mean union-compatible schema.

than with a delete tansfomation. Like extend, contact takes a pai of queies specifying a lowe and uppe bound on the extent of the deleted constuct. Figue shows the utomed integation appoach in an XML setting. Each XML data souce is descibed by an XML Souce Schema (a simple schema definition language pesented in Section.), S i, and is tansfomed into an intemediate schema, S i, by means of a seies of pimitive tansfomations that inset, emove, o ename schema constucts. The union schemas US i ae then automatically poduced, and they extend each S i with the constucts of the est of the intemediate schemas. fte that, the id tansfomation pathways between each pai US i and US i+ of union schemas ae also automatically poduced. Ou XML integation famewok suppots both top-down and bottom-up schema integation. With the top-down appoach, descibed in Section.., the global schema is pedefined, and the data souce schemas ae estuctued to match its stuctue. With the bottom-up appoach, descibed in Section.., the global schema is not pedefined and is automatically geneated. GS: GS: View GS:C GS:D GS:E View GS GS:G S : S : S : id S : S : S : id S :C S :D S :E S :C S :D S :E S :C S :D S :E S :G S :G S :G US US US S : S : S : S : S : S : S :C S :D S :E S :C S :D S :E S :C S :D S :E S ' S :G S ' S :G S ' S :G S : S : S : S : S :C S :D S : S :C S :H S S :C S :D S :E S S :F S :G S Figue : XML integation in utomed

. Schema fo Repesenting XML Souces XML files may o may not have a DTD [0] o XML Schema [] associated with them. These ae complex gammas to which the file must confom. In a data integation setting, a schema definition language this complex is not necessay. Howeve, the file s stuctue is cucial fo schema and data integation and in optimizing the quey pocessing algoithms. lso, it is possible that the XML file has no efeenced DTD o XML Schema. Fo these easons, we intoduce the XML Souce Schema, which is automatically deivable fom an XML file, and abstacts only its stuctue. To obtain an XML file s Souce Schema, we copy the file s DOM epesentation (see []) into memoy, then we modify it to become the Souce Schema, as specified by the following algoithm:. Get the oot R. If it has child nodes, get its list of childen, L. (a) Get the fist node in L, N. Fo evey othe node N in L that has the same tag as N do: Copy any of the attibutes of N not pesent in N to N. Make a deep copy of the list of childen of N and append them to the list of childen of N. Deep means that the copy contains the whole tee whose oot is the copied node. Delete N and its subtee. (b) Get the next child fom L and pocess it in the same way as the fist child, N, in step (a).. R now has a new list of childen L new. pply step () fo evey node N new in L new. utomed has as its Common Model a Hypegaph Model (HDM) [4]. This is a low-level data model that can epesent highe-level modelling languages such as ER, elational, object-oiented and XML [8, 9]. HDM schemas consist of nodes, edges and constaints. Constucts of highe-level schemas ae identified by thei scheme (see below). The selection of a low-level common data model fo utomed was intentional, so as to be able to bette epesent high-level modelling languages without semantic mismatches o ambiguities. Highe Level Constuct Constuct element Class nodal Scheme e Constuct attibute Class nodal-linking, constaint Equivalent HDM Repesentation Node xml : e Node xml : e : a Edge, xml : e, xml : e : a Links xml : e Scheme e, a Cons makecad(, xml : e, xml : e : a, 0 :, : N) Constuct nest-list Edge xml : i, e p, xml : e c Class linking, constaint Cons makecad( xml : i, e p, xml : e c, 0 : N, : ) Scheme e p, e c, i Constuct pcdata Class nodal Scheme pcdata Node xml : pcdata Table : XML Souce Schema epesentation in tems of HDM Table shows the epesentation of XML Souce Schema constucts in tems of the HDM. XML Souce Schemas consist of fou constucts:. n element e can exist by itself and is a nodal constuct. It is epesented by the scheme e.. n attibute a belonging to element e is a nodal-linking constuct. In tems of the HDM this means that an attibute is epesented by a node epesenting the attibute with scheme xml : e : a, an edge linking the attibute node to its owne element with scheme, xml : e, xml : e : a, and a cadinality constaint that states that an instance of e can have at most one instance of a associated with it, while an instance of a can be associated with one o moe instances of e.. The paent-child elationship between two elements e p and e c is a linking constuct with scheme e p, e c, i, whee i is the ode of e c in the list of childen of e p. In tems of the HDM, this is epesented by an edge between e p and e c and a cadinality constaint that states that each 4

instance of e p is associated with 0 o moe instances of e c, while an instance of e c is associated with pecisely one instance of e p. 4. Text in XML is epesented by the constuct. This is a nodal constuct with scheme pcdata. In any schema, thee is only one constuct. To link the constuct with an element we teat it as an element and use the nest-list constuct. Note that this is somewhat diffeent fom the XML schema language given in [9]. In ou model hee, we make specific the odeing of childen elements unde a common paent in XML Souce Schemas wheeas this was not captued by the model in [9]. lso, in that pape it was assumed that the extents of schema constucts ae sets and theefoe exta constucts ode and nest-set wee equied, to espectively epesent the odeing of childen nodes unde paent nodes, and paent-child elationships whee odeing is not significant. Hee, we make use of the fact that IQL is inheently list-based, and thus use only one nest-list constuct. The n th child of a paent node can be specified by means of a quey specifying the coesponding nest-list, and the equested node will be the n th item in the IQL esult list. fte a modelling language has been defined in tems of HDM via the PI of utomed s Model Definition Repositoy [4], a set of pimitive tansfomations is automatically available fo the tansfomation of the schemas defined in the language. The XML Souce Schema definition language consists of fou diffeent constucts, namely element, attibute, nest-list and pcdata. The available tansfomations on XML Souce Schemas ae shown in Table. The loweound and uppeound paametes in the extend and contact tansfomations ae queies that patially specify the extent of the constuct being inseted/emoved. Inset pimitive tansfomations addelem(schema, elemnode, quey) extendelem(schema, elemnode, loweound, uppeound) addtt(schema, elemnode, attnode, quey) extendtt(schema, elemnode, attnode, loweound, uppeound) add(schema, quey) extend(schema) addlist(schema, paent, child, position, quey) extendlist(schema, paent, child, position, loweound, uppeound) Remove pimitive tansfomations deleteelem(schema, elemnode, quey) contactelem(schema, elemnode, loweound, uppeound) deletett(schema, elemnode, attnode, quey) contacttt(schema, elemnode, attnode, loweound, uppeound) delete(schema, quey) contact(schema) deletelist(schema, paent, child, quey) contactlist(schema, paent, child, position loweound, uppeound) Rename pimitive tansfomations enameelem(schema, elemnode, newname) enamett(schema, elemnode, attnode, newname) enamelist(schema, nestlist, position) Table : XML pimitive tansfomations XML Souce Schemas ae vey simila to Guides. ccoding to [9], the benefit of having a Guide is theefold: define data stuctue, help uses undestand the stuctue of the database and fom queies ove it and help the quey pocesso devise efficient quey plans fo computing quey esults. XML Souce Schema also fulfills these aims. The main eason fo ceating a new schema definition language is simplicity: Guides ae OEM gaphs, wheeas XML Souce Schemas ae tees. This means that they ae vey easy to pase, tavese and manipulate. moe detailed compaison between XML Souce Schemas and Guides is given in Section... poblem when dealing with XML Souce Schema is that multiple XML elements can have the same name. The poblem is amplified when dealing with multiple files, as in ou case. To esolve such ambiguities, a new unique identifies assignment technique had to be implemented. Fo XML Souce Schemas, the assignment technique is scheman ame : elementn ame : count, whee schemaname is the name of the Souce Schema as defined in the utomed epositoy and count is a counte incemented evey time the same elementn ame is encounteed, in a depth-fist tavesal of the Schema. ttibutes ae then identified as elementu ID : attibuten ame. 5

oot: oot: autho: autho: fistname: lastname: book: book: fistname: lastname: book: Umbeto Eco title: publishe: yea: title: publishe: yea: title: publishe: yea: Foucault's Pendulum Vintage 00 The Name Of The Rose Vintage 99 oot: autho: autho: fistname: lastname: book: fistname: lastname: book: Umbeto Eco title: publishe: yea: Umbeto Eco title: publishe: yea: Foucault's Pendulum Vintage 00 The Name Of The Rose Vintage 99 Figue : Two diffeent XML files confoming to the same XML Souce Schema. In ode to captue node identity in XML data, we use a simila technique: the unique identifies fo nodes ae of the fom scheman ame : elementn ame : count : instance whee instance is a counte incemented evey time a new instance of the coesponding schema element is encounteed... XML Souce Schema vs. Guides The concept of XML Souce Schema is vey simila to Guides. ccoding to [9], the benefit of having a Guide is theefold: define the stuctue of the data, enable uses to undestand the stuctue of the database and fom meaningful queies ove it and help the quey pocesso devise efficient quey plans fo computing quey esults. Looking at these aims, we can see that XML Souce Schema also fulfills them. The main eason fo ceating a new schema definition language is simplicity: Guides ae OEM gaphs, wheeas XML Souce Schemas ae XML tees. This means that they ae vey easy to pase, tavese and manipulate. diffeence between the two types of schemas is that a souce can have many Guides, but only one XML Souce Schema. On the othe hand, both types of schemas may coespond to moe than one data souces. Fo example, the souces in Figue have the same XML Souce Schema, shown in the uppe ight cone, even though they ae not the same. nothe diffeence between XML Souce Schemas and Guides is the way they handle the odeing of child elements. In [0], the authos define the poblem and suggest thee diffeent appoaches. XML Souce Schema does not use any of these techniques, as it does not ty to solve the odeing poblem. The eason fo this is that XML Souce Schema is used only fo single files, contay to Guides. Of couse, even in single files thee is the issue of an element having the same child elements with diffeent odeing in diffeent instances. Conside the following case: <W> <X><></><C/></></X> <X><><C/></></></X> </W> Even in this case, thee is no need to ty and find the best odeing possible. The fact that an element does not pesent a specific policy on its childen s odeing, means that thee isn t one, so thee 6

is no eason to ty and enfoce one. This also agees with the XML Schema specification, which eithe enfoces a stict odeing policy, o none at all. One thing that will pobably be added to the algoithm poducing the XML Souce Schema is the ability to peseve a file s odeing policy, if thee is one, even in the pesence of optional elements. Conside the following example: <W> <X></><C/></X> <X></></><C/></X> </W> This shows that a possible scenaio is that the file has an odeing policy, namely fist element, then, then C, but is optional. The algoithm, if not changed, will ceate the following schema: <W> <X></><C/></></X> </W> Of couse, this could be vey easily detected if the file efeences an XML Schema. This is a topic fo futue wok, as discussed in section 5. Famewok Components The main aim of ou eseach is to develop semi-automatic methods fo geneating the schema tansfomation pathways shown in Figue. This includes two aspects: fist, the matching of individual elements, also known as schema matching [5, 5], using fo example data mining techniques, o semantic mappings to ontologies. oth appoaches can be used to automatically geneate fagments of utomed tansfomation pathways - see fo example [7]. Next, gaph estuctuing is applied to estuctue the heteogeneous XML Souce Schemas into a unifom stuctue. Once the semantic equivalences between schema constucts have been identified with schema matching, ou famewok integates the data souce schemas by tansfoming each one into its espective union schema (see Figue ). This schema tansfomation pocess is accomplished by an algoithm that automatically ceates the tansfomation pathway fom a data souce schema to its coesponding union schema. The algoithm, descibed in Section., is based on the estuctuing of XML Souce Schemas. Once seveal souces have been integated unde a vitual global schema, this can be used fo queying the data souces via the XMLWappe, as descibed in Section., o fo mateializing the data fom one o moe data souces. This view mateialization algoithm is descibed in Section... Schema Tansfomation lgoithm The schema tansfomation algoithm can be applied in two ways: top-down, whee thee is a global schema and the schemas of the data souces ae tansfomed into the global schema, egadless of any loss of infomation; o bottom-up, whee thee is no global schema and the data of all the data souces ae peseved. oth appoaches ceate the tansfomation pathways that poduce intemediate schemas with identical stuctue. These schemas ae then automatically tansfomed into the union schemas US i of Figue, including the id tansfomation pathways between them. The tansfomation pathway fom one of the US i to GS can then be poduced in one of two ways: eithe automatically, using append semantics, o semi-automatically, in which case the queies supplied with the tansfomations that specify the integation policy need to be supplied by the use. y append semantics we mean that that the lists containing the extents of the constucts of GS ae ceated by appending the coesponding constucts of US, US,..., US n in tun. Thus, if the XML data souces wee integated in a diffeent ode, the extent of each constuct of GS would contain the same instances, but thei odeing would be diffeent... Top-down appoach Conside a setting whee a global schema GS is given, and the data souce schemas need to be confomed to it, without necessaily peseving thei infomation capacity. Ou algoithm woks in two phases. 7

In the gowing phase, the global schema GS is tavesed and evey constuct not pesent in a data souce schema S i is inseted. In the shinking phase, each schema S i is tavesed and any constuct not pesent in the global schema is emoved. These two phases epesent the fact that fist the souce schemas ae augmented with constucts fom the global schema, then they ae educed by emoving the edundant constucts. Howeve, some emovals also occu in the fist phase. This is in ode to educe the cost of the tavesal of the schemas: if a necessay emoval is detected in the fist phase, it is cheape to issue the tansfomation at that stage than detect it again in the second phase. The algoithm to tansfom an XML souce Schema S to have the same stuctue as an XML souce Schema S is descibed as follows. This algoithm consides an element in S to be equivalent to an element in S if they have the same element name. s specified below, the algoithm assumes that element names in both S and S ae unique. We discuss shotly the necessay extensions to cate fo cases when this does not hold. Gowing phase: conside evey element E in S in a depth fist ode.. If E does not exist in S : (a) Seach the attibutes in S to find one whose name is the same as the name of E in S i. If such an attibute a is found, add E with the extent of a and add an edge fom the element in S equivalent to owne(a, S ) to E. Then, inset the attibutes of E fom schema S as attibutes to the newly inseted element E in S with add o extend tansfomations, depending on if it is possible to descibe the extent of an attibute using the est of the constucts of S. Then, delete a. This situation is illustated in case in Figue 4. ii. Othewise, inset E with an extend tansfomation. Then find the equivalent element of paent(e, S ) in S and add an edge fom it to E with an extend tansfomation. Next, inset the attibutes of E fom S with add o extend tansfomations, depending on if it is possible to descibe the extent of an attibute using the est of the constucts of S (case in Figue 4). (b) If E is linked to the constuct in S : i. If S does not contain the constuct, inset it with an extend tansfomation. ii. Inset an edge fom E to the constuct. The tansfomation is an add, if E was inseted with an add tansfomation and thee was aleady a constuct in S befoe the application of the algoithm. In any othe case, the tansfomation is an extend.. If E exists in S and paent(e, S ) = paent(e, S ) (case in Figue 4): (a) If E in S has attibutes that E in S does not contain, inset them with add o extend tansfomations, depending on if it is possible to descibe thei extents using othe constucts of S. (b) If E is linked to the constuct in S and thee was aleady a constuct in S befoe the application of the algoithm, add an edge fom E to the constuct, othewise extend the edge.. If E exists in S and paent(e, S ) paent(e, S ): (a) Inset an edge fom E P to E, whee E P is the equivalent element of paent(e, S ) in S. This insetion can eithe be an add o an extend tansfomation, depending on the path fom E P to E. The algoithm finds the shotest path fom E P to E, and, if it includes only paent-to-child edges, then the tansfomation is an add, othewise it is an extend. To explain this, suppose that the path contains at some point an edge (, ), whee actually, in S, element is the paent of element. It may be the case that in the data souce of S, thee ae some instances of that do not have instances of as childen. This means that, when migating data fom the data souce of S to schema S, some data will be lost, specifically those instances without any childen. To emedy this, the extend tansfomation is issued with both a lowe bound and an uppe bound quey. The lowe bound quey etieves the actual data fom the data souce of S, but pehaps losing some data because of the poblem just descibed. The uppe bound quey etieves all the data that the lowe bound quey etieves, but also geneates new instances of (with unique IDs) that ae needed in ode to peseve the 8

instances of that the lowe bound quey was not able to. ecause such behavio may not always be desied, the use has the option of telling the algoithm to just use ny as the uppe bound quey in such cases. In Figue 4, case 4 illustates a situation whee the edge fom E P to E is inseted with an add tansfomation, wheeas in cases 5 and 6 it is inseted with an extend tansfomation. Case 6 in paticula epesents an edge flip. Shinking phase: tavese S and emove any constucts not pesent in S. The tansfomation is a delete o a contact one, depending on whethe it is possible to descibe the extent of the constuct with the emaining constucts of the schema, o not, espectively. C F S C S F G C F S C S F G C S D C D E S Case Case Case C S S S S S S Case 4 Case 5 Case 6 Figue 4: Example cases fo the schema tansfomation algoithm. The algoithm pesented above assumes that element names in both S and S ae unique. In geneal, this may not be the case and we may have (a) multiple occuences of an element name in S and a single occuence in S, o (b) multiple occuences of an element name in S and a single occuence in S, o (c) multiple occuences of an element name in both S and S. Fo case (a), suppose that in S thee ae thee elements S :employees:, S :employees:, S :employees:, and in S thee is a single element S :employees:. The algoithm then needs to geneate a quey that constucts the extent of the single element in S by combining the extents of all thee elements fom S. Fo case (b), suppose that in S thee is a single element S :employees: while in S thee ae thee elements S :employees:, S :employees:, S :employees:. Then the algoithm needs to make a choice of which of these elements to migate the extent of S :employees: to. Fo this, a heuistic can be applied which favous (i) paths with the fewest extend steps, and (ii) the shotest of such paths. Fo case (c), suppose that in S thee ae thee elements S :employees:, S :employees:, S :employees:, and in S thee ae also thee elements S :employees:, S :employees:, S :employees:. Then a combination of the solutions fo (a) and (b) needs to be applied. 9

Case g : addelem(s, :, [{b} {b} : ]) g : addlist(s,,,,, [{a, b, } {a, b}, : ]) g : addtt(s,, : F, [{f} {f} F ]) g 4 : extendtt(s,, : G, V oid, uppeound) g 5 : deletett(s,, :, [{b} {b} ]) s : contactlist(s, C, F,, [{c, f, } {c, f, o} C, F, ], uppeound) s : deleteelem(s, F, [{f} {f} : F ]) Case g : extendelem(s,, V oid, uppeound) g : extendlist(s,,,, V oid, uppeound) g : addtt(s,, : F, [{f} {f} F ]) g 4 : extendtt(s,, : G, V oid, uppeound) s : contactlist(s, C, F,, [{c, f, } {c, f, o} C, F, ], uppeound) s : deleteelem(s, F, [{f} {f} : F ]) Case g : extendtt(s,, : E, V oid, uppeound) Case 4 g : addlist(s,,,, [{a, b, } {a, c, o}, C, ; {c, b, o} C,, ]) s : contactlist(s,, C,, [{a, c, } {a, c, o}, C, ], uppeound) s : contactlist(s,, C,,, [{c, b, } {c, b, o} C,, ], uppeound) s : contactelem(s, C, V oid, uppeound) s 4 : enamelist(s,,,, ) Case 5 g : extendlist(s,,,, [{a, b, } {a,, o},, ; {, b, o},, ], uppeound) s : deletelist(s,,,, [{, b, } {, a, o},, ; {a, b, o},, ]) Case 6 g : addlist(s,,,, [{, b, } {, a, o},, ; {a, b, o},, ]) g : extendlist(s,,,, [{b, a, } {a, b, o},, ], uppeound) s : deletelist(s,,,, [{a, b, } {b, a, o},, ]) s : enamelist(s,,,, ) Table : Tansfomations fo Figue 4... ottom-up appoach In this appoach, a global schema GS is not pesent and is poduced automatically fom the souce schemas, without loss of infomation. In ode to integate the data souces, a slightly diffeent vesion of the schema tansfomation algoithm is applied to the data souce schemas in a paiwise fashion, in ode to deive each one s union-compatible schema (Figue 5). The data souce schemas LS i ae tansfomed into intemediate schemas, IS i, so that they have the same stuctue. Then, the union schemas, US i, ae poduced along with the id tansfomations. To stat with, the intemediate schema of the fist data souce schema is itself, LS = IS. Then, the schema tansfomation algoithm is employed on IS and LS (see annotation in Figue 5) The algoithm augments IS with the constucts fom LS it does not contain. It also estuctues LS to match the stuctue of IS, also augmenting it with the constucts fom IS it does not contain. s a esult, IS is tansfomed to IS, while LS is tansfomed to IS. The same pocess is pefomed between IS and LS, esulting in the ceation of IS and IS (annotation ). The algoithm is then applied between IS and IS, esulting only in the ceation of IS, since this time IS does not have any constucts IS does not contain (annotation ). The emaining intemediate schemas ae geneated in the same manne: to poduce schema IS i, the schema tansfomation algoithm is employed on ISi and LS i, esulting in the ceation of ISi and IS i ; all othe intemediate schemas except ISi and IS i ae then extended with the constucts of LS i they do not contain. Finally, we automatically geneate the union schemas, US i, the id tansfomations between them, and the global schema by applying append semantics. The bottom-up integation of the data souces of Figue is 0

US id Global Schema IS n+ US id IS IS n US id IS IS IS n- US n IS IS IS IS n n- LS LS LS.. LS n Figue 5: XML Souce Schema integation shown in Figue 6.. Queying XML Files fte the geneation of tansfomation pathways by eithe top-down o bottom-up integation, queies posed on the global schema can now be evaluated. quey sent to utomed s quey engine is fist pocessed by the quey pocesso, which is esponsible fo efomulating the input quey into queies suitable fo the data souces. This is accomplished by following the evese tansfomation pathways fom the global schema to the data souce schemas. Each time a delete tansfomation is encounteed, the quey pocesso eplaces any occuences of the deleted scheme by the quey supplied with the delete tansfomation. s a esult, the oiginal quey is tuned into a quey with multiple banches, each one suitable fo each data souce - see []. Note that, fo the moment, queying of XML files is pefomed by DOM tavesal. Futue plans include XPath and XQuey suppot. utomed s quey engine and wappe achitectue ae displayed in Figue 7. The utomedwappefactoy and utomedwappe classes ae abstact classes poviding some implementation, while the the XMLWappeFactoy and XMLWappe classes implement the emaining abstact methods. Factoies deal with model specific aspects, like pimay keys fo elational databases. The XMLWappeFactoy class contains a validating switch. When it is on, the pasing of the XML file the XMLWappe object is attached to is pefomed by consulting the DTD o XML Schema the file efeences. numbe of switches, such as a switch fo collapsing whitespace, will be added in the futue. s Figue 7 suggests, the whole achitectue is extensible with wappes fo new data souce models.. View Mateialization Stategy fte the ceation of the tansfomation pathways between the data souce schemas and the global schema, thee exists a diect connection between the data esiding in the data souces and GS, though the tansfomation pathways and the queies they contain. Ou famewok povides an algoithm that can mateialize GS into a new XML file. The algoithm taveses GS in a depth-fist fashion, and obtains the necessay data by evaluating the individual schema constucts of GS as global queies. n issue that aises duing this pocess is to detemine the coect paent-child elationships, so that the esulting XML file pecisely eflects the integation semantics. Ou algoithm uses the edge constucts and schema- and instance-level unique IDs fo this pupose. fte mateializing a schema element, E P say, the algoithm etieves the edge schema constucts that have E P as the paent node. It etieves

GS: GS: GS:H GS:C GS:D GS:E GS:F US Extend with constucts fom othe schemas id GS:G GS S : S : S : S : S : H S : S :H S : S :H S : H S :C S :D S :E S :C S :D S :E S :C S :D S :E S :C S :D S :E S :F S :F S :F S :F S :G S :G S :G S :G IS Extend with constucts fom othe schemas id US S : S : S : S : S :H S :C S :D S :E S :C S :D S :E S :F S :F US S :G S :G IS IS Extend with constucts fom othe schemas S : S : S : S : S : S :H S : S :C S :D S :E S :C S :D S :E S :C S :D S :E S :F S :F S :G S :G IS IS IS S : S : S : S :C S :D S : S :C S :D S :E LS S :F S :G LS S : S :C S :H LS Figue 6: ottom-up integation of the data souces of Figue

Quey Engine quey Quey Pocesso Fagment Pocesso Evaluato esult utomed Wappe Factoy utomed Wappe XML Wappe XML Wappe Factoy XML Wappe Wappe Factoy Wappe XML File XML File Souce Figue 7: utomed s quey engine & wappe achitectue the extent of each of these constucts in tun, and insets the childen instances unde the appopiate paent instances of E P, as indicated by the instance-level unique IDs. 4 Related Wok Thee has been a lot of eseach effot concentated on solving issues concening XML and its connectivity with othe models, mainly the elational model. The famewok pesented hee aims at ceating a complete solution fo the integation of XML data, by focusing on finding XML-specific solutions. Schema matching is a poblem well-studied in a elational database setting. seminal pape on schema matching, focusing on elational databases, but also outlining the geneal pinciples of the poblem is [5]. moe ecent suvey, focused on an XML setting is [5]. In [4], the schema authos povide themselves the semantics of the schema elements, by poviding mappings between elements of thei schemas to a global ontology. These mappings ae then used fo quey efomulation to poduce data souce specific queies. The ontology is then used fo quey efomulation, avoiding the need fo a global schema. Othe simila appoaches ae the ones descibed in [] and []. The majo poblem with these appoaches, and the appoaches of this categoy as a whole, is that, if souce and ae mapped togethe, if does not have some of the elements of, then a use knowing only about schema will neve know about these elements. Concening schema integation, DIXSE [8] tansfoms the DTD specifications of the souce documents into an inne conceptual epesentation, with some heuistics to captue semantics. Most wok though is done semi-automatically by the domain expets that augment the conceptual schema with semantics. [6] has an abstact global DTD, expessed as a tee, vey simila to a global ontology. The connection between this DTD and the DTDs of the data souces is though path mappings: each path between two nodes in a souce DTD is mapped to a path in the abstact DTD. Then, quey ewiting is employed to quey the souces. [] applies schema matching techniques on input DTDs, in ode to ceate an integated DTD fom the souces DTDs. The famewok pesented in this technical epot appoaches the schema integation poblem using gaph estuctuing and is a puely XML solution. Futhemoe, ou appoach allows fo the use of multiple types of schema matching methods (use of ontologies, povision of semantics in the fom of RDF, data-mining), which can all seve as an input to the schema integation algoithm. In the context of XML views, an XML-specific tool is ctive Views [] that has advanced featues like active ules fo view updates, but is semi-automatic in that the use must pogam the ceation of

the view in a high-level language. WHX [6] also defines views pogammatically using WHX-QL. Xyleme [6] offes automated view ceation via tag, DTD and path mappings: the system exploits these mappings stoed in the system to ceate the view wheneve a use specifies its DTD. simila appoach is followed by []. moe staightfowad appoach to XML data integation is though XQuey [4]. The poblem with this appoach is automation: a use has to pogammatically define the view instead of just defining its schema. MIX [] and UXQuey [9] follow this appoach, the fome using its own quey language (XMS) and the latte using a subset of the XQuey language. To ou knowledge, thee is no appoach that consides the XML-specific poblem of odeing policy when mateializing views ove multiple souces. Theefoe all views ae ceated by appending elements at the end of the paent s list of childen. SilkRoute [8] implements an XML view mateialization appoach that suppots odeing, but does so because the input is elational data, and is odeed using the ode by SQL clause fo all esults pio to mateialization. The same applies fo XPERNTO [5], which povides a pue XML middlewae on top of an object-elational database. The tem pue is used because uses do not need to know anything else than XML technologies to ceate and quey views. 5 Concluding Remaks This epot pesented a famewok fo the integation of XML data within the utomed heteogeneous data integation system. ssuming that a schema matching pocess has aleady occued which has identified equivalent individual schema constucts, ou schema tansfomation and view mateialization algoithms succeed in integating and mateializing XML data souces automatically. Ou algoithms make use of a simple schema definition language fo XML data and a technique fo assigning unique IDs to schema- and instance-level elements, both developed specifically fo the pupose of XML data integation. The novelty of ou algoithms lies in the use of XML-specific gaph estuctuing techniques applied to XML schemas. Futue wok will extend the schema tansfomation algoithm to cate fo cases whee thee ae multiple occuences of an element name within an XML souce Schema, as discussed at the end of Section... We note that ou schema tansfomation algoithm can also be applied in a pee-to-pee setting. Suppose thee is a pee P T that needs to quey XML data stoed at a pee P S. We can conside P S as the pee whose XML Souce Schema needs to be tansfomed to the XML souce Schema of pee P T. fte application of ou schema tansfomation algoithm, P T can then quey P S fo the data it needs via its own schema, since utomed s quey evaluato can teat the schema of P T as the global schema and the schema of P S as the local schema. Evolution of applications o changing pefomance equiements may cause the schema of an XML data souce to change. In the utomed poject, eseach has aleady focused on the schema evolution poblem, both in the context of vitual data integation [0, ] and mateialized data integation [7]. Fo futue wok we will investigate the application of these geneal solutions specifically in the case of XML data. The main advantage of utomed s both-as-view appoach in this context is that it is based on pathways of evesible schema tansfomations. This enables the development of algoithms that update the tansfomation pathways and the global schema, instead of having to egeneate them, when data souce schemas ae modified. These algoithms can be fully automatic if the infomation content of a data souce schema contacts o emains the same, though equie domain knowledge o human intevention if thei infomation content expands. The mateialization algoithm also opens up seveal issues. One poblem is that of especting the data souce schemas constaints when ceating the integated XML file. Fo this, we can exploit constaints supplied within an DTD/XML Schema. Howeve, an XML file may not efeence a DTD/XML Schema, o the authos may not exploit the full capabilities of these languages. Moeove, even if such constaints exist, they detemine inta-schema athe than inte-schema elationships. In cases of ambiguity, global schema constaints must be supplied. nothe issue is suppoting patial e-mateialization of the global schema, afte one o moe data souce schemas evolve. The schema definition language used in ou famewok, XML Souce Schema, will also be extended to captue the semantics of optional elements. This is because if a data souce at fist does not contain some optional elements, attibutes o sections, when these optional data appea, the XML souce Schema descibing the data souce will not be valid, and the poblem will appea as a schema evolution poblem. 4

Finally, the famewok cuently assumes up to now that the data souces ae single XML files, and theefoe each one is descibed by one XML Souce Schema. Howeve, we aim to include Native XML bases as data souces in the futue. In such a setting, a single data souce may consist of multiple vey simila XML files. The algoithm poducing the XML Souce Schema must be extended to handle such a case. Refeences [] S. biteboul,. mann, S. Cluet, E. Eyal, L. Mignet, and T. Milo. ctive Views fo Electonic Commece. In Poc. VLD 99, pages 8 49, 999. [] C. au,. Gupta,. Ludäsche, R. Maciano, Y. Papakonstantinou, P. Velikhov, and V. Chu. XML-based infomation mediation with MIX. In Poc. SIGMOD 99, pages 597 599, 999. [] T. enes-lee, J. Hendle, and O. Lassila. The Semantic Web. Scientific meican, May 7 00. [4] M. oyd, P. Mcien, and N. Tong. The utomed Schema Integation Repositoy. In Poceedings of the 9th itish National Confeence on bases, pages 4 45. Spinge-Velag, 00. [5] M. Caey, D. Floescu, Z. Ives, Y. Lu, J. Shanmugasundaam, E. Shekita, and S. Subamanian. XPERNTO: Publishing Object-Relational as XML. In WebD (Infomal Poceedings), pages 05 0, 000. [6] S. Cluet, P. Velti, and D. Vodislav. Views in a Lage Scale XML Repositoy. In The VLD Jounal, pages 7 80, 00. [7] H. Fan and. Poulovassilis. Using utomed metadata in data waehousing envionments. Poc Int. Wokshop on Waehousing and OLP (DOLP 0), New Oleans, Novembe 00. [8] M. Fenandez,. Moishima, and D. Suciu. Efficient Evaluation of XML Middle-wae Queies. In SIGMOD Confeence, 00. [9] R. Goldman and J. Widom. Guides: Enabling Quey Fomulation and Optimization in Semistuctued bases. In VLD 97, Poceedings of d Intenational Confeence on Vey Lage ases, pages 46 445, 997. [0] R. Goldman and J. Widom. Summaizing and Seaching Sequential Semistuctued Souces. Technical epot, Mach 000. []. Halevy, O. Etzioni,. Doan, Z. Ives, J. Madhavan, L. McDowell, and I. Tatainov. Cossing the Stuctue Chasm. [] E. Jaspe,. Poulovassilis, and L. Zamboulis. Pocessing IQL Queies and Migating in the utomed toolkit. Technical Repot 0, ikbeck College, June 00. [] E. Jeong and C. Hsu. View Infeence fo Heteogeneous XML Infomation Integation. Jounal of Intelligent Infomation Systems (JIIS), 0():8 99, Januay 00. [4] L. Lakshmanan and F. Sadi. XML Inteopeability. In CM SIGMOD Wokshop on Web and bases (WebD), San Diego, C, pages 9 4, June 00. [5] J. Lason, S. Navathe, and R. Elmasi. Theoy of ttibute Equivalence in bases with pplication to Schema Integation. IEEE Tansactions on Softwae Engineeing, 5(4):449 46, pil 989. [6] H. Liefke and S. Davidson. View Maintenance fo Hieachical Semistuctued. In Waehousing and Knowledge Discovey, pages 4 5, 000. [7] P. Mcien and. Poulovassilis. utomatic Migation and Wapping of base pplications - Schema Tansfomation ppoach. In Poc. ER 99, LNCS 78, pages 96, 999. [8] P. Mcien and. Poulovassilis. unifom appoach to inte-model tansfomations. In LNCS, volume 66, pages 48. Spinge-Velag, June 999. [9] P. Mcien and. Poulovassilis. Semantic ppoach to Integating XML and Stuctued Souces. In Confeence on dvanced Infomation Systems Engineeing, pages 0 45, 00. [0] P. Mcien and. Poulovassilis. Schema Evolution In Heteogeneous base chitectues, Schema Tansfomation ppoach. In CiSE, pages 484 499, 00. 5

[] P. Mcien and. Poulovassilis. integation by bi-diectional schema tansfomation ules. In 9th Intenational Confeence on Engineeing. ICDE, Mach 00. [] L. Popa, Y. Velegakis, R.J. Mille, M.. Henandez, and R. Fagin. Tanslating Web. In Poc. VLD 0, pages 598 609, 00. []. Poulovassilis. The utomed Intemediate Quey Langauge. Technical Repot, ikbeck College, June 00. [4]. Poulovassilis and P. Mcien. geneal fomal famewok fo schema tansfomation. In & Knowledge Engineeing, volume 8 of, pages 47 7, 998. [5] E. Rahm and P. enstein. suvey of appoaches to automatic schema matching. VLD Jounal, 0(4):4 50, 00. [6] C. Reynaud, J.P. Siot, and D. Vodislav. Semantic Integation of XML Heteogeneous Souces. In IDES, pages 99 08. IEEE Compute Society, July 00. [7] N. Rizopoulos. V tansfomations on elational schemas based on semantic elationships between attibutes. Technical Repot, Impeial College, ugust 00. [8] P. Rodiguez-Gianolli and J. Mylopoulos. Semantic ppoach to XML-based Integation. In ER, volume 4 of Lectue Notes in Compute Science, pages 7. Spinge, Novembe 00. [9] V. aganholo and S. Davidson and C.. Heuse. UXQuey: building updatable XML views ove elational databases. In Poceedings of the azilian Symposium on bases, 00. [0] WC. Guide to the WC XML Specification ( XMLspec ) DTD, Vesion.. http://www.w.og/xml/998/06/xmlspec-epot-v.htm#en56, June 998. [] WC. XML Path Language (XPath). http://www.wc.og/tr/xpath, Novembe 999. [] WC. XML Schema Specification. http://www.wc.og/tr/xmlschema-0, http://www.wc.og/tr/xmlschema-, http://www.wc.og/tr/xmlschema-, May 00. [] WC. Document Object Model (DOM). http://www.w.og/dom/, June 00. [4] WC. XQuey.0: n XML Quey Language. http://www.w.og/tr/xquey/, Novembe 00. 6