The University of Sheffield Department of Computer Science. Indexing XML Databases: Classifications, Problems Identification and a New Approach

Size: px

Start display at page:

Download "The University of Sheffield Department of Computer Science. Indexing XML Databases: Classifications, Problems Identification and a New Approach"

Evan Ross
6 years ago
Views:

1 The Universiy of Sheffield Deparmen of Compuer Science Indexing XML Daabases: Classificaions, Problems Idenificaion and a New Approach Research Memorandum CS-7-5 Mohammed Al-Badawi Compuer Science Dep mbadawi@dcs.shef.ac.uk Dr. Siobhán Norh Compuer Science Dep s.norh@dcs.shef.ac.uk Dr. Barry Eaglesone Informaion Sudies Dep b.eaglesone@sheffield.ac.uk Dae: 5 h November, 27

2 2

3 Execuive Summary Indexing XML daabases is a criical requiremen in he oday s daabase echnology because of he increase in he size and he number of users of XML daabases. The lieraure conains an enormous number of XML indexing proposals ha vary from RDBMS-based echniques o naive XML approaches. In his aricle hese indexing echniques are reviewed for he purpose of evaluaing heir encoding mehodologies for he hierarchal naure of XML daa. Apar from being a RDBMS-based or a naive XML index, he exising XML indexes are classified ino four main caegories depending on he mehodology hey use o encode he XML s srucural relaionships. The node-labelling approach is a class of XML indexes ha is mainly used in XML-o-RDBMS mappings o reflec he paren-child and he ancesordescendan relaionships beween he XML ree nodes. The second class of XML indexes is he pah-summary approach which preserves he XML s srucural relaionships by soring he all disinc XML pahs so ha a regular pah query can be evaluaed by searching he pah lis a a specific ime. Alernaively, he sequence-based approach serializes boh he XML daabase and he XML query and checks for he exisence of he query s sequence in he XML daabase sequence using sequence-based searching echniques. Similarly, he feaure-based class basically encodes he XML s srucural-relaionships as feaures in wha is called a feaure-marix and use he marix s algebra o check for he query exisence in he XML daabase. To criicize he srenghs and he weaknesses of he four index classes, he exising XML indexes lack efficiency in one or more of he following forms; he large index size which prevens he index residing in he compuer s main memory and herefore involves expensive I/O disk operaions, he high cos in erms of he index s consrucion and he query evaluaion processes, and he absence and/or he high cos of he index s updae operaions. Among he four index s caegories, he feaure-based approaches are promising XML indexing echniques for a wider range of XML queries ha are represened by regular pah expressions. The approach has shown an abiliy o encode he XML s srucural relaionships while minimizing he cos of he above deficiencies by using cerain echniques from oher disciplines such as he marix/graph heories and compuerised daasrucures. Based on he above fac abou he feaure-based approaches, his paper proposes a new feaure-based XML index which use sparse marices o encode he hierarchical srucure of XML documens ha are required o answer regular pah queries. The index uses; sparsemarix s compression algorihms o minimize he index size, novel encoding mehodologies for he XML s srucure o reduce he cos of consrucing he index and evaluaing he XML queries using he consruced index. The novel encoding mehodology (used by he index) also allows sysemaic updae operaions for he index wihou he need o reconsruc he index when he underlying XML daabase changes. 3

4 4

5 Table of Conens LIST OF FIGURES...7. INTRODUCTION XML PRINCIPLES WHAT IS XML? XML Elemens XML Aribues Commens Oher XML Componens SCHEMA DEFINITION DTD XML Schema Oher XML Schema Languages Well-Formed & Valid Documens XML DATA MODEL XML and Semi-srucured Daa Model Order XML Tree and DOM XML QUERY LANGUAGES Pah Expression XPah XQuery SUMMARY INDEXING XML DATA PROCESSING DOCUMENT-CENTRIC XML Querying Conen-oriened XML: A Problem Definiion Conen-based Query Approaches: An Evaluaion Conclusion TECHNIQUES FOR MEMORY-BASED XML INDEXING Srucural Summaries Selecive XML Indexes Adapive XML Indexes Conclusion STRUCTURAL-JOINS XML INDICES NODE ENCODING APPROACHES PREFIX ENCODING APPROACHES REGION-BASED ENCODING APPROACHES CONCLUSION PATH ENCODING APPROACHES PATH-ENCODING IN XML-TO-RDBMS MAPPING NATIVE XML INDEXES USING PATH SUMMARY CONCLUSION SEQUENCE-BASED INDEXING APPROACHES FEATURE-BASED INDEXING APPROACHES WHAT IS SIMILARITY SEARCH? IMPLEMENTATIONS AND DATA-STRUCTURES XML FEATURE-BASED INDEXES CONCLUSION MOTIVATIONS AND A RESEARCH PROPOSAL

6 8.. MOTIVATIONS Sudying XML Daabases Moivaing Feaure-based XML Indexing The Index Idea A Moivaing Example INDEXING PROBLEMS IN EXISTING APPROACHES The Index Size Problem The Compuaion Cos Problem The Index Updae Problem CONCLUSION CONCLUSION...5 REFERENCES:...52 APPENDIX A: ABBREVIATIONS LIST...6 6

7 Lis of Figures FIGURE 2. : SAMPLE XML DOCUMENT (SHEFUNI DATABASE)... FIGURE 2. 2: AN XML ELEMENT DIAGRAM... FIGURE 2. 3: A DTD FOR SHEFUNI DATABASE...3 FIGURE 2. 4: AN XML SCHEMA FOR SHEFUNI DATABASE...4 FIGURE 2. 5 : CONVERSION BETWEEN RDBMS, SEMI-STRUCTURED, AND XML DATA REPRESENTATION...5 FIGURE 2. 6: A TREE GRAPH FOR SHEFUNI DATABASE...6 FIGURE 2. 7: XPATH2. AXES...8 FIGURE 4. : A DEWEY ENCODING (EXAMPLE)...28 FIGURE 4. 2: [CKM 2] ENCODING (EXAMPLE)...28 FIGURE 4. 3: AN EXAMPLE OF A REGION-BASED CODING ALGORITHM...29 FIGURE 5. : RELATIONAL SCHEMA FOR XPARENT...32 FIGURE 6. : AN XML TREE AND ITS STRUCTURED-ENCODED SEQUENCE...35 FIGURE 6. 2: A QUERY REPRESENTATION USING [WPFY'3] TECHNIQUE...35 FIGURE 7. : EXAMPLE: FEATURE-GRAPH MATRIX...38 FIGURE 7. 2: MULTI-DOCUMENT XML DATABASE AND ITS PATHS & TOKENS SETS...4 FIGURE 7. 3: A BITMAP INDEX FOR THE DATABASE IN FIGURE FIGURE 7. 4: A BITCUBE INDEX FOR THE DATABASE IN FIGURE FIGURE 7. 5: A BIBLIOGRAPHY XML DATABASE...4 FIGURE 7. 6: A BI-SIMULATION GRAPH OF THE XML TREE IN FIGURE FIGURE 7. 7: MATRIX COMPUTATIONS OF THE BI-SIMULATION GRAPH IN FIGURE FIGURE 8. : A FRAMEWORK STRUCTURE FOR THE NEW FEATURE-BASED INDEX...46 FIGURE 8. 2: AN XML TREE...47 FIGURE 8. 3: PARENT-OF RELATIONSHIP MATRIX OF FIGURE FIGURE 8. 4: CHILD-OF RELATIONSHIP MATRIX OF FIGURE FIGURE 8. 5: ANCESTOR-OF RELATIONSHIP MATRIX OF FIGURE FIGURE 8. 6: DESCENDANT-OF RELATIONSHIP MATRIX OF FIGURE FIGURE 8. 7: THE MASTER FEATURE-BASED MATRIX FOR FIGURE FIGURE 8. 8: PARENT-OF RELATIONSHIP MATRIX OF FIGURE 8.2 (AFTER INSERTING I)...5 FIGURE 8. 9: CHILD-OF RELATIONSHIP MATRIX OF FIGURE 8.2 (AFTER INSERTING I)...5 FIGURE 8. : ANCESTOR-OF RELATIONSHIP MATRIX OF FIGURE 8.2 (AFTER INSERTING I)...5 FIGURE 8. : DESCENDANT-OF RELATIONSHIP MATRIX OF FIGURE 8.2 (AFTER INSERTING I)...5 7

8 8

9 . Inroducion Since February 998, he exensible Mark-up Language (XML [W3C] [W3CS] [ABS']) has become he sandard medium for daa represenaion and exchange over he web [ABS']. As a resul, he size and he number of XML's users increased dramaically, and he lieraure o bring his new echnology up o accepable mauriy levels is exremely large including proposals for; XML query languages (e.g. Xpah [W3C4], XQuery [W3C3] and XML-QL [ABS ]), mapping XML o relaional daabases (examples are [YDF 4] [BP 5] [SKT ] [ACLLF 6]), commercial XML-enabled daabases (such as OpenXML package in SQL-Server [R 5] and XML daa ype in IBM-DB2 [NL 5]), naive XML daabases (e.g. Naix [FHKMNW 2], Timber [JACLNPPSWWY 2], and DB4XML[SVMA 4]), and XML daa accessibiliy and securiy (e.g. Compressed Accessibiliy Model CAM [YSLJ 4], SQL-like XML securiy model [G 4]). Among his lieraure, much has been produced by he daabase communiy o build up he XML daabase echnology in erms of query opimizaion echniques and XML sorage managemens. This aricle aims mainly o review recen approaches for indexing XML daabases. The exising XML query opimizaion echniques can be divided in o wo main caegories. The firs caegory uses he maure relaional daabase managemen sysem o sore and manipulae he XML daa hrough wha are so-called XML-o-RDBMS mapping approaches [YDF 4] [BP 5] [SKT ] [ACLLF 6] [FK99] [LC'] [LY'6]. The main concern of such mapping approaches is o find he bes relaional represenaion for he XML which preserve he whole XML semanics while minimizing he cos of he mapping process in erms of processing-ime and sorage-space. The process of XML-o- RDBMS mapping is associaed wih he problem of he documen s order semanic which can be answered by using node s labelling echniques. For his, he exising node's labelling schemes are caegorised ino wo main caegories; he prefix [TVBSSZ 2] [CKM 2] [KMS 2] [YLML 5] and he region-based [ACLLF 6] [ZNDLL ] [DTCO 3] [YASU ] encoding algorihms. Approaches from each caegory in urn- are subcaegorized depending on he used labelling mehodology, and he srenghs and weaknesses of each subcaegory is criicized. The second caegory of he XML opimizaion echniques includes naive XML indexes [WPFY 3] [KSBG 2] [WPFY 3] [ZOIA 6a]. Similar o XML-o-RDBMS approaches, based on he use of documen's srucure-encoding mehodology; he naive XML indexes are grouped ino: pah-encoding approaches [BFM 5] [YDF 4] [YASU ] [CMS 2] [KSBG 2] [CLO 3], sequence-based approaches [WPFY 3], and feaure-based approaches [YYH 5] [ZOIA 6a]. For each group he advanages and disadvanages are discussed. The oucome from above menioned discussion is used o moivae a proposed feaure-based XML index in Secion 8. This feaure-based approach aims o improve he performance of he XML query processing by addressing several limiaions found in he exising XML indexes, such as he higher processing-coss in erms of ime and sorage space, and he index updae operaions. Such improvemens migh be achieved by employing some echniques from boh he daabase echnology and graphs/marices heories. Designing, esing and evaluaing he proposed feaure-based XML index is he subjec of ongoing research. The res of his repor is srucured as follows: Secion 2 gives a brief definiion of XML, oulines he XML documen s componens (e.g. elemens, aribues, DTD ec), and inroduces he noion of XML daabases and is relaed echnologies (e.g. XML daa model and XML query languages). In Secion 3, he XML indexing problem is inroduced, and various indexing requiremens are oulined for hree differen groups of XML indexes (i.e. indexing documen-cenric daabases, memory-based indexes, and srucurebased indexes). Secions 4, 5, 6 and 7 analyse four differen groups of srucure-based XML indexes. Specifically, Secion 4 discusses he node's labelling approaches in RDBMS implemenaions while he Secion 5 discuses he pah-encoding approaches in boh he relaional XML implemenaions and he naive XML implemenaions. Secion 6 and 7 respecively explain he sequence-based indexes and he feaure-based indexes. As menioned above, Secion 8 idenifies some common problems in he exising indexing proposals and discusses a soluion opporuniy for hem using a feaure-based approach. Finally, Secion 9 concludes he paper. 9

10 2. XML Principles This secion gives a general overview of XML and is relaed echnologies o provide a concree background for he following secions. The secion sars by defining he XML, is origin, and documen srucure. Secion 2.2 discusses some XML schema definiion languages while Secion 2.3 inroduces he inheriance of XML daa model from is superior class semi-srucured daa model. Nex, because hey are very popular sandards for querying XML daa and mos of he invesigaed indexing echniques in his review are based on hem, wo XML query languages (namely XPah2. [W3C4] and Query. [W3C3]) are inroduced in Secion 2.4. The las subsecion (Secion 2.5) concludes his inroducory secion and moivaes he discussion for he upcoming secions. 2.. Wha is XML? The XML can simply be defined [W3C] as a general-purpose mark-up language ha is used o describe boh daa and a corresponding srucure in machine-human readable forma. XML is a simplifieddescendan version of he Sandard Generalized Mark-up Language (SGML [W3C8]) which was originally designed in 96 o faciliae informaion-sharing among heerogeneous sysems. XML was iniially appeared in 996 and in February 998 i was acceped [W3C] by he World Wide Web Consorium (W3C) as he sandard medium for exchanging he daa over he Web. From synacic poin of view, daa in XML is agged wih an infinie se of user-defined ags ha come in pairs and can be nesed o arbirary levels [ABS ]. A group of such agged daa, wih oher componens (e.g. commens and processing insrucion consrucs) compose an XML documen; he basic eniy for soring XML daa in he compuers <! Decls. go below his line... --> <ShefUni> Modules & Saff a Sheffield <modules> <module mid = "COMP"> <ile> Inro. o PC </ile> <credis> 2 </credis> <secions> <secion secid = ""> <regis> 28 </regis> <lecurer idref = "L"/> </secion> <secion secid = "2"> <regis> 25 </regis> <lecurer idref = "L2"/> </secion> </secions> </module> </modules> <saffs> <saff id = "L"> <name> John </name> <dep> DCS </dep> </saff> <saff id = "L2"> <name> Alice </name> <dep> MATH </dep> </saff> </saffs> </ShefUni> Figure 2. : Sample XML Documen (ShefUni Daabase) To show he basic srucure of an XML documen, Figure 2. presens a sample of XML daa conaining informaion abou some academic modules augh in he Universiy of Sheffield. The daa is conained in XML-elemens; ha are a series of adjacen and nesed encodings which consis of ags (surrounded by angular brackes, e.g. <modules> and <saff>) and values ha represen he acual daa (e.g. John and Alice ). In addiion, an XML documen may also include a number of aribues. An aribue is a pair of a name and a value ha represens a piece of informaion which is shared by he elemen and all is sub-elemens and can be used o express new informaion and/or iner-links differen pars of an XML documen (or muliple XML documens). For example, he aribue declaraion (id = L ) in line 23 represens and aribue named id and is value is L. This aribue idenifies he corresponding saff elemen lised beween he lines 23 and 26, and is referenced by a special-purpose aribue declared in he line 2. Boh XML elemens and aribues will be furher described in he nex wo subsecions respecively. In summary, he simpliciy of he XML documen's srucure and he expandabiliy of is vocabulary has given he XML an abiliy o describe mos of he exising daa models such as he relaional daa model

11 [OO'] (see Figure 2.2) and he objec-oriened daa model [GKP'92]. These facs abou XML- have increased is populariy and araced researches o build a concree XML daabase echnology which already had a wide range of implemenaions in boh he daabase and he informaion managemen communiies. The following subsecion describes he XML erminology mos relevan o he inended research. Readers are direced o [ABS ] [W3C] [W3CS] for a full explanaion of he srucure of XML documens. Figure 2. 2: An XML Elemen Diagram 2... XML Elemens An XML elemen (see Figure2.2) is a piece of ex ha is enclosed wihin a sar ag (idenified by < >) and an end ag (idenified by </ >). I is also he basic componen of an XML documen which has a name and conen. An XML elemen is named afer is sar (or end) ag-name which is a user-defined name ha mus follow some convenional naming rules (e.g. sartag & endtag in Figure 2.2). From synacic poin of view, each pair of a sar- and an end-ag mus be idenical. Also XML elemens can be repeaed and/or nesed o any arbirary deph following a simple synax; ha is, he sar- and he end- ag(s) of a conained-elemen mus boh exis inside he conaining-elemen (observe he relaionship beween sartag/endtag and sartag2/endtag2 in Figure 2.2). This srucure deermines he ype of any XML elemen. An XML elemen can be a simple elemen (i.e. conains only an aomic value such as sring and inegers), a complex elemen (i.e. conains sub-elemens) or a mixed elemen (i.e. conains an aomic value and sub-elemens). In Figure 2.; he <dep> elemen is a simple elemen, he <saff> elemen is a complex elemen because i conains wo sub-elemens while he <shefuni> is an example of mixed elemens. In addiion o he above synax, he saring-ag of an XML elemen can be associaed wih one or more XML aribues. The use and he synax of XML aribues are explained in he nex subsecion XML Aribues An XML aribue is simply a pair of a name and a value ha is injeced in he saring-ag of an XML elemen (see Figure 2.2) indicaing ha all sub-elemens (of he corresponding elemen) share he exisence of his aribue and is value. Like elemen names (i.e. ags), aribue names are enirely userdefined ex following he same convenional naming rules of XML elemens. However, here are wo major issues o be considered when using aribues. An aribue canno be repeaed more han once wihin a single elemen definiion and is value par mus be enclosed in double-quoes as a sring regardless is daa-ype.

12 In addiion, hree special-purpose aribues, namely "id", "idref" and "idrefs", are used o suppor references beween differen pars (i.e. elemens) of an XML documen or beween differen XML documens. Therefore, he benefi of his layou is wofold; a) minimizing he XML documen size by prevening daa redundancies; and b) reducing he cos of updaing such referenced daa by ensuring ha he updae will ake place a one posiion in he XML documen. Of course one may noice ha such feaures can be also obained from he use of user-defined aribues, however, an aribue "canno be used o express an eniy if i has children or if i may appear more han once per paren or if he order maers" [H'5]. As an example, in Figure 2., each <saff> elemen has an aribue called "id" which uniquely idenifies he saff member in he XML documen. To assign a eacher for a <module> elemen, is subelemen <lecurer> is associaed wih an "idref" aribue which poins o a <saff> elemen wih an ID equal o he value of he associaed "idref" aribue. Unforunaely, he aribue "idref" does no suppor muliple referencing and hus he "idrefs" aribue is o be used in such siuaions. For example, if a module is being augh by more han one eacher; hen we associae he "idrefs" aribue wih he corresponding <lecurer> elemen, and we lis as many eacher IDs as we wan in he value sring of his "idrefs" aribue. In heory, XML specificaions show ha he use of aribue does no enhance he XML daa's expressiveness [H'5]. Therefore, mos of he XML lieraure, such as XML-o-RDBMS mapping algorihms (see Secion 4 & 5.) and naive XML indexing proposals (see Secions 5.2, 6, & 7), were buil on he assumpion ha every aribue can be ransformed as a new sub-elemen conained in he hosing elemen Commens Commens can be found anywhere inside XML documens. As in programming languages, XML commens are used for clarifying some pars of an XML documen, and hey are ignored by mos of he APIs (and languages) ha manipulae XML documens. Commens have he following synax:  In addiion, a single commen consrucor may span o muliple lines in he XML documen. (See line () in Figure 2.) Oher XML Componens Beside elemens, aribues and commens, an XML documen may conain opional componens such as processing insrucions (used o pass some insrucions o he applicaions ha manipulae XML documens), and schema definiion insrucions (e.g. DTD and XML Schema). In he nex secion, he roles of XML schema definiions in he XML daabase echnology are illusraed by describing wo sandards from schema definiion languages (i.e. DTD [W3CS5] and XML Schema [W3C6]). Discussing oher XML componens are ouside he scope of his paper bu can be found in [W3CS, W3C, ABS ] Schema Definiion In srucured daabases (e.g. relaional daabases), he schema definiion consrains he layou, definiion and yping of he underlying daa. The schema mus hen be defined prior o populaing he daabase and he sored daa in he daabase mus obey his schema definiion. In conras, since XML is an insance of semi-srucured daa [ABS ], he schema no longer plays his role because semi-srucured daa is selfdescripive (in erms of layou and definiion) and XML daa need no be consisen. However, here are many reasons why here may be a need o describe he srucure of XML daa such as checking for informaion availabiliy in an XML documen, herefore, various schema definiion languages have been 2

13 proposed and some of hem have emerged as essenial componens of XML language. In his secion, in order o complee he basic discussion of he XML echnology, he mos wo popular XML schema definiion languages are described; he Documen Type Definiion language (DTD) and he XML Schema language (XSD) in separae subsecions respecively. The following subsecion oulines some oher schema languages such as XDR, OX and DSD while he las subsecion defines he well-formed and he valid XML documen requiremens DTD The Daa Type Definiion (DTD) was proposed by W3C as a par of XML language o serve as a grammar and o some exen- as a schema for he underlying XML daa [ABS ]. The main purpose of he DTD (see DTD specificaions in [W3CS5]) is o describe he srucure (i.e. he elemens and aribues declaraions) of he underlying XML documen. The DTD s declaraions are sored eiher in he XML documen i describes, or in a separae file (wih.dd exension). The second opion is more appropriae in he case where many XML documens need o share he same DTD. In such cases, DTD saemens are sored in a global file or URL and documens refer o i by he inclusion he following saemen: <!DOCTYPE roo_elemen_name SYSTEM dd_file_name.dd > However, a DTD does no offer as many consrains as a relaional daabase schema. For example, DTD lacks he noion of aomic daa ypes as i only allows sring declaraions (i.e. #PCDATA declaraion) [ABS ]. Also DTD's do no suppor range specificaion such as lookup-liss of values and range domains. Some of hese limiaions are covered by he XML Schema which is discussed in he following subsecion. Figure2.3 shows a DTD for he XML daabase found in Figure <! DOCTYPE ShefUni [ <!ELEMENT ShefUni (#PCDATA,modules,saffs)> <!ELEMENT modules (module*)> <!ELEMENT module (ile,credis,secions)> <!ATTLIST module cid CDATA #REQUIRED> <!ELEMENT ile (#PCDATA)> <!ELEMENT credis (#PCDATA)> <!ELEMENT secions (secion+)> <!ELEMENT secion (regis,lecurer)> <!ATTLIST secion secid ID #REQUIRED> <!ELEMENT regis (#PCDATA)> <!ELEMENT lecurer EMPTY> <!ATTLIST lecref IDREF #REQUIRED> <!ELEMENT saffs (saff*)> <!ELEMENT saff (name,dep?)> <!ATTLIST saff id ID #REQUIRED> <!ELEMENT name (#PCDATA)> <!ELEMENT dep (#PCDATA)> ] > Figure 2. 3: A DTD for ShefUni daabase XML Schema XML Schema [W3C6] [W3CS2] language was adoped as W3C s Recommendaion in May 2 o avoid several limiaions ha are faced when using DTD. One of he key facors ha srenghen he XML Schema is is suppor of convenional daa yping for XML daa [W3C6]. Beside declaring XML elemens/aribues, defining he srucure and relaionships beween documen componens, and supporing daa ypes; he XML Schema suppors value-range domains and he occurrence frequency of XML elemens in he XML documen. Furhermore, XML Schema declaraions follow he XML language synax, suppor namespaces, and are more human readable han DTD declaraions. An insance of he XML Schema is called an XML Schema Definiion (abbreviaed as XSD) and, as discussed for DTD's, can be sored direcly in he XML documen i consrains or in a separae daa file wih.xsd exension. Figure 2.4 illusraes an XSD for he daabase in Figure2.. 3

14 <?xml version=. encoding= ISO-8859-?> <xs:schema xmlns:xs=hp:// ShefUni (roo) elemen declaraion: <xs:elemen name= ShefUni > <xs:complextype> <xs:sequence mixed= True >! modules elemen declaraion: <xs:elemen name= modules > <xs:complextype> <xs:elemen name= module minoccurs= maxoccurs= unbounded > <xs:complextype> <xs:sequence> <xs:elemen name= ile ype= xs:sring /> <xs:elemen name= credis ype= xs:posiiveineger /> <xs:elemen name= secions > <xs:complextype> <xs:elemen name= secion minoccurs= maxoccurs= 2 > <xs:complextype> <xs:sequence> <xs:elemen name= regis ype= xs:posiiveineger /> <xs:elemen name= lecurer > <xs:complextype> <aribue name= secid ype= xs:sring use= required /> </xs:complextype> </xs:elemen> </xs:sequence> <aribue name= sfid ype= xs:posiiveineger use= required /> </xs:complextype> </xs:elemen> </xs:complextype> </xs:elemen> </xs:sequence> <aribue name= cid ype= xs:sring use= required /> </xs:complextype> </xs:elemen> </xs:complextype> <xs:elemen! saffs elemen declaraion: <xs:elemen name= saffs > <xs:complextype> <xs:elemen name= saff minoccurs= maxoccurs= unbounded > <xs:complextype> <xs:sequence> <xs:elemen name= name ype= xs:sring /> <xs:elemen name= dep ype= xs:sring /> </xs:sequence> <aribue name= id ype= xs:sring use= required /> </xs:complextype> </xs:elemen> </xs:complextype> </xs:elemen> </xs:sequence> </xs:complextype> </xs:elemen> </xs:schema> Figure 2. 4: An XML Schema for ShefUni Daabase Oher XML Schema Languages In realiy, here are many XML schema languages oher han he W3C s DTD [W3CS5] and XML Schema [W3C6] menioned above. These languages arac less aenion from he XML daabase communiy because of heir limied or specific-domain usages. Regardless of heir populariy, mos of he exising schema languages have a leas as much suppor for he XML documen s srucure as he DTD language. Furhermore, a group of exising schema languages, for example XDR [LTG] and SOX [W3C9], incorporae suppor for he basic daa ypes as well as he sring ype which is he only ype suppored in DTD. However, hese examples lack some oher aspecs of schema definiions such as he explici-null and he user-defined daa-ypes. These feaures, plus ohers such as he uniqueness, key-ness and he inheriance are well-suppored in schema languages like DSD [BRICS] and Schemaron [SCM]. In he lieraure, a comprehensive analysis of six schema languages; namely DTD [W3CS5], XML Schema [W3CS2], XDR [LTG], SOX [W3C9], Schemaron [SCM] and DSD [BRICS] is given by 4

15 Lee and Chu in [LC ]. The auhors compare he six schema languages menioned in erms of synax, daa-yping, componens declaraion, consrains definiion, usabiliy and populariy. Based on his sudy, he srenghs and he weaknesses of any schema language should no be ulimae; raher he judgemen should be based on he philosophy by which each language has been designed [LC ]. While some languages migh be designed o be more semanic-based, ohers are mean o be more opimisic. However, as far as he boh requiremens are concerned, he XML Schema is a superior, and ha is why i has been widely acceped in he XML daabase conex [LC ] Well-Formed & Valid Documens An XML documen is said o be well-formed if and only if- i saisfies he following hree condiions:. sar-ag and end-ag of all elemens mus be mached (i.e. idenical) 2. sub-elemens mus be nesed properly inside heir parens (i.e. sub-elemens mus be opened and closed wihin he boundary of he conaining elemens) 3. aribues associaed wih an elemen mus be uniquely defined Being well-formed is enough for an XML documen o be modelled as a node/edge labelled ree (see Secion 2.3 for XML daa modelling) and herefore can be parsed wih exising XML parsers. In addiion o he well-formed condiions, an XML documen is said o be a valid documen if -and only if- here is a schema definiion aached o i (e.g. DTD), and he daa conained in he documen obeys his schema definiion [W3CS] [W3C] [ABS ]. Alhough i weakens he flexibiliy of semi-srucured daa, XML documen validaion plays a major role in several XML implemenaions such as query opimizaions echniques [DTCO 3] [BGK 6] [EDR 5], and XML-o-RDBMS mapping algorihms [YDF 4] [R 5] [ACLLF 6]. ID Name Dep L John DCS L2 Alice MATH L3 Andrew DCS (a) Saffs Daabase (Relaioanl) {saffs: {saff: {id: L, name: John, dep: DCS}, saff: {id: L2, name: Alice, dep: MATH}, saff: {id: L3, name: Andrew, dep: DCS} } } (b) Saffs Daabase (semi-srucured) <saffs> <saff id= L > <name>john</name> <dep>dcs</dep> </saff> <saff id= L2 > <name>alice</name> <dep>math</dep> </saff> <saff id= L3 > <name>andrew</name> <dep>dcs</dep> </saff> </saffs> (c) Saffs Daabase (XML) Figure 2. 5 : Conversion beween RDBMS, Semi-srucured, and XML daa represenaion 2.3. XML Daa Model As menioned in Secion 2., XML gained is populariy in he conex of daabase echnology because of is self-descripivism, simpliciy, and machine/human readabiliy [W3CS] [ABS ]. Because he daa is always a valuable source of informaion, i is essenial ha i be securely sored, efficienly searched and rerieved, and easily updaed. As a resul, here has been much research ino bringing XML s performance up ha of he curren esablished daabase echnology. 5

16 The conen of his secion aims o give a concree background o he subsequen discussion of XML daa modelling. Therefore, he derivaion of he XML daa model from is superior semi-srucured daa model is discussed in Secion Nex, Secion discusses he relaionship beween he wo models while Secion describes a graphical represenaion form for XML daa XML and Semi-srucured Daa Model XML daa is said o be an insance of he semi-srucured daa [ABS ] where here is no clear cu line beween he daa and is srucure (schema). The definiion of he schema is mixed wih he daa in a serialized ex o sui users and applicaions needs. Therefore, he schema in semi-srucured daa no longer consrains he daa represenaion as his is he case in he relaional daa model. Consequenly, his gives semi-srucured daa a flexible forma, and hence makes i a suiable model for ransferring daa beween heerogeneous sysems [ABS ]. Anoher imporan facor ha gives he semi-srucured daa model srengh in oday's daabase echnology is is abiliy o represen daa from oher models such as he relaional daa model. There is a simple and direc mapping o represen an insance of relaional daa ino a corresponding version of semi-srucured daa (and consequenly ino an XML version). However, he relaional-o-xml conversion process may no be an easy ask especially when he efficiency is an issue. An efficien XMLdesign has o saisfy several concerns such as minimal daa-sorage and opimal XML query suppors. Furhermore, he relaional-o-xml conversion process resuls in some informaion loss such as daa-ype consrains, inernal riggers, daa-inegriy keys and oher consrains; and he cos of such issues mus be kep o a minimum. Figure 5(a, b and c) exemplifies a simples relaional-o-xml conversion resuls. Figure 2. 6: A Tree Graph for ShefUni Daabase Order Alhough XML and semi-srucured daa models share several feaures (e.g. flexibiliy, self-descripivism, simpliciy, and machine-human readabiliy), here is a major conflic beween hem in he suppor of order [ABS ]. While he order is irrelevan in semi-srucures represenaion as he semi-srucured daa is expressed in erms of se(s), he order is imporan in an XML conex because XML was proposed originally as a documen mark-up language. So, ignoring he order while processing hese documens leads o a semanic loss. In he XML conex, he noion of order is applied o he elemens bu no he aribues. For example, if an elemen E conains wo sub-elemens, say E and E2, and wo aribues, say A and A2, 6

17 hen he posiion of A and A2 can be inerchanged wihou affecing he documen s semanics while he swapping of E and E2 posiions yields a differen meaning. As a resul, managing he "order" produces an exra complexiy during he XML daabase manipulaion (i.e. querying and updaing XML daa). For example here is a need for noe-keeping echniques during he XML-o-relaional mapping process in order o preserve he srucural semanics in XML documens. A deailed invesigaion on his problem, is effecs, and proposed soluions is given in Secions XML Tree and DOM Mos of he lieraure represens XML daa as an ordered, node-labelled or edge-labelled, rooed ree graph. In a node-labelled represenaion, he graph consiss of nodes and edges. Each node represens an elemen, aribue or aomic conen which is always of ype sring. The graph is iniiaed from a single node represening he roo elemen. Child elemens are hen conneced wih direced edges from he paren node o he child node. The leaf nodes always conain aomic daa while he inner nodes hold he ags and aribues names. This represenaion varies slighly in he lieraure depending on he logic of he argeed implemenaion. For example, edges are labelled wih he ag or he aribue names while he nodes are lef empy, or incorporaed wih an invened aribue ha reflecs he documen's order. This is called an edge-labelled graph. Regardless of he represenaion used; mos XML implemenaions (e.g. XML query languages) use he following convenions [W3C7] o model he XML daa during daa processing: The op-level node (roo) is he s node in he documen The node precedes is children and all is descendans in he documen order If an elemen node is associaed wih namespace nodes, he namespace nodes are ordered afer heir associaive elemen node. The order of he aribue nodes of an elemen follow he order of he elemen node and if found- he associaed namespace nodes Sibling nodes of an elemen node come afer he elemen s children and descendan nodes The children propery of an elemen node deermines he relaive order of he elemen s children (he order among he siblings) A compuer-based version of he ree model described in his secion is he Documen Objec Model (DOM [W3C]). DOM is a plaform-independen ree model proposed by W3C o access XML documens in he compuerised sysems such as W3C's Java DOM API [SUN]. In he DOM ree model, an XML node belongs o one of he following caegories: elemen nodes, aribue nodes, and exualconens nodes. The aribue names and he ag names are used o label he aribue and he elemen nodes respecively while values (i.e. exual conens) are conained in he exual nodes. Each caegory of XML nodes has is own mehods and evens ha allow an easy access for compuer-based implemenaions o parse an XML documen and idenify is componens for furher manipulaions. The ree graph in he Figure 2.6 represens he ShefUni daabase (given in Figure 2.) using he DOM ree represenaion. In he following secion, he discussion in he res of his review is moivaed by describing some of he well-known XML query languages ha use his daa model o process XML daa XML Query Languages The mos imporan reason for soring daa is o use i again in one or more of he following siuaions: Rerieve he whole or a par of daa, 7

18 Re-forma he daa layou o sui special needs and differen applicaions, Consruc new daa (e.g. produce saisics) from exising daa, and Updae daa when circumsances change. Query languages in he daabase conex- perform he firs hree asks. On some occasions, query languages are exended o do he fourh ask. This secion aims o inroduce wo of he widely-used XML query languages, namely XPah (XPah2. [W3C4]) and XQuery (XQuery. [W3C3]). The reason for discussing hese query languages is ha hey are widely used XML query languages and los of oday's opimizaion proposals are based on hem and hence heir relevance o his research. However, he discussion is no inended o be comprehensive, and herefore, readers are direced o he W3C's specificaions for more informaion (XPah [W3C4] and XQuery [W3C3]). The discussion sars by inroducing regular pah expression in Secion 2.4. followed by lising he feaures and he synax of XPah2. and XQuery. in separae subsecions. Axis Descripion Abbreviaed Synax Expanded Synax Child node defaul child Descendan nodes no available descendan Curren and descendan nodes // descendan-or-self Paren node.. wo dos paren Ancesor nodes no available ancesor Curren and ancesor nodes no available ancesor-or-self Following node no available following Preceding node no available preceding Following siblings no available following-sibling Preceding siblings no available preceding-sibling Curren node. do self aribue Figure 2. 7: XPah2. Axes Pah Expression Given ha he bes way for describing semi-srucured daa is represening i as a labelled-ree graph [ABS ], reaching a specific posiion (i.e. accessing daa) a his ree graph requires navigaion from he roo node o he desired node. The Pah expression [W3C4] is a powerful echnique ha enables applicaions (based on semi-srucured daa) o access arbirary posiions in he ree by walking hrough labelled-edges. Each edge in he pah is called a sep. Every wo seps in a pah- are separaed wih a do (i.e.. ) indicaing ha he lef-hand side sep precedes he righ-hand side sep. Similarly, some operaors such as " ", "?", "+" and "*" are borrowed from regular expressions o express choices, exisence, one or more repeiions, and none or more repeiions of a cerain node in a pah expression. XPah expressions, he XML noaion of pah expressions, use slashes (i.e. / ) insead of dos o separae pah seps. In addiion, a se of axes [W3C7] were inroduced by XPah expressions in order o navigae hrough he XML ree from/o any arbirary node in he XML ree. XPah axes can be wrien in abbreviaed or expanded forms. Figure 2.7 liss all possible axes in boh forms. In general, XPah expressions are forming he base for all XML query languages XPah2. This secion only describes he feaures and he synax of he XPah2. query language for he purpose of inroducing he language. A comprehensive discussion abou XPah2. echnical specificaions can be found in [W3C7] [W3CS3]. Feaures: [W3C7, W3CS3] 8

19 As a query language, XPah refers o XML Pah Language Models XML daa as an edge-labelled ree Is used for addressing and selecing cerain pars of XML documens or producing summaries Can express complex queries by using predicaes which are enclosed in square brackes [ ]. These predicaes mus be saisfied before he XPah begin maching he following node in he pah (e.g. //saff[@sfid= L ]/dep) Has a wide range of operaors o deal wih daa such as Boolean operaors (e.g. or, and ), arihmeic operaors (e.g. +, mod ), and comparison operaors (e.g. >,!= ) Owns a rich library of funcions ha manipulae node-ses (e.g. posiion(), coun(nodese), srings (e.g. sring(obj), conca(sr,sr2[, sr*]), Boolean values (e.g. no(boolvalue), rue()), and numbers (e.g. sum(num_nodese), average(num_nodese)) XPah2., a W3C Recommendaion on 27/Jan/27, is he laes version of XPah which is a subse of XQuery.. [W3C3] query language which is discussed in Secion Synax: XPah queries use he noion of regular pah expressions along wih he above menioned XPah axes, operaors and funcions. Some XPah queries based on he ShefUni daabase- are given below. Examples: These examples are based on he ShefUni daabase in Figure2.: Query: Q: This query reurns conens of he lecurer-name nodes: /ShefUni/saffs/saff/name/ex() Resul: Q: John Alice Query: Q2: This query reurns name nodes for DCS saffs: /ShefUni/saffs/saff[dep= DCS ]/name Resul: Q2: <name> John </name> <name> Alice </name> Query: Q3: This query reurns he name node of he firs saff in he daabase: /ShefUni/saffs/saff[]/name Resul: Q2: <name> John </name> XQuery. Similarly, his secion only describes he feaures and he synax of he XQuery. query language for he purpose of inroducing he language. A comprehensive discussion abou XQuery. echnical specificaions can be found in [W3C3] [W3CS4]. Feaures: [W3C3, W3CS4] Sands for XML Query Language Is an XML query language wih some programming language feaures and SQL-like semanics Based on XPah daa model Has all funcionaliies, libraries and capabiliies of XPah2. (i.e. XPah 2. is a subse of XQuery) Is suppored by mos commercial RDBMS such as IBM, Oracle and Microsof SQL-Server In addiion o XPah2. capabiliies, XQuery suppors FLWOR [W3CS4] expressions (FLWOR is an acronym for "For, Le, Where, Order by, Reurn") XQuery., a W3C Recommendaion on 23/Jan/27, is he laes version of XQuery 9

20 Synax: There are wo differen synaxes for XQuery. query language. The basic synax is XPah because XPah2. is a subse of XQuey.. In oher words, all XPah2. expressions are valid XQuery. expressions. The second XQuery. synax is he FLWOR expressions which is influenced by some funcional programming and SQL-like feaures. As a resul, exra rules are applied o validae he synax of XQuery. expressions (quoed from [W3CS4]): XQuery is case-sensiive XQuery elemens, aribues, and variables mus be valid XML names An XQuery sring value can be in single or double quoes An XQuery variable is defined wih a $ followed by a name, e.g. $booksore XQuery commens are delimied by (: and :), e.g. (: XQuery Commen :) Examples: All XPah examples (in Secion 2.4.2) are valid XQuery. expressions and hey produce exacly he same resuls. Furhermore, he queries Q and Q2 (from Secion 2.4.2) can be re-wrien in he form of FLWOR expressions wih addiional ask ha changes he order of he resuls. Query: Q: This query reurns conens of he lecurer-name nodes: for $n in doc("shefuni.xml")/shefuni/saffs/ saff/name/ex() order by $n reurn $n Resul: Q: Alice John Query: Q2: The following query creaes new elemens ha lis ou all saff names from DCS deparmen: for $n in doc("shefuni.xml")/shefuni/saffs/ saff where $n/dep = DCS order by $n/name reurn $n/name Resul: Q2: <name> Alice </name> <name> John </name> 2.5. Summary This secion (Secion 2) consised of a general overview of XML and some of is relaed echnologies including XML daa modelling and XML query languages. As has been explained above, XML is an emerging sandard media for ransferring daa over he Web and exchanging he informaion beween heerogeneous sysems. Furhermore, XML is used in many fields and he amoun of he informaion sored in XML forma has become incredibly large in a shor period of ime. Compuer and informaion echnology scieniss have suggesed ha XML daabases are abou o replace he exising convenional daabases (e.g. including relaional and objec-oriened daabases) because of is simpliciy and flexibiliy in mos fields of informaion echnology. Therefore, he recen daabase lieraure aims o bring XML daabase echnology up o he level of mauriy of exising convenional daabases. This mauriy includes reliable XML sorage managemen, efficien XML processing echniques, and rusable XML daa warehousing ools. Among hese requiremens, query opimizaion echniques have become essenial ools for processing large-scale XML daabases efficienly. The res of his review discusses he exising XML indexing echniques for he purpose of invesigaing furher enhancemens in such query opimizaion ools. The discussion ends by proposing a new indexing echnique which possibly could improve he performance of a wide range of XML queries. 2

21 3. Indexing XML Daa Daa is sored in order o be used when is needed. The process of soring and rerieving daa is associaed wih echniques such as daa sorage managemen and daa indexing echniques. Indexing daabases has become of criical imporance due o he increase in heir size. In erms of XML daabases, sorage managemen and indexing problems are even worse because of he irregulariy of he daa and he lack of daa-yping [ABS ]. This review mainly invesigaes XML indexing echniques in boh XML-naive plaform and XML-enabled plaform. To presen he srenghs and he weaknesses of XML indexes clearly, he discussion is based on hree differen aspecs. Firs, XML indexing proposals migh be evaluaed according o he ype of he documens indexed. So, here are indexes designaed o index daa-cenric documen, for example Vis [WPFY 3], XSeq [MJCW 4] and FIX [ZOIA 6a]. On he oher hand here are indexes designed o index documen-cenric XML daabases such as [XP 5] [YL 6] [BG 6] which implemen Informaion Rerieval (IR) echniques o process XML conens. The framework of some recen documen-cenric indexes is oulined in Secion 3.. Secions 4 o 7 evaluae daa-cenric indices. The second aspec of his discussion of XML indexes is based on he index residency. There are emporary indexes ha o be buil on-he-fly during he query execuion (e.g. [ZNDLL ] [BKS 2]). This ype of index has an excellen response ime for srucural-joins queries because hey avoid he expensive cos of I/O operaions as hey reside in he main memory. Unforunaely, memory-based indexing echniques lack he scalabiliy for large XML daabases, and hence he need for disk-based indexing algorihms (e.g. [WJWLLL 5]). Secion 3.2 describes he echnology of memory-based indexes wih some recen proposals from he lieraure. Alernaively, XML indexing algorihms can be evaluaed on he basis of how he XML srucural relaionships are encoded in he index file [ZLC 4]. Indexing echniques of his ype have been grouped ino four caegories. These caegories are; node s encoding echniques, pah encoding echniques, sequence-based indexing echniques, and feaure-based indexing echniques. Secion 3.3 oulines he four caegories in general, while Secion 4, 5, 6 and 7 respecively discusses each caegory in deail. An imporan observaion abou his class of indexes is ha hey pay less aenion o indexing he conens (i.e. values). This is obvious because he mos imporan issue in querying XML daa is o find he proper pah o he daa despie he irregulariy of he XML. The major par of his repor is based on a discussion of hese indexes because of heir relevance o he proposed research in indexing XML daa which is oulined laer. 3.. Processing Documen-cenric XML In his secion I discuss he noion of processing conen-oriened XML documens. Unlike he processing of daa-cenric XML documens, which is inensively invesigaed in his review because of is relevance o research, only he mos significan aspecs of querying and indexing documen-based XML daa are oulined for he purpose of comparing and conrasing hem wih he daa-cenric approaches. In he following secions, he need for dedicaed conen-based XML indexing echniques is discussed followed by several soluions for processing conen-based XML documens (Secion 3..2) and a conclusion in Secion Querying Conen-oriened XML: A Problem Definiion Querying conen-oriened XML documens is differen from processing daa-cenric and/or plain-ex documens. Daa-cenric documens can be viewed as a represenaion of flexible relaional ables where he documen srucure is consrained by he use of elemens and aribues [W3C] [KMRS 4]. Therefore, his ype of XML documens is ofen processed via SQL-like query languages such as XPah 2

22 [W3C4] and XQuery [W3C3]. On he oher hand, ex in documen-cenric documens is more or less narraive, and loosely srucured, wih he elemens order being of more significance. However, i is unlike plain-ex documens where here is no srucure o be considered during he query process, and herefore, exising informaion rerieval (IR) echniques are used o process such documens [L 6] [KMRS 4]. Indeed, he mixed naure of conen-based XML documen has increased he need for a hybrid querying approach where he srucural naure is processed by he means of daabase aspecs and he exual naure is managed by employing he IR echnology [KMRS 4]. I has been shown ha neiher srucural query ools (for example XPah) nor exual IR approaches can independenly process query XML documen-cenric daabases [KMRS 4] [L 6] [FG ]. In he XPah query language for example, he query processor uses an exac-maching approach o reurn resuls. This is obviously no ideal for querying documen-cenric XML daa because firsly, exual conen in such documen is oo big o be embedded (e.g. as predicaes) in a query expression. Secondly, exual conen conains large amoun of redundan informaion (i.e. non-keywords) which has no obvious and direc role in a deerminaion of he query resul, and consequenly could lead o false-posiive responses. Las bu no leas, any XPah processor is designed o reurn he enire maching subree(s) corresponding o he documen s order. A proper design of an IR sysem should reurn he mos relevan par of documens wih highes relevance ranked firs [KMRS 4]. While he conains operaor in XPah2. [W3C4] solves a small porion of he keyword-search problem, here is sill a need for oher operaors o incorporae he documen consrains (such as ag and aribue names as well as documens hierarchal srucure) during he querying process life-cycle saring from maching phase, hrough selecing he mos-relevan resuls and ending o resuls ranking phase. In comparison, IR approaches perform no beer han XML query languages for processing documencenric XML daa. Guo e al. [GSBS 3] have idenified hree major challenges o be addressed when using IR keyword searching (KWS) echniques o process XML documens in general and documencenric XML daa in paricular. Firs, he resul reurned by KWS echniques is no longer he enire maching documen as i is in radiional KWS. In XML conex, a KWS echnique has o consider he nesed-hierarchal naure of XML documens. Tha is, if one or more keywords exis in an elemen, i would be more pracical o reurn ha elemen (or is surroundings) insead of reurning he enire XML documen. The following secion (Secion 3..2) conains deailed discussion on his issue. Having all maching elemens idenified, he second challenge is how o rank hese resuls. We should noice ha he ranking compuaion is now based on he elemen granulariy insead of documen granulariy [SKR 4] [GSBS 3] and his could lead o undisinguishable ranking values as he keywords sough migh be found evenly in he reurned elemens. So, he radiional KWS echniques are no an ideal soluion for performing XML keyword searches. In addiion, because of he hierarchal naure of XML elemens, he resuls could be meaningless (e.g. selecing an incomplee porion of ex or elemens), and herefore we require inelligen algorihms o selec he mos desirable XML fragmens. This involves exra compuaion and accordingly increases he query s processing cos. Finally, he proximiy compuaion is a sraighforward process for plain-ex documens which can be direcly derived from he keywords disances in he documen. In he XML conex, however, he proximiy compuaion is more complex. A wo-dimensional proximiy meric is required o consider boh he keyword disance wihin he enire XML documen, and he disance beween he keywords hosing elemens and heir ancesors. In summary, conen-oriened XML documens can be processed by mixing echniques from boh he sandard daabase-querying echnology and he radiional IR echnology. Implemening such hybrid approaches requires a consideraion of using; a proper query-documen maching echnique, an inelligen oupu-selecor algorihm, and a reliable ranking scheme. The following secion invesigaes several wellknown approaches ha have been proposed o saisfy hese requiremens Conen-based Query Approaches: An Evaluaion Many soluions were proposed o query and index conen-based XML documens efficienly using a hybrid echnique from boh he daabase communiy and informaion rerieval echnology. In his secion various implemenaions of such hybrid approaches are described. The discussion includes answers for 22

23 he following hree quesions: How o mach he query erms wih he documen erms? How o selec and group he resuls? How o rank he seleced resuls? XIRQL [FG ], is one of he earlies conen-based XML query languages and was proposed by Fuhr and Groβiohann in 2. I is buil on he op of XQL [W3C5], a predecessor XML query language ha allows flexible condiions on boh he XML srucure as well as he conens. The main goal of XIRQL is o exend XQL by employing radiional IR keyword search echniques wih he srucural-semanic suppor in mind. To do his, XIRQL sars by reaing XML leaf nodes as aomic query-able and indexable unis, and expands hem o wha is called index-objecs. The idea of forming such index-objecs is borrowed from FERMI mulimedia model [CMF 96] in which hey are used o rerieve he mos relevan ex from he searched documens o answer a query. In XIRQL, index-objecs are a se of disinc subrees ha represens meaningful oupu for a conen-based query. Therefore, he use of hese indexobjecs is wofold: firsly, hey are used in he elemens weighing process and secondly, he maching index-objecs are reurned in he final resul. To access hese objecs during he query s differen phases, XIRQL uses a simple exension of invered files and reas index-objecs as if hey were be sand-alone documens in he radiional IR sysems. One drawback of his approach is he use of he pre-se objecs which ofen require updaing when he XML srucure changes. However, prior specificaion of hese index-objecs may preven undesirable resuls such as reurning full documens. To avoid he use of pre-se index-objecs, XRank [GSBS 3] and XSEarch [CMKS 3] use wo auomaed mehods o highligh he elemen nodes o be processed (i.e. compared agains query keywords). In XRank, an XML documen is pariioned as follows; each pariion (i.e. objec) includes all nodes ha have a leas one occurrence of all searched-keywords; such ha hey belong o he same ancesor. However, o avoid duplicaing descendan elemens in he higher pariions, XRank conducs a boom-up pariioning process excluding he descendan pariion from appearing is ancesor [GSBS 3]. XSEarch, on he oher hand, uses he idea of inerconnecion [CMKS 3] o pariion XML documens. Two nodes belong o he same pariion if here are no disinc nodes holding he same elemen name in he same se of nodes ha share he same lowes common ancesor [CMKS 3] [LC 7]. In addiion, XRank does no differeniae beween he keyword ypes (i.e. wheher i is a ex-word or a label-word) when maching keywords. This is also implemened by XSEarch bu he laer employs more IR echniques, such as f-idf, o preven reurning unrelaed resuls. In erms of forming and sizing he oupu, boh XRank and XSEarch reurn he same or a simple variaion (such as meaningful or smalles lowes common ancesor elemen) of documen s pariions used in he maching phase. However, boh echniques rank he resuls slighly differen. XSEarch ranking scheme is mosly IR-oriened. I considers facors such as erms disance and frequency [LC 7] while XRank employs he Google s PageRank [BP 98] mehod bu wih he XML elemens granulariy insead of documens (HTML documens in he case of [BP 98]) granulariy [GSBS 3]. One common problem in boh XSEarch and XRank is heir ignorance of he ype of searched-keyword (i.e. wheher a keyword is exual word or node-label word) which may produce semanically irrelevan oupu or cause he reurn of he enire documen especially in he case of keyword-enriched queries 2. The problem of semanic-irrelevancy ha is resuled from using boh exual and node-labels keywords in XML s IR approaches is invesigaed in [KMRS 4] and he abou operaor is proposed o enable XPah query language o efficienly query documen-cenric daabases. The abou operaor has he same synax as he sandard conains operaor [W3C4] bu wih wo more unique feaures. Beside is sandard synax, he abou operaor allows he formulaion he search requiremen in erms of mixing he conens and he srucure of XML documen. Second, i allows he use of bes-mach querying of conen-based XML documens. So, he XML srucural semanic sill can be preserved by using sric XPah axes while he conens can be searched by injecing one or more abou operaor ino he XPah expression. In erms of managing he query oupu, alhough i is no explicily saed in [KMRS 4], i is obvious ha he XPah sandard library will conrol he oupu produced as well as is order. Having said ha, XPah is a subse of he W3C sandardized XQuery. language; he order by clause can be used o re-arrange he XML fragmens reurned based on calculaed crieria hroughou oher XPah operaors. From he daabase poin of view, he abou exension seems o be ideal soluion for querying conenbased XML documens. However, such documens conain informaion mosly of ineres o The f idf (erm frequency inverse documen frequency) is a weigh ofen used in informaion rerieval and ex mining. This weigh is a saisical measure used o evaluae how imporan a word is o a documen in a collecion or corpus 2 Queries ha include many keywords from boh; he documen s ags vocabulary and he acual daa (i.e. leaf-nodes) 23

24 XPah/XQuery ignoran users. Therefore, IR-oriened soluions, wih XML s feaures, are more pracical in his conex. A recen soluion for he XPah s abou exension problem is proposed in [CPCFD 6]. Chu-Carroll e al. suggesed ha, incorporaing some semanic issues from XML Fragmens query language [CMMMS 3] ino radiional keyword searching approaches will resul in a beer query performance for documen-cenric XML daabases. They found ha applying concepualizaion, resricion and relaion operaions of XML Fragmen query language o radiional keyword searching echniques places more consrains on he query and produces more precise resuls. So, concepualizaion operaion is used for query-documen maching phase while he ambiguiy of he keywords is solved via he resricion operaion which also used o specify he search-erms ha are inerrelaed by he relaion operaion [CPCFD 6]. The good feaure of his approach is ha i is applicable in mos of he exising keywordbased query languages. Therefore, i leverages he characerisics of he underlying IR sysem such are weighing and ranking operaions Conclusion This secion oulined several approaches for processing documen-cenric daabases. In paricular, each echnique was evaluaed wih respec o hree main processes: ) Conducing query-documen maching mehodology, 2) Pariioning and selecing he mos relevan XML daa, and 3) Ranking he resuls according o heir relevance. As is noable from he above discussion, all approaches join echniques from he daabase communiy wih exiing IR echniques o perform hese asks. Since he main focus of his review is an analysis of he process of indexing daa-cenric daabases, deailed discussion of documencenric daabases. The following secion (Secion 3.2) discusses some echniques used in consrucing memory-based indexes for XML daabases Techniques for Memory-based XML Indexing This secion oulines several sae-of-he-ar echniques ha are used by some of he exising XML indexes in order o minimize he index size so ha such indexes can easily fi in he compuer s memory and herefore obain beer query performance by reducing he expensive I/O disk operaions. The purpose of his discussion is o seek a beer compression echnique for he proposed feaure-based index o be discussed in laer secions. This secion specifically describes he XML srucural summaries in Secion 3.2., he noion of adapive XML indexes in Secion 3.2.2, and he noion of selecive XML indexes in Secion The discussion is concluded in Secion Srucural Summaries This echnique is widely used in indexing semi-srucured daa in general and XML daabases in paricular. The main goal of a srucural summary is o eliminae any redundan srucural informaion of he underlying daabase wihou loosing srucural consrains; ha is he srucural relaionship beween XML elemens such as paren-child and ancesor-descendan relaionships. In he XML conex, a srucural summary is a smaller version of an XML ree where all pahs from he roo node o any leaf node in he acual XML ree are preserved in he summary ree. Therefore, a any level in he XML ree, nodes ha can be reached by a specific pah from he roo node are grouped in a single node (called an exen) in he summary ree. The resul is anoher ree wih fewer nodes and mos of he ime- deerminisic navigaion a any level of he ree. An early XML index using his approach is he Srong DaaGuide [GW 97] from he LORE projec [MAGQW 97]. In his represenaion, evaluaing XML queries ha only involve regular pah expressions is significanly faser because no recursive and/or backracked navigaion is required o access similar pahs a each branch in he summary ree. However, for hose queries wih a descendan axis or a wildcard node-es, he query evaluaion process requires navigaing over he full index ree [H 5], and herefore 24

25 he benefi from he index is limied o reducing he oupu size a each query sep. Obviously his is no he case when he acual XML ree conains nodes references. In his case, he size of he index represenaion becomes larger han he size of he acual XML ree represenaion because referenced nodes are repeaed as child nodes under he nodes ha reference hem. To solve he index size problem caused by he exisence of node's referencing, he -Index [MS'99] ries o group duplicaed nodes ino one exen bu wih muliple edges poining o i. The resuling summary is herefore no longer a ree, and his leads o selecing muliple exens for a single query's sep. Therefore, he resuls mus be refined in addiional sep [H'5]. Furher summary indexes (such as 2-Index [MS'99] [H'5]) followed he design of -Index o overcome several exising problems in he Srong DaaGuide [GW 97] such as he index-size problem and he absolue-naure 3 of he index. Mos of hese rials were affeced by some of hese problems such as he rade-off beween he index-size and is performance. As a resul, some proposals ried o find alernaive mehodologies o reduce he index size. Some soluions use he noion of he seleciviy o index cerain porions of he XML daabase such as he frequen accessed informaion. This idea is explained in he nex subsecion Selecive XML Indexes XML selecive indexes ry o employ indexing echniques from he RDBMS lieraure. In RDBMS, he daabase adminisraor (DBA) can choose cerain fields (i.e. columns of a able) o build an index on. These fields are seleced in such a way as o saisfy cerain SQL queries ha are frequenly riggered by he daabase's users. Applying his echnique o an XML daabase, a corresponding index no longer represens he enire XML daabase. Therefore, he acual daabase has o be accesses in order o answer queries. Anoher problem o be considered when consrucing a selecive XML index is he ype and he amoun of updaes ha may happen in he indexed daa. As he documen's srucure is he main concern of any XML index, he XML daa indexed by a selecive index mus be chosen from he occasionally-updaed srucure. This is because when a cerain srucure is alered by he means of XML updae operaions, he new srucure will no necessarily saisfy he same se of frequenly riggered XML queries. Therefore, he DBA has o reconsruc he index crieria according o hese srucure changes. A well-known XML index of his ype is he T-Index [MS'99] [H'5]. The T-Index is summarised in [H'5] as follows; " he main idea of T-Index is o esablish a srucural summary ha only covers pah expressions fulfilling one specific emplae. The emplae describes he srucure of he pah expressions by node ess and placeholders. If he indexing sysem suppors muliple pah expressions wih differen srucures ha canno be summarised wih common emplae, we need several T-Indexes, one for each emplae." Anoher problem emerges from he above descripion of he T-Index is refleced in he cos of choosing an appropriae T-Index o saisfy a cerain XML query which riggers he query-rewriing 4 problem [H'5]. A possible soluion for such problems migh be achieved by using an adapive indexing echnique which is able o change is srucure incremenally according o he changes in he query workload. The noion of adapive XML indexes is described in he following subsecion Adapive XML Indexes As was inroduced in he above secion, his ype of XML indexes ries o updae heir indexing crieria incremenally based on he mos frequen XML queries. To saisfy his requiremen, he daabase sysem 3 by his I mean he index can be only used o saisfy he absolue regular pah queries such as queries ha sar wih he descendanaxis (i.e. // axis) 4 How o rewrie a query in such a way i evaluaes o he same se of resuls bu wih a fewer processing cos (i.e. query opimizaion) 25

26 (or he DBA) mus keep rack of all riggered queries along wih he changes in he documen's srucure. There are hree obvious drawbacks o his implemenaion. Firs, he index will involve expensive compuaional complexiy required o aler he index srucure during he query-evaluaion process which may resul in slowing he query-evaluaion process iself. Second, more sorage space is required o keep he queries and/or he changes in he documen's srucure so ha he index srucure can be updaed. Third, he index becomes a sor-of selecive index; ha is i is concerned wih frequen riggered queries and ignores oher quires which migh involve heavy comparison operaions. Two ypical examples of adapive indexes are he Adapive Pah Index (APEX [CMS'2]) and he index proposed in [CLO'3]. These indexes are furher discussed in Secion Conclusion This secion oulines hree echniques ha can be used o limi he size of an XML index. These echniques concern reducing he sorage space required o sore he documen's srucure raher han he XML daa (i.e. value) so ha he index can reside in he compuer's memory for fas access. These compression echniques divide beween complee and parial XML indexes. XML queries benefi less from he parial XML indexes as hese indexes are direced o specific classes of XML queries and/or ignore cerain porions of XML daabases which may be riggered by differen ineress. In erms of he index proposed laer, srucural summaries offer a good opporuniy -wih he conjuncion of he XML daa represenaion used by he index- o reduce he index size. Regardless of he compression echnique used, he following secion inroduces a caegorizaion of srucural XML indexes based on he mehodology used o encode he elemens relaionships found in he XML documens Srucural-joins XML Indices Srucured XML queries (or conainmen queries) evaluae pah expressions (query paerns) o reurn all maching nodes and/or check he query paern exisence in he XML ree [WJWLJ 5]. Therefore, an index of his ype need o encode as many documen s srucures as possible in order o saisfy a wides range of such queries. On he oher hand, separae indexes can be creaed for a single XML documen, each of which is used wih a specific class of srucural queries. The basic framework of any srucural index is o encode some srucural relaionships beween XML nodes (elemens, aribues, ec) so ha he query processor is able o predic resuls by simply approaching a corresponding index wihou accessing he acual daa file. I has been idenified by researchers [ZNDLL ] [KJKPSW 2] [MW 99] [STZHDN 99] ha he paren-child and he ancesor-descendan relaionships are sufficien o answer mos classes of he srucured queries. However he efficiency of he evaluaion process varies from one class o anoher and herefore, as more documen s srucures are encoded in he index, beer processing efficiency can be obained [ZOIA 6a]. Based on he algorihm used for encoding he XML s hierarchal-srucure, he srucure-based indexes are grouped ino four caegories. These caegories are; simple node-encoding approaches (e.g. [TVBSSZ 2] [SLFW 5] [ZNDLL ]), pah-encoding approaches (e.g. [CLO 3] [CMS 2]), sequence-based approaches (e.g. [RM 6] [WPFY 3] [WJWLLL 5]) and feaure-based approaches (e.g. [ZOIA 6a] [YKKC 2]). The following four secions discuss hese caegories. 26

27 4. Node Encoding Approaches Recall ha daa in XML documens is modelled as an ordered, node/edge-labelled, and un-ranked ree [ABS ], hus, during processing operaions (e.g. query and updae), he XML documen s srucure mus be efficienly mainained. Keeping he srucural relaionships up o dae wihin he index file is an imporan issue for querying XML daa efficienly. In realiy, mos of he exising indexing echniques encode hese relaionships o speed up he evaluaion process of he conainmen queries [CAO 6]. During he las decade, lieraure is rich wih proposals for encoding he XML s srucural relaionships efficienly via wha is so-called node-labelling. The main wo goals of he exising node-labelling soluions are: a) o assign a unique code for each node in he XML ree, and b) o preserve he neshierarchal srucure of XML documens (e.g. ancesor-descendan and paren-child relaionships) during XML updaes [TVBSSZ 2] [SLFW 5] [SKT ] [CKM 2] [FLSW 3] [EDR 5] [BP 5] [YLML 5]. However, minimizing he re-numbering cos (including he processing ime and I/O accessibiliy) in he case of daa updaes, and reducing he required sorage space o sore generaed code; are essenial requiremens for any node-labelling proposal [TVBSSZ 2] [FLSW 3] [KMS 2]. In general, any node-labelling algorihm is eiher a prefix-based approach or a region-based approach. In his secion, some proposals ha bes describe he characerisics of each caegory are discussed. In paricular, secion 4. deals wih prefix-coding approaches while he following secion discusses regionbased approaches. Secion 4.3 concludes he discussion of node-encoding approaches. 4.. Prefix Encoding Approaches Numbering schemes of his ype basically generae code consising of wo pars: he prefix par which encodes he preceding node code and he acual-code par which encodes he order of he node among all nodes in he XML ree (or is siblings) using a specific ree-raversal algorihm such as pre-order and pos-order raversal algorihms. In his encoding, paren-child and ancesor-descendan relaionships are included in he prefix par, while boh pars of he generaed code form a unique idenificaion for each node. There are many proposals of his ype. A simple bu mos famous example is he Dewey [TVBSSZ 2] coding algorihm where each node (excep he roo node) is given a Dewey code ha consiss of wo pars: an incremenal local number ha reflecs he posiion of he node among is siblings, preceded by he Dewey code of he paren s node. The wo pars of he code are separaed by a. (i.e. do). The roonode code consiss of one par only because i has no paren. Figure 4. shows an XML ree labelled using he Dewey algorihm. Anoher prefix algorihm is presened in [CKM 2] where he usage of do separaors is avoided o reduce he sorage space. The algorihm in [CKM 2] works in similar manner o [TVBSSZ 2]: saring from he roo node which has a code, is firs child is numbered wih, hen is added o he lef-hand side of he firs child s code o form he second child s code and so on. Of course each child code will be preceded by is paren code as a prefix. Figure 4.2 exemplifies he nodelabels ha are produced by he coding algorihm proposed in [CKM 2]. I is clear ha, he full pah - from he roo node o any node- can be direcly obained from he node s label iself in he case of he Dewey algorihms whils his is no possible wih he encoding algorihm presened in [CKM 2]. A beer prefix coding algorihm called Prefix Perfec Binary Tree (P-PBiTree [YLML 5]) is buil on he op of PBiTree [WJLY 3]. The P-PBiTree algorihm preserves codes (of size m-bis) for siblings and descendans of a node when he updae occurs in ha posiion. These codes are evenly disribuing among he children nodes. Unlike he previous prefix algorihms, P-PBiTree also works when he updaes occur in he middle of an XML ree. An inelligen prefix-based coding algorihm was proposed by Kalpan e al in [KMS 2]. The algorihm uses he idea of a compressed XML ree o reduce he sorage space required o sore he codes. The algorihm simply idenifies all disinc pahs from he roo node o all possible leaf nodes and builds a virual-compressed XML ree. Then, a normal prefix-based algorihm is applied o encode he virual XML ree. Alhough his echnique minimizes he ime ha is required o access a cerain node by 27

28 eliminaing un-relaed PATH nodes in he firs round, he algorihm complicaes he comparison es ha is required o idenify he ancesor/descendan relaionships. Figure 4. : A Dewey Encoding (Example) Figure 4. 2: [CKM 2] Encoding (Example) Prefix-based numbering algorihms have several feaures which make hem suiable for some implemenaions bu also here are cases (e.g. deph-oriened XML documens [MO 6]) where heir performance drops and herefore beer algorihms are required. Like oher ypes of numbering echniques, prefix-based algorihms uniquely idenify nodes in he XML ree [TVBSSZ 2] [CKM 2] and preserve he mos imporan srucural relaionships (namely he paren-child and he ancesordescendan relaionships) beween hese nodes [YLML 5] [WJLY 3]. In addiion, he generaed code reflecs he direc pah (i.e. he linking node-se) beween any wo descendan-nodes in he XML ree [TVBSSZ 2]. However, in he case of XML updaes, he process of re-coding becomes cosly alhough i can be minimized by leaving gaps [TVBSSZ 2] beween any successive nodes codes. The reason is ha he quaniy and he place of such updaes are unpredicable [W3C] [YLML 5], and herefore an inelligen updaes-predicion algorihm is required o esimae he size of hese gaps which may ulimaely become exhaused. Furhermore, XML updaes are mos-likely happen a he lower-levels of he XML ree [TIHW ]. So, he number of re-calculaed codes becomes high because prefix-based numbering schemes propagae from op-o-boom, lef-o-righ in he XML ree [YLML 5] [HHMW 7] and herefore all descendans and he following siblings need o be re-numbered. Anoher deficiency of prefix-based algorihms is he amoun of sorage space required for he generaed codes. As he XML ree goes deeper, he lengh of he generaed code increases cumulaively; hus more sorage space is required [MO 6]. In summary, he prefix-encoding algorihms perform well in he saic XML operaions [TVBSSZ 2]. However, here is a need o reduce he amoun and complexiy of he re-numbering process during he updae operaions. Also he sorage space, which increases in proporion o he deph and he breadh of he XML ree, should be kep small especially when he implemenaion akes place in memory. Some of hese problems are addressed by region-based encoding proposals which are discussed in he following secion Region-based Encoding Approaches The basic idea of any region-based encoding algorihm is o encode he ancesor-descendan relaionships (and/or he paren-child relaionships) in an XML documen by aaching wo variables o each node; namely he sarid and endid. The firs variable sores he node ID of firs descendan elemen/aribue node (which is in mos cases he node ID of he node iself). The second variable (i.e. endid) holds he node ID of he mos descendan node from he curren node [ZNDLL ]. Node IDs are usually obained by applying he pre-order raversal [D 82] [LM ] echnique on he documen s ree [ACLLF 6] [ZNDLL ]. Some proposals add wo exra variables o hese codes; one of hem sores he level of he curren node (variable name level ), and he oher sores he documen ID (variable name is docid ) [e.g. ZNDLL ]. The docid variable is required o idenify he XML documen -o which he curren node belongs o- in case of an XML daabase consiss of muliple XML documens. Similarly, he level variable works wih he conjuncion of sarid and endid o idenify he paren-child relaionship. So, he node is evenually associaed wih he following vecor which holds he minimum informaion o process conainmen queries [ZNDLL ]: (docid,sarid,endid,level). Some implemenaions add exra informaion for differen reasons such as pahid which refers he pah from 28

29 he roo node o he curren node [e.g. LN 4]. Figure 4.3 illusraes ypical region-based codes for an XML documen. Figure 4. 3: An Example of a Region-based Coding Algorihm There are many node-encoding proposals using region-based numbering scheme in he lieraure such as [ACLLF 6] [ZNDLL ] [DTCO 3] [YASU ]. The majoriy of region-based proposals are used in XML-o-RDBMS mapping implemenaions o suppor he evaluaion of conainmen queries over he resuling relaional schema. All exising region-based algorihms share he idea menioned above bu hey vary in he way in which he vecor s variables are compued. For example, he algorihm in [ACLLF 6] uses he pre-order raversal echnique (as described above) o assign he nodes heir objec-idenifiers (i.e. oid). These oid s are hen subsiued in he node s vecor o encode he documen s relaionships. In conras, he echnique proposed in [ZNDLL ] builds wo invered-indices for he documen s okens (Noe: a oken can be a ag-name or a single exual-conen uni (e.g. words, numbers, ec) wih in XML elemens/aribues), and uses he oken s locaion (in he corresponding invered-index) o code he values of sarid and endid. The node s code in [ZNDLL ] differs slighly from ha in [ACLLF 6]. In [ZNDLL ], he nodes of he elemens and he aribues are associaed wih (docid,sarid:endid,level), and he nodes holding he exual-conens are agged wih (docid,posiionid,level); where posiionid is he disance (in he number of words) of he oken iself from he firs oken in he XML documen. The algorihm in [DTCO 3] is idenical o he one in [ACLLF 6] excep ha i uses he offses in a sring represenaion of he XML documen. Like prefix-based numbering algorihms, region-based encoding algorihms have feaures ha make hem perform beer han oher algorihms in some implemenaions, and less efficienly in ohers. Any regionbased algorihm performs beer han prefix-based (e.g. [TVBSSZ 2] [YLML 5]) algorihms in erms of he amoun of sorage space required o sore he codes. I is clear from he example in Figure 4.3 ha he amoun of space for each node s label depends on he amoun of he space reserved for he node s idenifier vecors. If each vecor requires m byes, and he XML ree is of size n nodes; hen he oal sorage space required o sore he enire XML codes is m n. In addiion, he idenificaion of ancesordescendan (and/or paren-child) relaionships is much easier in he region-based numbering schemes [ZNDLL. By looking a any wo codes; say node: (doc,s,e,l), and node2: (doc,s2,e2,l2); node is an ancesor of node2 if (s < s2) and (e >= e2). Furhermore, node is idenified o be he paren of node2; if he following condiion is addiionally applied: (l = l2 - ). In erms of XML updaes, region-based algorihms use he mechanism used by he prefix-based echniques. During he pre-order raversal of he XML ree, some gaps are lef beween he codes of successive nodes in order o reserve room for any suspeced inserions [DTCO 3]. Alernaively, floaing numbering migh be used, so exra decimal places can be added o he preceding code when coding he newly insered nodes. However, as in prefix-based echniques, he number and he locaion of he inserions are un-predicable [W3C] [YLML 5]; hus, some gaps are evenually consumed and consequenly, he re-coding process will be required. Unforunaely, he re-coding process in region-based numbering schemes is more expensive han ha in prefix-based numbering schemes. In he former, he coding process propagaes from boom of he XML owards he roo node. In pracice, because mos of he updaes (e.g. inserions) happen in he lower levels, he re-coding process affecs all he ancesors [LLHC 4]. A similar complexiy is encounered when one wans o find he pah (i.e. he linking nodese) ha links any wo nodes in he XML ree. In his siuaion, i is necessary o walk hrough he whole 29

30 ree of descendan (or ancesor) nodes, checking he paren-child relaionships for all successive pairs unil reaching he desinaion node [TVBSSZ 2] [LLCC 5]. To conclude, like prefix-coding approaches, region-based algorihms perform well in he saic XML operaions alhough hey involve more comparison operaions during he process of idenifying srucural relaionships. However, he cos in erms of sorage space- of he region-based numbering approaches is much lower han ha of he prefix-numbering approaches. Therefore, reducing comparisons cos and adding he abiliy o infer pahs from he nodes code are valid enhancemen in he region-based numbering approaches. In [LLCC 5], Lu e al approached such enhancemens in heir Exended Dewey algorihm ha combines he Dewey [TVBSSZ 2] echnique wih a region-based numbering echnique such as [BKS 2] [ZNDLL ] [KJKPSW 2] Conclusion Zou e al [ZLC 4] summarized he archiecure of he all numbering echniques of XML nodes as follows; hey creae indexes on each node by is posiional informaion wihin an XML daa ree. Such index schemes can deermine he hierarchal relaionships beween a pair of nodes in consan ime. They also use a node as a basic query uni, which provides grea query flexibiliy. However, hese indexes are inensively updaed when he underlying daabase is modified. In erms of query evaluaion coss, queries such as involving recursive srucural joins, node-encoding algorihms consume more ime and soragespace o sore he inermediae resuls. In addiion, hese echniques pay no aenion o indexing XML exual conen (i.e. values) excep some saisical figures abou he appearance of he disinc words in he indexed XML documen [ZLC 4]. The nex secion discusses he archiecure of pah encoding approaches, which is anoher class of indexes used o encode XML srucural joins. 3

31 5. Pah Encoding Approaches Indexing echniques of his ype share he idea of creaing a pah summary for XML daa o speed up he process of query evaluaion. Tha is, for an XML documen, is pah summary includes enries for all possible disinc pahs from he roo node o any arbirary node in he documen [CLO 3]. An XML query is hen evaluaed by simply accessing he pah summary wihou he need o process he underlying XML documen [BCM 5]. However, as far as a query complexiy is concerned, XML pah-queries are divided ino simple-pah queries and complex-pah queries (i.e. queries wih muliple pah-branches). Bara e al [BCM 5] saed ha; he pah-indexing echniques are ideal for evaluaing single-pah queries raher han muliple-branches queries (or wig queries). The reason is ha, evaluaing such complex-pah queries involves expensive join-operaions ha are required o unie resuls reurned by he muliple branches in order o consruc he final oupu ha maches he enire query expression. In his discussion, pah-summary proposals are divided ino wo caegories: proposals used in XML-o-RDBMS echniques, and proposals used in indexing naive XML daa. The reason for his separaion is ha, alhough boh caegories use he idea of building pah summaries, he firs caegory uses pah summaries o design efficien-sorage implemenaions in RDBMS and herefore use RDBMS engines o query XML daa, whils he second caegory uses pah summaries o consruc efficien indexing o speed up XML queries in a naive XML query languages. In his discussion, Secion 5. explores some pah-summary echniques used in XML-o-RDBMS mapping implemenaions, and he following secion (Secion 5.2) discusses some naive implemenaions of pah-indexes. The las subsecion concludes he discussion on pahencoding approaches. 5.. Pah-encoding in XML-o-RDBMS Mapping Many echniques have been proposed o sore XML daa in RDBMS in order o leverage advanages of hese relaional daabase engines in soring, querying, indexing and updaing daa. Recen ye popular examples of hese echniques include in-lining [BFM 5], ShreX [YDF 4], and Edge++ [BFM 5]. Among ens of such XML mapping echniques, here are algorihms sore XML pahs in pre-defined relaions in order o preserve he XML srucure such as paren/child and ancesor/descendan relaionships [YASU ] [JLWY 2] [LN 4]. While his process has an excellen performance in answering pah queries over he sored XML daa in RDBMS [KR ] [TDCZ 2], oher issues, such as expensive recursive joins and daa updaes coss, need o be addressed. XRel [YASU ] and XParen [JLWY 2] are wo XML-o-RDBMS mapping echniques ha use a predefined relaional schema of four ables o sore boh he daa and he srucure of XML daa. However, he wo echniques have differen layous of hese four ables, especially he number and he design of ables (among he four) ha sore XML srucural relaionships. In XRel [YASU ], only one able, called Pah relaion, is used o sore possible disinc XML pahs. The Pah able has wo fields: he pah_id ha holds a shor-unique idenifier for each XML pah, and he Pah_Desc which sores all possible unique pahs from he roo node o any leaf node. On he oher hand, XParen [JLWY 2] uses wo ables (namely LabelPah and DaaPah-also see Figure 5.) for he same purpose bu wihou soring region informaion (see Region-based Encoding Approaches in Secion 4.2) [YASU ]. Thus, for a simple pah query ha only conains paren/child relaionships, he evaluaion process is sraighforward, and no recursive joins are required in order o idenify conainmen relaionships. However, when a query conains one or more ancesor/descendan relaionships; he evaluaion process becomes more complicaed, and many joins will be involved on he Pah (or LabelPah and DaaPah) able(s). In XParen implemenaion, he problem is worse because he evaluaion process requires joining he DaaPah relaion many imes o iself. In addiion, during XML updae operaions, especially operaions ha affec he documen srucure; records are subjec o frequen updaing. However, his may no be a criical problem since mos of he updaes occur a he leaf nodes (daa nodes) raher han upper level nodes. Finally, such mapping echniques are used o minimize he sorage space required for soring XML daa in RDBMS [KR ] [TDCZ 2]. 3

32 INode [LN 4] is anoher XML-o-RDBMS mapping implemenaion ha sores XML pah summaries in a predefined relaional schema. The relaional schema used in INode has one able less han XParen and XRel (i.e. is schema conains he following ables: Pah, Elemen, and Aribue). The experimenal resuls in [LN 4] have proven ha his schema furher reduces he number of exernal joins (and possibly sorage space) when performing queries ha involve he paren-child relaionship or he ancesor-descendan relaionship. However, evaluaing ancesor/descendan relaionships sill require expensive recursive SQL queries in he RDBMS implemenaions. Unlike XRel, XParen and INode; Mone [SKWW ] creaes a separae relaion for each unique pah in an XML documen o avoid self-joins or recursive queries. However, i is very obvious ha here will be many exernal joins beween differen pah ables even for a simple query. So, his echnique performs no beer han he above echniques in erms of query evaluaion complexiy [KR ] [TDCZ 2]. Figure 5. : Relaional Schema for XParen To sum up he above discussion, some XML-o-RDBMS mapping echniques creae one or more relaions for soring all possible disinc pahs in an XML documens in order o mainain documen s srucure including paren/child and ancesor/descendan relaionships. Alhough his echnology reduces he amoun of sorage space ha is required for soring XML daa in a corresponding relaional schema, and in he same ime- preserves is srucural relaionships; mos of he proposed echniques suffer from he inernal joins problem and he saisfacion of recursive-queries Naive XML Indexes using Pah Summary The underlying echnology of XML pah-indexing, especially in naive XML implemenaions- is XML pah summaries. Srucural summaries, anoher erm describing he same echnology, have been proposed o prune he searching space while evaluaion pah expressions [CLO 3]. A pah summary for an XML documen is simply a lis of all disinc pahs ha basically build-up he acual XML documen. A collecion of all pah summaries for an XML documen or a se of XML documens (i.e. XML daabase) are linked ogeher in a ree-like graph called summary graph [HHK 95]. Thus, a summary graph -for an XML documen- preserves is srucure wih fewer nodes and edges. As a resul, o evaluae an XML query over an XML daabase, i would be sufficien o search he corresponding summary graph. Alhough hey share he idea of XML pah pruning in order o minimize he search space, differen summary-based indexes use differen sraegies o evaluae XML queries. Any pah summary (i.e. pah index) can be eiher an exac pah summary or an approximae pah summary [BCM 5]. An exac pah summary represens every disinc pah of he underlying daa ree in he resuling graph summary, whils an approximae pah summary includes only pahs o a cerain deph. In boh cases, he summary is used as a back-end echnology for evaluaing regular pah expressions (mos of XPah queries are regular pah expressions). In addiion, an exac summary can work as a schema for he underlying daabase which creaes an opporuniy for a beer query opimizaion [BCM 5]. This class of XML indexes is bes represened by he following proposals; -Index [MS 99], A(k)-Index [KSBG 2], APEX [CMS 2], and D(k)-Index [CLO 3]. Criicizing hese echniques is he aim of his secion. 32

33 All exising XML pah summaries proposals are based on a bi-similar represenaion of XML daa. In such represenaions, any wo nodes are said o be bi-similar if all incoming pahs (i.e. pahs descendan from he roo node) o any of hem are idenical [HHK 95] [PT 87]. The se of bi-similar nodes are grouped in an equivalence class, and his class forms an I-Node (i.e. Index-Node, some proposals called his node an exen node) in he resuling summary graph. Similarly, an edge ha connecs a pair of I- Node in a summary graph is called I-Edge (i.e. Index-Edge). The -Index summary [MS 99] is a complee summary which represens he whole XML ree in he summary graph so ha a query can be evaluaed wihou consuling he acual XML ree. However, he noion compleeness in XML pah summaries- someimes resuls in a large summary graph which makes he searching space almos equivalen o he acual XML ree, and herefore he processing speed is hardly improved [BCM 5]. The A(k)-Index [KSBG 2] was proposed o overcome he large-scaled index problem encounered in he -index. I reduces he search space by considering he in-coming pahs of k-lengh only. This resuls in a lower sorage space and faser processing bu unforunaely he index becomes approximae, and consequenly furher operaions will be required in order o avoid false indicaions reurned by he primary index. An obvious observaion abou A(k)-Index is ha here is a rade-off beween he size of he index, which depends on he k-facor, and is performance. Bigger values of he k facor minimize addiional processing; bu on he oher hand increase he index size and consequenly increase he processing ime. So far, wo main problems have been idenified in boh -Index and is derivaion A(k)-Index. These problems are: he big index-size and is inefficiency; and hey are boh caused by he fac ha he underlying summary graphs are saic, ha is, indexing crieria are no adjusable according o he query workload and/or he indexed daa-size. Because of his, a workload-aware pah index, called APEX [CMS 2] was proposed by Wan e al in 22. The APEX index enhances he summary srucure by: ) incorporaing a hash index for efficienly evaluaing he frequen queries, and 2) calling an algorihm which is able o adjus he index according o he query workload. In [CLO 3], Chen e al. furher improved he A(k)-Index by adding a echnique similar o he one found in APEX which considers changes boh in he query workload as well as he source daa. So, he D(k)-Index serves as a dynamic summary index ha resuls in a smaller index size and a beer performance. In addiion, he dynamism of D(k)-Index can be uned o simulaneously updae he index and evaluae queries. In conclusion, pah summaries; ha are buil on op of bi-simulaion graphs [HHK 95] [PT 87], are powerful echniques for indexing naive XML daabases because hey efficienly preserve he documen s srucure wih low sorage space. Once an exac-opimum summary is obained, a corresponding index becomes an excellen memory-based soluion for indexing and querying XML. However, indexes of his class are basically srucure-based echniques wih a lile or no aenion o conens (i.e. values) indexing. Therefore, a grea effor is required o make hem ideal soluions for a wider XML query ypes ha involve value comparisons Conclusion The above wo secions (Secion 5. & Secion 5.2) discussed some ypical XML pah-indexing proposals. Pah indexes have been used in boh XML-o-RDBMS mapping and naive XML indexing. In he firs implemenaion (e.g. in-lining [BFM 5], ShreX [YDF 4], Edge++ [BFM 5]), all possible pahs; from he roo node o he leaf-nodes, are sored in pre-defined relaional schemas. These relaions are hen used o ) speed up XML queries over a relaional daabase engine, 2) reduce he sorage space, and 3) avoid expensive srucural joins ha are required o evaluae branching pah queries. Unforunaely, his echnology produces several drawbacks such as recursive self-joins (i.e. recursive SQL queries) even for simple pah expressions. In addiion, updaing he underlying XML daa dramaically increases he updaing processes for he mapped daa. Alernaively, pah summaries have been used o consruc naive XML indexes (e.g. -Index [MS 99], A(k)-Index [KSBG 2], APEX [CMS 2], and D(k)-Index [CLO 3]). All exising pah summaries are using he noion of bi-simulaion graphs [HHK 95] [PT 87]. This class of XML indexes demonsraed a noiceable improvemen in performance in erms of speed and efficiency. However, here is a rade-off 33

34 beween he size of a pah summary index and is performance. The problem could be solved by designing a flexible pah summary ha can adjus is size according o he query workload and he daabase size. Addiionally, because of is efficiency in describing XML documens, and is noiceably smaller size wih respec o acual XML rees; bi-simulaion graphs can be employed in designing efficien XML feaure-based indexes. Feaures-based indexes framework is he subjec of Secion 7. The following secion (i.e. Secion 6) explores sequence-based indexing approaches. 34

35 6. Sequence-based Indexing Approaches Lieraure shows ha an XPah query can be modelled as a ree srucure similar o XML rees [WPFY 3]. I has also been shown ha if an XML query ree is a subse of an XML documen ree, hen he query will reurn some resuls when i is processed agains he documen. This echnique, herefore, can be used o predic he query resuls exisence in he XML documen. One way o implemen such echnique is o conver boh he XML query and XML documen ino sequences and use he wellesablished sequences maching echniques o obain query answers (e.g. [RM 6] [PK 5] [MJCW 4] [WPFY 3]). The good feaure of sequence-based indexing echniques is ha hey use he enire query ree as one uni, hus he expensive join operaions are avoided during he query evaluaion while he srucural relaionships are preserved. This secion discusses he performance of his class of XML index using some examples from he lieraure. Figure 6. : An XML Tree and is Srucured-encoded Sequence To illusrae how his class of indexing echniques works, consider he echnique presened in [WPFY 3] for example. The XML ree in Figure 6. is firsly convered ino a sequence ha is shown a he boom of he same figure. On he oher hand, an XML query lising saffs from he "DCS" deparmen is expressed ino four differen formas (i.e. verbal, XPah, ree-form, and sequence-form) in Figure 6.2. To evaluae his query, he query processor will search for he query's sequence ino he documen sequence. If he query sequence is non-coniguous sub-sequence in he documen sequence, hen resuls are reurned by he query, oherwise he query evaluaes o an empy se. In his example, he query sequence is a nonconiguous sub-sequence (see he underlined enries in Figure 6.) in he documen sequence. Therefore, some resuls will be reurned by he query processor. Figure 6. 2: A Query Represenaion using [WPFY'3] Technique 35

36 A superior feaure for his approach over he oher classes of indexing echniques, especially Nodeencoding approaches (e.g. [ZLC 4] [KMS 2]), is ha i incorporaes a deeper consideraion of indexing exual conens (i.e. values). To do so, he indexing algorihm injecs invened exual-nodes ino XML/query sequences afer calculaing heir labels using a simple hashing [S 7] index. Also he required sorage space o sore he generaed documen s sequence is linear in relaion o he XML ree size (i.e. number of nodes), which is less han he sorage space required o sore codes generaed by Node-encoding index algorihms [BG 6]. Alhough hey speed he query pre-evaluaion and eliminae he expanse of srucural joins, his class of XML indexes have wo major limiaions. Firsly, he XML sequence ha is generaed by any sequencebased algorihm- will be reconsruced frequenly because of: ) he updaes in he underlying XML ree srucure (e.g. insering new nodes), and 2) he updaes in he exual-conens which resul in heir coding updaes. While he firs problem causes a major change in he generaed XML sequence, he second problem can be avoided by using a simple mapping lis ha maches every exual enry wih a fixed code. The second limiaion in he sequence-based indexes is he use of hashing algorihms o encode he exual conens (i.e. generae codes for he exual-nodes in he XML documen). In his, he hashing algorihm simply calculaes a hash key for every single exual node, and uses hese keys in he XML sequence. Hence here is always infinie se of exual nodes in each documen; he consequence is ha, he hash lis grows very quickly, and evenually he hash feching process iself becomes slow. 36

37 7. Feaure-based Indexing Approaches Indexing and querying algorihms of his ype ofen use similariy search echniques [WDJLCL 7] [WSB 98] o compare a se of feaures of a query agains a similar se of feaures in he daabase in order o evaluae queries. In he XML conex, feaure-based indexing algorihms encode one or more srucural feaures (i.e. srucural relaionships such as paren-child relaionship) of he XML query paern on one side, and he same se of feaure of he XML documens (i.e. XML daabase) on he oher side, and hen use his informaion o check he query s exisence in he XML daabase [YKKC 2] [ZOIA 6a]. The acual resul is hen obained by conducing an addiional raversal process on he acual XML daa. Depending on he class of argeed XML queries, he ype and he amoun of he encoded XML feaures vary from one echnique o anoher. On he oher hand, various daa-srucures were proposed o represen hese feaures in he main memory or disks such as adjacency-marix similariy esing [ZOIA 6a], hashbased similariy search [S 7], and ranked-based similariy search [WDJLCL 7]. This secion iniially defines he erm Similariy Search as described in he lieraure (Secion 7.). Nex, Secion 7.2 discusses some similariy-searching ools and heir daa-srucure backend along wih some implemenaions used by he informaion rerieval communiy. The following sub-secion (Secion 7.3) explores he lieraure for some recen feaure-based indexing proposals. 7.. Wha is Similariy Search? Wang e al. [WDJLCL 7] refer he erm Similariy Search as searching a collecion of objecs o find objecs similar o a given query objec. In human percepion, wo objecs are said o be similar if hey share specific number of feaures. The similariy migh be an exac- or an approximae-similariy. In he exac-similariy search, he duy of a query processor is o find all idenical objecs ha mach he queryobjec. The main problem of he exac-similariy search echniques is heir exponenial cos in erms of search ime and search space [DL 76]. Therefore, researchers have sudied he problem of finding he d- disance neares objecs from he query objec. This is called he approximae-similariy search. In his, objecs are described in erms of high-dimensional feaure vecors, and herefore, he similariy search is conduced by calculaing he disance beween he query and each searched-objec on he underlying space of feaures [WHCL 6a, WHCL 6b]. Reurned objecs are hen ranked based on ha disance meric. Tha is, for a query objec Q, he goal of any approximae-similariy search echnique is o find all objecs S, such ha he disance λ(q,s i ) beween Q and S i is less han a cerain number. In general, he bes maching objecs evaluae he smalles value for he funcion λ. Alhough i saves he high cos of inensive pair-wise comparisons by limiing such comparisons o a small number of searched-objecs, he efficiency and he accuracy of an approximae-similariy search echnique depends respecively- on amoun and he ype of exraced feaures [WDJLCL 7]. Forunaely, XML implemenaions of similariy search echniques have fewer drawbacks as here is a cerain number of feaures o be encoded in order o perform he similariy-check for he majoriy of XML query classes [ZOIA 6a]. XML indexing echniques using his mechanism are discussed in Secion 7.3. The nex secion explores some daa-srucures ha are used o encode he feaures of he indexed objecs Implemenaions and Daa-srucures Similariy searching is widely used in oday s informaional rerieval echnology. This secion aims o highligh some areas where such echniques are used o show heir power. To achieve his goal, wo examples of such implemenaions, namely searching graphs' daabases and ex processing, are discussed in paricular because hey link o he discussion of he XML's feaure-based indexing in erms of he size and he ype of he manipulaed daa. Oher implemenaions include archived video [ROS 4] [WHCL 6a] [WHCL 6b] [GACW 5] echniques. The discussion is inended o be brief for he purpose of inroducing he nex secion (Secion 7.3). 37

38 Tex Processing: Similariy search has been used in ex processing and documen rerieval for a long ime [SL 68] [DM 85]. The main purpose of his erminology is o find all documens (ex snaps) ha mach a query which mainly composed of a se of keywords. One of he earlies and simples echniques was invered-liss [ZMS 92]. In his, each documen (or a piece of ex) is inspeced for he required keywords afer sop-words removal and semming [FF 3] he remaining ones. Then, he frequency of each keyword is calculaed in each documen o give he documen is weigh (or feaures). Nex, his informaion is sored on wha so called invered-lis (usually use hash ables [TS 84]) which sores, for each keyword, all documen IDs and he corresponding frequency of he curren keyword. By using Roberson s BM25 formula [RWBGP 95], similariies beween he query (which is also represened in a form of keywords and heir frequencies) and each documen is calculaed o give each documen is rank agains he query. Depending on he ranking resuls, a documen is eiher reurned by he query processor or discarded. In conclusion, here are oher similariy-calculaion echniques as well as ex-processing mehodologies. However, lising and discussing such echniques and mehodologies are ouside he scope of his repor. In [MZH 5], Maffo e al. have oulined some IR-relaed proposals wrien beween 98 and 24. Recen IR-relaed proposals are also found in [SH 6] [ZM 6] [BCG 5] [TNI 4]. In XML conex, here are wo main occasions where invered liss can be used effecively. In documencenric XML daabases, conens of XML documens play he major role in deermining he resuls reurned raher han he documen s srucure. Therefore, variaions of exising informaion rerieval (IR) echniques are used o index and query documen-cenric XML daabases. An example of such an implemenaion is he RANK sysem [GSBS 3] which differs from any non-xml oriened IR sysems in ha; ) i maches deeply-nesed XML elemens insead of maching he enire documen, and 2) ranking process is based on he granulariy of XML elemens raher han he granulariy of he enire documen. Deailed discussion and more examples of his ype of implemenaions are found in Secion 3.. Invered liss are also used o index daa-cenric XML daabases. There are cases where such implemenaions used o index ag names and virual nodes ha are creaed o hold leaf-node values in srucural indexes. More discussion on his is found in Secion 5 and Secion 6. Graph Indexing: Similariy search has also been used o search chemical-compounds daabases [YYH 5] [FR 5] [YZYH 6]. A chemical compounds is ofen represened in cycled-graph, and a chemical daabase is hen defined as a fores of such disconneced graphs. This is similar o an XML daabase which is modelled as a fores of XML rees (see Secion 2.2). So, similariy search is used o check for he query's (a compound graph's) exisence in a chemical-graphs daabase by maching some feaures of he query wih each graph in he daabase. I imporan o noice ha such a comparison is used o eliminae as many non-maching graphs as possible a an early sage before conducing any pair-wise similariy compuaion in order o reurn he final resuls. In order o faciliae he similariy filering, Yan e al [YZYH 6] used wha is called feaure-graph marix [SWG 2]. In his marix, indexed chemical graphs are aligned in he columns whereas each row in he marix represens a feaure. Each enry in he marix reflecs a calculaed value of a specific feaure for he corresponding graph. For example, if here are four graphs (say G, G 2, G 3 and G 4 ) each of hem is searched for hree feaures (say f a, f b and f c ), hen a corresponding feaure-graph marix is given in Figure 7.. A very good phenomenon abou his represenaion is ha a graph and/or a feaure can be easily added or removed o/from he index wihou reconsrucing he index. In addiion, he index can be implemened using a linked-lis where each enry poins o an array (or anoher liked-lis) ha conains enries of a row in he marix [YZYH 6]. f f f a b G G 2 2 G 2 3 G4 2 3 c Figure 7. : Example: Feaure-graph Marix In he lieraure, FIX [ZOIA 6a] seems o be he firs and he only soluion ha uses a feaure-based marix echnique specifically o index XML daa. However, here are several differences beween his XML-based index (i.e. FIX) and he one used in [YZYH 6] or any graph-based index. A deailed 38

39 discussion of FIX and is relaed echnologies is he subjec of he nex secion (Secion 7.3). More abou graph and marix heories and heir possible employmen in indexing XML is oulined in Secion XML Feaure-based Indexes As in chemical-graphs daabases, feaure-based indices also can be used o speed XML processing [YZYH 6]. However, here are major differences beween chemical graphs and XML rees. For example, chemical graphs in mos cases- conain cycles [YZYH 6] whereas cycles in XML rees are rare, and if hey exis (e.g. using IDREF and IDREFS aribues) hey can be avoided or modelled in acyclic forms [ABS ]. This phenomenon makes XML feaure-encoding algorihms easier han hose for cycled-graphs. On he oher hand, XML rees are much larger, in erms of he number nodes and edges, han chemical graphs. This fac implies a larger number of XML feaures o be indexed, and consequenly more sorage space is required o represen hese feaures. For example, when using a feaure-based marix [SWG 2] o encode XML feaures, marix rows and columns grow linearly due o he naure of he XML daa and is frequen updae operaions [ZOIA 6a]. Alhough his is no a problemaic issue because such complexiy can be reduced by using eiher a corresponding bi-simulaion graph [HHK 95] [PT 87] or an efficien daa-srucure, issues such as query processing and daa updaes need more aenion. Based on he above discussion abou he requiremens of graph feaure-based indices in general, and XML feaure-based in paricular; he hree major issues o be considered when designing any XML feaure-based index can be summarized as: a) Wha XML feaures need o be encoded so ha fas and accurae comparisons can be conduced? b) Wha is he cos of he feaures encoding process in erms of sorage space, compuaion complexiy, and processing ime? c) How can feaure-based indexes be updaed when he underlying daa changes? This secion discusses he only wo XML feaure-based indexing echniques found in he lieraure. These wo indexes are; a scalable bimap XML index proposed by Yoon e al [YKKC 2], and he FIX index ha is proposed by Zhang e al [ZOIA 6a]. The bimap index proposed in [YKKC 2] is basically designed o index boh he srucure and he conen of muli-documens XML daabases (e.g. Figure 7.2). Having said ha an XML daabase is composed of many XML documen, say for example d, d 2, d 3,, d n ; he idea is encouner he all possible pahs from hese documens (say p, p 2, p 3,, p m ) and hen encode he exisence of hese pahs in each documen by in a wo-dimensional marix wih binary enries. Each enry in his marix has a value of if a pah is found in he corresponding documen or oherwise. Figure 7.3 illusraes his srucure. So, for XML queries ha do no conain any value comparisons, one can easily idenify all maching documens by looking a he column corresponds o he query s pah and rerieve all rows (i.e. documens) which vave values of s. To include documens conens (including he disinc words in exual values, elemen names, aribue names, and oher exs) in he index; a hree-dimensional marix is used. Similar o encoding he pah exisence in each documen, he new dimension in he bimap-index marix represens he exisence of he words/okens (i.e., 2, 3,, k ). The [YKKC 2] names his new srucure as BiCube (See Figure 7.4). Therefore, a BiCube for an XML daabase consising of many documen is defined as BiCube = (d, p,, b), where he variable d ranges in he documens number, p denoes o all possible pahs, spans for all words/okens, and b has a value eiher or. Apar from oher echnical infrasrucures of he proposed indexing echnique in [YKKC 2], he BiCube represenaion is associaed wih several operaions which build and query he encoded informaion. For example, he epah_slice akes a pah as an inpu and reurns a wo-dimensional bimap marix which represens a se of documens and words associaed wih hem. Similarly he operaion Word_Slice and Documen_Projec respecively ake a word or a documen and reurn a wo-dimensional bimap marix ha conains he oher wo dimensions of he BiCube. 39

40 4  <a> 2 <c>3</c> </a>  <a> <d>3</d> <c>3</c> </a>  <a> <c> 3 <d>3</d> </c> <e>3</e> </a>  <a> 2 2 3 <c>3 3</c> </a> Pahs Se: P=a\b p2=a\c p3=a\b\d p4=a\c\d p5=a\e Tokens Se:, 2, 3 Figure 7. 2: Muli-documen XML Daabase and is Pahs & Tokens Ses (4) (3) (2) () (5) (4) (3) (2) () doc doc doc doc pah pah pah pah pah Figure 7. 3: A Bimap Index for he Daabase in Figure (4) (3) (2) () (5) (4) (3) (2) () doc doc doc doc pah pah pah pah pah Figure 7. 4: A BiCube Index for he Daabase in Figure 7.2 Alhough he above index srucure appears good and simple in erms of speeding up he XML regular queries (i.e. queries ha use regular pah expressions), here are several facs migh affec is performance. As is clear form is srucure, he BiCube is designed o index a muli-documens XML daabases. However, he srucure of he composing XML documens are disinc; ha is a se of relaed XML documens rarely conain a same vocabulary of elemen and aribue names, and herefore he se of XML pahs o be encoded will become huge. This in urn increases he pah s dimension in he BiCube represenaion and consequenly, he ime and space for locaing and soring such a large se of pahs will become cosly. The problem becomes even worse for he word s dimension as he number of words is assumed o be infinie. Also he advanage of indexing muli-documens in a single index is of lessimporance as he inersecion beween he pah s ses of hose documens is relaively small. In he case of BiCube represenaion, a beer performance migh be obained by using some hashing funcions o cluser he large ses of words and pahs bu such recommendaion was no menioned in [YKKC 2]. The second feaure-based index in [ZOIA 6a] is basically designed o index a single-documen XML daabases. The index called FIX (sands for Feaure-base Indexing echnique for Xml daabase) and i uses feaure-based marices [SWG 2] o deermine he query paern in XML rees. The basic idea is o consruc wo feaured marices; one marix encodes some feaures of XML rees, and a similar feaured

41 marix is used o encode he query paern feaures. Then, he algorihm uses maure specral graphs heorems [L 99] [C ] o deermine wheher he XML ree marix can induce he query marix. FIX works as follows. The XML marix encodes node names (i.e. elemens, aribues and could be exual value) and edges relaionships (i.e. srucural relaions such as paren-child) beween hese nodes. For example; if an XML ree has n nodes, hen is corresponding marix will be m m marix, where m is he se of nodes in a corresponding bi-simulaion [HHK 95] [PT 87] graph, so m n. The bi-simulaion graph -of an XML ree- is he minimum ree ha represens all possible disinc pahs of he acual XML ree (Secion 5.2). I is used in his approach -insead of he complee XML ree- in order o reduce he size of he marix represenaion since i is well-known ha here is a finie-se of disinc pahs for every XML ree while he XML ree may conain a huge number of redundan pahs. Therefore, FIX is a pruning index ha can be buil on he op of any exising XPah query processor o achieve beer query performance [ZOIA 6a]. In oher words, FIX approach only checks he exisenial relaionship of a query paern ino an XML ree, he acual resuls are reurned via an addiional process. As an example, a bi-simulaion graph for an XML ree shown in Figure 7.5 is given in Figure 7.6 while Figure 7.7 illusraes he corresponding feaure-marix based on he FIX algorihm. Deails of how he enries of he marix are calculaed can be found in [ZOIA 6a] and he underlying echnical repor [ZOIA 6b]. Texual conens in XML documens are no direcly suppored by he FIX echnique. I is clearly saed in [ZOIA 6a] ha, FIX is concerned only wih handling a subse of wig-paerns where here is no use of predicaes and value comparisons. However, o overcome his limiaion, each exual value is reaed as a value-node. The ag name for a value-node is calculaed using a simple hashing algorihm and he corresponding hash-key is added o he bi-simulaion. Unforunaely, his enlarges he node-se in he bisimulaion graph and consequenly increases he sorage space as well as slows he query evaluaion process. Updaes are also a boleneck of FIX approach. In general, XML documens are subjec o frequen updaes (e.g. insering, deleing, and moving nodes). Obviously hese updaes aler he underlying XML ree srucure and consequenly is bi-simulaion graph(s) which causes re-calculaion of he enries of he corresponding feaure-marix. Figure 7. 5: A bibliography XML Daabase (Adaped and modified from [ZOIA 6a]) Figure 7. 6: A Bi-simulaion Graph of he XML ree in Figure 7.5 4

42 Edge Map [bib,aricle] [aricle,auher] 2 [bib,book] 3 [book,auher] 4.. [aricle,ile] [book,ile] M = Figure 7. 7: Marix Compuaions of he Bi-simulaion Graph in Figure Conclusion Similariy search has proven is srengh and efficiency in ex processing via he use of various IR echniques. However, more effor is required o apply such echniques in he XML conex. Alhough here have been several aemps o employing IR echnology in querying and indexing XML daa, mos of hese proposals were based on documen-cenric daabases raher han daa-cenric documens. In addiion, hose aemps lack mos of IR-relaed funcionaliies such as documen s weighing, ranking and relevance rerieval [See Secion 3.]. In erms of feaure-based searching, very lile aenion has been given o indexing XML daa using such echniques. There are wo proposals; he firs echnique indexes a se of XML documens in a bimapbased marix while he second encodes a corresponding bi-simulaion graph of an XML ree in a marix ha consiss of arbirary unique numbers for each edge of he bi-simulaion graph. Boh echniques hen use he heories of marices along wih some similariy searching echniques o evaluae XML queries. However, among oher sympoms, hese echniques seem o suffer from he same XML updae problem ha oher XML indexes do because of heir infrasrucures (e.g. he use of bi-simulaion graph in [ZOIA 6a]). Logically, basing such indexes on he enire XML ree and using a proper XML ree represenaion wih an efficien compressing algorihm may leads o a beer query performance and updae mainenance. The research, which is moivaed and proposed in he nex secion, is based on his hypohesis. 42

43 8. Moivaions and a Research Proposal This secion presens he moivaion for sudying XML daabases in general and heir indexing echniques in paricular. In addiion, he hree main problems ha are found in he exising XML indexing echniques are summarized. As resul of hese indexing problems, a proposal for consrucing a new XML index ha may reduce or eliminaed he effecs of such problems is pu forward. The secion is srucured as follows: Secion 8. divides he research moivaion beween expressing he imporance of sudying he XML daabases and showing he srenghs of he feaure-based index in he XML conex. In addiion, he secion oulines he idea of a new feaure-based index which is hypohesised o address hree major problems in he exising XML index. These problems are revisied from previous secions of his review and highlighed in Secion Moivaions This research is iniially moivaed by he imporance of XML imporance in he oday s daabase managemen echnology. So, he firs subsecion (of his secion) revisis some XML s feaures ha araced daabase communiy o research he noion of XML daabases. The research is also moivaed by feaure-based indexes srenghs in he evaluaion of XML s regular pah queries especially for he large XML daabases. This moivaion is presened in he second subsecion. The hird and he fourh subsecions respecively describe and exemplify he srucure of a new feaure-based index ha is hypohesised o addresses some problems in he exising XML indexes Sudying XML Daabases Secions and 2 discussed he feaures of he XML and is daa model which made i a widely acceped echnology in he oday s informaion echnology era. Revisiing such feaures, he XML is a language which: has a simple synax: i consiss of a serious of opening and closing ags ha describe he daa as well as he srucure, is porable: ha is, i is no ied o a specific implemenaion because i is encoded in Unicode exual forma, is exensible: is infinie vocabulary (i.e. ags and aribue names) allows he XML an abiliy o represens daa from oher models such as he relaional and he objec-oriened daa models, and is readable by boh humans and machines These valuable feaures have made he XML a sandard medium for ransferring daa over he Web as well as exchanging informaion beween he heerogeneous sysems [W3C]. As a resul, he amoun of he daa sored in he XML s reposiories, and he number of XML s users has increased dramaically and herefore, imporan daabase issues, such efficien indexing echniques, have become criical requiremens. In Secion 4, 5, 6 and 7, differen ypes of XML indexes were discussed comparing heir srenghs and shorcomings in indexing XML daabases of differen sizes. Among hese XML indexes, he feaurebased XML approaches seem o be a promising echnology in indexing XML daa efficienly for a wide range of XML queries ha are represened by regular pah queries (e.g. XPah2. queries and a subse of XQuery. queries as he XQuery. query language is buil on he op of XPah2. query language). This is discussed furher in he nex secion. 43

44 8..2. Moivaing Feaure-based XML Indexing As i is well-known, here is no clear division beween daa and schema definiions in he conex of XML daabases. Therefore, geing a desired se of daa requires a proper navigaional algorihm over he documen s srucure saring from he op-level elemen (i.e. he roo of he corresponding XML ree) downward o he leaf nodes which conains he acual daa. Reaching a specific poin in he XML ree requires knowledge of a cerain se of informaion abou he XML ree a each branch while walking hrough is pahs. In widely acceped XML query languages such as XPah2. and a subse of XQuery. query languages, his informaion is normally expressed in a form of pah axes (or XML elemens relaionships), for example child axis, self-or-descendan axis, and so no (See [W3C4] for oher XPah s axes). By reaing such elemens relaionships as mid-way feaures or mile-sones while navigaing hrough boh he XML ree pahs and/or he query expression pahs, he process of selecing specific daa from an XML daabase hrough an XPah query hen becomes a maer of following hese mile-sones unil reaching he desired daa. This is ofen referred as similariy search discussed in Secion 7.. I has been shown in he lieraure (e.g. [TVBSSZ 2] [LLCC 5] [CKM 2]) ha regular pah queries can be easily answered by encoding only wo srucural relaionships beween he elemens in an XML daabase. These relaionships are he paren-child and he ancesor-descendan. In addiion, oher srucural relaionships such as following and preceding axes can be derived from hem. Therefore, encoding he wo basic relaionships as feaures in wha is so called a feaure-based marix, and using a proper similariy-search algorihm o compare he query s feaure wih he XML daabase s feaures migh lead o a beer XML indexing. The purpose of his research is hen o formalise, design, and possibly, o implemen and evaluae a new feaure-based index which encodes he elemens srucural relaionships in a se of feaure-based marices. The nex subsecion (Secion 8..3) describes he srucure of a new feaure-based index ha may answer he index problems discussed in Secion The Index Idea a) Background and Advanages: The basic idea of he new feaure-based XML index is o speed-up he process of evaluaing a wide class of XML queries ha are represened by regular pah queries. The following feaures of he proposed index conribue he index's efficiency: Conduc boh he query evaluaion phase and oupu consrucion phase by consuling he index represenaion only. This eliminaes he ime and he effor required o swich beween he index represenaion and he acual daabase file in order o consruc he query s resul. Allow he index o reside in he compuer s main memory (by compressing he index) while processing he underlying XML daabase. Again his is going o save he query processing ime by eliminaing he expensive disk-based I/O operaions. Employ he power of he feaure-based approaches (described in Secion 8..2) in answering regular pah XML queries which can be evaluaed by comparing and conrasing a se of XML feaure for boh; he XML daabase and he regular pah query. Employ a novel represenaion for XML daa using feaure-based marices o represen he XML documen s srucural relaionships which are required o answer regular pah queries as described in Secion Allow differen levels of he index's compression necessary o keep he index in memory. These include: o Clusering he feaure-based marices ino a single feaure-based marix o Employing an effecive sparse-marix compression echnique from marix and graph heories o furher compress he clusered feaure-based marix. This fac should reduce he size of he index so ha i fis in he compuer s main memory and herefore uses memory-based query s evaluaion ransacions. 44

45 Allow cheap and sysemaic index s updae operaions ha are offered by he novel XML daa represenaion menioned above. These include: o o o o Node-Inser: which can be done by adding a column and a row for he new node in he feaure-based marices (or he clusered feaure-based marix) Node-Delee: which can be done by removing he corresponding column and row from he feaure-based marices (or he clusered feaure-based marix) Label-Updae: which has no direc effecs on he feaure-based marices because he node-se names (i.e. labels) are kep in a separae lis, and any node-label in ha lis can be reached efficienly using an exising lis-searching echnique such as Binary-search rees or Hash-ables. Node-Shif: o be invesigaed for possible auomaion. In addiion o he above node-updae operaions, wig-updae operaions will be invesigaed for similar auomaion. b) Marices Consrucion: Having he above facs in mind, he whole idea seems o be concenraed around how he XML s srucural relaionships are represened in a se of feaure-based marices. To achieve such a presenaion, necessary srucural relaionships for he evaluaion of each class of regular pah queries are idenified and each of hem is encoded as binary enries in separae feaure-based marix. Each feaure-based marix is consruced as following: The marix s dimension is n rows by n columns, where n is he number of nodes in he indexed XML ree. So, he header of each row and column is a unique idenifier corresponds o a node in he XML ree. Each marix M R represens a cerain srucural relaionship R, where R belongs o he se of exising srucural relaionships in he XML ree (i.e. Paren, Child, Ancesor ec). So, for example here is a marix M P ha encodes he paren-relaionship beween each pair of he ree nodes. A marix's enry M R (i,j) is eiher or. If he R-relaionship is exis beween he i h node and he j h node, hen M R (i,j) =, oherwise M R (i,j) =. For example, if he node a is he paren of he node b, hen he Paren-relaionship marix conains he following enry M P (a r,b c ) =, where a r is he row number corresponds o he node a, and he b c is he column number corresponds o he node b. c) Query Evaluaion: By consrucing several feaure-based marices using he above algorihm, evaluaing a regular pah XML query becomes a maer of consuling a subse of hese marices depending on he query s seps. For example, he XPah query \a\\b can be evaluaed by accessing he Paren-Marix and he Descendan marix respecively for each sep of he query. A query-processor ha uses he new feaure-based index will be consruced in laer sages of his research. d) Sorage Space Complexiy: In erms of sorage complexiy, a he firs glance he index seems o require massive sorage space which can be approximaed by O(m) O(n 2 ), where n is he number of nodes in he XML ree and m is he number of encoded feaures (i.e. he number of marices). However, he m complexiy can be eliminaed o O() in wo seps: Some feaure-based marices can be obained by applying several geomeric ransformaions. For example, he Paren-Marix can be calculaed from he Child-Marix as follows: M P (i,j) = Reflec Y [Roae 9 [M C (i,j)]], similarly, he Ancesor-Marix can be calculaed from he Descendan-Marix as follows: M A (i,j) = Reflec Y [Roae 9 [M D (i,j)]], (he proof is delayed) 45

46 Nex, he resuling feaure-based marices (say M, M 2,, M k ) are clusered in a single marix (M all ) as following: M all (i,j) = map( M( i, j) M 2 ( i, j) M k ( i, j) ) Where is a sring concaenaion operaor, and he map(x) funcion produce a singlecharacer map for he concaenaed sring x. For example, if k=4 hen a possible value-se for he funcion map(x) is he hexadecimal numeric noaions (i.e. map(x) {,,2,3,4,5,6,7,8,9,A,B,C,D,E,F}) On he oher hand, because mos of he maser feaure-marix M all are zeros, he O(n2) complexiy can be reduced by applying an effecive sparse-marix's compression echnique. This issue will be invesigaed in he coming sages of he research. Figure 8. : A Framework Srucure for he New Feaure-based Index e) Updaes Compuaion: In erms of he index updae compuaion, he effec of he wo basic updae operaions were already discussed above in his secion. While he Node-Inser and he Node-Delee operaions cos nohing excep adding and removing he corresponding row and column o/from he feaure-based marices (apar from he cos of he un-map() and map() funcions), he effec of insering a wig (wig-inser) and removing a wig (Twig-Delee) are o be invesigaed. One way o conduc Twig-Inser and Twig-Delee operaions is o model hem in a series of Node-Inser and Node-Delee respecively. In addiion, he Node-Shif and Twig-Shif operaions can also be modelled as a series of Node-Inser and Node-Delee operaions. These issues are subjec o invesigaion in furher sages of he research. The basic framework for consrucing he index and using his index o query he underlying XML daabase is illusraed in Figure 8.. The following secion presens an example o show he simpliciy of 46

he index's basic srucure which can be considered as anoher moivaion owards proposing he new feaure-based index. 8..4.

2 shows an XML ree represening a simple XML daabase. Figures 8.3, 8.4, 8.5 and 8.

3, we can ge all he child-nodes of he node c by selecing all he rows ha conain a value of under he c column.

Therefore, he basic idea of he proposed index is o use hese feaure-based marices in order o selec he desired nodes a each sep of an XPah

47 he index's basic srucure which can be considered as anoher moivaion owards proposing he new feaure-based index A Moivaing Example To show how he XML s srucural feaures can be encoded in feaure-based marices, Figure 8.2 shows an XML ree represening a simple XML daabase. Figures 8.3, 8.4, 8.5 and 8.6 draw four feaure-based marices ha respecively encode he paren-, child-, ancesor- and descendan relaionships. For example, from Figure 8.3, we can ge all he child-nodes of he node c by selecing all he rows ha conain a value of under he c column. So, he resuling node-se conains he nodes f and g. Therefore, he basic idea of he proposed index is o use hese feaure-based marices in order o selec he desired nodes a each sep of an XPah query as described in Secion 8..3 and illusraed in Figure 8.. The res of he idea will be developed progressively during his research. Figure 8. 2: An XML Tree Figure 8. 3: Paren-of Relaionship Marix of Figure 8.2 Figure 8. 4: Child-of Relaionship Marix of Figure 8.2 Figure 8. 5: Ancesor-of Relaionship Marix of Figure 8.2 Figure 8. 6: Descendan-of Relaionship Marix of Figure

48 The four feaure based marix (in Figures 8.3, 8.4, 8.5 and 8.6) can be clusered in a single feaure-based marix using a simple mapping funcion ha map he concaenaed binary codes o hexadecimal values. The resuled feaure-based marix is given in Figure M = C C C C 4 4 C 4 C C Figure 8. 7: The Maser Feaure-based Marix for Figure Indexing Problems in Exising Approaches Mos of he exising indexing echniques from he all classes discussed in Secions 4, 5, 6 and 7 share hree major problems ha affec boh he cos of he index and is performance. These problems are: ) he index size, 2) he index compuaional cos, and 3) he index updaabiliy. Each of hese problems is revisied in he following subsecion respecively, and possible soluions (using he proposed feaurebased index) are hypohesised along wih ha discussion The Index Size Problem Mos of he discussed indexing echniques in his review lack he scalabiliy feaure -ha is deermined by he index size- in one form or anoher. While some indexes are basically designed o handle a singledocumen daabase, which is far from he realiy; ohers are designed o be memory-based indexes and herefore, large-scale XML documens become a bole-neck for such indexes. For hose approaches which use a disk-based indexing echniques (e.g. [WJWLLL 5]) o index he large-scaled XML daabases, he scalabiliy problem resuls in a rade-off beween he capaciy of he index and is performance in erms of he index-processing ime ha is required o locae he desired disk pages. Alhough he physical sorage space is no a problemaic issue nowadays, minimizing he index size is an imporan aspec for a memory-based index. The proposed feaure-based index should reside in he compuer s memory and all is operaions are o be conduced on he memory version of he index. So, based on hese assumpions, he index size can be compaced by using one or more of he following echniques: The corresponding bi-simulaion [HHK 95] [PT 87] version of he XML daabase has a major effec in reducing he acual XML ree. Encoding he bi-simulaed version of he underlying XML daabase in he proposed feaure-based marices will resul in furher reducion in he index size I can be easily proven ha some of he above feaure-based marices can be obained by mahemaically ransforming oher feaure-based marices. For example, he child-relaion marix can be calculaed from roaing he paren-relaionship marix by 9 ο and hen reflecing he roaed marix around y-axis. This phenomena allows he index o cluser several feaure-based marices in a single marix and consequenly reduce he required sorage space for he index 48

49 Using cerain daa-srucure represenaion and/or one of he exising sparse-marix 5 compression echniques will leads o a furher index compression due o he naure of he srucural relaionships ha compose XML rees For he above hree aspecs, here are exising algorihms from he marix-algebra and he compuerizeddaa srucure fields which can be direcly employed in he inended feaure-based indexing echnique. The selecion of a suiable echnique will be invesigaed during he index s design process. The nex subsecion presens he second common problem in he exising XML indexes The Compuaion Cos Problem Relaional-based XML indexes discussed in Secion 5. use node-labelling algorihms o encode he XML s srucural relaionships. One deficiency of his implemenaion is he need for relaively-complex compuaions o calculae he elemens relaionship during he query evaluaion. Furhermore, accessing hese indexes is no enough o evaluae he answer o a query wihou consuling he acual daabase file. In he new feaure-based represenaion, he answers of a srucured-based query can be direcly obained from he index file iself wihou he need o access he acual XML documen. In he naive XML indexes such as sequenced-based echniques, he high compuaional cos resuls from he following aspecs: he way he XML daa sequences are formed and encoded (i.e. he algorihms used o encode and build he daabase exual-sream) he way o conduc he comparison process beween he query s sequence and he documen s sequence The same compuaional complexiies are found in he exising feaure-based indexes. Furhermore, in boh he exising sequence- and he exising feaure-based echniques, a very high compuaion cos is incurred during he re-consrucion process of he index when he underlying daabase changes. This resuls in he index-updaabiliy problem which is discussed in he following secion. In he proposed feaure-based index, all feaure-based marices can be consruced by parsing he XML documens only once. Furhermore, he daa-selecion process (i.e. he oupu consrucion) is a maer of excluding hose enries which have zero values in a corresponding feaure-based marix. Furher complexiy reducion in he proposed feaure-based index can be demonsraed during he updae operaion which is discussed below The Index Updae Problem This is a common problem in all indexing classes ha are discussed in his repor. Relaional approaches discussed in Secion 4 and 5. experience expensive updae operaions when he underlying XML daa changes. The updae problem resuls from; ) updaing he label of he XML nodes for hose echniques which rely on he node s labelling o represen he srucural informaion of he XML documen [TVBSSZ 2] [SLFW 5] [ZNDLL ], and/or 2) updaing he pahs' lis for hose echniques which sore he all possible pahs found in he XML documen in order o use hem for he srucural-queries evaluaion process [CLO 3] [CMS 2]. The updae problem is even worse in he sequence-based approach (Secion 6) where he updae has o occur in several posiions in he exual sream ha encodes he underlying XML daabase. Wih respec o he exising feaure-based indexes, he wo echniques discussed in Secion 7 suffer from he updaabiliy problem as well. In he [YKKC 2] s proposal, he problem is very clear because he echnique encodes all possible epahs found in he underlying XML daabase and uses hem o consruc he feaure marix (See Secion 7.4). As a consequence of his, any simple change in he documen s srucure (e.g. deleing a non-leaf node) could lead o many changes in he epah se especially when he 5 In he mahemaical subfield of numerical analysis a sparse marix is a marix populaed primarily wih zeros 49

change happen owards he roo of he XML ree. Therefore he underlying feaure-based marix mus reflec his change by updaing many columns ha correspond o a se of epahs affeced by he change.

4, because he FIX algorihm simply assign sequenial numbers o he whole exising edges beween he nodes pair in he bi-simulaion graph.

For example, o inse he leaf-node i in Figure 8.

The resuling feaure-based marices afer insering he leaf-node i are shown in Figures 8.8, 8.9, 8. and 8. (he shaded columns and rows encodes he i h node srucural relaionships).

50 change happen owards he roo of he XML ree. Therefore he underlying feaure-based marix mus reflec his change by updaing many columns ha correspond o a se of epahs affeced by he change. This problem is more or less same in he FIX index [ZOIA 6a]; he second feaure-based index discussed in Secion 7.4, because he FIX algorihm simply assign sequenial numbers o he whole exising edges beween he nodes pair in he bi-simulaion graph. These numbers are aligned in he feaure-based marix; and for hose un-relaed pairs, hey are assigned zero values. In he proposed feaure-based index, he wo basic updae operaions (i.e. he node-inser and he nodedelee) are sraighforward. For example, o inse he leaf-node i in Figure 8.2, i will be sufficien o add a column and a row a he end of he feaure-based marix ha reflec he corresponding srucural relaionships beween he insered node and he exising nodes. The resuling feaure-based marices afer insering he leaf-node i are shown in Figures 8.8, 8.9, 8. and 8. (he shaded columns and rows encodes he i h node srucural relaionships). Similarly, when a node is o be deleed from he XML ree, i will be sufficien o remove he corresponding column and row from he feaure-based marix. Oher updae operaion can be modelled by conducing a series of node-inser and node-delee operaions. Figure 8. 8: Paren-of Relaionship Marix of Figure 8.2 (afer insering i) Figure 8. 9: Child-of Relaionship Marix of Figure 8.2 (afer insering i) Figure 8. : Ancesor-of Relaionship Marix of Figure 8.2 (afer insering i) Figure 8. : Descendan-of Relaionship Marix of Figure 8.2 (afer insering i) 8.3. Conclusion This secion (Secion 8) has highlighed hree major problems ha affec mos of he exising XML indexing echniques. Using he index srucure discussed in Secion 8..3, his research ries o formalize, and design an alernaive feaure-based index which may resuls in beer query performance. 5

Implementing Ray Casting in Tetrahedral Meshes with Programmable Graphics Hardware (Technical Report)

Implementing Ray Casting in Tetrahedral Meshes with Programmable Graphics Hardware (Technical Report) Implemening Ray Casing in Terahedral Meshes wih Programmable Graphics Hardware (Technical Repor) Marin Kraus, Thomas Erl March 28, 2002 1 Inroducion Alhough cell-projecion, e.g., [3, 2], and resampling,