A Combination of Trie-trees and Inverted Files for the Indexing of Set-valued Attributes

Size: px

Start display at page:

Download "A Combination of Trie-trees and Inverted Files for the Indexing of Set-valued Attributes"

Jason Lyons
6 years ago
Views:

1 A Combination o Trie-trees and Files or the Indexing o Set-valued Attributes Manolis Terrovitis Nat. Tehnial Univ. Athens mter@dblab.ee.ntua.gr Spyros Passas Nat. Tehnial Univ. Athens spas@dblab.ee.ntua.gr Panos Vassiliadis Univ. o Ioannina pvassil@s.uoi.gr Timos Sellis Nat. Tehnial Univ. Athens timos@dblab.ee.ntua.gr ABSTRACT Set-valued attributes requently our in ontexts like marketbasked analysis and stok market trends. Late researh literature has mainly oused on set ontainment joins and data mining without onsidering simple queries on set valued attributes. In this paper we address superset, subset and equality queries and we propose a novel indexing sheme or answering them on set-valued attributes. The proposed index superimposes a trie-tree on top o an ile that indexes a relation with set-valued data. We show that we an eiiently answer the aorementioned queries by indexing only a subset o the most requent o the items that our in the indexed relation. Finally, we show through extensive experiments that our approah outperorms the state o the art mehanisms and sales graeully as database size grows. Categories and Subjet Desriptors H.2.2 [Database Management]: Physial Design Aess Methods General Terms Algorithms, Perormane Keywords HTI, iles, tries, ontainment queries 1. INTRODUCTION Containment queries on set-values emerge in a variety o appliation areas ranging rom sientii databases to XML douments. Examples o set valued data an be ound in market basket analysis, prodution models, image and moleular databases [7]. Containment queries span a wide range o query amilies, ranging rom simple existene queries to omposite similarity, pattern mathing, or graph isomorphism queries. Naturally, the undamental set-ontainment operators are typial or a large number o situations (e.g., Give me all photographs whose annotation ontains the terms galaxy and red giant, or Give me all protein sequenes that ontain either G or T or a ombination o Permission to make digital or hard opies o all or part o this work or personal or lassroom use is granted without ee provided that opies are not made or distributed or proit or ommerial advantage and that opies bear this notie and the ull itation on the irst page. To opy otherwise, to republish, to post on servers or to redistribute to lists, requires prior speii permission and/or a ee. CIKM 6, November 5 11, 26, Arlington, Virginia, USA. Copyright 26 ACM /6/11...$5.. them, but nothing else ). Moreover, set-ontainment operators an be used in other query lasses where a pruning o the andidate sets to be proessed takes plae (e.g., Give me all mediines sequenes that are similar to my XYZ test mediine and their X omponent ontains either G or T or a ombination o them, but nothing else ). Another important appliation area or ontainment queries is the evaluation o path expressions in XML data, whih partially resolves to keyword searhing [9]. As RDBMSs and IR ome loser, oten in the interest o storing and handling XML [2] and web data [4], ontainment queries on set values beome a more and more signiiant use ase or an RDBMS. A natural way o modelling and storing set-values in modern RDBMS is by using set valued attributes. Set valued attributes are an integral part o the objet-relational model and they are supported by most modern RDBMS s [17]. In this ontext, we are interested in ontainment queries over the set valued attributes o a relation. More speiially, assuming a relation D(id, set values) and a set o interesting items qs = {i 1,..., i n}, we would be interested to ask queries o the orm {t t D qs θ t.set values}, where θ {,, }. The problem o eiiently omputing the result set o these operations is hallenging, mainly due to the vastness o the underlying data volumes and the partiularities o the queries. The problem with set values is that the spae o potentially indexed values is enormous (2 n, or n items) and the resulting index would also be huge as well. Moreover, the query semantis are quite dierent: whereas simple subset queries retrieve the tuples that ontain a ertain set o items, superset values require that some (but not neessarily all) o these items are ontained in the result tuples, and nothing else. Thereore, an eiient indexing sheme that an (a) support the oexistene o multiple items in the same query set and (b) adequately support dierent lasses o ontainment queries by exploiting their harateristis is neessary. To this day, the database and the inormation retrieval(ir) researh ommunities are mainly the ones having studied set-values in depth. From the database perspetive, there is a need to eiiently handle huge volumes o small sets, usually taking values rom a limited domain. So ar, database researh has mostly oused on similarity and join queries. Similarity queries [3, 12] retrieve the set values that are most similar to the one provided in the query. Join queries, whih are lassiied as similarity joins [15], or as set ontainment joins [11, 13], ous on interseting two dierent relations based on their set valued attributes. Researh on aess methods or basi ontainment queries is very limited. To 728

2 the best o our knowledge, only aess methods based on signature iles [2] and ile indies [1, 22] have been used in the database researh literature or supporting ontainment queries on set valued attributes. A reent survey [7] has shown that iles learly outperorm signature-based methods or ontainment queries on low ardinality set values. The same holds or text douments as Zobel et. al. showed in [21]. Moreover, Zhang et. al., studied in [2] how iles indies ompare to traditional relational methods or ontainment queries, motivated by the integration o IR untionality in RDBMSs. Using traditional relational indies like B-trees or ontainment queries was shown to have signiiantly inerior perormane in most ases. Considering iles as the stateo-the-art mehanism or set ontainment is also supported by the at that they used by all WWW searh engines [19]. Still, the perormane o iles suers when the domain o the distint items o the database is small or when the distribution o the items is skewed and ew items dominate the dataset. This is due to their internal struture: iles ontain a header list with all the items o the voabulary; or eah item, an list with pointers to the transations that ontain this item is maintained. Thus, i some items appear in many set values, their lists beome very long. Sine ontainment queries usually require sanning the entire lists o the query items, having long lists has a deteriorating impat o the query evaluation. This is oten the ase o real world. A harateristi ase o numerous reords o set values rom a limited domain are the real datasets rom UCI KDD arhive [8] that we use in our experimental evaluation. These datasets are logs that trae the behavior o users in large web portals, whih is a ommon soure o data that are analyzed by using ontainment queries (e.g., Whih users downloaded only drivers and pathes rom our website and did not visit any other page? ). Moreover, highly skewed data is a ommon ase or retail transations, where some basi produts dominate the transational logs. In this paper, we ous on the eiient evaluation o ontainment queries on large olletions o low ardinality sets with exat query semantis. The query lasses under investigation inlude subset, superset and set equality queries. These queries test a set o items, a.k.a query set, over a set valued attribute o a set o reords, or the ulillment o the query s seletion ondition (subset, superset or equality). The exat set o transations that ulill the seletion ondition is returned. To eiiently answer these lasses o queries, we propose a novel indexing sheme, the Hybrid Trie- ile (HT I) index. The HT I-index superimposes a trie struture, the aess tree, over an ile index. The aess tree oers pointers to the lists o the most requent items, thus leveraging the perormane o iles. In the HT I index, queries over the requent items are evaluated by the aess tree. At the same time, the memory requirements remain low, sine the inormation or the vast majority o the data is kept in the ile. This evaluation mehanism has a signiiant impat on query answering eiieny in the average ase, sine we expet items to be queried aording to their requeny o appearane. In short, our ontribution omprises the ollowing: 1. We propose a novel indexing sheme, the HT I index that ombines a trie with an ile, or large olletions o low ardinality sets. The main idea is that the trie is plaed in main memory, indexing the top k most requent items o the data set, whereas the ile is plaed in seondary storage, assoiating eah item with all the transations that ontain it. The index is partiularly it or data rom a limited domain or skewed data, whih is a very ommon real world ase. 2. We present eiient evaluation algorithms or set ontainment queries that utilize the proposed index. For all types o queries we quikly identiy the set o requent items that partiipate in the query by exploiting the main memory part o HT I and omplement the answer by testing the inrequent items through the ile. 3. We demonstrate the superiority o our proposal over the state o the art aess methods, by extensive experiments. We evaluate the HT I index on real and syntheti data. We assess the number o perormed by the HT I index as a untion o domain o items, database size and size o the query set. In all oasions, HT I signiiantly outperorms a ompetitor ile, and sales graeully, espeially in the ases o large database and query sizes (as opposed to the ile that ails to sale similarly). In the ase o the real datasets, whih involve 32k and 1M transations, the HT I index perorms an order o magnitude less with a memory overhead o less than.5mb. Our experiments with syntheti data show that even or large domains, keeping a low threshold or the top-k items held in the trie is suiient or ahieving high perormane with minimum memory expenses. The rest o the paper is organized as ollows: In Setion 2 we ormulate the problem and in Setion 3 we present the proposed HT I index. Setion 4 desribes the query evaluation algorithms and in Setion 5 we demonstrate the results o the experimental omparison o our proposal against the ile index. Finally, Setion 6 onludes the paper. 2. PROBLEM FORMULATION For reasons o simpliity we assume that the data are organized under a simple objet relational shema D with eah tuple t = [id, s] having two attributes; id is a unique identiier o the transation and s is a set (not a bag or a list) o objets rom an ininitely ountable domain o distint items. We reer to the ative domain o D, with the term voabulary and denote it as I. Thus, every t.s I. Moreover, throughout the paper we onsider the id as adequate inormation to allow us to retrieve the whole transation rom the hard disk in one step (i.e., in one page aess). Queries. In queries on set valued data, the user speiies the query prediate and the query set qs. The query set is a set o items rom the domain o I. The queries we are interested in are deined as ollows: Subset queries. In subset queries the user asks or all transations t that ontain the query set qs, i.e., {t t D qs t.s}. Equality queries. In equality queries the user asks or all transations that ontain exatly the query set, i.e., {t t D qs t.s}. 729

3 ID Items bought 1 {, a, } 2 {, b, d} 3 {, a} 4 {a, } ID 5 {, d} 6 {, } 7 {} Items bought Figure 1: Example relation D o ustomer transations Voabulary (I) a d b 1, 3 1, 2 1, 3 2, 5 2 lists o transation id s 5,6 4,6 4 7 Database transations (D) Figure 2: A simple ile index sheme or the example o Figure 1. Superset queries. In superset queries the user asks or all transations whose items are ontained in the query set, i.e., {t t D qs t.s}. 3. INDEX STRUCTURE Tries and iles have been extensively used or text indexing, still the ormer have not been employed or indexing set-valued attributes in objet-relational databases. In this setion, we introdue the HT I index that ombines a main memory trie with an ile residing in seondary storage. First, we give bakground inormation or iles and tries and we explain their beneits and drawbaks. Then, we show how these indexing shemes are ombined in the HT I index. Finally, we also disuss issues onerning updates, ompression and ahing. 3.1 The ile The ile index has two major omponents: (a) the voabulary and (b) the lists. The voabulary is a list o all the distint items appearing in the database, i.e., it is the same with the database voabulary I o Setion 2. Eah list node has a label indiating the item it represents and a pointer to the head o the list. The list ontains inormation about all the transations in whih the item appears. In our ase, this inormation omprises the transation id alongside with its length. The length o the transation is required in order to eiiently exeute equality and superset queries. Figure 2 depits an ile index or the relation o Figure 1. The voabulary inludes all the distint items that appear in the transations. The lists may be huge or large databases; the id and the length o a transation is inserted as many times as the number o items it ontains. This means that, theoretially, the size o the ile ould be similar to the size o the transation olletion or even larger. Unompressed iles or text douments typially onsume around 3% o the spae required or the unompressed database [16]. In the ases that we are mostly 4 7 a, interested, where there are no repetitions and the voabulary is signiiantly smaller than the number o transations (I D), the ile an be equal or larger than the database, sine the t.id requires more bits than the items o I. We an trae the answer to subset, superset and equality queries by using set operations on the lists. Due to their size, the lists are stored in seondary storage. Thereore, the larger these lists are, the more memory pages have to be retrieved rom the disk or evaluating a query. This means that the most requent items that have the larger lists are the most expensive to proess. This is an important weakness, when dealing with set values in databases, onsidering that most requent items are usually the ones most requently queried. 3.2 Tries in the ontext o set-values Tries are multiway tree strutures or storing string keys whih enable retrieval in time proportional to the string length [1]. Unlike iles, tries are letter oriented and eah string orresponds to a path in the tree. Consequently, ommon preixes in strings orrespond to ommon preix paths in the tree. Lea nodes inlude either the douments themselves, or links to the douments that ontain the string that orresponds to the path. Sine strings are words o some language, the maximum number o hildren or a node, is limited by the number o letters o the alphabet o the douments language. The way tries are reated allows or preix (or suix, i strings are beore being mapped to paths) searh, i.e., they provide a kind o range searh, based on the irst letters o the string. A signiiant dierene between set values and text douments, is that unlike words (whih are omposed o letters), the items o a set are not urther deomposable to smaller units. Even i the items are alphanumeri values themselves, this is simply a oding sheme o the database, that eventually has no relationship to the user queries. Thereore, it is meaningless to exploit the alphanumeri value o the items or indexing purposes, but rather, we need to use the set o all items I as the voabulary o the index. As a result, eah node might have I hildren. This makes the potential size o the trie very large and thus the spae gain ahieved rom ommon preixes is a lot smaller ompared to the one in the text doument ase. Pratially, even or a moderately large I, e.g., 2k, the maximum spae o the trie is so big, that it grows almost linearly with the number o transations. In our ollowing deliberations, we need to deine the undamental notion o item requeny ordering that onerns the ordering o the items o a voabulary. Item requeny ordering. The item requeny ordering o the items o a voabulary I (over a database D) is the total ordering o the items aording to their requeny o appearane in the underlying database. In our reerene example, the item reerene ordering < I = [,, a, d, b]. To onstrut a trie or set values, we ollow the approah o Han et al. in [5, 6]. First, eah transation is transormed rom an unordered set to an ordered sequene based on the item requeny ordering o the voabulary. An item x preedes another item y in an ordered transation i x is more requent than y in the whole database D. The ordered transation is subsequently mapped to a path starting rom the trie tree root. I some nodes already exist, due to a ommon preix with a previously inserted transation, we only add the new nodes. 73

4 Ordered Transations 1 {,,a} 2 {,d,b} 3 {,a} 4 {,a} 5 {,d} 6 {,} 7 {} a tid s: 1,6 tid s: 1 a tid s: 3 d Null tid s: 1, 3, 5, 6, 7 tid s: 5 d b tid s: 2 tid s: 2 tid s: 2, 4 a tid s: 4 How the ull trie would ideally be. The shaded area is to be exluded in the aess tree o HTI Figure 3: An abstrat orm o a trie tree or the example o Figure 1 An abstrat orm o the trie tree or the database o Figure 1 is depited in Figure 3. The transation with id = 1 and set value s = {a,, } is ordered aording to the requeny o its items in the database. Sine ours 5 times, ours 4 and a ours 3 times, the transation s set is transormed to a sequene s = {,, a} that subsequently ontributes the path a in the trie. Unlike typial tries, in Figure 3 we annotate eah node with the list o transation id s that orrespond to it (without implying that they are atually kept in main memory along with the trie). Note that depending on its preix, a transation might belong to the list o more than one nodes. For example, the transation with id = 1 belongs to the lists o all the nodes o its preix, i.e., all the nodes o the path a. Finally, there is a dierene among the transations that pertain solely to a node and the transations that also pertain to its desendants. Observe the node o the path a. The transation with id = 1 is the transation {,, a} that also belongs to the node a o the same path. On the ontrary, the transation with id = 6 reers exatly to the path. The distintion will be very useul later, or equality and superset queries. The potentially very large number o desendants that a node might have and the at that tries are unbalaned, does not make the trie a good andidate or seondary memory storage. Thereore, we hoose to use it as a main memory struture oering alternative aess to the data, on top o the ile. 3.3 The HTI index As we have explained in Setion 3.1, the perormane o the ile suers, when very long lists have to be proessed. The issues involved in the proessing o iles are (a) the IO ost o transerring the disk pages with the lists to main memory and (b) the CPU ost o interseting lists o dierent items that partiipate in the same query set. To ounter this eet we propose the HT I-index, whih uses a relatively small main memory trie to oer additional aess points to the lists o the most requent items (that also have the longest lists). The basi idea o the HT I-index is to split the voabulary o the database into (a) a small set o requent items I r and (b) a large set o inrequent items I \ I r. Then, a trie is used or the ormer, in order to speed up the aess to the lists that pertain only to the ombinations o requent items, whereas the latter are treated as usually, through an ile. The HT I-index, has three major omponents: a voabulary, an aess tree and a set o lists. An HT I index is shematially depited in Figure 4. The voabulary. Like iles, the HT I has a list o all the distint items o the database, whih oers aess to the lists. The items in the voabulary are divided in two lasses: (a) the requent items I r, I r I, whose voabulary entries point to the aess tree in main memory, and (b) the inrequent items, I inr = I \ I r, whose voabulary entries lead diretly to their lists in seondary storage, exatly like in iles. The voabulary is kept as an array in main memory and together with the aess tree root they omprise the initial aess points to the lists. The array is implemented as a hash table. The aess tree. The aess tree is a trie struture that oers aess points to bloks o transations that share the same aess preix paths (app). The app o a transation an easily be omputed i we order its items aording to the item requeny ordering o I. Then, we deine as aess preix path the sequene preix path whose items all lie in I r i.e., the ordered sequene o the requent items o the transation. For example, the app o {, a} is {}. We store the app o eah transation in the aess tree, by putting the irst and most requent element as a diret hild o the root (see also the next setion or a detailed disussion on the reation o the aess tree). The aess tree has two kinds o nodes: (a) the root, whih does not orrespond to any item in I r and (b) inormation nodes, whih are all the other nodes o the trie. Eah suh node holds the ollowing inormation: A label indiating the item o I r, whih orresponds to the node. A link to the sublist o the transations that ontribute to the path rom the root to the node. These are all the transations whose preix is the same with the path rom the root to the urrent node. Navigational links to the hildren-nodes, the parentnode and to the rest o the nodes with the same label. It is important to stress here that due to the vast volume o the ull-ledged trie presented in the previous setion, the aess tree is a subset o it, onerning only its most requent items I r. The voabulary entries onerning these requent items point to lists that omprise all the aess tree nodes that are labelled with the respetive item. In turn, these nodes point to the respetive lists, stored in seondary storage. In Figure 4 we depit an example HT I index or the relation o Figure 1. We hoose as requent items I r =, (having a requeny greater than 3), and we reate the aess tree onsidering only them. Observe that in Figure 3, these were also the items with the longest transation lists. The shaded area in Figure 3 onerns the inrequent items that were subsequently dropped rom the aess tree o Figure 4. Item is more requent than, thus it preedes it in aess tree paths. Assuming this I r set, all the transations o Figure 1, ontribute to three paths: root, 731

5 Frequent Items Inrequent Items Voabulary a d b Null Aess Tree 3, 5 1, 6 1, 3 2, 5 2 Main memory 7,1 2, 4 4 Seondary storage 6 lists Figure 4: HT I index or the relation o Figure 1. Dark shaded box stands or the list o Total number o transations that ontain the item 5 3 3, 5 7,1 6 Number o transations whose app ends at urrent node, i.e., transations 3,5,7 Light shaded boxes stand or HD pages Figure 5: The transation list orresponding to the node, assuming two id s per disk page. root and root. Observe, also, how the nodes labeled are linked to eah other. The lists. There are two ases or the lists o the voabulary items: (a) lists o non-requent items and (b) lists o requent items. Conerning the non-requent items, their lists are exatly the same as those o a regular ile (i.e., sorted lists ontaining the id s o all the transations that ontain the respetive item). The ase o requent items belonging to I r, on the other hand, involves lists made up o many smaller sorted sublists, eah o them orresponding to an aess tree inormation node labelled with the respetive item. To enhane the evaluation o equality and superset queries, we urther divide the sublists o the aess tree to two parts as depited in Figure 5, or the ase o : (a) the id s o the transations whose app ends at this node; these are transations id s 3,5 and 7, (b) the id s o rest o the transations that ontribute to the urrent node; these are id s 1 and 6. In the beginning o eah inormation node sublist, we store the number o transations o ase (a) alongside with the total number o the transations that ontribute to the urrent node, so that we an retrieve the right blok rom the disk eah time. Example. As shown in Figure 4, the aess tree and the voabulary are kept in the main memory, whereas the lists reside at seondary storage. Transations 1,3,5,6,7 ontribute to the path root, thus they are stored at the sublist o node. Observe that, being the most requent item has exatly one sublist, i.e. its list omprises a single sublist, orresponding to its single appearane in the aess tree. This is not neessarily the ase or all the items, though. For example, the item has two sublists. Two o the transations o, 1 and 6, also ontribute to the path root, and they are stored in the irst sublist. At the same time, transations 2 and 4 ontribute to the path root and they are stored at the seond sublist. For storage eiieny, the individual sublists o all the dierent nodes o are stored ontiguously, one ater the other. The nodes o the aess tree point to the oset o the list where their orresponding sublist begins. Note that in Figure 4 we depit only the id s that are ontained in the sublists and not the labels that mark eah sublist or reasons o readability. The real struture o the sublist or the ase o item is depited in Figure 5. The rest o the items are indexed by an ile and the id s o the transations that inlude them are stored in the respetive lists. Note that, onerning the inrequent items a, d, b, their voabulary entries point diretly to the lists in seondary storage without any intererene with the aess tree. Updates in HTI index. When a new transation is to be inserted or deleted rom the HT I index we pratially have to perorm two dierent updates: one to the ile omponent and one to the aess tree. I the items and the order o I r are not modiied, the ase is straightorward [18]. Still, it is also possible that the order and the member items o I r should be hanged, due to hanges in the items appearane requeny. In all suh ases, the query evaluation algorithms are orret. Considering also, that in most related appliation areas, the relative requenies o the items hange slowly or remain stable, the rebuilding o the index is not neessary. In any ase, the requeny ordering relets a heuristi or keeping the size o the aess tree small, as reported in [5]; other orderings ould also apply. For more details on the reation and maintenane o the HT I index we reer the interest reader to the long version o the paper [18]. Compression and ahing There is a question o how the HT I index ompares to iles, when ompression tehniques are applied [16, 14] or a ahe equal to the aess tree size is given to the ile. As ar as the ormer is onerned, the HT I index is omplementary to ompression and not ompetitive to it. I the lists beome smaller, then we an redue the size o the HT I by using a smaller threshold. Giving ahe to the lists on the other hand, may be a good solution or uniorm distributions with large voabularies. Still, the eetiveness o the ahe is dependent on how big it is when ompared with the total ile and it will be redued as the size o the ile grows. On the ontrary, the main memory requirements o the HT I index depend mostly on the size o the voabulary, sine dupliate or similar transations do not aet its size and eetiveness. Thus, or small voabularies and espeially or skewed distributions, the HT I index is a better hoie. 4. QUERY EVALUATION In this setion, we present the evaluation algorithms or the three types o queries that we are interested in: subset, equality and superset. The evaluation algorithms or all types o queries have two main stages: (a) evaluation in the aess tree, and (b) evaluation in the ile. The evaluation in the aess tree onerns the requent items o the query set, and the evaluation in the ile the rest o the items. The basi idea is that we use the aess points to the lists oered by the trie, to quikly trae the inal or a andidate answer to the query. The beneit is quite signiiant sine the aess points are given or the largest lists, whih orrespond to the items o I r. This way we avoid expensive union or intersetion operations be- 732

6 tween the lists indexed by the aess tree, and instead we impliitly perorm these operations in the tree itsel. For all three ases o queries, we assume a query set o the orm qs = { 1,..., k, i k+1,..., i n}, where the irst k items i onern the requent items o the query set, belonging to the aess tree, and the next n k items i j are the inrequent items that are only indexed by the ile. In the ollowing, we detail the evaluation tehniques or eah type o queries. 4.1 Subset queries Subset queries are the most ommon queries exeuted against transation and text olletions and most broadly studied in researh literature. Furthermore, the evaluation o many query lasses, inluding ranking ones, partially resolves to the evaluation o subset queries. The main idea around evaluating subset queries is that the transations that ontain the app part o the qs an easily be identiied by using the aess tree, without merging the respetive lists. This is eiiently done by traing all the appearanes o the last element o the app, k (whih is also the least requent in app), and then identiying whih paths rom the root to the k nodes ontain the app o the qs. These paths possibly ontain other requent items too, but they neessarily ontain the app o the query set. We all the set o the retrieved transation id s andidateids. Possibly, apart rom the requent items, there are also inrequent items in the query set. The only way to aess these inrequent items i k+1,..., i n is through the ile. Thereore, to ompute the inal query answer we must ind the intersetion o the lists o transation id s that orrespond to the inrequent items i k+1,..., i n with the list o the already retrieved andidateids. Any transation id that belongs to this result ontains both the requent items o the app and the inrequent items i k+1,..., i n. The algorithm in pseudo-ode is depited in Figure 6. Algorithm SubsetQueries Input: An HT I index H over a dataset D, a query set qs = { 1,..., k, i k+1,..., i n } and a query Q={t qs t.s}. Output: the t.id s o the transations that ontain qs Method: 1. Determine the app = { 1,..., k } o the query set. 2. I app is not empty use subsettrie(app) to retrieve the andidateids rom the trie. 3. I {i k+1,..., i n} is not empty in the query set: 4. result=merge-join the andidateids with the lists o {i k+1,..., i n} 5. else 6. result=andidateids 7. return result Funtion subsettrie(app) Input: An HT I index H over a dataset D, the app o the qs Output: The andidateids, i.e. the t.id s o the transations that ontain the items o app Method: 1. Let be the last item (least requent) o app 2. For every appearane o in the trie 3. i every item i app appears in the path rom the root to the urrent node. 4. add the t.ids o the sublists o the urrent node to the andidateids 5. return andidateids Figure 6: Algorithm or determining subset queries Assume or example that the user asks or all transations that ontain the {,, a} items rom the relation D depited in Figure 1. I we evaluate the query against the ile, depited in Figure 2, we would have to perorm a merge-join o the lists o all the items in the query set. That would require six and we would only have one answer, that is t.id = 1. I, instead, we evaluate the query against the HT I index, the disk pages aesses are muh less. First, we have to identiy the app o the qs whih is. Then, we must trae all the nodes o the aess tree and identiy the paths rom root to, whih ontain the rest o the items o app, i.e.,. This results in only one path: root. Now we an diretly retrieve the transations that ontain and, whih are 1 and 6 by perorming only 1 page aess. Subsequently we an merge-join {1, 6} with the list o a to retrieve the inal answer. The total we enounter in this ase is two. In general, i the lists o the items o the qs (ordered by requeny) over l 1,..., l n disk pages, the worst ase evaluation will require l l n. This holds or both the ile and HT I-index, but as experiments in Setion 5 show, the average ases learly avor the HT I index. The beneit rom using the aess tree omes rom the at that we avoid perorming intersetions between the largest lists. This beneit an potentially be very signiiant, espeially i the requent items are not orrelated. Moreover, the larger the lists are and the greater the skewness o the items distribution is, the greater beneit we gain rom using the aess tree. Some more tehnial notes should also be made or algorithm o Figure 6. Whereas the simpliied orm o the algorithm, implies that we use the aess tree to atually retrieve the t.ids rom the disk and put them in the andidateids this is not the most eetive implementation in most ases. Instead, we return the links to the sublists in the disk, whih are then merged-joined with the lists o the {i k+1,..., i n } items. Furthermore, the merge-join is perormed by starting rom the less requent item, thus it is not always neessary to use the aess tree. In some ases, we an quikly deide that there is no solution, by interseting the smaller lists, and avoid any urther omputation. Pratially, the algorithm irst traverses the aess tree and deides i there is a solution or the app items, and how many it will need to retrieve them. Depending on how many it will need, the algorithm deides the order o the merge-joins i.e., whether it will start rom the trie or the ile. 4.2 Equality queries Employing the HT I index or equality queries leads to very eiient evaluations. For eah query, only one path o the aess tree has to be identiied. This is the path, whih is idential to the app o the query set. Assuming that nodes are organized in some eiient data struture, like hash arrays, the evaluation on the trie an be done in time O( app ), that is proportional to the app o the query set. Ater identiying the single sublist that possibly satisies the query, it has to be interseted with the lists o the non-requent items. In the proess o the mergejoin, the transations are iltered aording to their length, whih must be equal to. We reer the interested user to the long version o the paper [18] or the pseudoode 733

7 o the evaluation algorithm. The worst ase in terms o page aesses is again the same as or subset queries. Still, experiments show that whereas evaluating equality queries in the ile requires as many as the respetive subset queries did, the results with HT I index are a lot better in this ase. 4.3 Superset queries Superset queries are by ar the most expensive queries we study. In a sense, a superset query is equivalent to 2 equality queries, or all its subsets. The evaluation algorithms, even those that work only in the ile, require signiiantly less than 2 equality queries, but still the number is high. I the lists o the items o the qs (ordered by requeny) need l 1,..., l n disk pages respetively, evaluating a superset query solely in the ile, with the algorithm presented in Figure 7, requires in the worst ase l 1 + 2l nl n disk page aesses. As in the ase o equality, the aess tree an drastially boost the eiieny o the query evaluation. The basi idea is to ind all the paths in the trie, whih are solely onstruted by items rom the app o the query. Then we an saely add to andidateids, the ids o all the transations that end in any node o these paths. For these transations we know that they do not ontain any other item o I r, exept rom 1,..., k. I the qs has non requent items too, then we have to hek in the ile i the remaining items o the transations o andidateids ontain only items rom i k+1,..., i n. I the qs does not ontain any other items we ilter the andidateids using their length and the length o the path that lead to them, as pruning riteria. I their length is greater than their app, whih an be inerred rom the trie without examining the transation itsel, the transation is dropped, sine it must have more items that are not ontained in qs. The algorithm or evaluating the superset query is presented in Figure 7. The redution o the disk pages aessed, when using the HT I index or superset queries, is not only attributed to the aess points oered by the trie. It is also a result o the possibility o identiying exatly the transations whose app ends at the aess tree nodes, as opposed to the rest o the transations within the same sublist. 5. EXPERIMENTAL STUDY As several surveys and previous researh have demonstrated, the iles, although a simple tehnique, oer better perormane than signature based methods or low ardinality set values [7] and or doument indexing [21]. Moreover they outperorm traditional indies like B-trees, or ontainment queries in RDBMSs [2]. For the aorementioned reasons, we hose the iles as the main point o reerene or the evaluation o the HT I index. 5.1 Methodology HTI index. We have implemented a prototype o the HT I index aording to the desription we gave in Setion 3. Sine query evaluation perormane is dominated by disk aesses, our implementation is aimed at providing aurate results on number o disk pages aesses during query evaluation on the HT -index. Some aspets o the index untionality were simulated; disk pages are 4k arrays in main memory, and sibling nodes Algorithm SupersetQueries Input: An HT I index H over a dataset D, a query set qs = { 1,..., k, i k+1,..., i n } and a query Q={t qs t.s}. Output: the t.id s o the transations that where t.s qs Method: 1. Determine the app = { 1,..., k } o the query set. 2. I app is not empty use supersettrie(app,root) to retrieve the andidateids rom the trie. 3. Let il 1... il m be the lists o all the non requent items o the qs and the andidateids, ordered aording to the number o memory pages 4. or (i=1 ; i n ; i++) 5. or eah entry t o il i 6. unmathed=t.length 1 7. i (unmathed == ) add t to result and break 8. or (j = i + 1 ; j n ; j++) 9. i (unmathed > n j) break 1. i (unmathed==) add t to result and break 11. san orward il j 12. i t ound in il j unmathed = unmathed return result Funtion supersettrie(app,urrentnode) Input: An HT I index H over a dataset D, the app o the qs, the root o the trie as urrentnode Output: The andidateids, i.e. the t.id s o the transations whose items are ontained in app Method: 1. while (app not empty) 2. newcnode=pop(app) 3. i newcnode is hild o urrentnode 4. add the sublist o newcnode to andidateids 5. supersettrie(app,newcnode) 6. return andidateids Figure 7: Algorithm or determining superset queries are stored in linked lists instead o arrays. This implementation provides aurate results both on the page aesses and on the size o aess tree in the main memory. The ormer are expliitly ounted by the program and the latter an be omputed by ignoring the links between sibling nodes. iles. We have implemented a basi version o the ile index. The voabulary is kept in a hash table and the lists in 4k arrays orresponding to disk pages. Eah entry in the ile omprises the id and the length o eah transation. The size o eah entry is e s = sizeo(long int) + sizeo(short int), whih is 6 bytes in our ase. Real data. We have evaluated HT I on two real datasets rom UCI KDD [8] arhive. Both o them are logs o user behavior on web portals. The irst one, denoted as msweb, is a one-week log traing the virtual areas that users visited in the web portal Eah reord orresponds to a user session and the set value omprises the areas she/he visited. There are 32k reords and the voabulary o the dataset ontains 294 distint items (areas). The distribution o the items in the reords is skewed and the average size o the reord is 3 items. Sine the dataset is small, to illustrate the perormane o the two indies better, we reated a new one, by dupliating the reords by a ator o 1, whih resulted to a dataset o 32k reords. This multipliation is reasonable, sine it simply orresponds to a 1 week log. The seond dataset, denoted msnb is again a log o users behavior on the web portal o msnb.om taken rom the UCI 734

8 KDD arhive as well. The voabulary here is very limited, omprising only 17 distint items and unlike the previous one, the distribution o the items is relatively uniorm. The average size o the reord is 5.7 items. Syntheti data. To investigate how HT I behaves or datasets and domains larger than the ones we had rom real soures, we used syntheti data, with a skewed zipian distribution o order 1 (as in [7]). Dupliates in eah transation were dropped and we ended up with transations with lengths rom 2 to 22 items, uniormly distributed. Query generation. We reated query sets or all the three types o queries. As in other approahes [7], we onsider the evaluation o the proposed method on queries that always have a solution as more inormative. We reated suh queries by randomly seleting existing transations rom D. For the syntheti data, we ranged the number o items in the query set,, rom 2 to 22 and we reated 5 queries o eah type. For the real data, we ranged the rom 2-7, sine their domain and the average reord length is a lot smaller. The seletivities o the subset queries are less than 3%, with highest appearing or queries with = 2. The most ommon ase or larger and or equality queries is that there are less than 5 answers. On the other hand the seletivity o superset queries an surpass 3% or large on the real data. Evaluation metris. We evaluate the HT I index by onsidering two main ators: (a) the beneit it provides to query evaluation, ompared to regular iles and (b) the main memory requirements it imposes. We evaluate the beneit to query evaluation by ounting as the dominating ator o the problem. We show how main memory requirements are aeted or the dierent D parameters by providing the number o aess tree nodes. Experimental setup. We implemented both methods in C, on a Linux platorm (Suse 9.3) and ompiled it with g version Our experiments were perormed on an AMD Sempron 28+ with 2G o main memory. The disk page aesses were diretly ounted by the program, by traing how many o the 4k arrays were aessed. 5.2 Perormane o the HT I index Real data To measure the beneit on query evaluation provided by the HT I on real data, we evaluated subset, equality and superset queries against the ile, and the HT I index. For the ase o the HT I index we varied the threshold, i.e., the perentage o items that omprise the I r. The results are depited in Figures 8 and 9. For the ase o msweb data, whih are skewed but they have larger voabulary than msnb data, we used as thresholds 5%, 2%, 4%. The size o the aess tree that must be kept in main memory is small in all ases, with the biggest being around 35k, or threshold 4%. For the ase o msnb data, where the voabulary is very small, we used the thresholds 2%, 6% and 1%. The largest aess tree in this ase is around 2k, or threshold 1%. Note that or a threshold o 1%, all items o I are indexed by the aess tree, thus or all types o queries no alse positives are retrieved rom the disk (we an iner the length o a transation by the length o the aess tree path i all items are indexed by the aess tree). As we an see the HT I index outperorms the ile in all ases. Moreover, it sales a lot better as the size o the query grows. For the larger queries, the perormane o HT I (with a suitable threshold) is at least a order o magnitude better or all types o queries Syntheti data By using syntheti data we are able to trae the impat o the voabulary I, the size o the dataset D and the size o the query set qs on the HT I index. In the ollowing we investigate how eah o the query types we introdued is aeted by these ators. Subset. In Figure 1 we see how the ile and the HT I index perorm or subset queries. We ompare three versions o HT I-index with the ile, eah time varying the threshold. Consider the irst variant o the HT I index with a I r o only the top.5% o the total items. In all three experiments o Figure 1, we ount the average number o page aesses perormed by all our queries on all our datasets as a untion o (a) the size o the voabulary, I (let); (b) the size o the underlying database D (enter), and () the number o items belonging to the query set qs (right). In all three ases, results are given or the average value o all parameters that do not appear in eah igure. Thus, when varying D, we present the average o the results or all I and, when we vary I we present the average o the results or all D and and when we vary we present the average o the results or all D and I. Individual results obey the general trend and are omitted or the interest o spae. In all ases, the HT I index outperorms the ile by a signiiant ator. It is important to note that the HT I seems to sale a lot better or large databases and large queries; whereas in the average ase the inrease o D seems to have a linear impat on the or both methods, the gradient o the HT I index perormane is signiiantly smaller. The larger the threshold is, the smaller the disk page aess inrease is. Furthermore, the inrease o the has diverting impat on the perormane o the ile and the HT I index. In the ormer ase it is ollowed by a proportional inrease in disk page aesses, whereas in the latter ase the required number o page aesses is redued. This is due to the at that when dealing with large queries, the hane o having more items rom I r is greater, thus the hane o perorming a more eetive pruning in the aesses tree is greater. The inrease o the voabulary size seems beneiial both or the HT I index and the ile, but as we show in the experiments or the HT I size, it signiiantly augments the memory requirements or the aess tree. Equality. Equality queries avor the HT I-index even more. In Figure 11 we assess the number o or equality queries as a untion o (a) the voabulary size, I (let), (b) the size o the underlying database, D (enter) and the number o items o the query set, (right). The evaluation in the ile requires exatly the same or equality queries, as it did or subset queries. On the other hand, evaluating equality queries in the HT I requires less than hal o the disk pages aesses it did or the respetive subset ones. This eet is even greater or queries with low ardinality qs. The main reason that makes equality queries behave better with the HT I index is that eah query requires retrieving one list rom the aess tree at most. 735

9 thres-5% thres-2% thres-4% Subset thres-2% thres-6% Subset thres-.5% I in 1 s thres-.5% thres-.5% I in 1 s thres-5% thres-2% thres-4% Equality thres-5% thres-2% thres-4% Figure 8: Average perormane o queries on msweb data thres-2% thres-6% Equality Superset 5 thres-2% thres-6% Figure 9: Average perormane o queries on msnb data thres-.5% D in 1 s Figure 1: Average perormane o subset queries thres-.5% D in 1 s Figure 11: Average perormane o equality queries thres-.5% Superset thres-.5% thres-.5% thres-.5% number o tree nodes in 1 s thres-.5% I in 1 s I in 1 s Figure 13: Eet o I number o tree nodes in 1 s D in 1 s Figure 12: Average perormane o superset queries thres-.5% D in 1 s Figure 14: Eet o D number o tree nodes D in millions o transations Figure 15: I = 5k,.5% Avg. page aesses Number o tree nodes in 1 s threshold Figure 16: Eet o k 736

10 Superset. As it an be inerred rom Figure 12 in superset queries the HT I-index learly outperorms the ile index. The ile perorms very poorly, sine it requires multiple sans o many lists. Note that the disk page aesses perormed in the evaluation o the superset queries surpass the needed by subset and equality queries by almost an order o magnitude. 5.3 Memory requirements o the HT I index The size o the aess tree o the HT I index or the real datasets we used is very small; or the ase o the msweb data it has only 1857 nodes (around 33kb) or a threshold o 5%, and in the worst ase (threshold 4%) it has 2569 nodes (around 369kb). For the ase o msnb data, it has only 7 nodes or a threshold o 2% and in the worst ase (threshold 1%) it has nodes (26kb). The size o the aess tree is important, sine it has to be resident in main memory; thereore, we investigated how it sales or larger D and I by using syntheti data. Figures 13 and 14 show how the aess tree is aeted by the voabulary size, I and the size o the database D. An interesting observation is that or smaller voabularies, where the queries take longer to evaluate due to the existene o larger lists, the size o the aess tree is smaller, too. This means that we an reate HT I indies with larger thresholds to ounter this eet. As the voabulary inreases, the maximum size o the trie augments superlinearly, thus, or large voabularies the aess tree tends to inrease in a proportional way to the database size. For small voabularies, the size o the aess tree grows sublinearly (or remains stable i the maximum size has been reahed) with respet to the database size. This is evident in Figure 15, where we vary the size o the database while keeping the voabulary ardinality at 5k and the HT I threshold at.5. In the respetive experiment with I = 1k the tree reahes its maximum size (31 nodes) very soon and remains invariant to the size o D. 5.4 Threshold hoie Whereas the voabulary and the database size depend on the data we have, the threshold or the HT I index is a hoie we must make aording to the speed requirements and the memory we have at our disposal. To highlight its eet we reated several HT I indies or dierent thresholds and we show their perormane in Figure 16 by varying the threshold rom.2% to 1%. We depit simultaneously how the aess tree grows, in 1 s o nodes, and how the average or the three types o queries all as the threshold grows. Ater a ertain threshold the average disk pages aesses are not signiiantly redued, whereas the size o the aess tree ontinues to grow, even i not as ast as or very low threshold. 6. CONCLUSIONS In this paper we have takled the problem o ontainment queries on large olletions o low ardinality set-valued attributes. We have proposed a novel indexing sheme, the HT I index, whih superimposes a trie tree (kept in main memory) over an ile (kept in seondary storage) to eiiently answer subset, superset and set-equality queries. We have introdued novel evaluation algorithms or these lasses o queries that use the HT I index and experimentally demonstrated that the HT I learly outperorms the state-o-the-art organization sheme, i.e., the ile, with reasonable main-memory overhead. Our experiments have showed that the sale o our approah is a lot smoother than the one o iles and in ertain ases, or large database or query-set sizes, we an redue the disk page aesses by orders o magnitude, with a small overhead o main memory. Future work omprises urther investigations on how to redue the size o the aess tree and how to exploit the HT I index to eiiently support other kind o queries, like, or example, set intersetions or similarity queries. 7. REFERENCES [1] R. A. Baeza-Yates and B. A. Ribeiro-Neto. Modern Inormation Retrieval. ACM Press / Addison-Wesley, [2] C. Faloutsos. Signature iles. In Inormation Retrieval: Data Strutures & Algorithms, pages [3] A. Gionis, D. Gunopulos, and N. Koudas. Eiient and tunable similar set retrieval. In SIGMOD, 21. [4] R. Goldman and J. Widom. Wsq/dsq: A pratial approah or ombined querying o databases and the web. In SIGMOD, 2. [5] J. Han, J. Pei, Y. Yin, and R. Mao. Mining requent patterns without andidate generation. In SIGMOD, 2. [6] J. Han, J. Pei, Y. Yin, and R. Mao. Mining requent patterns without andidate generation: A requent-pattern tree approah. Data Mining and Knowledge Disovery, 8(1):53 87, 24. [7] S. Helmer and G. Moerkotte. A perormane study o our index strutures or set-valued attributes o low ardinality. VLDBJ, 12(3): , 23. [8] S. Hettih and S. D. Bay. The UCI KDD Arhive. University o Caliornia, Department o Inormation and Computer Siene [9] R. Kaushik, R. Krishnamurthy, J. F. Naughton, and R. Ramakrishnan. On the integration o struture indexes and lists. In SIGMOD, 24. [1] D. E. Knuth. The Art o Computer Programming, Volume III: Sorting and Searhing. Addison-Wesley, [11] N. Mamoulis. Eiient proessing o joins on set-valued attributes. In SIGMOD, 23. [12] N. Mamoulis, D. W. Cheung, and W. Lian. Similarity searh in sets and ategorial data using the signature tree. In ICDE, 23. [13] S. Melnik and H. Garia-Molina. Adaptive algorithms or set ontainment joins. ACM TODS, 28(1):56 99, 23. [14] A. Moat and J. Zobel. Sel-indexing iles or ast text retrieval. ACM TOIS, 14(4): , Ot [15] S. Sarawagi and A. Kirpal. Eiient set joins on similarity prediates. In SIGMOD, 24. [16] F. Sholer, H. E. Williams, J. Yiannis, and J. Zobel. Compression o indexes or ast query evaluation. In ACM SIGIR, Aug. 22. [17] M. Stonebraker and D. Moore. Objet-Relational DBMSs: The Next Great Wave. Morgan Kaumann, [18] M. Terrovitis, S. Passas, P. Vassiliadis, and T. Sellis. HTI tehnial report. mter/papers/ TR-HTI-1.pd, 26. [19] I. H. Witten, A. Moat, and T. C. Bell. Managing Gigabytes: Compressing and Indexing Douments and Images. Morgan Kaumann, 2nd edition, [2] C. Zhang, J. F. Naughton, D. J. DeWitt, Q. Luo, and G. M. Lohman. On supporting ontainment queries in relational database management systems. In SIGMOD, 21. [21] J. Zobel, A. Moat, and K. Ramamohanarao. iles versus signature iles or text indexing. ACM TODS, 23(4):453 49, [22] J. Zobel, A. Moat, and R. Saks-Davis. An eiient indexing tehnique or ull text databases. In VLDB,

Extracting Partition Statistics from Semistructured Data

Extracting Partition Statistics from Semistructured Data Extrating Partition Statistis from Semistrutured Data John N. Wilson Rihard Gourlay Robert Japp Mathias Neumüller Department of Computer and Information Sienes University of Strathlyde, Glasgow, UK {jnw,rsg,rpj,mathias}@is.strath.a.uk