Efficient and scalable trie-based algorithms for computing set containment relations

Size: px

Start display at page:

Download "Efficient and scalable trie-based algorithms for computing set containment relations"

Margery French
5 years ago
Views:

1 Effiient and salale trie-ased algorithms for omputing set ontainment relations Yongming Luo #1, George H. L. Flether #2, Jan Hidders 3, Paul De Bra #4 # Eindhoven University of Tehnology, The Netherlands 1 y.luo@tue.nl, 2 g.h.l.flether@tue.nl, 4 dera@win.tue.nl Delft University of Tehnology, The Netherlands 3 a.j.h.hidders@tudelft.nl Astrat Computing ontainment relations etween massive olletions of sets is a fundamental operation in data management, for example in graph analytis and data mining appliations. Motivated y reent hardware trends, in this paper we present two novel solutions for omputing set-ontainment joins over massive sets: the Patriia Trie-ased Signature Join (PTSJ) and PRETTI+, a Patriia trie enhaned extension of the state-of-theart PRETTI join. The ompat trie struture not only enales effiient use of main-memory, ut also signifiantly oosts the performane of oth approahes. By arefully analyzing the algorithms and onduting extensive experiments with various syntheti and real-world datasets, we show that, in many pratial ases, our algorithms are an order of magnitude faster than the state-of-the-art. I. INTRODUCTION Sets are uiquitous in data proessing and analytis. A fundamental operation on massive olletions of sets is omputing ontainment relations. Indeed, ulk omparison of sets finds many pratial appliations in domains ranging from graph analytial tasks (e.g., [1] [3]) and query optimization [4] to OLAP (e.g., [5], [6]) and data mining systems [7]. As a simple example, onsider an online dating wesite where eah user has an assoiated profile set listing their harateristis suh as hoies, interests, and so forth. User dating preferenes are also indiated y a set of suh harateristis. By exeuting a set-ontainment join of the set of user preferenes with the set of user profiles, the dating wesite an determine all potential dating mathes for users, pairing eah preferene set with all users whose profiles ontain all desired harateristis. A onrete illustration an e found in Tale I. In this paper we onsider effiient and salale solutions to the following formalization of this ommon prolem. Consider two relations R and S, eah having a set-valued attriute set. The set ontainment join of R and S (R S) is defined as R S = {(r, s) r R s S r.set s.set}. State of the art: Due to its fundamental nature, the theory and engineering of set ontainment joins have een intensively studied (e.g., [8] [18]). Existing solutions fall into two general ategories: signature-ased and informationretrieval-ased (IR) methods. Signature-ased methods (e.g., [8] [12]) enode set information into fixed-length it strings (alled signatures), and perform a ontainment hek on the signatures as an initial filter followed y a validation of the TABLE I: Example of set-ontainment join. If we perform a set-ontainment join ( ) etween user profiles and user preferenes, we retrieve mathing pairs {(u 1, p 1 ), (u 1, p 2 ), (u 2, p 3 )}. (a) user profiles id set signature u 1 {, d, f, g} 0111 u 2 {a,, h} 1011 u 3 {a,, d} 1011 () user preferenes id set signature p 1 {, d} 0101 p 2 {, f, g} 0110 p 3 {a,, h} 1011 resulting pairs using atual set omparisons. IR-ased methods (e.g., [13] [16]) uild inverted indexes upon sets storing tuple IDs in the inverted lists. A merge join etween inverted lists will produe tuples that ontain all suh set elements. Typially auxiliary indexes are reated to aelerate inverted index entry look-ups and joins. Most of the fous of the state-of-the-art algorithms has een on disk-ased algorithms (e.g., [11] [13], [15], [16]). Though these algorithms have proven quite effetive for joining massive set olletions, the performane of these solutions is ounded y their underlying in-memory proessing strategies, where less work has een done (see Setion II). For example, PSJ [11] and APSJ [12], two advaned disk-ased algorithms, share the same in-memory proessing strategy with mainmemory algorithm SHJ [8], whih we ll disuss in detail in Setion II-A. To keep up with ever-inreasing data volumes and modern hardware trends we need to push the performane of set-ontainment join to the next level. Therefore, it is essential to revisit (and develop new) in-memory set-ontainment join algorithms. Suh algorithms will serve oth as an essential omponent for main memory dataases [19] as well as uilding loks and inspiration for external memory and other omputation models and platforms. This is hallenging eause existing work has already investigated many possile optimization tehniques, suh as itwise operations [8], ahing [13], reusing result set [14] and so on. Contriutions: Nonetheless, y arefully analyzing the existing solutions and ringing in new data strutures, in this researh we propose two novel in-memory set-ontainment join algorithms that are in many ases an order of magnitude faster than the previous state-of-the-art. In our study, we sale the relations to e joined along three asi dimensions: set

2 ardinality, domain ardinality, and relation size. Here, set ardinality is the size of set values in the relations; domain ardinality is the size of the underlying domain from whih set elements are hosen; and relation size is the numer of tuples in eah relation. The ontriutions of our study are as follows: We propose two novel algorithms for set-ontainment join. One is for the low set ardinality, high domain ardinality setting (PRETTI+); the other is for the remaining senarios (PTSJ). Both algorithms make use of the ompat Patriia trie data struture. Our PTSJ proposal is a signature-ased method. Hene, the length of the signature is a ritial parameter for the algorithm s performane. Therefore, we perform a detailed analysis on PTSJ for determining the proper signature length. We also detail how PTSJ an (1) e easily extended to answer other set-oriented queries, suh as set-similarity joins, and (2) effiiently e adapted to disk-ased environment. We present the results of an extensive empirial study of our solutions on a variety of massive real-world and syntheti datasets whih demonstrate that our algorithms in many ases perform an order of magnitude faster than the previous state-of-the-art and sale well with relation size, set ardinality, and domain ardinality. The rest of the paper is organized as follows. In the next setion, we introdue the state-of-the-art solutions for setontainment join. In Setions III and IV we propose PTSJ and PRETTI+, our two new algorithms. Setion V presents the results of our empirial study of all algorithms. We then onlude in Setion VI with a disussion of future diretions for researh. II. STATE-OF-THE-ART ALGORITHMS In this setion we desrie two effiient in-memory setontainment join algorithms, SHJ and PRETTI. These solutions are representative of the state-of-the-art, and serve as aseline solutions in our later development and experiments. For simpliity we assume in the following that domain values and tuple IDs are represented as integers. A. Signature Hash Join We first introdue the definition of signature [8]. A signature of tuple t (t.sig) an e seen as an output of some hash funtion h (i.e., t.sig = h(t.set)) suh that t 1.set t 2.set h(t 1.set) h(t 2.set). Here the ontainment relation etween two hash values is defined as sig 1 sig 2 sig 1 & sig 2 = 0, where & and denote itwise AND and NOT operations. We will also refer to the relation as suset ontainment when there is no possiility of onfusion. A straightforward implementation of a signature hash funtion is as follows: assume the signature length t.sig is its, all initially set to zero. If integer x is in t.set we set the (x mod )th higher-order it of t.sig to 1. The resulting signature is essentially a ompressed itmap representation of t.set. In the signature olumn of Tale I we show the 4-it signature for eah set in our example relations. Alphaets are mapped to integers starting from 1, in alphaetial order (i.e., a is mapped to 1, to 2, and so forth). Note that tuples u 2 and u 3 have the same signature, ut different set values. The Signature Hash Join (SHJ) was proposed y Helmer and Moerkotte [8]. SHJ uses the signature struture as a onise representation for sets, and uses signature omparisons as filtering operations efore performing real set omparisons. In the spirit of hash join, SHJ works as follows: (1) for eah tuple s in S, ompute s.sig, and insert (s.sig, s) into a hash map (idx); (2) for eah tuple r in R, ompute r.sig, enumerate all susets of r.sig, examine all tuples with suh signatures in the hash map (hene in S), omparing them with r. Pseudo ode of this approah an e found in Algorithm 1 and Algorithm 2. Here we split SHJ into two parts: a generalized signature join framework (Algorithm 1) that an e reused for other algorithms; and, an enumeration algorithm used in SHJ (Algorithm 2) that an e replaed with more effiient algorithms (e.g, Algorithms 4 and 5 elow). Algorithm 1: SIGNATURE JOIN() signature join framework Input: relations S and R Output: pairs of tuple IDs that have the set ontainment relation 1 reate index idx // e.g., in SHJ is a hashmap 2 for eah s S do 3 insert (s.sig, s) into idx 4 for eah r R do 5 suset Call suset enumeration algorithm // e.g., SHJ ENUM(r.sig, idx) 6 for eah s suset do 7 if r.set s.set then 8 output (s, r) Algorithm 2: SHJ ENUM() suset enumeration of SHJ Input: signature, hash tale Output: tuple IDs that have signature ontainment relation 1 reate list 2 for eah suset signature do // enumerate all 3 if suset hash tale then 4 for eah tuple in hash tale[suset] do 5 add tuple to list 6 return list SHJ inspired other algorithms (e.g., PSJ [11] and APSJ [12]). It is one of the most effiient in-memory solutions for omputing set-ontainment join. One drawak of SHJ omes from line 2 of Algorithm 2, where all susets of a given signature are enumerated and validated in the hash map. Though the authors provide a very effiient proedure (with itwise operations) to perform this enumeration, suh a mehanism annot sale with respet to signature length, and

3 therefore annot sale with relation size and set ardinality. Consequently, all algorithms using this mehanism suffer also from the same prolem. In Setion III, we provide a solution to this prolem, with the introdution of an alternative data struture. B. PRETTI Join To the est of our knowledge, PRETTI (PREfix Tree ased set join) [14] is the most reent and effiient in-memory setontainment join algorithm. In ontrast with SHJ, PRETTI operates on the spae of set elements instead of on the spae of signatures. In partiular, PRETTI works as follows: given relations S and R, first uild a prefix tree (trie) ased on the ordered set elements of tuples in S; then uild an inverted file ased on set elements of tuples in R. In the same root-to-leaf path of the trie, tuples of the desendants ontain tuples of the anestors. Then when traversing the trie from root to leaf, at eah node a list of ontainment tuples an e generated y joining the tuples in the node and in the inverted list. The list is passed down the trie for further refinement. A sketh of the PRETTI join an e found in Algorithm 3. The reursive all operates on eah hild of the root node and goes down the tree in a depth-first-searh manner. Figure 1 illustrates the trie struture after inserting sets in user preferenes from Tale I. a h p3 root d p 1 f g p 2 Fig. 1: Trie example for PRETTI, after inserting sets from user preferenes (Tale I) Algorithm 3: PRETTI JOIN() reursively join and output Input: sutree root node, urrent list, inverted index idx Output: pairs of tuple IDs that have signature ontainment relation // Initially, urrent list idx[node.lael] 1 for eah s in node.tuples do 2 for eah r in urrent list do 3 output (s,r) 4 for eah hild of node do 5 hild list = urrent list idx[.lael] 6 PRETTI JOIN(, hild list, idx) Assume we have an inverted index reated for user profiles from Tale I as follows: {a:{u 2, u 3 }, :{u 1 }, :{u 2, u 3 }, d:{u 1, u 3 },...}. Then when PRETTI exeutes on the trie in Figure 1, it first finds all tuples that ontain element y proing the inverted index, whih is {u 1 }. Then the list is arried to s hildren nodes. At node d for instane, the list is joined with the inverted list ontaining element d, whih is {u 1, u 3 }. Sine we see one tuple p 1 on the urrent node, and only u 1 in the list is left, we an onlude that u 1 p 1. Suh ations are performed on all nodes in the trie, and therey PRETTI finds all ontainment relations. PRETTI is a very effiient algorithm. It only traverses the trie one to generate all results. Set omparisons are naturally performed while traversing, and most interestingly, early ontainment results are reused for further omparisons. PRETTI has two main weak points. First, many auxiliary data strutures suh as trie and inverted index are uilt for the algorithm, whih an onsume too muh spae if set ardinality is high. Seond, varied-length set omparisons an e time onsuming in omparison with fixed-length signature omparisons, espeially when set ardinality is high. In our later empirial evaluation we will see that PRETTI an perform quite well for low set ardinality datasets. However, due to exessive main memory onsumption and element omparisons, it annot sale with either larger relations or higher set ardinalities. Later in this paper, we develop extensions to PRETTI to overome this main-memory onsumption prolem. III. PATRICIA TRIE-BASED SIGNATURE JOIN (PTSJ) Let s reonsider SHJ from Setion II-A. After all signatures are omputed, given one signature r.sig, SHJ needs two steps to get its suset results: (1) enumerate all susets of r.sig; (2) hek whether some suset exists in the hash map entry and perform set omparison afterwards. It is diffiult for this mehanism to sale to longer signatures, eause the numer of possile susets of a given signature is exponential (2 ) to the signature length. Therefore in real ases, only part of the signature is used for enumeration purposes (and for reating hash map entries). Based on our experiene, this partial signature length annot even reah 20 its due to its exponential time omplexity. This mehanism essentially limits the possile performane gain of SHJ. However, it is not neessary to enumerate all possile susets, ut rather only those that atually exist in a relation. Hene, we only need O( S ) time to enumerate susets of r.sig (that exists in S). This is the ore idea of our initial algorithm. We will first introdue our algorithm using a simple inary trie, and later with a Patriia trie. A. Trie-ased Signature Join Reall that a trie is a asi tree data struture for storing strings. One property of tries is that strings within a sutree share the same path (prefix) from the root to the sutree. Here we use a inary trie, whih stores inary strings (i.e., signatures) and tuples assoiated with a given signature. After we insert all signatures into the trie, sine signatures have the same length, we get a trie with the height of signature length. From the root, eah level of trie nodes represents one position it in signatures. Tuple IDs and set values are stored in the leaves of the trie. An example of a inary trie an e found in Figure 2. When performing a readth-first searh on a trie, in the end we enumerate all existing signatures y visiting the leaves. If we restrit our searh at eah level of the trie using some given

4 root ranh nodes, ideally the trie should have around 2k nodes. But instead, it will in the worst ase need k( lg 2 k) + 2k nodes. The longer the signature, the more single-ranh nodes it has. Moreover, these nodes all need to e enqueued and visited. In an empirial study, we witnessed that Algorithm 4 performs slower than SHJ. Therefore we laim that Algorithm 4 is not pratial to use, and exlude it from later empirial study (Setion V). 1 p 1 0 p 2 p 3 1 Fig. 2: Trie example, after inserting signatures 0101, 0110, 1011 from user preferenes (Tale I) into an initially empty trie. Here we let left ranhes store signatures with prefix it 0 and right ranhes store signatures with prefix it 1. signature as guidane, we get the suset enumeration algorithm TRIE ENUM(), given in Algorithm 4. The asi idea is that, while traversing the trie level y level, we are examining all signatures it y it. Then if we take the input signature into onsideration, the searh spae shrinks every time a it 0 is enountered. We use a queue to hold nodes whose prefixes are susets of the input signature. When Algorithm 4 finishes, all its of the input signature are examined, and all signatures that are a suset of the input signature are in the queue. We an then diretly perform a set omparison of these tuples with the input tuple, y simply plugging in TRIE ENUM() into line 5 of Algorithm 1. Algorithm 4: TRIE ENUM() suset enumeration using trie Input: signature, trie Output: tuple IDs that have signature ontainment relation 1 reate queue q 2 i 0 3 urrent it signature[i++] 4 enqueue trie.root on q 5 while q.top has hildren do 6 node dequeue from q 7 if urrent it = 0 then // if node.left exists 8 enqueue node.left on q 9 else // if node.left and node.right exist 10 enqueue node.left and node.right on q 11 urrent it signature[i++] 12 return q For example, if we want to find ontainment relations for u 1 in Tale I, we first get its signature Then while we run Algorithm 4, all nodes in the left ranh of Figure 2 are visited and plaed on the queue. In the end, p 1 and p 2 at leaf nodes are returned. A limitation of this approah is that there are many unneessary nodes that only have one hild in the trie (whih we later refer to as single-ranh nodes). We also see this in Figure 2. For k signatures (with its eah), if there are no single- B. Introdue Patriia Trie Knowing what is the weakness, we an improve the design aordingly. To avoid single-ranh nodes, we adopt a data struture alled Patriia trie [20], [21], whih is speifially designed for this purpose. Essentially, a Patriia trie merges single-ranh nodes into one node in a trie, so it an guarantee that all nodes have full ranhes (in our ase two-way ranhes). Of ourse in the worst ase a Patriia trie is not etter than a regular trie, ut as we ll see in the experiments, that rarely happens for randomly-generated and real-world datasets. Figure 3 shows what a Patriia trie would look like if we insert the same signatures as in Figure 2. First, eause there is it differene on position 0, one node is reated on this position. Here, the right ranh has no more splitting points, so it diretly points to For the left ranh, there is another splitting point on position 2, so another node is reated aordingly, and eah signature elongs to one of the ranhes. Overall, 2 extra nodes are reated, and there is no single-ranh node in the trie. 01 root 01 p1 10 p p 3 Fig. 3: Patriia trie example, inserting the same signatures as in Figure 2 into a Patriia trie In this paper we apply a slight modifiation to the original Patriia trie. In our version of a Patriia trie node, we store (1) pointers to the left and right nodes, (2) the indexes at whih point the prefix starts and splits, and (3) the ommon prefix from the last split point to the urrent split point. We define a suset generation proedure on Patriia tries in Algorithm 5. It is similar to Algorithm 4 with the only differene eing that, instead of omparing one it at a time, segments of its (whih ome from merged single-ranh nodes) are ompared at eah node. In the end, signatures that have a ontainment relation are stored in the result list instead of queue q. Naturally, we an again reuse Algorithm 1 (y alling PATRICIA ENUM at line 5) to perform the join. We all this approah Patriia Trie-ased Signature Join (PTSJ). To ontinue our example, if we run the same query u 1 (0111) on Figure 3 using Algorithm 5, we still need to visit the left ranh of the trie. Only at this time, three instead of six nodes need to e traversed. In pratie, signatures an e

5 muh longer and sparse (see Setion V-B), therefore more node visits are saved ompared to Algorithm 4. Algorithm 5: PATRICIA ENUM() suset enumeration using Patriia trie Input: signature, patriia trie Output: tuple IDs that have signature ontainment relation 1 reate queue q 2 reate list result 3 enqueue patriia trie.root on q 4 while q do 5 node dequeue from q 6 if node.prefix signature.prefix then 7 if node.split = signature then 8 add node to result 9 else 10 split it signature[node.split] 11 if split it = 0 then 12 enqueue node.left on q 13 else 14 enqueue node.left and node.right on q 15 return result C. Cost analysis of PTSJ In this setion we give some ost estimation of PTSJ under simple onditions. Some notation we use are given in Tale II. The ost of PTSJ (C PTSJ ) an e roken down to C PTSJ = C reate PT + C query PT + C ompare set, where C reate PT is the ost to uild the Patriia trie on relation S, C query PT is the ost to ompare signatures on the trie, and C ompare set is the ost to atually perform set omparisons. We first identify that C reate PT and C ompare set are not the major ost of PTSJ. Then we dig deeper into C query PT, giving an estimation of how many integer omparisons will it ost. We find that under simple natural assumptions, C query PT is mostly influened y set ardinality and signature length. In the end, ased on these analyses, we propose a strategy to hoose a good signature length for PTSJ. 1) C reate P T and C ompare set : C reate PT : During Patriia trie reation, at most 2 S 1 nodes are reated in total. Even in the worst ase nodes are visited during eah signature insertion. Oviously, C reate PT does not take the major part of PTSJ s running time. C ompare set : Assume that on average N tuples remain for set omparison for eah tuple in R. Then C ompare set = N R. It is easy to see that N dereases when signature length grows, and inreases when R inreases. In general this is a small value (from 10 s to 100 s), proportional to the result output size (see elow). Therefore C ompare set is also not the major ost of PTSJ. Estimation of N: To estimate N, we start with a rather simple situation. Consider two signatures d and q, with set ardinalities (and hene numer of its set to 1 in signature) d and q, resp., and with signature length. We want to know Notation d Int X H N V TABLE II: Notation for ost analysis Explanation Signature length in its Average set ardinality Domain ardinality, set element domain Integer size in its Size of relation X Average height of Patriia trie Numer of tuples in S that have the signature ontainment relation with some tuple in R Numer of trie nodes one query has to visit what is the proaility that d q. For eah element in a set, the proaility that it appears on eah it is 1. For d q to happen, d should have 1 s on only the positions that q has 1 s. For eah element in d, they have q positions to hoose from, so eah element has the proaility q to e a suset. In total, the proaility is ( q ) d, and N = S ( q ) d. We next onsider a more ompliated senario. For example, if d s set ardinality is uniformly distriuted etween 1 and d, then the estimated proaility of d q would e p 1 +p p d d p d (1 p), where p = q. In general, N gets smaller when signature length () grows. High set ardinality query ( q ) tends to have more results, while low set ardinality data ( d ) tend to produe more results. All these intuitions are onfirmed y our formula. The main take-home message here is that N is a small value, so that set omparisons do not take the signifiant part of the overall running time. 2) C query P T : Let s assume that the numer of trie nodes eah tuple in R has to visit is V. Then the numer of omparisons to e done on the trie is C query PT = R V H Int H Here eah node on average ompares its, whih osts x H Int atual integer omparisons. We know that y + 1, so we get the upper ound x y. ( ) C query PT R V H Int + 1. (1) We first examine the integer omparisons. For low ardinality settings, signatures are sparse, so two signatures are more likely to share longer prefixes. In the extreme ase, all nodes share one path (skewed trie), therefore the average trie height H an e as high as 2. So it is rare for a single node to take more than two integer omparisons. For higher ardinalities, the trie tends to e more alaned, and H is a smaller value loser to log 2 (2 S ), ut still grows with respet to. Then we an expet a small ut slowly inreasing value for omparisons per node. The more important fator however is V.

6 Estimation of V : There are ( ) possile signatures with its set to 1. When set ardinality is small (i.e., when ( ) S ), it is highly proale that all possile signatures exist in the trie. For example, in the extreme ase that set ardinality is 1, there are only 2 possile nodes in the trie. Sine 2 << 2 S, the trie is likely to e full. In suh ases, V tends to reah the maximum possile, i.e., 2 H. Here H is approximately 2. This eomes less ovious when and grow to larger values. In suh ases, the trie will not ontain all possile ases, and the average height usually does not reah 2. If we have an all-one signature as the query, all nodes (2 S ) will e visited. Therefore 2 H = S (assume alaned trie). If on the lowest level, only one ranh is inluded, the numer of nodes to visit eomes 2 H H 1. Similarly, if singleranhes happen for the lowest x levels (whih yields the most numer of nodes), we get 2 H x +x 2 H x = (1+x) 2 H x. Furthermore, if we assume the numer of single-ranh nodes in a result is proportional to the numer of zeros in a signature (1 ), so x = (1 ) H, then, the numer of visited nodes is estimated to e ( ( V = 1 + H 1 )) 2 H (1 + H) S (2) Here, we see that with the inrease of S, the numer of visited nodes inreases. Bigger set ardinality also indiates more visited nodes, while longer signatures redue the numer of visited nodes. As we ll see later, we usually selet etween 2 Int and Int, so ( S ) is around 2 even for a million tuples. In suh ase we say the V is ounded y O(H). And if we ring formula 2 into formula 1, we get C query PT is ounded y O( R ). 3) Spae omplexity of PTSJ: Sine to uild a patriia trie for some relation S, only 2 S nodes are reated, and for eah tuple the signature size is usually no more than its set values, the spae omplexity of PTSJ is O( S ). D. Choosing the signature length for PTSJ Beause there is no need for exhaustive suset generation, in pratie, signature length an e set to thousands of its in PTSJ without any prolem. Generally, longer signatures provide more effetive filtering, ut ring more signature omparisons and higher main memory onsumption. So there is a need for finding the alane point for signature length. First of all, there is an asolute upper ound for signature length, whih is domain ardinality d. Letting = d essentially makes the signature a itmap representation of the sets. This numer, in many ases an e ahieved. For example, for a domain that has 1024 elements, the maximum signature length is 1024 Int integers. It is ovious that there is a lower ound for as well, whih is. If <, there is a high hane that all its in a signature are set to 1, whih is not useful anymore. Apart from these two ounds, we find the optimum signature length depends on many properties of input relations, suh as set ardinality, domain ardinality, relation size, and data distriutions. Among these, we notie oth from formula (2) and empirial study (Setion V-B) that the set ardinality has a igger impat on signature length seletion, and usually 2 Int Int an yield a good result. This also prevents the algorithm from using more signature omparisons than set omparisons. If not speified otherwise, we use the lower ound of the range ( 2 Int). Finally, we an set a maximum length in the algorithm, to prevent it from eing extremely long. In our experiments, this limit is set to 256 integers. Overall, our signature length is set to minimum of {d, } 2 Int, 256 Int. An empirial validation for this strategy is presented in Setion V-B. E. Extensions to PTSJ 1) Merge idential sets: With the help of the trie, tuples of the same signature are naturally grouped together. If we go one step further, maintaining a mapping list of tuples that have the same set elements, taking them into onsideration while output, we save the ost of omparing dupliates over time. This strategy is applied in our PTSJ implementation. It works well without introduing notieale overhead while reating the trie, and saves quite some omparisons while performing joins, espeially for real-world datasets. 2) Superset and set-equality joins: While our algorithms are designed for R S, it an e easily modified to perform R S, in ase we want to reuse the existing index on S. Here we take Algorithm 4 as an example to illustrate; Algorithm 5 an e hanged in a similar manner. The only plae that needs to e touhed is the if-else statements (lines 7 to 10). Two ase handling statements should e swithed, as given in Algorithm 6. Furthermore, in Algorithm 1 the set value ontainment hek (line 7) will hange aordingly, to if r.set s.set. Algorithm 6: Replae Algorithm 4 line 7 to 10 for superset join 7 if urrent it = 0 then 8 enqueue node.left and node.right on q 9 else 10 enqueue node.left on q Set-equality joins (R = S) an e answered effiiently as well. In this ase, a simple searh on the trie will return a list of tuples with the same signature. Further set omparisons are needed to validate the searh results. Sine we already merge tuples with the same set values, as disussed aove in Setion III-E1, many set omparisons are saved. 3) Set similarity joins: Apart from eing used for set ontainment omputations, a Patriia trie an e (re)used to answer set similarity join [22] queries as well. Set similarity join has een well-studied in the literature [23]. Solutions that make use of a trie have een proposed as well (e.g., [24], [25]), ut these do not operate on (and annot e easily adapted to) the signature spae as PTSJ does. For instane, given query signature q, we want to find signatures within hamming distane k. We an use Algorithm 7 to ahieve this

7 goal, where we extend Algorithm 4 for illustration purposes. In partiular, we use a ounter to rememer the hamming distane etween some prefix and our query. In the end, all signatures (therefore tuples) that are within the distane are in the queue, waiting for other operations (validation, output) to take ation. Systems suh as OLAP an enefit greatly y reusing one index for different purposes. Algorithm 7: TRIE SSJ() hamming distane set similarity join using trie Input: signature, trie, threshold k Output: tuple IDs that have similar signature within hamming distane k 1 reate queue q 2 i 0 3 urrent it signature[i++] 4 enqueue (trie.root, 0) on q 5 while q.top has hildren do 6 (node, i) dequeue from q 7 if i k then 8 if urrent it = 0 then 9 enqueue (node.left, i) on q 10 enqueue (node.right, i+1) on q 11 else 12 enqueue (node.left, i+1) on q 13 enqueue (node.right, i) on q 14 urrent it signature[i++] 15 return q 4) Disk-ased algorithm: PTSJ an e easily extended to an external memory setting. A straightforward implementation is to perform a nested-loop join over partitions of the data. Here we partition oth relations until one pair of partitions an fit into main memory. Then for eah pair of partitions from oth relations, we load them into main memory and perform the join. In this ase, the algorithm will have a quadrati ehavior with respet to the numer of partitions. Similar tehniques have een applied to other algorithms suh as PRETTI. However, as we disussed, PTSJ has a muh smaller memory footprint than PRETTI, whih makes it more suitale for this strategy. Smarter partitioning tehniques (e.g., [11], [12]) an e integrated into PTSJ as well. F. Disussion SHJ an e viewed as a one-level, multi-way trie, where eah ranh starts with a different prefix. PTSJ, on the other hand, is a multi-level, inary trie. The main enefits of PTSJ over SHJ ome from longer signatures, whih an filter out more unneessary set omparisons. Furthermore, the trie struture guarantees that only interesting suset prefixes are visited, instead of the whole exponential spae. PRETTI, on the other hand, does make use of a trie struture, ut it operates on the set element spae instead of signature spae. The enefit is that it does not need to e validated twie. The downside, however, is that trie height is as high as the set ardinality, making it only suitale for low set ardinality settings. This rings us to an advaned version of PRETTI, using a Patriia trie. IV. PRETTI+ Sine the Patriia trie is so useful for PTSJ, it is natural to ask if this data struture an e used to advantage with PRETTI. We have integrated a Patriia trie with PRETTI, alling this new join algorithm PRETTI+. Modifiations have to e done oth on trie onstrution and on the join proedures. Inserting sets to the trie an e a it trikier than with PTSJ, sine sets are not neessarily of the same size. In Algorithm 8, we show the trie onstrution funtion for PRETTI+. Here we assume eah node maintains a prefix, a set of related tuples, and a set of hildren nodes. The main idea is that, depending on the ommon prefix etween a trie node and the newly arrived set (tuple), the new tuple may e inserted to different positions with respet to the given node. Speifially, the tuple may e inserted to (1) the urrent root, or (2) some sutree of the urrent root, or (3) a newly reated node that eomes a parent of the urrent root, or (4) a newly reated node that is a siling of the urrent root. The ore of Algorithm 8 then is to find the orret insertion position. ah p3 root d p 1 fg p 2 Fig. 4: Trie example for PRETTI+, after inserting sets from user preferenes (Tale I) Algorithm 8: PRETTI+ INSERT() trie onstrution for PRETTI+ Input: sutree root node, tuple s, ursor on s.set: from Output: root for the sutree // insert s.set[from:] to sutree node, here we treat s.set as a string 1 len ommon prefix of node.prefix and s.set[from:] 2 nlen node.prefix 3 tlen s.set[from:] 4 if len = nlen then 5 if len < tlen then 6 some hild of node that mathes s.set[(from+len):] 7 all PRETTI+ INSERT(, s, from + len) 8 else // len = tlen 9 put s into node 10 return node 11 else // len < nlen 12 if len = tlen then 13 reate new node for s, insert new node etween node and its parent 14 else 15 reate new node as parent for node and tuple 16 return new node The join operation is almost the same as for PRETTI,

8 exept that lists of tuples from the inverted index have to e joined several times in eah node, sine eah node holds several set elements. By replaing a standard trie with a Patriia trie, PRETTI+ onsumes muh less main memory than PRETTI. However, set omparisons and tuple list joins still take plae, same as in PRETTI. As we ll see in our empirial study, PRETTI+ is always a etter hoie than PRETTI. V. EMPIRICAL STUDY In this setion we empirially ompare the performane of SHJ, PRETTI, PTSJ, and PRETTI+. We first introdue the experiment settings. Then we validate the signature length seletion strategy disussed aove in Setion III-D. After that we ondut the main omparison of the four algorithms on a variety of syntheti and real-world datasets. A. Experiment setting 1) Syntheti datasets: We reate a data generator to generate syntheti relations. The generator an generate relations with varying sizes, set ardinalities, domain ardinalities, and so on. The distriution of data an vary on oth set ardinality and elements. The distriutions are generated using Apahe Commons Math 1, a roust mathematis and statistis pakage. We start with a simple setting, with uniform distriution on different set ardinalities and set elements. Later we test the algorithms performane on relations with Zipf and Poisson distriutions, whih are ommonly found in real-world senarios. 2) Real-world datasets: We experiment with four representative real-world datasets, overing the senarios of low, medium and high set ardinalities. Some statistis of the datasets 2 are shown in Tale III. TABLE III: Statistis for real-world datasets data R avg. median d flikr orkut twitter wease Flikr-3.5M (flikr): The flikr dataset 3 assoiates photos with tags [26]. Naturally, here we treat tags as sets, to perform a set-ontainment join on photo ids. In this way, we reate the ontainment relation etween photos. Further operations suh as reommendation an e investigated upon suh relations. This is a low set-ardinality senario. Orkut ommunity (orkut): The Orkut dataset 4 ontains relations of people from an online soial network and the ommunities they elong to [27]. Here we treat eah person as a tuple and the ommunities they elong to as a set. Set-ontainment join in this ase, an help people disover new ommunities and new friends with similar hoies. Set Can e downloaded at xirong/index.php?n=dataset.flikr3m 4 ardinality for this dataset is higher than Flikr, and we further keep tuples with 10 to exhiit a low-to-medium set ardinality senario. Twitter k-isimulation (twitter): We derive this dataset from paper [28]. Bisimulation is a method to partition the nodes in a graph, ased on the neighorhood information of nodes. In this dataset, tuples are the partitions of the graph, and sets are the enoded neighorhood information eah partition represents. Here we define the neighorhood of eah node to e within 5 steps from the node. On suh dataset set-ontainment join ould e used for graph similarity detetion and graph query answering. For this dataset, we selet tuples with 30, to exhiit a medium set-ardinality senario. WeBase Outlinks-200 (wease): This dataset is a we graph from the Stanford WeBase projet [29]. We extrat the data 5 using tools from the WeGraph projet [30]. We only keep pages that have more than 200 outlinks, following Melnik et al. [12], to exhiit a high set-ardinality senario. 3) Implementation details: We implement all algorithms in Java. The signature length of SHJ is set to optimal aording to paper [8]. The signature length of PTSJ is set as suggested in setion III-D. For PRETTI and PRETTI+, we maintain a hash map in eah trie node to enale fast aess to hildren while traversing. This is ostly ut neessary for the algorithm to reah its est performane. Note that here we tried various effiient implementations of hash map (e.g., Fastutil 6, CompatColletions 7, Trove 8 ), and we find the HashMap implementation from JDK 7 itself has oth the est performane and lowest main memory onsumption. The opensoure ode of all implemented algorithms is availale online 9. 4) Test environment: All experiments are exeuted on a single mahine (Intel Xeon 2.27 GHz proessor, 12GB main memory, Fedora it Linux, JDK 7). The JVM maximum heap size is set to 5GB, whih we think is a deent setting even for today s omputers. In the experiments we run eah algorithm ten times, and reord the average, standard deviation and median of running times. We oserve in our measurements that the average gives a good estimate of the running time, and the standard deviation is not signifiant when ompared with the overall time. Hene in the following we only show the average running time. We tend to test with igger relations when possile, sine larger relations and longer running times eliminates the random ehavior introdued y OS sheduling. We run programs with taskset ommand, to restrit the exeution on one CPU ore. The running time we later present inlude the time to uild indexes (e.g., hash map for SHJ and trie strutures for the rest algorithms). We notie there is a trend that with the inrease of set ardinality, the perentage of index uild time over running time dereases. This is due to the fat that igger set ardinality leads to more set element omparisons, whih takes a larger portion of running time aordingly. But in general, the index uild time of SHJ and PTSJ are less than 1% and 5% of the overall running time; PRETTI and PRETTI+ on the other hand take more than 70% and 20% of the running time to uild indexes

9 D = 2 10 D = 2 11 D = 2 12 D = 2 13 D = 2 14 (a) Impat of domain ardinality setting = 2 2 = 2 4 = 2 6 = 2 8 = 2 10 () Impat of set ardinality setting R = 2 15 R = 2 16 R = 2 17 R = 2 18 R = 2 19 () Impat of relation size Fig. 5: Performane of PTSJ with different signature length settings For PRETTI and PRETTI+ ertain datasets are too ig to run in the given memory. In suh ases we swith the algorithms to the nested-loop on-disk versions. We notie that PRETTI and PRETTI+ may gain some effiieny y this approah, sine the in-memory trie of a partition an e shallower than the gloal trie. This is more notieale for high set ardinality senarios. Overall when swith to diskased versions, the differenes in ehavior of PRETTI and PRETTI+ are insignifiant, sine the algorithms running times are dominated y omputations instead of disk I/Os. B. The optimal signature length of PTSJ As we disussed, the signature length has a huge impat on PTSJ s performane, sometimes an order of magnitude differene. In Setion III-D, we gave some suggestions on how to hoose signature length. In this setion, we want to empirially validate these suggestions. Given a dataset, there are three main properties: the relation size, the set ardinality, and the domain ardinality. We want to know how these properties affet the ehavior of PTSJ. The strategy of this investigation is to hange one property while keeping the other two fixed. By examining the performane under different signature lengths, we an then learly see whether there is a orrelation etween a ertain property and signature length. Tale IV summarizes the settings for this investigation. fixed parameters TABLE IV: Dataset onfigurations hanging parameter R = 2 17, = 2 4 d {2 10, 2 11, 2 12, 2 13, 2 14 } R = 2 17, d = 2 14 {2 2, 2 4, 2 6, 2 8, 2 10 } = 2 4, d = 2 14 R {2 15, 2 16, 2 17, 2 18, 2 19 } Figure 5 shows the performane results of PTSJ, where the x-axis is the ratio etween signature length and set ardinality. The strategy given in Setion III-D suggests that a ratio etween 16 and 32 is suffiient. In Figure 5a, we see that indeed, a ratio etween 16 and 32 gives the est performane. Domain ardinality does not have a ig impat on the signature seletion. In Figure 5 we show how the algorithm performs under different set ardinality settings. Again PTSJ finds its est performane point etween 16 and 32. We notie that for some high ardinality settings ( = 2 8, 2 10 ), omparing signatures themselves eomes an expensive operation. In these ases shorter signatures are preferred in general. Figure 5 shows the impat of relation size over signature length seletion. We see a slow trend that when relations grow in size, the optimal signature length tends to move to larger values. This is indiated y formula 2, where R is part of the fator. But as we oserve, a ratio etween 16 and 32 an already give a good result. Overall, these experiments support our signature seletion strategy of Setion III-D. A signature of length etween 16 and 32 is usually a good seletion. C. Comparison of algorithms In this setion we disuss the experimental results of the four algorithms on various syntheti datasets. We test on different settings to show the salaility of all algorithms. Figure 6 shows experiments on uniformly distriuted datasets. Figure 7 further shows performane on Poisson and Zipf distriutions. Dataset onfiguration is the same as in Tale IV. 1) Spae effiieny for different algorithms: Main-memory onsumption is an essential fator for evaluating main memory algorithms. Low main-memory onsumption indiates etter salaility of the algorithm with respet to larger datasets. It is not diffiult to get a rough estimation of memory onsumption for the algorithms mentioned in this paper. The main differenes ome from the different data strutures (indexes) eah algorithm uses. For instane, for SHJ, a hash tale has to e uilt; for PRETTI and PRETTI+, a prefix tree and an inverted index; for PTSJ, a patriia trie. In general, two fators influene memory onsumption: (1) relation size R and (2) set ardinality. The influene of relation size is ovious: the numer of hash tale entries grows linearly with relation size, and so does the size of the prefix tree and inverted index, and the Patriia trie. Set ardinality, on the other hand, has a larger impat on PRETTI and PRETTI+, while SHJ and PTSJ are not so sensitive to it.

10 mem. per tuple (yte) d (a) Memory onsumption () Salaility w.r.t. domain ardinality () Salaility w.r.t. set ardinality 10 0 R R R (d) Salaility w.r.t. relation size ( = 2 4 ) (e) Salaility w.r.t. relation size ( = 2 6 ) (f) Salaility w.r.t. relation size ( = 2 8 ) SHJ PRETTI PTSJ PRETTI+ Fig. 6: Comparison of different algorithms for uniformly distriuted data We an learly see this via our experiments. In Figure 6a, we plot, for eah join algorithm, with different set ardinality settings, main memory onsumption per tuple. Here we note that, though the experiment runs with 2 17 tuples, the result stays the same for muh larger relations. This means that we an estimate how muh memory we need, given information aout relation size and set ardinality. We see that the memory onsumption asially has a linear relationship with set ardinality. SHJ, PTSJ and PRETTI+ vary y a onstant fator, whih is asially the ost of longer signatures (PTSJ), patriia trie (PTSJ and PRETTI+) and inverted index (PRETTI+). PRETTI on the other hand, needs around ten times more main-memory than others. For a relation with set ardinality 2 6, it needs more than 10KB per tuple, whih means 10GB for just one million tuples. This empirially sustantiates our remarks on PRETTI. 2) Salaility with different domain ardinality settings: Figure 6 depits performane with different domain ardinality settings. We see that the signature-ased solutions (SHJ and PTSJ) are not sensitive to hanges in domain ardinality, sine they operate on the signature spae instead of on the set element spae. PRETTI and PRETTI+, on the other hand, operate diretly on the set element spae. Larger domain ardinality indiates more entries in the inverted index, and shorter inverted lists (therefore faster merge joins on the lists). So PRETTI and PRETTI+ perform etter when domain ardinality is high. 3) Salaility with different set ardinality settings: In order to determine the salaility of the algorithms with respet to set ardinality, we set the relation size to 2 17, with average set ardinality varying from 2 2 to The very high set ardinality senarios (2 10 ) are not unommon, espeially in the ontext of graph analytis. We ll see more data of this kind from experiments with real data. In Figure 6, we see that PRETTI and PRETTI+ are oth more sensitive to set ardinalities, ompared to the signature-ased solutions. When set ardinality is lower (elow 2 5 ), PRETTI+ is a etter hoie over the other alternatives; ut eyond that point, PTSJ is a etter hoie. In eah ase, one of our new algorithms will ahieve nearly an order of magnitude performane gain over the est of SHJ and PRETTI. 4) Salaility with different relation sizes: Algorithm salaility with respet to relation size may e the most important fator in pratie. From Figure 6d to 6f, we show performane with differene set ardinality senarios ( = 2 4, 2 6, 2 8 ). Just as we saw earlier, for low ardinality settings (Figure 6d), PRETTI+ is a lear winner, followed y PTSJ, PRETTI and SHJ. When set ardinality grows, the advantages of signatureased solutions start to show. PTSJ eomes a etter hoie over the others. The differene eomes more signifiant with larger relation sizes. In Figure 6f we see that in many ases in-memory PRETTI (and PRETTI+) annot finish the experiments, so we swith the algorithm to a disk-ased nested-loop version. 5) Poisson distriution and Zipf distriution: Here we want to determine if different distriutions on the set ardinality and set elements have an impat on performane. We test datasets ( R = 2 17 ) with two distriutions: Poisson distriution and Zipf distriution, whih are widely found in real-world datasets. Distriutions are applied to either set ardinality or set

11 max (a) Salaility w.r.t. set ardinality, with poisson distriution on set ardinality () Salaility w.r.t. set ardinality, with poisson distriution on set element () Salaility w.r.t. set ardinality, with zipf distriution on set ardinality (d) Salaility w.r.t. set ardinality, with zipf distriution on set element SHJ PRETTI PTSJ PRETTI+ Fig. 7: Comparison of different algorithms for skewed distriutions elements. We expet that the distriution on set ardinality will have a greater impat, as shown previously. Unless speified otherwise, the x-axis shows the average set ardinalities. In Figure 7a we show datasets with Poisson distriution on set ardinalities. This setting is ad news for PRETTI and PRETTI+, eause then the set ardinality an e potentially large. We see that indeed, even when = 2 3, PRETTI and PRETTI+ are not ompetitive with PTSJ. Indeed, PTSJ performs the est in all ases. Figure 7 shows Poisson distriution on set elements. This distriution does not make a signifiant differene for all algorithms, whih ehave as in Figure 6. Zipf distriution on set ardinality favors PRETTI and PRETTI+. As in Figure 7, we see that PRETTI+ eomes the est solution on all settings. Note that in this ase the x- axis is the maximum set ardinality instead of average. Sine follows a Zipf distriution, many sets have small and only a few have larger ones. In fat, the median set ardinality for the dataset with max = 2 9 is only 17. This explains why PRETTI+ performs so well. Zipf distriution on set elements, as in Figure 7, does not have a huge impat on performane differenes. PRETTI and PRETTI+ perform slightly etter than in uniform distriution, sine they ould produe results earlier due to the nature of Zipf distriution (frequent elements are plaed near the trie root). Overall, our oservation is that distriutions on set ardinality has a large impat on performane. In suh ases, we need to not only examine the average set ardinality, ut also the median of set ardinality of data, for hoosing the right algorithm. Nonetheless, either PTSJ or PRETTI+ will e the est hoie, with sometimes a 10-fold speedup ompared with the urrent state-of-the-art. D. Experiments on real-world datasets Figure 8 summarizes performane on various real-world datasets, where we plot the ratio of a ertain algorithm s running time over the est algorithm for that dataset. We see that the performane an vary in an order of magnitude for many algorithms. In low-to-medium set ardinality settings (flikr, orkut), PRETTI+ is the lear winner, where signature ased methods, even PTSJ, are at least three times slower. SHJ in these two ases runs longer than a day. When it omes to medium-to-high set ardinality settings (twitter), however, the enefit of signatures starts to appear, PTSJ an make the omputation 3.6 times faster than the seond est (SHJ). For wease, PTSJ again is at least 8 times faster than the stateof-the-art, 2.6 times faster than PRETTI > > flikr orkut twitter wease SHJ PRETTI PTSJ PRETTI+ Fig. 8: Algorithm performane omparison for different realworld datasets VI. CONCLUSION AND FUTURE WORK Motivated y reent hardware trends and pratial appliations from graph analytis, query proessing, OLAP systems, and data mining tasks, in this paper we proposed and studied two effiient and salale set-ontainment join algorithms: PTSJ and PRETTI+. The latter is suitale for low set ardinality, high domain ardinality settings, while the former is a more ommon algorithm suitale for the other senarios. As shown in the experiments, these two new algorithms an e in many ases remarkaly faster than the existing state-of-the-art, and sale graefully with set ardinality, domain ardinality,

Incremental Mining of Partial Periodic Patterns in Time-series Databases

Incremental Mining of Partial Periodic Patterns in Time-series Databases CERIAS Teh Report 2000-03 Inremental Mining of Partial Periodi Patterns in Time-series Dataases Mohamed G. Elfeky Center for Eduation and Researh in Information Assurane and Seurity Purdue University,