Efficient Estimation of Inclusion Coefficient using HyperLogLog Sketches

Size: px

Start display at page:

Download "Efficient Estimation of Inclusion Coefficient using HyperLogLog Sketches"

Gyles Flynn
5 years ago
Views:

1 Efficient Estiation of Inclusion Coefficient using HyperLogLog Sketches Azade Nazi, Bolin Ding, Vivek Narasayya, Surajit Chaudhuri Microsoft Research {aznazi, bolind, viveknar, ABSTRACT Efficiently estiating the inclusion coefficient the fraction of values of one colun that are contained in another colun is useful for tasks such as data profiling and foreign-key detection. We present a new estiator,, for inclusion coefficient based on Hyperloglog sketches that results in significantly lower error copared to the state-of-the art approach that uses Botto-k sketches. We evaluate the error of the estiator using experients on industry bencharks such as TPC-H and TPC-DS, and several realworld databases. As an independent contribution, we show how Hyperloglog sketches can be aintained increentally with data deletions using only a constant aount of additional eory. PVLDB Reference Forat: Azade Nazi, Bolin Ding, Vivek Narasayya, Surajit Chaudhuri. Efficient Estiation of Inclusion Coefficient using HyperLogLog Sketches. PVLDB, (): 97-9, 28. DOI: INTRODUCTION The discovery of all inclusion dependencies in a dataset is an iportant part of data profiling efforts. It is useful for tasks such as foreign-key detection and data integration [7, 28, 4, 22, 2]. However, due to issues in data quality such as issing values or ultiple representations of the sae value, it becoes iportant to relax the requireent of exact containent. Thus, coputing the fraction of values of one colun that are contained in another colun inclusion coefficient is of interest to these applications. When the database schea and data sizes are large, coputing the inclusion coefficient for any pair of coluns in the database can be both coputationally expensive and eory intensive. One approach for addressing this challenge is to estiate the inclusion coefficient using only bounded-eory sketches of the data. Given a fixed budget of eory per colun, these techniques scan the data once, and copute a data sketch that fits within the eory budget. For a given pair of coluns X and Y, the inclusion coefficient (X, Y ) is then estiated using only the sketches on coluns X and Y. One such approach, adopted in Zhang et al. [3] is to build a Botto-k sketch [2] on each colun, and develop an inclusion Perission to ake digital or hard copies of all or part of this work for personal or classroo use is granted without fee provided that copies are not ade or distributed for profit or coercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific perission and/or a fee. Articles fro this volue were invited to present their results at The 44th International Conference on Very Large Data Bases, August 28, Rio de Janeiro, Brazil. Proceedings of the VLDB Endowent, Vol., No. Copyright 28 VLDB Endowent /8/6... $.. DOI: coefficient estiator using these sketches. Intuitively, Botto-k sketches have high accuracy for inclusion coefficient estiation when the nuber of distinct values in both X and Y are saller than k since the sketches effectively behave like hash tables of the respective coluns. However, as we show epirically in this paper, one of the liitations of using Botto-k sketches for estiating inclusion coefficient is that for given eory budget (of k values), as the cardinality (i.e. nuber of distinct values) of colun X or Y increases beyond k, the estiation error becoes large. Additionally, Botto-k sketches are not aenable to increental aintenance in situations where data is deleted. For instance, in data warehousing scenarios, it is not uncoon for recent data to be added and older data to be reoved fro the database. Developing an estiator with low error for inclusion coefficient using bounded-eory sketches is challenging. We first establish a hardness result showing that any estiator that relies only on sketches with bounded eory ust incur unbounded error in certain cases. Intuitively, the difficult cases are when colun X has sall cardinality and colun Y has large cardinality or vice-versa. Furtherore, as we report in our epirical evaluation, these difficult cases appear to be quite coon in the real-world databases that we have studied. We develop a new estiator for inclusion coefficient (Binoial Mean Lookup) based on Hyperloglog sketches [4]. Hyperloglog (HLL) sketches can be coputed efficiently and within a given eory budget, requiring invocation of only a single hash function for each value in the colun. HLL sketches are becoing increasingly popular for estiating cardinality in different applications, and are even being adopted in coercial database engines e.g., [6] uses Hyperloglog to support approxiate distinct count functionality. The ain idea behind the estiator is a theoretical result that establishes a apping fro the inclusion coefficient (X, Y ) to the probability that the value of a bucket in the HLL sketch of colun Y is greater than the value of the corresponding bucket in the HLL sketch of colun X., is based on axiu likelihood estiation (MLE) ethod. It observes the nuber of buckets in the HLL sketch of colun Y whose HLL value is greater than the HLL value of the corresponding bucket in colun X, and it returns the value of (X, Y ) that axiizes the likelihood of this observation. As a by-product of our estiation technique, we are also able to provide a bound on the error. This error bound is data-dependent, i.e. it is specific to the pair of coluns X and Y. Such an error bound can be valuable to applications that consue the estiates, and is not provided by prior techniques. Hyperloglog sketches can be aintained increentally in the presence of insertions. An independent contribution of this paper is a technique for increentally aintaining a Hyperloglog sketch in the presence of data deletions with a constant eory overhead. 97

2 The key idea is that each bucket in an HLL sketch always holds an integer value less than a constant `, where ` is the nuber of bits of the hash value. By aintaining a ax-heap of constant size (at ost `) for each bucket we are able to support increental deletion. We show through experients on real-world databases and industry benchark TPC-H and TPC-DS databases that has significantly lower overall error than the approach using Bottok sketches [3], particularly in cases where at least one of the coluns has large cardinality. For cases where both coluns have sall cardinality, the accuracy of both estiators are siilar. For instance, in two real-world databases where there are any coluns with sall and large cardinality, the average error using Botto-k sketches is.3 and.59 respectively, whereas the corresponding errors for are only. and.4. For the other two realworld databases where ost coluns have sall cardinality the average error using Botto-k sketches is.5 and.2 respectively, whereas the corresponding errors for are.6 and.4. Finally, we consider one iportant application of inclusion coefficients, naely the proble of foreign-key (FK) detection in a database. All prior work on FK detection [29, 3, ] relies on exact inclusion coefficients to prune the FK candidates. We show epirically on several real-world and benchark databases that the estiation error of is acceptable for these FK detection techniques. In other word, replacing the exact inclusion coefficient with an estiate obtained via the estiator has no noticeable ipact on the precision and recall of these FK detection algoriths. In suary, this paper akes the following contributions: We establish a hardness result for inclusion coefficient estiation using bounded-eory sketches. We develop an MLE-based estiation algorith called (Binoial Mean Lookup) for inclusion coefficient estiation based on Hyperloglog sketches [4]. We prove the correctness of our algorith and the error bound of our estiator. We show how, with a constant eory overhead, Hyperloglog sketches can be extended to support increental deletion. We ipleent our inclusion coefficient estiator () on Cosos [], a distributed, Big Data engine that is extensively used within Microsoft for analyzing large data sets; and evaluate the effectiveness of using several synthetic and real-world datasets. We easure the precision and recall of two existing foreign-key detection techniques when using inclusion coefficient estiates rather than the exact inclusion coefficients. The rest of the paper is organized as follows. We present a hardness result for inclusion coefficient estiation using sketches and introduce background of Hyperloglog sketch construction in 2. We describe the estiator for inclusion coefficient and its error analysis in 3. The extension to support increental deletion is presented in 4. We describe the results of our experiental evaluation of inclusion coefficient estiation and its application to FK detection in 5. We discuss related work in 6 and conclude in PROBLEM DEFINITION AND BACKGROUND Let database D be the collection of tables T, where the set of all coluns in tables T are C. Let n be the total nuber of coluns ( C = n). For each colun X 2C, the set of all its possible values is called the doain of X, denoted by do(x). We use X[i] as the value of colun X for tuple i. We forally define the inclusion coefficient estiation proble and we discuss the hardness result. 2. Inclusion Coefficient Estiation and Hardness Result Inclusion coefficient is defined between two sets of values, and is used to easure the fraction of values of one set that are contained in the other set. DEFINITION. (Inclusion Coefficient) Given two coluns/sets, X and Y, the inclusion coefficient of X and Y is defined as: (X, Y )= X \ Y, X 6= () X where represents the nuber of distinct values in a set. If X is fully covered by Y (X Y ), (X, Y )=; otherwise apple (X, Y ) <. Note that (X, Y ) is asyetric: in general, (X, Y ) is not equal to (Y,X). In tasks such as foreign key detection and data profiling, we need to calculate inclusion coefficients for any pairs of coluns, which could be too expensive (in ters of both tie and eory) for large datasets. As a trade-off between accuracy and perforance, the idea is to construct a sketch (or a copact data structure) for each colun C 2Cby scanning the data once, and estiate the inclusion coefficient between any pair of coluns using their sketches. Such sketching and estiation techniques are useful in two scenarios: i) coluns are too large to fit into eory; and ii) we want to copute inclusion coefficients for any pairs of coluns (e.g., there are n coluns and inclusion coefficients need to be calculated for all the n(n )/2 pairs). PROBLEM. (Estiating Inclusion Coefficient using sketches) For each colun C 2C, we construct a sketch S C by scanning C once. Then for any two coluns X and Y, we want to derive an estiator ˆ(S X, S Y ) to (X, Y ), by accessing only the two sketches. In the rest part, we write ˆ(S X, S Y ) as ˆ if X and Y are clear fro the context. Table shows frequently used notations. 2.. Estiation and Sketch Size Suppose we estiate the inclusion coefficient (X, Y ) of two coluns X and Y as ˆ, using their sketches. The estiation error (X, Y ) ˆ ranges fro to. Considering the randoness in the sketch construction, ideally, we would like the estiation error to be bounded with high probability, i.e., (X, Y ) ˆ apple with probability at least, for any given two coluns X and Y. Unfortunately, we can show that, unless the sketch size is linear in the nuber of distinct values (which could be equal to the nuber of rows), there is no sketch based on which the worst-case estiation error can be bounded with <and <. A lower bound of sketch size: The hardness can be observed even in a very siple case when X = {x} contains only one eleent, and Y is large. In this case, (X, Y ) takes value either (if x /2 Y ) or (otherwise). Therefore, to bound the estiation error below any constant less than, fro the sketches of Y, we need to distinguish two cases, i) x /2 Y or ii) x 2 Y, with high probability this is exactly the approxiate ebership proble [9]. More forally, we can prove the following hardness result. THEOREM. We pre-copute a sketch for each colun in a database to estiate the inclusion coefficient for pairs of coluns. If we require that, for any two given coluns X and Y 2 C, (X, Y ) ˆ apple < with probability at least >, then any sketch ust use space at least (n C log(/ )) bits per colun C 2C, where n C is the nuber of distinct values in C. PROOF. Fro the above discussion, we can reduce approxiate ebership proble to our inclusion coefficient estiation proble: for each colun Y of size n, we want to construct a sketch 98

3 to support ebership queries, i.e., check whether x 2 Y for any given x, with a false positive rate at ost. Suppose we want to estiate the inclusion coefficient (X, Y ) using sketches of X and Y. Consider the case when X = {x}. Indeed, (X, Y )=if x 2 Y, or if x /2 Y. Therefore, to ensure that the estiation error is less than, there ust be no false positive to the ebership query x 2 Y (with probability less than ) [9] shows that, for this purpose, any sketch ust use space at least (n log(/ )) bits. So the proof is copleted. Reark: [25] gives an even tighter lower bound of sketch sizes when the nuber of distinct values in each colun is unknown. 2.2 HLL Sketch Construction As shown in Theore, the worst-case error cannot be bounded based on sublinear-size sketches. However, the hope is that, for particular instances of X and Y, the estiation errors still could be uch better than the worst-case error. Hyperloglog (HLL) sketch [4, 5] provides a near-optial way to estiate cardinality (i.e. the nuber of distinct values in a set). In 3, we will show how to use HLL sketches to estiate inclusion coefficients. We first review how to construct the HLL sketch of a colun. Let h : do(x) {, }` be a hash function that returns ` bits for each value X[i] 2 do(x). For each hash value s i = h(x[i]), we find the position of the leftost represented by (s i) and the axiu of { (s i) : s i = h(x[i])} is the HLL sketch of the colun X which is used for cardinality estiation [4, 5]. One way to reduce the variance of the estiation is to use ultiple hash functions [5]; however, [5] proposed stochastic averaging that eploying only a single hash function eulates the effect of using different hash functions. They use one hash function but take the first bits to bucketize hash values to iic 2 hash functions and reduce the variance. Then the reaining ` bits of each hash value s i = h(x[i]) are used to find the position of the leftost as (s i). Algorith 3 in Appendix A shows the steps to construct such sketches for a set of coluns C. In 3.3, we discuss how to choose the paraeter for given two coluns X, and Y. X Buckets i j (X[i]) s i : (X[j]) s j : (s i ) = 3 (s j ) = 4 Max( (s i ), (s j )) b b 2^ = 4 Figure : Constructing the HLL sketch of the colun X as S X. For exaple, as shown in Figure, one hash function h is used for all values in colun X, and the first bits are used to bucketize hash values. As a result there are 2 buckets (b,b 2,,b 2 ) in HLL sketch of colun X represented by S X. Specifically, in this figure s i = h(x[i]) and s j = h(x[j]) are assigned to bucket b because their first bits are the binary representation of the value one. Moreover, (s i)=3and (s j)=4because the position of the leftost in the reaining ` bits of the hash values s i and s j are 3, and 4 respectively. If the s i and s j are the only hash values assigned to b, the final value in b is V X = ax(3, 4) = 4. The foral definition of Vi X is given by Definition 2. DEFINITION 2. (Vi X : HLL value of colun X in bucket b i). Let s j be the hash value of the tuple j in X whose first bits (s j[,...,]) indicate it belongs to bucket b i, i.e., s j[,...,] is the binary representation of the value i ((s j[...]) 2 = i), and (s j) be the leftost one in the reaining ` bits of the hash value s j (s j[ +,...,`]). The HLL value of colun X for the bucket b i is defined as: Table : Suary of frequently used notations n Total nuber of coluns nx Nuber of distinct values in colun X S X Sketch of colun X (X, Y ) Inclusion coefficient of coluns X and Y ˆ Estiation of (X, Y ) h ` bits Hash function Nuber of bits assigned for bucketization 2 Nuber of buckets in HLL sketch Vi X HLL value of colun X in bucket b i V X HLL value of colun X when there is only one bucket ˆP Estiated value of pr(v X apple V Y ) e p Estiation error of pr(v X apple V Y ) e Estiation error of (X, Y ) Vi X = ax (s j ) (2) s j :(s j [...]) =i 2 Note that when there is only one bucket we ignore index i and denote the HLL value of colun X by V X. Space coplexity: In the HLL sketch of the colun X (S X) the HLL value of each bucket is an integer nuber ( apple Vi X apple ` ). Thus each bucket needs log(` ) bits to store Vi X, i.e., total eory to store the sketch S X is O(2 log(` )). 3. EFFICIENT ESTIMATION OF INCLU- SION COEFFICIENT In this section, we describe a technique to estiate the inclusion coefficient using Hyperloglog (HLL) sketches. In a preprocessing step, a HLL sketch [4, 5] is constructed for all coluns by scanning the data only once in practice these sketches are copact enough to be able to fit into eory. Our algorith ai to estiate the inclusion coefficient of two coluns X and Y using pre-coputed HLL sketches of X and Y. We also produce an error bound for the estiate. 3. Overview of our Approach The ain idea of our algorith is to produce an estiate of the inclusion coefficient between two coluns X and Y by coparing their HLL values (Definition 2). To develop intuition on why coparing the HLL values of X and Y can help in finding the inclusion coefficient, we start with the case of a HLL sketch with a single bucket, and suppose that X and Y have the sae nuber of distinct values. We exaine the following two extree cases. i) if X = Y or (X, Y )=, we have pr(v X apple V Y )=pr(v X = V Y )=, because the hash function h in HLL is applied on the sae set of values for both X and Y ; and ii) if X \ Y =? or (X, Y )=, pr(v X apple V Y ) is at least.5. A useful observation here is that pr(v X apple V Y ) increases onotonically as (X, Y ) increases (we prove it forally later). For exaple, in Figure 2, we plot pr(v X apple V Y ) as a function of (X, Y ) when X = Y = introduces how to derive this function for general cases. Clearly given bucket i for each coluns X and Y, the event Vi X <= Vi Y is a Bernoulli trial. When there are ultiple buckets since buckets are independent, the events Vi X <= Vi Y are independent Bernoulli trials [5]. The reason that for a given colun the buckets are independent is the result of the HLL construction 2.2. fro property of the universal hashing, first bits of a hash value is independent of the rest ` bits. The intuition behind our algorith is that, using ultiple buckets in the HLL sketches of X and Y, we first estiate pr(v X apple V Y ) given the fact that the event Vi Y is an independent Bernoulli trial for each bucket b i (e.g., estiating pr(v X apple V Y ).8 in Figure 2). Then, since pr(v X apple V Y ) can be written as a function of the inclusion coefficient, we effectively lookup the value of 99

4 (X, Y ) that produces the estiated pr(v X apple V Y ). For exaple, in Figure 2 by looking up we get (X, Y ).62). Maxiizing the likelihood. Our algorith is based on the axiu likelihood estiation (MLE). More forally, let Z = {i Vi Y } be the nuber of buckets (aong all 2 buckets) where Vi Y. The rando variable Z follows a distribution paraeterized by X, Y, and (X, Y ). We observe Z = z fro HLL sketches of X and Y, and we want to choose (X, Y ) to axiize the likelihood of our observation. le = argax pr (Z = z (X, Y )= ). (3) Next we introduce our MLE-based estiation algorith called (Binoial Mean-Lookup). Suppose X and Y are known (or estiated fro their HLL sketches), there are still two reaining questions. First, we need to characterize the distribution of Z with (X, Y ) as a paraeter ( 3.2.). Second, we need an efficient algorith to axiize the likelihood as in Equation 3 ( 3.2.2). We prove the correctness of our algorith, i.e., show that our algorith indeed gives a solution to Equation 3, and analyze its estiation error in Finally, we briefly discuss how can leverage additional eory if available to iprove accuracy ( 3.4). 3.2 : Binoial Mean-Lookup Estiator As shown in Figure 2, pr(v X apple V Y ) increases onotonically as (X, Y ) increases. We first show how to derive the closed for of pr(v X apple V Y ) as a function of (X, Y ) then we discuss the detail of our inclusion coefficient estiator Calculating pr(v X apple V Y ) Given coluns X and Y, V X and V Y are both rando variables. Let us first consider a sipler case where there is only one rando variable V X and discuss how to find pr(v X k), where k is constant. Then we show how to use this siple case to derive the general case pr(v X apple V Y ). Given colun X with n X distinct values, when there is only one bucket ( =), based on the Definition 2, the HLL value of the colun X is in fact the axiu over n X independent rando variables. As we discussed in 2.2, for each hash value s i, we find the position of the leftost represented by (s i) and the axiu of { (s i):s i = h(x[i])} is the HLL sketch of the colun X. Obviously every bit in the s i is a Bernoulli trial given that it is a rando experient with exactly two possible outcoes, and and fro property of the universal hashing every bit of a hash value is independent fro each other. Thus, the bits in s i are independent Bernoulli trials. Each rando variable (s i) represents the leftost one in s i, i.e, first one after (s i) zeros. Thus as also pointed out in [4] each rando variable is geoetrically distributed and pr(v X apple k) is: pr(v X apple k) = nx i= ( Pr(first k bits are zero)) = ( (4) By Equation 4, pr(v X = k) is pr(v X apple k) pr(v X apple k ). pr(v X = k) =( ( 2 k )nx (5) Next we show how to use these equations to derive pr(v X apple V Y ), where both V X and V Y are rando variables. When the intersection of X and Y is non epty V X and V Y are not independent. Let T be X \ Y. To resolve the dependency of X and Y, we consider three disjoint sets: T, X = X \T, and Y = Y \T. As shown in Figure 3 based on the cardinality of these sets there are three different cases. () X and Y are disjoint, i.e. n T =, n X = n X, n Y = n Y. (2) Y is subset of X, i.e., Y = Y \T is epty (n T = n Y, n X = n X n Y, n Y =). (3) X and Y are partially overlapped Y = Y \T is not epty (n T 6=,n X 6=, n Y 6= ). Next we show how to use Equations 4, and 5 to derive the pr(v X apple V Y ) for each case. Case: Clearly, if T = X \ Y is epty (n T =), then V X and V Y are independent. As shown in Figure 3, since we are interested in V X apple V Y if X and Y are disjoint and V Y = k, then V X should be at ost k. Based on Definition 2 we have apple k apple `. Thus the pr(v X apple V Y ) with the help of Equations 4 and 5 can be calculated by Equation 6. Note that since n T =, n X = n X and n Y = n Y. pr(v X apple V Y )= pr(v X apple k) ^ pr(v Y = k) = ( k= k= ( ( 2 k )n Y For exaple, by Equation 6 when X = Y = 4, n T =, (X, Y ) is the pr(v X apple V Y ).58 (Figure 2). Case2: If Y X, then T = Y and Y is epty (n Y =). Siilar to case, if V T = k, then V X should be at ost k, where k can be any value in [,` ]. So pr(v X apple V Y ) is as follows: pr(v X apple V Y )= pr(v X apple k) ^ pr(v T = k) = ( k= k= ( 2 k )nt ( 2 k )nt For exaple, when X = Y = 4, and n T = 4 the (X, Y ) is and the pr(v X apple V Y ) (Figure 2). Case3: Finally, if X and Y are partially overlapped, as shown in Figure 3, given k there three scenarios ) V X apple k, V T apple k,v Y = k, 2) V X apple k, V T = k, V Y apple k, and, and 3) V X apple k, V T = k, V Y = k. Thus the pr(v X apple V Y ) can be derived by Equation 8. pr(v X apple V Y )= pr(v X apple k) ^ pr(v T apple k ) ^ pr(v Y = k) k= + pr(v X apple k) ^ pr(v T = k) ^ pr(v Y apple k ) + pr(v X apple k) ^ pr(v T = k) ^ pr(v Y = k) ( = ( k= +( +( ( ( ( ( 2 k )nt 2 k )nt ( 2 k )nt ( ( ( 2 k )nt 2 k )nt 2 k )n Y ( (6) (7) 2 k )n Y 2 k )n Y (8) For exaple, when X = Y = 4, and n T = 62 the (X, Y ) is.62 and the pr(v X apple V Y ).8 (Figure 2). Multiple buckets: When there are 2 buckets ( >), inspired by stochastic averaging [5] we consider the average case where the HLL value of colun X in Definition 2 will be the axiu over on average nx independent rando variables for each bucket 2 i and Equations 4 and 5 are updated as: pr(vi X apple k) =( 2 k ) nx 2 (9) pr(vi X = k) =( 2 k ) nx 2 ( 2 k ) nx 2 () Later in 3.3, we discuss how considering the average case affects the nuber of buckets.

5 = Lookup = e P e P e e (.62)= e P e = Figure 2: X = Y = 4, Lookup approach Figure 3: For coluns X and Y ( X Y ) cases are () disjoint, (2) overlapped and (3) partially overlapped, where T = X \ Y, X = X \T, Y = Y \T Figure 4: e p and e are the error of estiating pr(v X apple V Y ) and inclusion coefficient (X, Y ) Efficient Algorith to Maxiize the Likelihood Recall fro Equation (3) that we forulate the proble of estiating inclusion coefficient as a axiu likelihood estiation proble, where we observe the nuber of buckets (aong all 2 buckets) where Vi Y is z and the goal is to choose (X, Y ) to axiize the likelihood of our observation. Here we propose our estiator,, to efficiently solve the MLE. has two ain steps. Step one is to estiate pr(v X apple V Y ) using only S X and S Y. Step two is to use lookup approach to ap the estiated probability into the inclusion coefficient. Before describing the details of these two steps, we first provide an overview of how works. As shown in Figure 2, pr(v X apple V Y ) increases by increasing the (X, Y ) (proof in theore 2). Clearly, if estiates pr(v X apple V Y ), it can be used to find the (X, Y ). For exaple, let us assue the estiated value of pr(v X apple V Y ) be.8 ( ˆP =.8). As shown in Figure 2, after projection to blue line ( 3.2.) estiates (X, Y ) to be.62. Algorith shows the detail of the two ain steps of. The input to this algorith is the HLL sketches of colun X and Y (S X, S Y ). In 3.3, we discuss how to choose nuber of buckets for these sketches. In the first step, in order to calculate ˆP, since the event Vi X apple in each bucket b i is an independent Bernoulli trial ( 3.). Thus ˆP is the ratio of the nuber of buckets where Vi Y to the total nuber of buckets (2 ) (lines 3-7). Later in Theore 4, we use Hoeffding inequality to provide the error bound of the ˆP. Vi Y Algorith : // inclusion coefficient estiator : Input: S X, S Y 2: //Step : Estiate pr(v X apple V Y ) using S X, S Y 3: z = 4: for b i, where i 2 [, 2 ] 5: if Vi Y then z = z + 6: ˆP = z 2 7: //Step 2: Lookup step 8: n X = Estiated nuber of distinct values fro S X 9: n Y = Estiated nuber of distinct values fro S Y : return ˆ(X, Y )=Lookup( ˆP,, in(n X,n Y ),n X,n Y ) In the second step, (Algorith ) calls Lookup function (Algorith 2) to estiate inclusion coefficient (X, Y ) as the one that produces the ˆP. The key idea is that since pr(v X apple V Y ) in an increasing function of (X, Y ) (Theore 2), given an estiated probability ˆP we can use binary search to estiate (X, Y ). Algorith 2 shows the pseudo code of the lookup approach. The input to this algorith are ˆP, ininc, axinc, n X, n Y, and where ˆP is the estiation of the pr(v X apple V Y ), ininc and axinc are the boundary of the search, n X, n Y are the estiated cardinality of the X and Y ( [4]), and is the error or tolerance. Algorith 2 is doing binary search over the possible intersection size n T and at each iteration based on the value of n T depends on which case it is, it uses the suitable Equation fro 3.2. (lines 5-8). For exaple, if X \ Y = ;, it uses Equation 6. Such siple bisection procedure for iteratively converging on a solution which is known to lie inside soe interval [a, b] has been introduced in [8] for root finding. They showed the nuber of iterations required to obtain an error saller than is. In our proble since ln(a b) ln ln 2 ln apple (X, Y ) apple, the nuber of iteration would be. For ln 2 exaple, Algorith 2 only needs 7 iterations to obtain an error saller than = 5. It worth entioning that the cost of each iteration is very cheap (analysis in 3.2.3). Algorith 2: Lookup // Map ˆP into ˆ(X, Y ) : Input: ˆP, ininc, axinc, nx,n Y, 2: Outputs: ˆ(X, Y ) 3: n T = ininc+axinc ; ˆ = nt 2 n X 4: if (n T = n X & n T = n Y ) then Prob =. //X = Y 5: else if (n T =) then Prob = Equation 6 //X \ Y = ; 6: else if (n X >n Y & n T = n Y ) then Prob = Equation 7 7: else Prob= Equation 8 8: if Prob ˆP apple return ˆ 9: if Prob > ˆP return Lookup( ˆP,n T,axInc,n X,n Y ) : if Prob < ˆP return Lookup( ˆP,inInc,n T,n X,n Y ) Algorith Correctness and Analysis We provide the following analysis to show the correctness, efficiency, and the error bound of. We prove pr(v X apple V Y ) is an increasing function of (X, Y ), i.e., in lookup step there is a one to one apping between the probability pr(v X apple V Y ) and the inclusion coefficient. In (3), we forulated the proble of estiating inclusion coefficient as axiu likelihood proble. We prove that the results of and the MLE forulation (3) are identical. We also provide the error bound of the for estiation of the pr(v X apple V Y ) and inclusion coefficient. Finally, we show the tie coplexity of the Algorith. Algorith Correctness: uses binary search in order to find the apping between the probability ˆP and the inclusion coefficient (Algorith 2). This approach only works if there is a one to one apping fro pr(v X apple V Y ) into (X, Y ). Theore 2 shows that probability pr(v X apple V Y ) ( 3.2.) is an increasing function of (X, Y ). Clearly an increasing function is a one to one function, hence is invertible.

6 THEOREM 2. Given two coluns X, and Y with intersection T, where T = n T, probability pr(v X apple V Y ) is an increasing function of (X, Y ). PROOF. Please find the proof in Appendix B. In Theore 3 we prove that the results of Algorith and the MLE forulation (3) are identical. THEOREM 3. The inclusion coefficient estiate ˆ fro Algorith and the inclusion coefficient estiate le fro MLE forulation (3) are identical. PROOF. Please find the proof in Appendix C. Bound of the Probability Estiation: Let P and ˆP be the exact and estiated value of the pr(v X apple V Y ) respectively. (Algorith ) calculates ˆP as the ratio of the nuber of buckets where Vi Y to the total nuber of buckets 2. As shown in Figure 4, this estiation ay produce error e p, e.g., ˆP =.8 while P =.8 ± e p. This figure also show that when ˆP =.8, the lookup step in returns.62 as the estiated inclusion coefficient while the actual one is.62 ± e (Figure 4). Thus, estiation error e p result in e, i.e., estiation error of the inclusion coefficient (X, Y ). Here we first show that given an error e p ( apple e p apple ), what is the probability that the estiation error of Algorith be at ost e p. Next we show how to use the e p in order to bound estiation error of the inclusion coefficient. THEOREM 4. Probability that estiation error of ˆP be at ost e p is: pr ˆP P applee p 2exp 2 + e 2 p PROOF. Please find the proof in Appendix D. For exaple, when =7and estiation error e p is.4, the pr( ˆP P applee p) is at least.95, i.e., with 95% confidence the estiation error is at ost.4%. Bound of the Inclusion Coefficient Estiation: Fro Figure 4 one can see that the slope of P (the blue line) at any point represent the ratio of e p to e, e.g., when ˆP =.8, the estiated inclusion coefficient is.62 and the slope of P at.62, P (.62), is ep. One can nuerically find this slope of P for a given a point e, e.g., P (.62) =.7. Moreover, as we discussed in theores 4, with 95% confidence the estiation error e p is.4. Thus e at this point can be calculated as.4 =.2. More forally at any.7 point we have: e = ep P ( ) () So far we have shown how to find e for a given point. Clearly in Figure 4 the slope of P at different points are different. Thus in order to bound e of two coluns X and Y the key observation is to first find the iniu and axiu slopes of P and then the ratio of e p to those slopes will be the bound of e. Table 2 shows the iniu and axiu slopes of the P and the e p bound for coluns X and Y with different cardinalities. For exaple, when the cardinality of both X and Y is 4, the iniu and axiu slopes of P, in Figure 4 are.24 and.79 and the iniu and axiu e are.4.4 =.2 and =.5, where the ep is.4. Figures 6 to 8 shows how the slope of P varies for the coluns with different cardinalities. One can see that the slope of P reduces as the difference of the cardinalities of the X and Y increases e.g., when X =, Y = 4 the iniu and axiu slopes of P are.62 and.72 respectively. Clearly, when the slope of P reduces, e will increase (Table 2). Tie coplexity of the Algorith: In Algorith, the estiation of pr(v X apple V Y ) using S X, S Y takes 2 steps Table 2: The iniu and axiu e for synthetic data (X, Y ) X Y in( ep ep ) ax( ) ax(e ) in(e ) e e (lines 4-6), where 2 is the nuber of buckets. As we discussed in ln the binary search in the lookup step takes iterations to ln 2 obtain an error saller than. The cost of each iteration is in O(` ) due to the fact that the Equations 6, 7, 8 used in Algorith 2 ( lines 6-8) is the su of ` values. Thus, the tie coplexity of Algorith is O(2 + ln (` )) which is linear in the ln 2 nuber of buckets. 3.3 Choosing Nuber of Buckets Algorith 3 (Appendix A) shows the steps to construct a HLL sketch for a fixed. In this section, we first discuss given two coluns X, and Y how to choose the paraeter (nuber of bits for the buckets). Next, we show how this algorith changes when we consider all pair of coluns since paraeter needs to be varied. Paraeter for a given colun pair: The HLL construction of colun X can be viewed as bins and balls proble [2], where buckets are the bins and distinct values in colun X are balls. As we discussed in 3.2., given colun X with n X distinct values, when there is only one bucket the HLL value of colun X is in fact the axiu over n X independent rando variables (Definition 2). When there are 2 buckets ( >), the HLL value of n colun X will be the axiu over, on average, X 2 independent rando variables for each bucket, i.e, balanced load in each bucket. In bins and balls proble [2], it is shown that as the nuber of bins increases the probability of balanced load will decrease. In other words, given n X balls and 2 bins the probability that all bins contains exactly nx 2 balls reduces as the nuber of bins increases. Thus in HLL construction of colun X, we should also expect that as the nuber of buckets increases the probability that all buckets have the sae load decreases. However, it is also shown in [5, 4] that having large nuber of buckets reduces the variance of cardinality estiation using a HLL sketch. Thus in one hand, having less nuber of buckets increases the probability of balanced load ( nx ), but in the other hand ore nuber of buckets reduces the variance of estiation. For the perfect 2 balanced load, there should be only one bucket, and for the lowest variance nuber of buckets should be n X ( =log(n X)). Given two coluns X, and Y, in this paper, as a trade-off between the variance and the balanced load we heuristically pick a id point between the best variance and the best balanced load by considering the nuber of buckets as equation 2. Note to say that it is a heuristic and it is not the optial choice. log(ax(nx,ny )) = (2) 2 Paraeter for ultiple pairs of coluns: Our goal is to efficiently estiate inclusion coefficient for all colun pairs in a database. When we consider ultiple pairs of coluns Equation 2 ight returns different for a given colun. For exaple, consider three coluns X, Y, and Z with cardinality n X,n Y, and n Z where cardinality of X is saller than Y and Z (n X <n Y, n X <n Z) and cardinality of Y and Z are not equal (n Y 6= n Z). In this exaple, for colun pair X and Y, paraeter is log(ny ), 2 2

7 Figure 5: X =5, Y = 4 Figure 6: X = 4, Y =5 Figure 7: X = 3, Y = 4 Figure 8: X = 4, Y = 3 while for colun pair X and Z paraeter is log(nz ). Thus, for 2 colun X the has two values log(ny ) and log(nz ). 2 2 To address this proble, we change Algorith 3 such that it still reads data only once but it keeps all the sketches for different 2{,,l} (value of k can be deterined by Theore 5). More specifically, after reading the data in line 4, we iterate over different values of fro to l and the rest of the algorith is the sae. Observe that even if we keep all sketches for fro to l, the eory required only doubles because P k = 2 log(`) = 2 l+ log(`), where 2 log(`) is the size of sketch for each colun ( 2.2). After constructing the sketches, given two coluns X and Y, we decide which produces better estiation of inclusion coefficient (Equation 2) and we pass those sketches to Algorith. Finally, we discuss how to set the paraeter l. Suppose the eory bound for each colun is M. In Theore 5, we prove that given colun X, bounded eory M, and ` bits hash function h, the paraeter l should be at ost ln( M log(`) ). THEOREM 5. Given colun X, bounded eory M, and ` bits hash function h, l apple ln( M log(`) ). PROOF. We hash each value X[i] 2 X only once but we generate l +sketches h for each 2{,,l}. Since we only keep the axiu for each bucket b j, the aount of eory for each bucket is log(`). Each sketch has 2 buckets so in total it requires P l i= 2i log(`) =2 l+ log(`) bits. Since M is the bounded eory, 2 l+ should be at ost gives us l apple ln( M log(`) ). M log(`),i.e. (2l+ apple M log(`) ) which 3.4 Discussion: Leveraging ore eory We briefly discuss how can leverage additional eory, when available, to iprove its accuracy. It is shown in [4] that increasing the nuber of buckets for HLL sketches reduces the variance of cardinality estiation. In other words, the bucketization (with stochastic averaging) eulates the effect of n hash functions with only one single hash function [5]. However, as discussed in 3.3, increasing the nuber of buckets reduces the probability of balanced load ( nx 2 ) which can ultiately reduce the accuracy. Thus, leveraging additional eory by increasing the nuber of buckets is not adequate for our proble. One approach is to cobine the use of ultiple hash functions and stochastic averaging in order to take advantage of additional eory. In other words, given two coluns X, and Y we fix nuber of buckets to 2, where can be find by Equation 2. Then we use ultiple hash functions in HLL construction. uses the sketches built by each hash function in order to estiate inclusion coefficient (Algorith ), and the final estiation of inclusion coefficient is the average of those results. A thorough epirical study of the trade-offs between estiation error and the above ethod of leveraging additional eory is an interesting area of future work. 4. EXTENSION OF HLL CONSTRUCTION TO SUPPORT DELETION HLL sketches are becoing increasingly popular for estiating cardinality in different applications, and are even being adopted in coercial database engines, e.g., [6] uses Hyperloglog to support approxiate distinct count functionality. However, in data warehousing scenarios, it is not uncoon for recent data to be added and older data to be reoved fro the database. Clearly HLL sketches can be aintained increentally in the presence of insertions. When a new data ite X[k] inserted to colun X, the sae Algorith 3 finds the hash value of the X[k] (s k = h(x[k])) and the affected bucket b j can be identified by the leftost bits of s k. It then finds (s k ) which is the position of the leftost in the l bits of s k and it updates the value of bucket j as Vj X =ax(s X[b j], (s k )) (Definition 2). For exaple, Figure shows the HLL sketch of the colun X, i.e., S X. let us assue a new value X[k] is added to colun X such that the first bits of the s k = h(x[k]) represents the first bucket (b ) in S X and (s k ) be 5. Thus the value of the b will be updated to ax(4, 5) = 5. On the other hand, when X[i] is deleted, siilar to insertion we can find which bucket is affected. Lets b j be the affected bucket. Since X[i] exists in database, (s i) apples X[b j]. If (s i) < S X[b j], no update is required but if (s i)=s X[b j], it eans (s i) is the largest and should be deleted. Since we do not know what the second largest value for that bucket, we cannot handle deletion. We can odify Algorith 3 such that for each bucket we keep track of all (s i)s in order to support deletion. Since Vj X is the ax over all those value (Definition 2), and when a deletion happens knowing the second largest for each bucket is crucial, as shown in Figure 9, rather than only keeping the axiu value in each bucket (Figure ), we keep all (s i)s in a ax-heap. Interestingly, we can show that by aintaining a heap of constant size (at ost `) for each bucket we are able to support increental deletion. Recall that during HLL sketch construction, we apply a hash function h : do(x) {, }` on each X[i] 2 do(x) which returns ` bits s i, and if is the nuber of bits for the buckets, (s i) is always an integer nuber saller than equal ` ( apple (s i) apple ` ). For exaple, if it uses 64 bits hash function and =, then apple (s i) apple 64. In the worst case, if we keep all distinct (s i)s, we are required to keep only ` values, which requires (` )log(` ) bits. This explains why with only a heap of constant size (at ost ` ) increental deletion can be supported. One issue we need to consider is that it is possible that X[i] and X[k] are assigned to sae bucket b j and (s i) is equal to (s j). In this case, if Vj X = (s i) and X[i] is deleted, the value of bucket b j (Vj X ) should still be (s i) because X[k] 2 do(x) still exists and (s j)= (s i). To handle this scenario and keep the ax heap size still liited to `, we keep a counter for each node in heap. For 3

8 Cardinality Log scale i j X Buckets (X[i]) s i : (X[j]) (s i ) = 3 s j : (s j ) = 4 b b 2^ Max-Heap 4 c 3 <4 Figure 9: HLL sketch of the colun X for deletion support. TPCH- TPCDS Coluns Figure : Cardinality of the coluns in TPCH- and TPCDS- 3 Cardinality Log scale Real-3 Real Coluns Figure : Cardinality of the coluns in Real-3 and Real-4 exaple in Figure 9, since there is only one X[i] that its (s i)=4, value is attached to node 4. So, a node in heap is deleted if the counter is one; otherwise we just reduce the counter by one. Space coplexity: For each bucket b i the heap size is at ost ` and each value in the heap is an integer nuber at ost `. Thus each bucket only needs (` ) log(` ) bits. Thus total eory to store the sketch S X is O(2 (` )log(` )). Since the space coplexity of the original HLL is O(2 log(` )) ( 2.2), with a constant overhead we are able to support deletion. 5. EXPERIMENTS The ain goals of our experiental evaluation are: (a) Evaluate the accuracy of the estiator for inclusion coefficient against the existing approach using Botto-k sketches [3]. (b) Evaluate the ipact of inclusion coefficient estiates instead of exact inclusion coefficient on precision and recall of two existing foreign detection techniques [, 29]. The key takeaways fro the experients are: On databases containing coluns with large cardinality results in substantially saller estiation error (ranging fro. to.5) copared to the Botto-k sketch approach (where errors range fro.3 to.59). As expected, Botto-k perfors well when the cardinality of the coluns are sall (less than k). In databases where ost coluns have cardinality saller than k, the estiation error of the range between.4 and.7 and the error of Botto-k ranges between.3 and.6. The running tie of the estiator is coparable to the approach that uses Botto-k sketches. When we use instead of the exact inclusion coefficients, it has only a sall ipact on the F easure of the two foreign-key algoriths. In contrast, using the estiator based on Botto-k sketches significantly reduces F easure in soe databases. 5. Setup Hardware and Platfor: We ipleented algorith for constructing the sketches on Microsoft s big data syste Cosos []. Cosos is designed to run on large clusters consisting of thousands of coodity servers and it is based on ap-reduce odel to achieve parallelis. The rest of our experients, including the estiation of inclusion coefficient between pairs of coluns using sketches, were perfored on a 6-core, 2.2 GHz, Intel Xeon E achine with 384 GB of RAM. Datasets: Table 3 shows the details of the datasets we used in our experients. Specifically, we used TPC-H with GB scaling factor and TPCDS with 3GB, 5GB, and 2GB(2TB) scaling factors as benchark databases. We also evaluated our technique on four real-world databases, Real- and Real-2 are publicly available in [23] and Real-3 and Real-4 are real-world custoer databases. Figures and show the distribution over the cardinality of the coluns over the synthetic (TPCH-, TPCDS-3) and the largest real databases (Real-3, Real-4). As these figures show, in TPCDS-3, Real-3 and Real-4 there are any coluns with sall and large cardinality. Table 3: Databases used in experiental evaluation. Database #of tables # of coluns # of FKs Size TPCH GB TPCDS GB TPCDS GB TPCDS TB Geeea(Real-) MB Mondial(Real-2) MB Real unknown 8GB Real unknown 7GB Botto-k: Since the inclusion coefficient estiator in [3] used Botto-k sketches, we refer to it as Botto-k in the rest of this section. In brief, the inclusion coefficient of X and Y ( (X, Y )= X\Y ) can be calculated using the Jaccard coefficient, i.e., (X, Y )= X '(X,Y ) '({X[Y },Y ). They use Botto-k sketches to estiate Jaccard coefficient, '(X, Y )= X\Y. Another approach to estiate (X, Y ) X[Y is to divide the estiated value of X \ Y [5] by the estiated value of X. The authors in [3] have shown that the Botto-k approach is ore accurate than the estiator based on the [5] so we only copare the accuracy of against the estiator in [3]. Setting k in Botto-k: We used a 64-bit hash function (MururHash ) for the sketch construction of both HLL and Botto-k. For the fair coparison we fixed the aount of eory per colun used by both estiators ( and Botto-k). In 3.3 we discussed how to pick nuber of buckets. For exaple, in TPCDS-3 the axiu nuber of buckets used by our algorith is 2 3, and since we used 64-bits hash function and HLL keeps the position of the leftost, so each bucket only requires log 64 = 6 bits. Since we keep all the sketches for 2{,,...,3}, total eory is bits (2.28 KB). On the other hand, Botto-k keeps the k sallest hash values, i.e, k 64 bits. Thus, to ensure equal eory, we configure the k in Botto-k as k = 24 6 =536. Table 4 64 shows the and k paraeters for each database. 5.2 Coparing and Botto-k 5.2. Execution Tie Figure 2 shows the execution tie of estiating (X, Y ) for all pair of coluns belonging to different tables using and Botto-k. The total execution tie depends on the nuber of such colun-pairs, and hence varies across the different databases. For exaple, coputes the inclusion coefficient of the 7,635,8 colun-pairs in Real-3 which takes roughly 6 seconds, i.e.,.3 s per colun-pair. Not surprisingly, for each database, the execution ties of the and Botto-k are quite close to each other ( is slightly ore expensive coputationally). 4

9 Es#a#on Execu#on #e in second Real- Real-2 TPCH- Real-3 Real-4 TPCDS-3 TPCDS-5 TPCDS-2 Bo2o-K Real- Bo(o-K Real-2 TPCH- Real-3 TPCDS-3 Real-4 TPCDS-5 TPCDS Botto-k Exact Inclusion Coefficient Botto-k Exact Inclusion Coefficient Figure 2: Execution tie of estiating (X, Y ) for all pair of coluns Figure 3: Average of estiated (X, Y ) Figure 4: Average of estiated (X, Y ) in TPCDS-3: all pair of coluns, k = 536 Figure 5: Average of estiated (X, Y ) in TPCDS-3: X applet, Y applet, t = 5, k = Botto-k Botto-k Botto-k Botto-k Exact Inclusion Coefficient Exact Inclusion Coefficient k* k*2 k*3 k*4 k*5 k*6 k*7 k*8 k*9 Cardinality Threshold (t) k* k*2 k*3 k*4 k*5 k*6 k*7 k*8 k*9 Cardinality Threshold (t) Figure 6: Average of estiated (X, Y ) in TPCDS-3: X >t, Y >t, t = 5, k = 536 Figure 7: Average of estiated (X, Y ) in TPCDS-3: X > t, Y apple t or X apple t, Y >t, t = 5, k = 536 Figure 8: Average of estiated (X, Y ) in TPCDS-3: X >t, Y >t, k = 536 Figure 9: Average of estiated (X, Y ) in TPCDS-3: X > t, Y apple t or X apple t, Y >t, k = 536 Table 4: Paraeter is the nuber of bits that define a bucket in HLL, and k and nuber hash values kept for Botto-k. TPCH- TPCDS- TPCDS- TPCDS- Real- Real-2 Real-3 Real k Estiation Next we copare the average error of the two estiators. The estiation error is the difference between the estiated inclusion coefficient ˆ and its actual value (X, Y ), i.e., (X, Y ) ˆ. Recall that both estiators rely on sketches that use an equal aount of eory ( 5.). As shown in Figure 3, the estiation error in the databases where ost coluns have sall cardinality (Real-, Real-2, and TPCH-) is alost siilar for the and Bottok. However, results in substantially saller error for the other databases (TPCDS-3, Real-3, and Real-4) where any coluns have large cardinality (significantly larger than k). To better understand these results, Figures 4-7 breakdown the result for TPCDS-3 database. In Figure 4, the x-axis shows the buckets that each bucket (lets say a b) contains those colun pairs whose exact inclusion coefficients are in [a, b), and the y-axis represents the absolute error of and Botto-k for those colun pairs. For exaple,.2.3 in x-axis is the bucket contains those coluns pairs whose their exact inclusion coefficients are in [.2,.3) and the error of both and Botto-k (k=536 in Table 4) is around.2. The results show that in TPCDS-3 when exact inclusion coefficient is saller than.5, Botto-k is slightly better than, however, perfors better when inclusion coefficient is large. In applications like foreign-key detection where the high containent (e.g., inclusion coefficient.8) is an iportant signal an estiator with low error for large coefficients is preferable. Next, we show that, the reason is the cardinality of the coluns. As we discussed in 2.., the error of the inclusion coefficient estiators depends on the cardinality of the coluns. Analyzing error based on cardinality threshold t: In Figures 5-7, we evaluate the error of the estiators over three different sets of colun-pairs in TPCDS-3 based on a cardinality threshold t =5(we later vary t in Figures 8, 9). In particularly, Figure 5 only considers the colun-pairs (X,Y ) where both coluns have cardinality less than 5. Figure 6 considers the colunpairs where both coluns have cardinality ore than 5. Figure 7 considers pairs of coluns where the cardinality of one colun is saller than 5 and the cardinality of the other colun is ore than 5. When the cardinality of the coluns are sall, Botto-K perfors better than (Figure 5). In effect, for coluns with cardinality less than k, if a good hash function is used, Botto-k keeps the hashes of all distinct values, thereby behaving siilar to full hash table of the colun. On the other hand, for large coluns (Figure 6) and the one that one colun is large and the other colun is sall (Figure 7), outperfors Botto-k significantly. Figures 8, and 9 show the error of against Botto-K when the cardinality threshold t in above experients is varied. We show the cardinality threshold t as a factor of k in Botto-k. Figures 8 shows the error for those coluns whose cardinalities are ore than t, and Figure 9 shows the error when the cardinality of one colun is saller than t and the cardinality of the other colun is ore than t. Figure 8 shows that for the large coluns significantly outperfors Botto-K. In the unbalanced case (Figure 9), when t is exactly k, Botto-k perfors slightly better than but as t grows outperfors Botto-K. Effect of deletion: Next we evaluate the error and scalability of the version of that uses a HLL sketch that supports data deletions (discussed in 4) vs. Botto-k. Recall, that we keep a constant size 5

Clustering. Cluster Analysis of Microarray Data. Microarray Data for Clustering. Data for Clustering

Clustering. Cluster Analysis of Microarray Data. Microarray Data for Clustering. Data for Clustering Clustering Cluster Analysis of Microarray Data 4/3/009 Copyright 009 Dan Nettleton Group obects that are siilar to one another together in a cluster. Separate obects that are dissiilar fro each other into