Efficient Estimation of Inclusion Coefficient using HyperLogLog Sketches

Size: px
Start display at page:

Download "Efficient Estimation of Inclusion Coefficient using HyperLogLog Sketches"

Transcription

1 Efficient Estiation of Inclusion Coefficient using HyperLogLog Sketches Azade Nazi, Bolin Ding, Vivek Narasayya, Surajit Chaudhuri Microsoft Research {aznazi, bolind, viveknar, ABSTRACT Efficiently estiating the inclusion coefficient the fraction of values of one colun that are contained in another colun is useful for tasks such as data profiling and foreign-key detection. We present a new estiator,, for inclusion coefficient based on Hyperloglog sketches that results in significantly lower error copared to the state-of-the art approach that uses Botto-k sketches. We evaluate the error of the estiator using experients on industry bencharks such as TPC-H and TPC-DS, and several realworld databases. As an independent contribution, we show how Hyperloglog sketches can be aintained increentally with data deletions using only a constant aount of additional eory. PVLDB Reference Forat: Azade Nazi, Bolin Ding, Vivek Narasayya, Surajit Chaudhuri. Efficient Estiation of Inclusion Coefficient using HyperLogLog Sketches. PVLDB, (): 97-9, 28. DOI: INTRODUCTION The discovery of all inclusion dependencies in a dataset is an iportant part of data profiling efforts. It is useful for tasks such as foreign-key detection and data integration [7, 28, 4, 22, 2]. However, due to issues in data quality such as issing values or ultiple representations of the sae value, it becoes iportant to relax the requireent of exact containent. Thus, coputing the fraction of values of one colun that are contained in another colun inclusion coefficient is of interest to these applications. When the database schea and data sizes are large, coputing the inclusion coefficient for any pair of coluns in the database can be both coputationally expensive and eory intensive. One approach for addressing this challenge is to estiate the inclusion coefficient using only bounded-eory sketches of the data. Given a fixed budget of eory per colun, these techniques scan the data once, and copute a data sketch that fits within the eory budget. For a given pair of coluns X and Y, the inclusion coefficient (X, Y ) is then estiated using only the sketches on coluns X and Y. One such approach, adopted in Zhang et al. [3] is to build a Botto-k sketch [2] on each colun, and develop an inclusion Perission to ake digital or hard copies of all or part of this work for personal or classroo use is granted without fee provided that copies are not ade or distributed for profit or coercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific perission and/or a fee. Articles fro this volue were invited to present their results at The 44th International Conference on Very Large Data Bases, August 28, Rio de Janeiro, Brazil. Proceedings of the VLDB Endowent, Vol., No. Copyright 28 VLDB Endowent /8/6... $.. DOI: coefficient estiator using these sketches. Intuitively, Botto-k sketches have high accuracy for inclusion coefficient estiation when the nuber of distinct values in both X and Y are saller than k since the sketches effectively behave like hash tables of the respective coluns. However, as we show epirically in this paper, one of the liitations of using Botto-k sketches for estiating inclusion coefficient is that for given eory budget (of k values), as the cardinality (i.e. nuber of distinct values) of colun X or Y increases beyond k, the estiation error becoes large. Additionally, Botto-k sketches are not aenable to increental aintenance in situations where data is deleted. For instance, in data warehousing scenarios, it is not uncoon for recent data to be added and older data to be reoved fro the database. Developing an estiator with low error for inclusion coefficient using bounded-eory sketches is challenging. We first establish a hardness result showing that any estiator that relies only on sketches with bounded eory ust incur unbounded error in certain cases. Intuitively, the difficult cases are when colun X has sall cardinality and colun Y has large cardinality or vice-versa. Furtherore, as we report in our epirical evaluation, these difficult cases appear to be quite coon in the real-world databases that we have studied. We develop a new estiator for inclusion coefficient (Binoial Mean Lookup) based on Hyperloglog sketches [4]. Hyperloglog (HLL) sketches can be coputed efficiently and within a given eory budget, requiring invocation of only a single hash function for each value in the colun. HLL sketches are becoing increasingly popular for estiating cardinality in different applications, and are even being adopted in coercial database engines e.g., [6] uses Hyperloglog to support approxiate distinct count functionality. The ain idea behind the estiator is a theoretical result that establishes a apping fro the inclusion coefficient (X, Y ) to the probability that the value of a bucket in the HLL sketch of colun Y is greater than the value of the corresponding bucket in the HLL sketch of colun X., is based on axiu likelihood estiation (MLE) ethod. It observes the nuber of buckets in the HLL sketch of colun Y whose HLL value is greater than the HLL value of the corresponding bucket in colun X, and it returns the value of (X, Y ) that axiizes the likelihood of this observation. As a by-product of our estiation technique, we are also able to provide a bound on the error. This error bound is data-dependent, i.e. it is specific to the pair of coluns X and Y. Such an error bound can be valuable to applications that consue the estiates, and is not provided by prior techniques. Hyperloglog sketches can be aintained increentally in the presence of insertions. An independent contribution of this paper is a technique for increentally aintaining a Hyperloglog sketch in the presence of data deletions with a constant eory overhead. 97

2 The key idea is that each bucket in an HLL sketch always holds an integer value less than a constant `, where ` is the nuber of bits of the hash value. By aintaining a ax-heap of constant size (at ost `) for each bucket we are able to support increental deletion. We show through experients on real-world databases and industry benchark TPC-H and TPC-DS databases that has significantly lower overall error than the approach using Bottok sketches [3], particularly in cases where at least one of the coluns has large cardinality. For cases where both coluns have sall cardinality, the accuracy of both estiators are siilar. For instance, in two real-world databases where there are any coluns with sall and large cardinality, the average error using Botto-k sketches is.3 and.59 respectively, whereas the corresponding errors for are only. and.4. For the other two realworld databases where ost coluns have sall cardinality the average error using Botto-k sketches is.5 and.2 respectively, whereas the corresponding errors for are.6 and.4. Finally, we consider one iportant application of inclusion coefficients, naely the proble of foreign-key (FK) detection in a database. All prior work on FK detection [29, 3, ] relies on exact inclusion coefficients to prune the FK candidates. We show epirically on several real-world and benchark databases that the estiation error of is acceptable for these FK detection techniques. In other word, replacing the exact inclusion coefficient with an estiate obtained via the estiator has no noticeable ipact on the precision and recall of these FK detection algoriths. In suary, this paper akes the following contributions: We establish a hardness result for inclusion coefficient estiation using bounded-eory sketches. We develop an MLE-based estiation algorith called (Binoial Mean Lookup) for inclusion coefficient estiation based on Hyperloglog sketches [4]. We prove the correctness of our algorith and the error bound of our estiator. We show how, with a constant eory overhead, Hyperloglog sketches can be extended to support increental deletion. We ipleent our inclusion coefficient estiator () on Cosos [], a distributed, Big Data engine that is extensively used within Microsoft for analyzing large data sets; and evaluate the effectiveness of using several synthetic and real-world datasets. We easure the precision and recall of two existing foreign-key detection techniques when using inclusion coefficient estiates rather than the exact inclusion coefficients. The rest of the paper is organized as follows. We present a hardness result for inclusion coefficient estiation using sketches and introduce background of Hyperloglog sketch construction in 2. We describe the estiator for inclusion coefficient and its error analysis in 3. The extension to support increental deletion is presented in 4. We describe the results of our experiental evaluation of inclusion coefficient estiation and its application to FK detection in 5. We discuss related work in 6 and conclude in PROBLEM DEFINITION AND BACKGROUND Let database D be the collection of tables T, where the set of all coluns in tables T are C. Let n be the total nuber of coluns ( C = n). For each colun X 2C, the set of all its possible values is called the doain of X, denoted by do(x). We use X[i] as the value of colun X for tuple i. We forally define the inclusion coefficient estiation proble and we discuss the hardness result. 2. Inclusion Coefficient Estiation and Hardness Result Inclusion coefficient is defined between two sets of values, and is used to easure the fraction of values of one set that are contained in the other set. DEFINITION. (Inclusion Coefficient) Given two coluns/sets, X and Y, the inclusion coefficient of X and Y is defined as: (X, Y )= X \ Y, X 6= () X where represents the nuber of distinct values in a set. If X is fully covered by Y (X Y ), (X, Y )=; otherwise apple (X, Y ) <. Note that (X, Y ) is asyetric: in general, (X, Y ) is not equal to (Y,X). In tasks such as foreign key detection and data profiling, we need to calculate inclusion coefficients for any pairs of coluns, which could be too expensive (in ters of both tie and eory) for large datasets. As a trade-off between accuracy and perforance, the idea is to construct a sketch (or a copact data structure) for each colun C 2Cby scanning the data once, and estiate the inclusion coefficient between any pair of coluns using their sketches. Such sketching and estiation techniques are useful in two scenarios: i) coluns are too large to fit into eory; and ii) we want to copute inclusion coefficients for any pairs of coluns (e.g., there are n coluns and inclusion coefficients need to be calculated for all the n(n )/2 pairs). PROBLEM. (Estiating Inclusion Coefficient using sketches) For each colun C 2C, we construct a sketch S C by scanning C once. Then for any two coluns X and Y, we want to derive an estiator ˆ(S X, S Y ) to (X, Y ), by accessing only the two sketches. In the rest part, we write ˆ(S X, S Y ) as ˆ if X and Y are clear fro the context. Table shows frequently used notations. 2.. Estiation and Sketch Size Suppose we estiate the inclusion coefficient (X, Y ) of two coluns X and Y as ˆ, using their sketches. The estiation error (X, Y ) ˆ ranges fro to. Considering the randoness in the sketch construction, ideally, we would like the estiation error to be bounded with high probability, i.e., (X, Y ) ˆ apple with probability at least, for any given two coluns X and Y. Unfortunately, we can show that, unless the sketch size is linear in the nuber of distinct values (which could be equal to the nuber of rows), there is no sketch based on which the worst-case estiation error can be bounded with <and <. A lower bound of sketch size: The hardness can be observed even in a very siple case when X = {x} contains only one eleent, and Y is large. In this case, (X, Y ) takes value either (if x /2 Y ) or (otherwise). Therefore, to bound the estiation error below any constant less than, fro the sketches of Y, we need to distinguish two cases, i) x /2 Y or ii) x 2 Y, with high probability this is exactly the approxiate ebership proble [9]. More forally, we can prove the following hardness result. THEOREM. We pre-copute a sketch for each colun in a database to estiate the inclusion coefficient for pairs of coluns. If we require that, for any two given coluns X and Y 2 C, (X, Y ) ˆ apple < with probability at least >, then any sketch ust use space at least (n C log(/ )) bits per colun C 2C, where n C is the nuber of distinct values in C. PROOF. Fro the above discussion, we can reduce approxiate ebership proble to our inclusion coefficient estiation proble: for each colun Y of size n, we want to construct a sketch 98

3 to support ebership queries, i.e., check whether x 2 Y for any given x, with a false positive rate at ost. Suppose we want to estiate the inclusion coefficient (X, Y ) using sketches of X and Y. Consider the case when X = {x}. Indeed, (X, Y )=if x 2 Y, or if x /2 Y. Therefore, to ensure that the estiation error is less than, there ust be no false positive to the ebership query x 2 Y (with probability less than ) [9] shows that, for this purpose, any sketch ust use space at least (n log(/ )) bits. So the proof is copleted. Reark: [25] gives an even tighter lower bound of sketch sizes when the nuber of distinct values in each colun is unknown. 2.2 HLL Sketch Construction As shown in Theore, the worst-case error cannot be bounded based on sublinear-size sketches. However, the hope is that, for particular instances of X and Y, the estiation errors still could be uch better than the worst-case error. Hyperloglog (HLL) sketch [4, 5] provides a near-optial way to estiate cardinality (i.e. the nuber of distinct values in a set). In 3, we will show how to use HLL sketches to estiate inclusion coefficients. We first review how to construct the HLL sketch of a colun. Let h : do(x) {, }` be a hash function that returns ` bits for each value X[i] 2 do(x). For each hash value s i = h(x[i]), we find the position of the leftost represented by (s i) and the axiu of { (s i) : s i = h(x[i])} is the HLL sketch of the colun X which is used for cardinality estiation [4, 5]. One way to reduce the variance of the estiation is to use ultiple hash functions [5]; however, [5] proposed stochastic averaging that eploying only a single hash function eulates the effect of using different hash functions. They use one hash function but take the first bits to bucketize hash values to iic 2 hash functions and reduce the variance. Then the reaining ` bits of each hash value s i = h(x[i]) are used to find the position of the leftost as (s i). Algorith 3 in Appendix A shows the steps to construct such sketches for a set of coluns C. In 3.3, we discuss how to choose the paraeter for given two coluns X, and Y. X Buckets i j (X[i]) s i : (X[j]) s j : (s i ) = 3 (s j ) = 4 Max( (s i ), (s j )) b b 2^ = 4 Figure : Constructing the HLL sketch of the colun X as S X. For exaple, as shown in Figure, one hash function h is used for all values in colun X, and the first bits are used to bucketize hash values. As a result there are 2 buckets (b,b 2,,b 2 ) in HLL sketch of colun X represented by S X. Specifically, in this figure s i = h(x[i]) and s j = h(x[j]) are assigned to bucket b because their first bits are the binary representation of the value one. Moreover, (s i)=3and (s j)=4because the position of the leftost in the reaining ` bits of the hash values s i and s j are 3, and 4 respectively. If the s i and s j are the only hash values assigned to b, the final value in b is V X = ax(3, 4) = 4. The foral definition of Vi X is given by Definition 2. DEFINITION 2. (Vi X : HLL value of colun X in bucket b i). Let s j be the hash value of the tuple j in X whose first bits (s j[,...,]) indicate it belongs to bucket b i, i.e., s j[,...,] is the binary representation of the value i ((s j[...]) 2 = i), and (s j) be the leftost one in the reaining ` bits of the hash value s j (s j[ +,...,`]). The HLL value of colun X for the bucket b i is defined as: Table : Suary of frequently used notations n Total nuber of coluns nx Nuber of distinct values in colun X S X Sketch of colun X (X, Y ) Inclusion coefficient of coluns X and Y ˆ Estiation of (X, Y ) h ` bits Hash function Nuber of bits assigned for bucketization 2 Nuber of buckets in HLL sketch Vi X HLL value of colun X in bucket b i V X HLL value of colun X when there is only one bucket ˆP Estiated value of pr(v X apple V Y ) e p Estiation error of pr(v X apple V Y ) e Estiation error of (X, Y ) Vi X = ax (s j ) (2) s j :(s j [...]) =i 2 Note that when there is only one bucket we ignore index i and denote the HLL value of colun X by V X. Space coplexity: In the HLL sketch of the colun X (S X) the HLL value of each bucket is an integer nuber ( apple Vi X apple ` ). Thus each bucket needs log(` ) bits to store Vi X, i.e., total eory to store the sketch S X is O(2 log(` )). 3. EFFICIENT ESTIMATION OF INCLU- SION COEFFICIENT In this section, we describe a technique to estiate the inclusion coefficient using Hyperloglog (HLL) sketches. In a preprocessing step, a HLL sketch [4, 5] is constructed for all coluns by scanning the data only once in practice these sketches are copact enough to be able to fit into eory. Our algorith ai to estiate the inclusion coefficient of two coluns X and Y using pre-coputed HLL sketches of X and Y. We also produce an error bound for the estiate. 3. Overview of our Approach The ain idea of our algorith is to produce an estiate of the inclusion coefficient between two coluns X and Y by coparing their HLL values (Definition 2). To develop intuition on why coparing the HLL values of X and Y can help in finding the inclusion coefficient, we start with the case of a HLL sketch with a single bucket, and suppose that X and Y have the sae nuber of distinct values. We exaine the following two extree cases. i) if X = Y or (X, Y )=, we have pr(v X apple V Y )=pr(v X = V Y )=, because the hash function h in HLL is applied on the sae set of values for both X and Y ; and ii) if X \ Y =? or (X, Y )=, pr(v X apple V Y ) is at least.5. A useful observation here is that pr(v X apple V Y ) increases onotonically as (X, Y ) increases (we prove it forally later). For exaple, in Figure 2, we plot pr(v X apple V Y ) as a function of (X, Y ) when X = Y = introduces how to derive this function for general cases. Clearly given bucket i for each coluns X and Y, the event Vi X <= Vi Y is a Bernoulli trial. When there are ultiple buckets since buckets are independent, the events Vi X <= Vi Y are independent Bernoulli trials [5]. The reason that for a given colun the buckets are independent is the result of the HLL construction 2.2. fro property of the universal hashing, first bits of a hash value is independent of the rest ` bits. The intuition behind our algorith is that, using ultiple buckets in the HLL sketches of X and Y, we first estiate pr(v X apple V Y ) given the fact that the event Vi Y is an independent Bernoulli trial for each bucket b i (e.g., estiating pr(v X apple V Y ).8 in Figure 2). Then, since pr(v X apple V Y ) can be written as a function of the inclusion coefficient, we effectively lookup the value of 99

4 (X, Y ) that produces the estiated pr(v X apple V Y ). For exaple, in Figure 2 by looking up we get (X, Y ).62). Maxiizing the likelihood. Our algorith is based on the axiu likelihood estiation (MLE). More forally, let Z = {i Vi Y } be the nuber of buckets (aong all 2 buckets) where Vi Y. The rando variable Z follows a distribution paraeterized by X, Y, and (X, Y ). We observe Z = z fro HLL sketches of X and Y, and we want to choose (X, Y ) to axiize the likelihood of our observation. le = argax pr (Z = z (X, Y )= ). (3) Next we introduce our MLE-based estiation algorith called (Binoial Mean-Lookup). Suppose X and Y are known (or estiated fro their HLL sketches), there are still two reaining questions. First, we need to characterize the distribution of Z with (X, Y ) as a paraeter ( 3.2.). Second, we need an efficient algorith to axiize the likelihood as in Equation 3 ( 3.2.2). We prove the correctness of our algorith, i.e., show that our algorith indeed gives a solution to Equation 3, and analyze its estiation error in Finally, we briefly discuss how can leverage additional eory if available to iprove accuracy ( 3.4). 3.2 : Binoial Mean-Lookup Estiator As shown in Figure 2, pr(v X apple V Y ) increases onotonically as (X, Y ) increases. We first show how to derive the closed for of pr(v X apple V Y ) as a function of (X, Y ) then we discuss the detail of our inclusion coefficient estiator Calculating pr(v X apple V Y ) Given coluns X and Y, V X and V Y are both rando variables. Let us first consider a sipler case where there is only one rando variable V X and discuss how to find pr(v X k), where k is constant. Then we show how to use this siple case to derive the general case pr(v X apple V Y ). Given colun X with n X distinct values, when there is only one bucket ( =), based on the Definition 2, the HLL value of the colun X is in fact the axiu over n X independent rando variables. As we discussed in 2.2, for each hash value s i, we find the position of the leftost represented by (s i) and the axiu of { (s i):s i = h(x[i])} is the HLL sketch of the colun X. Obviously every bit in the s i is a Bernoulli trial given that it is a rando experient with exactly two possible outcoes, and and fro property of the universal hashing every bit of a hash value is independent fro each other. Thus, the bits in s i are independent Bernoulli trials. Each rando variable (s i) represents the leftost one in s i, i.e, first one after (s i) zeros. Thus as also pointed out in [4] each rando variable is geoetrically distributed and pr(v X apple k) is: pr(v X apple k) = nx i= ( Pr(first k bits are zero)) = ( (4) By Equation 4, pr(v X = k) is pr(v X apple k) pr(v X apple k ). pr(v X = k) =( ( 2 k )nx (5) Next we show how to use these equations to derive pr(v X apple V Y ), where both V X and V Y are rando variables. When the intersection of X and Y is non epty V X and V Y are not independent. Let T be X \ Y. To resolve the dependency of X and Y, we consider three disjoint sets: T, X = X \T, and Y = Y \T. As shown in Figure 3 based on the cardinality of these sets there are three different cases. () X and Y are disjoint, i.e. n T =, n X = n X, n Y = n Y. (2) Y is subset of X, i.e., Y = Y \T is epty (n T = n Y, n X = n X n Y, n Y =). (3) X and Y are partially overlapped Y = Y \T is not epty (n T 6=,n X 6=, n Y 6= ). Next we show how to use Equations 4, and 5 to derive the pr(v X apple V Y ) for each case. Case: Clearly, if T = X \ Y is epty (n T =), then V X and V Y are independent. As shown in Figure 3, since we are interested in V X apple V Y if X and Y are disjoint and V Y = k, then V X should be at ost k. Based on Definition 2 we have apple k apple `. Thus the pr(v X apple V Y ) with the help of Equations 4 and 5 can be calculated by Equation 6. Note that since n T =, n X = n X and n Y = n Y. pr(v X apple V Y )= pr(v X apple k) ^ pr(v Y = k) = ( k= k= ( ( 2 k )n Y For exaple, by Equation 6 when X = Y = 4, n T =, (X, Y ) is the pr(v X apple V Y ).58 (Figure 2). Case2: If Y X, then T = Y and Y is epty (n Y =). Siilar to case, if V T = k, then V X should be at ost k, where k can be any value in [,` ]. So pr(v X apple V Y ) is as follows: pr(v X apple V Y )= pr(v X apple k) ^ pr(v T = k) = ( k= k= ( 2 k )nt ( 2 k )nt For exaple, when X = Y = 4, and n T = 4 the (X, Y ) is and the pr(v X apple V Y ) (Figure 2). Case3: Finally, if X and Y are partially overlapped, as shown in Figure 3, given k there three scenarios ) V X apple k, V T apple k,v Y = k, 2) V X apple k, V T = k, V Y apple k, and, and 3) V X apple k, V T = k, V Y = k. Thus the pr(v X apple V Y ) can be derived by Equation 8. pr(v X apple V Y )= pr(v X apple k) ^ pr(v T apple k ) ^ pr(v Y = k) k= + pr(v X apple k) ^ pr(v T = k) ^ pr(v Y apple k ) + pr(v X apple k) ^ pr(v T = k) ^ pr(v Y = k) ( = ( k= +( +( ( ( ( ( 2 k )nt 2 k )nt ( 2 k )nt ( ( ( 2 k )nt 2 k )nt 2 k )n Y ( (6) (7) 2 k )n Y 2 k )n Y (8) For exaple, when X = Y = 4, and n T = 62 the (X, Y ) is.62 and the pr(v X apple V Y ).8 (Figure 2). Multiple buckets: When there are 2 buckets ( >), inspired by stochastic averaging [5] we consider the average case where the HLL value of colun X in Definition 2 will be the axiu over on average nx independent rando variables for each bucket 2 i and Equations 4 and 5 are updated as: pr(vi X apple k) =( 2 k ) nx 2 (9) pr(vi X = k) =( 2 k ) nx 2 ( 2 k ) nx 2 () Later in 3.3, we discuss how considering the average case affects the nuber of buckets.

5 = Lookup = e P e P e e (.62)= e P e = Figure 2: X = Y = 4, Lookup approach Figure 3: For coluns X and Y ( X Y ) cases are () disjoint, (2) overlapped and (3) partially overlapped, where T = X \ Y, X = X \T, Y = Y \T Figure 4: e p and e are the error of estiating pr(v X apple V Y ) and inclusion coefficient (X, Y ) Efficient Algorith to Maxiize the Likelihood Recall fro Equation (3) that we forulate the proble of estiating inclusion coefficient as a axiu likelihood estiation proble, where we observe the nuber of buckets (aong all 2 buckets) where Vi Y is z and the goal is to choose (X, Y ) to axiize the likelihood of our observation. Here we propose our estiator,, to efficiently solve the MLE. has two ain steps. Step one is to estiate pr(v X apple V Y ) using only S X and S Y. Step two is to use lookup approach to ap the estiated probability into the inclusion coefficient. Before describing the details of these two steps, we first provide an overview of how works. As shown in Figure 2, pr(v X apple V Y ) increases by increasing the (X, Y ) (proof in theore 2). Clearly, if estiates pr(v X apple V Y ), it can be used to find the (X, Y ). For exaple, let us assue the estiated value of pr(v X apple V Y ) be.8 ( ˆP =.8). As shown in Figure 2, after projection to blue line ( 3.2.) estiates (X, Y ) to be.62. Algorith shows the detail of the two ain steps of. The input to this algorith is the HLL sketches of colun X and Y (S X, S Y ). In 3.3, we discuss how to choose nuber of buckets for these sketches. In the first step, in order to calculate ˆP, since the event Vi X apple in each bucket b i is an independent Bernoulli trial ( 3.). Thus ˆP is the ratio of the nuber of buckets where Vi Y to the total nuber of buckets (2 ) (lines 3-7). Later in Theore 4, we use Hoeffding inequality to provide the error bound of the ˆP. Vi Y Algorith : // inclusion coefficient estiator : Input: S X, S Y 2: //Step : Estiate pr(v X apple V Y ) using S X, S Y 3: z = 4: for b i, where i 2 [, 2 ] 5: if Vi Y then z = z + 6: ˆP = z 2 7: //Step 2: Lookup step 8: n X = Estiated nuber of distinct values fro S X 9: n Y = Estiated nuber of distinct values fro S Y : return ˆ(X, Y )=Lookup( ˆP,, in(n X,n Y ),n X,n Y ) In the second step, (Algorith ) calls Lookup function (Algorith 2) to estiate inclusion coefficient (X, Y ) as the one that produces the ˆP. The key idea is that since pr(v X apple V Y ) in an increasing function of (X, Y ) (Theore 2), given an estiated probability ˆP we can use binary search to estiate (X, Y ). Algorith 2 shows the pseudo code of the lookup approach. The input to this algorith are ˆP, ininc, axinc, n X, n Y, and where ˆP is the estiation of the pr(v X apple V Y ), ininc and axinc are the boundary of the search, n X, n Y are the estiated cardinality of the X and Y ( [4]), and is the error or tolerance. Algorith 2 is doing binary search over the possible intersection size n T and at each iteration based on the value of n T depends on which case it is, it uses the suitable Equation fro 3.2. (lines 5-8). For exaple, if X \ Y = ;, it uses Equation 6. Such siple bisection procedure for iteratively converging on a solution which is known to lie inside soe interval [a, b] has been introduced in [8] for root finding. They showed the nuber of iterations required to obtain an error saller than is. In our proble since ln(a b) ln ln 2 ln apple (X, Y ) apple, the nuber of iteration would be. For ln 2 exaple, Algorith 2 only needs 7 iterations to obtain an error saller than = 5. It worth entioning that the cost of each iteration is very cheap (analysis in 3.2.3). Algorith 2: Lookup // Map ˆP into ˆ(X, Y ) : Input: ˆP, ininc, axinc, nx,n Y, 2: Outputs: ˆ(X, Y ) 3: n T = ininc+axinc ; ˆ = nt 2 n X 4: if (n T = n X & n T = n Y ) then Prob =. //X = Y 5: else if (n T =) then Prob = Equation 6 //X \ Y = ; 6: else if (n X >n Y & n T = n Y ) then Prob = Equation 7 7: else Prob= Equation 8 8: if Prob ˆP apple return ˆ 9: if Prob > ˆP return Lookup( ˆP,n T,axInc,n X,n Y ) : if Prob < ˆP return Lookup( ˆP,inInc,n T,n X,n Y ) Algorith Correctness and Analysis We provide the following analysis to show the correctness, efficiency, and the error bound of. We prove pr(v X apple V Y ) is an increasing function of (X, Y ), i.e., in lookup step there is a one to one apping between the probability pr(v X apple V Y ) and the inclusion coefficient. In (3), we forulated the proble of estiating inclusion coefficient as axiu likelihood proble. We prove that the results of and the MLE forulation (3) are identical. We also provide the error bound of the for estiation of the pr(v X apple V Y ) and inclusion coefficient. Finally, we show the tie coplexity of the Algorith. Algorith Correctness: uses binary search in order to find the apping between the probability ˆP and the inclusion coefficient (Algorith 2). This approach only works if there is a one to one apping fro pr(v X apple V Y ) into (X, Y ). Theore 2 shows that probability pr(v X apple V Y ) ( 3.2.) is an increasing function of (X, Y ). Clearly an increasing function is a one to one function, hence is invertible.

6 THEOREM 2. Given two coluns X, and Y with intersection T, where T = n T, probability pr(v X apple V Y ) is an increasing function of (X, Y ). PROOF. Please find the proof in Appendix B. In Theore 3 we prove that the results of Algorith and the MLE forulation (3) are identical. THEOREM 3. The inclusion coefficient estiate ˆ fro Algorith and the inclusion coefficient estiate le fro MLE forulation (3) are identical. PROOF. Please find the proof in Appendix C. Bound of the Probability Estiation: Let P and ˆP be the exact and estiated value of the pr(v X apple V Y ) respectively. (Algorith ) calculates ˆP as the ratio of the nuber of buckets where Vi Y to the total nuber of buckets 2. As shown in Figure 4, this estiation ay produce error e p, e.g., ˆP =.8 while P =.8 ± e p. This figure also show that when ˆP =.8, the lookup step in returns.62 as the estiated inclusion coefficient while the actual one is.62 ± e (Figure 4). Thus, estiation error e p result in e, i.e., estiation error of the inclusion coefficient (X, Y ). Here we first show that given an error e p ( apple e p apple ), what is the probability that the estiation error of Algorith be at ost e p. Next we show how to use the e p in order to bound estiation error of the inclusion coefficient. THEOREM 4. Probability that estiation error of ˆP be at ost e p is: pr ˆP P applee p 2exp 2 + e 2 p PROOF. Please find the proof in Appendix D. For exaple, when =7and estiation error e p is.4, the pr( ˆP P applee p) is at least.95, i.e., with 95% confidence the estiation error is at ost.4%. Bound of the Inclusion Coefficient Estiation: Fro Figure 4 one can see that the slope of P (the blue line) at any point represent the ratio of e p to e, e.g., when ˆP =.8, the estiated inclusion coefficient is.62 and the slope of P at.62, P (.62), is ep. One can nuerically find this slope of P for a given a point e, e.g., P (.62) =.7. Moreover, as we discussed in theores 4, with 95% confidence the estiation error e p is.4. Thus e at this point can be calculated as.4 =.2. More forally at any.7 point we have: e = ep P ( ) () So far we have shown how to find e for a given point. Clearly in Figure 4 the slope of P at different points are different. Thus in order to bound e of two coluns X and Y the key observation is to first find the iniu and axiu slopes of P and then the ratio of e p to those slopes will be the bound of e. Table 2 shows the iniu and axiu slopes of the P and the e p bound for coluns X and Y with different cardinalities. For exaple, when the cardinality of both X and Y is 4, the iniu and axiu slopes of P, in Figure 4 are.24 and.79 and the iniu and axiu e are.4.4 =.2 and =.5, where the ep is.4. Figures 6 to 8 shows how the slope of P varies for the coluns with different cardinalities. One can see that the slope of P reduces as the difference of the cardinalities of the X and Y increases e.g., when X =, Y = 4 the iniu and axiu slopes of P are.62 and.72 respectively. Clearly, when the slope of P reduces, e will increase (Table 2). Tie coplexity of the Algorith: In Algorith, the estiation of pr(v X apple V Y ) using S X, S Y takes 2 steps Table 2: The iniu and axiu e for synthetic data (X, Y ) X Y in( ep ep ) ax( ) ax(e ) in(e ) e e (lines 4-6), where 2 is the nuber of buckets. As we discussed in ln the binary search in the lookup step takes iterations to ln 2 obtain an error saller than. The cost of each iteration is in O(` ) due to the fact that the Equations 6, 7, 8 used in Algorith 2 ( lines 6-8) is the su of ` values. Thus, the tie coplexity of Algorith is O(2 + ln (` )) which is linear in the ln 2 nuber of buckets. 3.3 Choosing Nuber of Buckets Algorith 3 (Appendix A) shows the steps to construct a HLL sketch for a fixed. In this section, we first discuss given two coluns X, and Y how to choose the paraeter (nuber of bits for the buckets). Next, we show how this algorith changes when we consider all pair of coluns since paraeter needs to be varied. Paraeter for a given colun pair: The HLL construction of colun X can be viewed as bins and balls proble [2], where buckets are the bins and distinct values in colun X are balls. As we discussed in 3.2., given colun X with n X distinct values, when there is only one bucket the HLL value of colun X is in fact the axiu over n X independent rando variables (Definition 2). When there are 2 buckets ( >), the HLL value of n colun X will be the axiu over, on average, X 2 independent rando variables for each bucket, i.e, balanced load in each bucket. In bins and balls proble [2], it is shown that as the nuber of bins increases the probability of balanced load will decrease. In other words, given n X balls and 2 bins the probability that all bins contains exactly nx 2 balls reduces as the nuber of bins increases. Thus in HLL construction of colun X, we should also expect that as the nuber of buckets increases the probability that all buckets have the sae load decreases. However, it is also shown in [5, 4] that having large nuber of buckets reduces the variance of cardinality estiation using a HLL sketch. Thus in one hand, having less nuber of buckets increases the probability of balanced load ( nx ), but in the other hand ore nuber of buckets reduces the variance of estiation. For the perfect 2 balanced load, there should be only one bucket, and for the lowest variance nuber of buckets should be n X ( =log(n X)). Given two coluns X, and Y, in this paper, as a trade-off between the variance and the balanced load we heuristically pick a id point between the best variance and the best balanced load by considering the nuber of buckets as equation 2. Note to say that it is a heuristic and it is not the optial choice. log(ax(nx,ny )) = (2) 2 Paraeter for ultiple pairs of coluns: Our goal is to efficiently estiate inclusion coefficient for all colun pairs in a database. When we consider ultiple pairs of coluns Equation 2 ight returns different for a given colun. For exaple, consider three coluns X, Y, and Z with cardinality n X,n Y, and n Z where cardinality of X is saller than Y and Z (n X <n Y, n X <n Z) and cardinality of Y and Z are not equal (n Y 6= n Z). In this exaple, for colun pair X and Y, paraeter is log(ny ), 2 2

7 Figure 5: X =5, Y = 4 Figure 6: X = 4, Y =5 Figure 7: X = 3, Y = 4 Figure 8: X = 4, Y = 3 while for colun pair X and Z paraeter is log(nz ). Thus, for 2 colun X the has two values log(ny ) and log(nz ). 2 2 To address this proble, we change Algorith 3 such that it still reads data only once but it keeps all the sketches for different 2{,,l} (value of k can be deterined by Theore 5). More specifically, after reading the data in line 4, we iterate over different values of fro to l and the rest of the algorith is the sae. Observe that even if we keep all sketches for fro to l, the eory required only doubles because P k = 2 log(`) = 2 l+ log(`), where 2 log(`) is the size of sketch for each colun ( 2.2). After constructing the sketches, given two coluns X and Y, we decide which produces better estiation of inclusion coefficient (Equation 2) and we pass those sketches to Algorith. Finally, we discuss how to set the paraeter l. Suppose the eory bound for each colun is M. In Theore 5, we prove that given colun X, bounded eory M, and ` bits hash function h, the paraeter l should be at ost ln( M log(`) ). THEOREM 5. Given colun X, bounded eory M, and ` bits hash function h, l apple ln( M log(`) ). PROOF. We hash each value X[i] 2 X only once but we generate l +sketches h for each 2{,,l}. Since we only keep the axiu for each bucket b j, the aount of eory for each bucket is log(`). Each sketch has 2 buckets so in total it requires P l i= 2i log(`) =2 l+ log(`) bits. Since M is the bounded eory, 2 l+ should be at ost gives us l apple ln( M log(`) ). M log(`),i.e. (2l+ apple M log(`) ) which 3.4 Discussion: Leveraging ore eory We briefly discuss how can leverage additional eory, when available, to iprove its accuracy. It is shown in [4] that increasing the nuber of buckets for HLL sketches reduces the variance of cardinality estiation. In other words, the bucketization (with stochastic averaging) eulates the effect of n hash functions with only one single hash function [5]. However, as discussed in 3.3, increasing the nuber of buckets reduces the probability of balanced load ( nx 2 ) which can ultiately reduce the accuracy. Thus, leveraging additional eory by increasing the nuber of buckets is not adequate for our proble. One approach is to cobine the use of ultiple hash functions and stochastic averaging in order to take advantage of additional eory. In other words, given two coluns X, and Y we fix nuber of buckets to 2, where can be find by Equation 2. Then we use ultiple hash functions in HLL construction. uses the sketches built by each hash function in order to estiate inclusion coefficient (Algorith ), and the final estiation of inclusion coefficient is the average of those results. A thorough epirical study of the trade-offs between estiation error and the above ethod of leveraging additional eory is an interesting area of future work. 4. EXTENSION OF HLL CONSTRUCTION TO SUPPORT DELETION HLL sketches are becoing increasingly popular for estiating cardinality in different applications, and are even being adopted in coercial database engines, e.g., [6] uses Hyperloglog to support approxiate distinct count functionality. However, in data warehousing scenarios, it is not uncoon for recent data to be added and older data to be reoved fro the database. Clearly HLL sketches can be aintained increentally in the presence of insertions. When a new data ite X[k] inserted to colun X, the sae Algorith 3 finds the hash value of the X[k] (s k = h(x[k])) and the affected bucket b j can be identified by the leftost bits of s k. It then finds (s k ) which is the position of the leftost in the l bits of s k and it updates the value of bucket j as Vj X =ax(s X[b j], (s k )) (Definition 2). For exaple, Figure shows the HLL sketch of the colun X, i.e., S X. let us assue a new value X[k] is added to colun X such that the first bits of the s k = h(x[k]) represents the first bucket (b ) in S X and (s k ) be 5. Thus the value of the b will be updated to ax(4, 5) = 5. On the other hand, when X[i] is deleted, siilar to insertion we can find which bucket is affected. Lets b j be the affected bucket. Since X[i] exists in database, (s i) apples X[b j]. If (s i) < S X[b j], no update is required but if (s i)=s X[b j], it eans (s i) is the largest and should be deleted. Since we do not know what the second largest value for that bucket, we cannot handle deletion. We can odify Algorith 3 such that for each bucket we keep track of all (s i)s in order to support deletion. Since Vj X is the ax over all those value (Definition 2), and when a deletion happens knowing the second largest for each bucket is crucial, as shown in Figure 9, rather than only keeping the axiu value in each bucket (Figure ), we keep all (s i)s in a ax-heap. Interestingly, we can show that by aintaining a heap of constant size (at ost `) for each bucket we are able to support increental deletion. Recall that during HLL sketch construction, we apply a hash function h : do(x) {, }` on each X[i] 2 do(x) which returns ` bits s i, and if is the nuber of bits for the buckets, (s i) is always an integer nuber saller than equal ` ( apple (s i) apple ` ). For exaple, if it uses 64 bits hash function and =, then apple (s i) apple 64. In the worst case, if we keep all distinct (s i)s, we are required to keep only ` values, which requires (` )log(` ) bits. This explains why with only a heap of constant size (at ost ` ) increental deletion can be supported. One issue we need to consider is that it is possible that X[i] and X[k] are assigned to sae bucket b j and (s i) is equal to (s j). In this case, if Vj X = (s i) and X[i] is deleted, the value of bucket b j (Vj X ) should still be (s i) because X[k] 2 do(x) still exists and (s j)= (s i). To handle this scenario and keep the ax heap size still liited to `, we keep a counter for each node in heap. For 3

8 Cardinality Log scale i j X Buckets (X[i]) s i : (X[j]) (s i ) = 3 s j : (s j ) = 4 b b 2^ Max-Heap 4 c 3 <4 Figure 9: HLL sketch of the colun X for deletion support. TPCH- TPCDS Coluns Figure : Cardinality of the coluns in TPCH- and TPCDS- 3 Cardinality Log scale Real-3 Real Coluns Figure : Cardinality of the coluns in Real-3 and Real-4 exaple in Figure 9, since there is only one X[i] that its (s i)=4, value is attached to node 4. So, a node in heap is deleted if the counter is one; otherwise we just reduce the counter by one. Space coplexity: For each bucket b i the heap size is at ost ` and each value in the heap is an integer nuber at ost `. Thus each bucket only needs (` ) log(` ) bits. Thus total eory to store the sketch S X is O(2 (` )log(` )). Since the space coplexity of the original HLL is O(2 log(` )) ( 2.2), with a constant overhead we are able to support deletion. 5. EXPERIMENTS The ain goals of our experiental evaluation are: (a) Evaluate the accuracy of the estiator for inclusion coefficient against the existing approach using Botto-k sketches [3]. (b) Evaluate the ipact of inclusion coefficient estiates instead of exact inclusion coefficient on precision and recall of two existing foreign detection techniques [, 29]. The key takeaways fro the experients are: On databases containing coluns with large cardinality results in substantially saller estiation error (ranging fro. to.5) copared to the Botto-k sketch approach (where errors range fro.3 to.59). As expected, Botto-k perfors well when the cardinality of the coluns are sall (less than k). In databases where ost coluns have cardinality saller than k, the estiation error of the range between.4 and.7 and the error of Botto-k ranges between.3 and.6. The running tie of the estiator is coparable to the approach that uses Botto-k sketches. When we use instead of the exact inclusion coefficients, it has only a sall ipact on the F easure of the two foreign-key algoriths. In contrast, using the estiator based on Botto-k sketches significantly reduces F easure in soe databases. 5. Setup Hardware and Platfor: We ipleented algorith for constructing the sketches on Microsoft s big data syste Cosos []. Cosos is designed to run on large clusters consisting of thousands of coodity servers and it is based on ap-reduce odel to achieve parallelis. The rest of our experients, including the estiation of inclusion coefficient between pairs of coluns using sketches, were perfored on a 6-core, 2.2 GHz, Intel Xeon E achine with 384 GB of RAM. Datasets: Table 3 shows the details of the datasets we used in our experients. Specifically, we used TPC-H with GB scaling factor and TPCDS with 3GB, 5GB, and 2GB(2TB) scaling factors as benchark databases. We also evaluated our technique on four real-world databases, Real- and Real-2 are publicly available in [23] and Real-3 and Real-4 are real-world custoer databases. Figures and show the distribution over the cardinality of the coluns over the synthetic (TPCH-, TPCDS-3) and the largest real databases (Real-3, Real-4). As these figures show, in TPCDS-3, Real-3 and Real-4 there are any coluns with sall and large cardinality. Table 3: Databases used in experiental evaluation. Database #of tables # of coluns # of FKs Size TPCH GB TPCDS GB TPCDS GB TPCDS TB Geeea(Real-) MB Mondial(Real-2) MB Real unknown 8GB Real unknown 7GB Botto-k: Since the inclusion coefficient estiator in [3] used Botto-k sketches, we refer to it as Botto-k in the rest of this section. In brief, the inclusion coefficient of X and Y ( (X, Y )= X\Y ) can be calculated using the Jaccard coefficient, i.e., (X, Y )= X '(X,Y ) '({X[Y },Y ). They use Botto-k sketches to estiate Jaccard coefficient, '(X, Y )= X\Y. Another approach to estiate (X, Y ) X[Y is to divide the estiated value of X \ Y [5] by the estiated value of X. The authors in [3] have shown that the Botto-k approach is ore accurate than the estiator based on the [5] so we only copare the accuracy of against the estiator in [3]. Setting k in Botto-k: We used a 64-bit hash function (MururHash ) for the sketch construction of both HLL and Botto-k. For the fair coparison we fixed the aount of eory per colun used by both estiators ( and Botto-k). In 3.3 we discussed how to pick nuber of buckets. For exaple, in TPCDS-3 the axiu nuber of buckets used by our algorith is 2 3, and since we used 64-bits hash function and HLL keeps the position of the leftost, so each bucket only requires log 64 = 6 bits. Since we keep all the sketches for 2{,,...,3}, total eory is bits (2.28 KB). On the other hand, Botto-k keeps the k sallest hash values, i.e, k 64 bits. Thus, to ensure equal eory, we configure the k in Botto-k as k = 24 6 =536. Table 4 64 shows the and k paraeters for each database. 5.2 Coparing and Botto-k 5.2. Execution Tie Figure 2 shows the execution tie of estiating (X, Y ) for all pair of coluns belonging to different tables using and Botto-k. The total execution tie depends on the nuber of such colun-pairs, and hence varies across the different databases. For exaple, coputes the inclusion coefficient of the 7,635,8 colun-pairs in Real-3 which takes roughly 6 seconds, i.e.,.3 s per colun-pair. Not surprisingly, for each database, the execution ties of the and Botto-k are quite close to each other ( is slightly ore expensive coputationally). 4

9 Es#a#on Execu#on #e in second Real- Real-2 TPCH- Real-3 Real-4 TPCDS-3 TPCDS-5 TPCDS-2 Bo2o-K Real- Bo(o-K Real-2 TPCH- Real-3 TPCDS-3 Real-4 TPCDS-5 TPCDS Botto-k Exact Inclusion Coefficient Botto-k Exact Inclusion Coefficient Figure 2: Execution tie of estiating (X, Y ) for all pair of coluns Figure 3: Average of estiated (X, Y ) Figure 4: Average of estiated (X, Y ) in TPCDS-3: all pair of coluns, k = 536 Figure 5: Average of estiated (X, Y ) in TPCDS-3: X applet, Y applet, t = 5, k = Botto-k Botto-k Botto-k Botto-k Exact Inclusion Coefficient Exact Inclusion Coefficient k* k*2 k*3 k*4 k*5 k*6 k*7 k*8 k*9 Cardinality Threshold (t) k* k*2 k*3 k*4 k*5 k*6 k*7 k*8 k*9 Cardinality Threshold (t) Figure 6: Average of estiated (X, Y ) in TPCDS-3: X >t, Y >t, t = 5, k = 536 Figure 7: Average of estiated (X, Y ) in TPCDS-3: X > t, Y apple t or X apple t, Y >t, t = 5, k = 536 Figure 8: Average of estiated (X, Y ) in TPCDS-3: X >t, Y >t, k = 536 Figure 9: Average of estiated (X, Y ) in TPCDS-3: X > t, Y apple t or X apple t, Y >t, k = 536 Table 4: Paraeter is the nuber of bits that define a bucket in HLL, and k and nuber hash values kept for Botto-k. TPCH- TPCDS- TPCDS- TPCDS- Real- Real-2 Real-3 Real k Estiation Next we copare the average error of the two estiators. The estiation error is the difference between the estiated inclusion coefficient ˆ and its actual value (X, Y ), i.e., (X, Y ) ˆ. Recall that both estiators rely on sketches that use an equal aount of eory ( 5.). As shown in Figure 3, the estiation error in the databases where ost coluns have sall cardinality (Real-, Real-2, and TPCH-) is alost siilar for the and Bottok. However, results in substantially saller error for the other databases (TPCDS-3, Real-3, and Real-4) where any coluns have large cardinality (significantly larger than k). To better understand these results, Figures 4-7 breakdown the result for TPCDS-3 database. In Figure 4, the x-axis shows the buckets that each bucket (lets say a b) contains those colun pairs whose exact inclusion coefficients are in [a, b), and the y-axis represents the absolute error of and Botto-k for those colun pairs. For exaple,.2.3 in x-axis is the bucket contains those coluns pairs whose their exact inclusion coefficients are in [.2,.3) and the error of both and Botto-k (k=536 in Table 4) is around.2. The results show that in TPCDS-3 when exact inclusion coefficient is saller than.5, Botto-k is slightly better than, however, perfors better when inclusion coefficient is large. In applications like foreign-key detection where the high containent (e.g., inclusion coefficient.8) is an iportant signal an estiator with low error for large coefficients is preferable. Next, we show that, the reason is the cardinality of the coluns. As we discussed in 2.., the error of the inclusion coefficient estiators depends on the cardinality of the coluns. Analyzing error based on cardinality threshold t: In Figures 5-7, we evaluate the error of the estiators over three different sets of colun-pairs in TPCDS-3 based on a cardinality threshold t =5(we later vary t in Figures 8, 9). In particularly, Figure 5 only considers the colun-pairs (X,Y ) where both coluns have cardinality less than 5. Figure 6 considers the colunpairs where both coluns have cardinality ore than 5. Figure 7 considers pairs of coluns where the cardinality of one colun is saller than 5 and the cardinality of the other colun is ore than 5. When the cardinality of the coluns are sall, Botto-K perfors better than (Figure 5). In effect, for coluns with cardinality less than k, if a good hash function is used, Botto-k keeps the hashes of all distinct values, thereby behaving siilar to full hash table of the colun. On the other hand, for large coluns (Figure 6) and the one that one colun is large and the other colun is sall (Figure 7), outperfors Botto-k significantly. Figures 8, and 9 show the error of against Botto-K when the cardinality threshold t in above experients is varied. We show the cardinality threshold t as a factor of k in Botto-k. Figures 8 shows the error for those coluns whose cardinalities are ore than t, and Figure 9 shows the error when the cardinality of one colun is saller than t and the cardinality of the other colun is ore than t. Figure 8 shows that for the large coluns significantly outperfors Botto-K. In the unbalanced case (Figure 9), when t is exactly k, Botto-k perfors slightly better than but as t grows outperfors Botto-K. Effect of deletion: Next we evaluate the error and scalability of the version of that uses a HLL sketch that supports data deletions (discussed in 4) vs. Botto-k. Recall, that we keep a constant size 5

Clustering. Cluster Analysis of Microarray Data. Microarray Data for Clustering. Data for Clustering

Clustering. Cluster Analysis of Microarray Data. Microarray Data for Clustering. Data for Clustering Clustering Cluster Analysis of Microarray Data 4/3/009 Copyright 009 Dan Nettleton Group obects that are siilar to one another together in a cluster. Separate obects that are dissiilar fro each other into

More information

A Trajectory Splitting Model for Efficient Spatio-Temporal Indexing

A Trajectory Splitting Model for Efficient Spatio-Temporal Indexing A Trajectory Splitting Model for Efficient Spatio-Teporal Indexing Slobodan Rasetic Jörg Sander Jaes Elding Mario A. Nasciento Departent of Coputing Science University of Alberta Edonton, Alberta, Canada

More information

Oblivious Routing for Fat-Tree Based System Area Networks with Uncertain Traffic Demands

Oblivious Routing for Fat-Tree Based System Area Networks with Uncertain Traffic Demands Oblivious Routing for Fat-Tree Based Syste Area Networks with Uncertain Traffic Deands Xin Yuan Wickus Nienaber Zhenhai Duan Departent of Coputer Science Florida State University Tallahassee, FL 3306 {xyuan,nienaber,duan}@cs.fsu.edu

More information

Collaborative Web Caching Based on Proxy Affinities

Collaborative Web Caching Based on Proxy Affinities Collaborative Web Caching Based on Proxy Affinities Jiong Yang T J Watson Research Center IBM jiyang@usibco Wei Wang T J Watson Research Center IBM ww1@usibco Richard Muntz Coputer Science Departent UCLA

More information

Mapping Data in Peer-to-Peer Systems: Semantics and Algorithmic Issues

Mapping Data in Peer-to-Peer Systems: Semantics and Algorithmic Issues Mapping Data in Peer-to-Peer Systes: Seantics and Algorithic Issues Anastasios Keentsietsidis Marcelo Arenas Renée J. Miller Departent of Coputer Science University of Toronto {tasos,arenas,iller}@cs.toronto.edu

More information

λ-harmonious Graph Colouring Lauren DeDieu

λ-harmonious Graph Colouring Lauren DeDieu λ-haronious Graph Colouring Lauren DeDieu June 12, 2012 ABSTRACT In 198, Hopcroft and Krishnaoorthy defined a new type of graph colouring called haronious colouring. Haronious colouring is a proper vertex

More information

Performance Analysis of RAID in Different Workload

Performance Analysis of RAID in Different Workload Send Orders for Reprints to reprints@benthascience.ae 324 The Open Cybernetics & Systeics Journal, 2015, 9, 324-328 Perforance Analysis of RAID in Different Workload Open Access Zhang Dule *, Ji Xiaoyun,

More information

Theoretical Analysis of Local Search and Simple Evolutionary Algorithms for the Generalized Travelling Salesperson Problem

Theoretical Analysis of Local Search and Simple Evolutionary Algorithms for the Generalized Travelling Salesperson Problem Theoretical Analysis of Local Search and Siple Evolutionary Algoriths for the Generalized Travelling Salesperson Proble Mojgan Pourhassan ojgan.pourhassan@adelaide.edu.au Optiisation and Logistics, The

More information

Identifying Converging Pairs of Nodes on a Budget

Identifying Converging Pairs of Nodes on a Budget Identifying Converging Pairs of Nodes on a Budget Konstantina Lazaridou Departent of Inforatics Aristotle University, Thessaloniki, Greece konlaznik@csd.auth.gr Evaggelia Pitoura Coputer Science and Engineering

More information

EE 364B Convex Optimization An ADMM Solution to the Sparse Coding Problem. Sonia Bhaskar, Will Zou Final Project Spring 2011

EE 364B Convex Optimization An ADMM Solution to the Sparse Coding Problem. Sonia Bhaskar, Will Zou Final Project Spring 2011 EE 364B Convex Optiization An ADMM Solution to the Sparse Coding Proble Sonia Bhaskar, Will Zou Final Project Spring 20 I. INTRODUCTION For our project, we apply the ethod of the alternating direction

More information

An Efficient Approach for Content Delivery in Overlay Networks

An Efficient Approach for Content Delivery in Overlay Networks An Efficient Approach for Content Delivery in Overlay Networks Mohaad Malli, Chadi Barakat, Walid Dabbous Projet Planète, INRIA-Sophia Antipolis, France E-ail:{alli, cbarakat, dabbous}@sophia.inria.fr

More information

A Novel Fast Constructive Algorithm for Neural Classifier

A Novel Fast Constructive Algorithm for Neural Classifier A Novel Fast Constructive Algorith for Neural Classifier Xudong Jiang Centre for Signal Processing, School of Electrical and Electronic Engineering Nanyang Technological University Nanyang Avenue, Singapore

More information

Approaching the Skyline in Z Order

Approaching the Skyline in Z Order Approaching the Skyline in Z Order KenC.K.Lee Baihua Zheng Huajing Li Wang-Chien Lee cklee@cse.psu.edu bhzheng@su.edu.sg huali@cse.psu.edu wlee@cse.psu.edu Pennsylvania State University, University Park,

More information

A Measurement-Based Model for Parallel Real-Time Tasks

A Measurement-Based Model for Parallel Real-Time Tasks A Measureent-Based Model for Parallel Real-Tie Tasks Kunal Agrawal 1 Washington University in St. Louis St. Louis, MO, USA kunal@wustl.edu https://orcid.org/0000-0001-5882-6647 Sanjoy Baruah 2 Washington

More information

Fair Energy-Efficient Sensing Task Allocation in Participatory Sensing with Smartphones

Fair Energy-Efficient Sensing Task Allocation in Participatory Sensing with Smartphones Advance Access publication on 24 February 2017 The British Coputer Society 2017. All rights reserved. For Perissions, please eail: journals.perissions@oup.co doi:10.1093/cojnl/bxx015 Fair Energy-Efficient

More information

Shortest Path Determination in a Wireless Packet Switch Network System in University of Calabar Using a Modified Dijkstra s Algorithm

Shortest Path Determination in a Wireless Packet Switch Network System in University of Calabar Using a Modified Dijkstra s Algorithm International Journal of Engineering and Technical Research (IJETR) ISSN: 31-869 (O) 454-4698 (P), Volue-5, Issue-1, May 16 Shortest Path Deterination in a Wireless Packet Switch Network Syste in University

More information

Designing High Performance Web-Based Computing Services to Promote Telemedicine Database Management System

Designing High Performance Web-Based Computing Services to Promote Telemedicine Database Management System Designing High Perforance Web-Based Coputing Services to Proote Teleedicine Database Manageent Syste Isail Hababeh 1, Issa Khalil 2, and Abdallah Khreishah 3 1: Coputer Engineering & Inforation Technology,

More information

Geo-activity Recommendations by using Improved Feature Combination

Geo-activity Recommendations by using Improved Feature Combination Geo-activity Recoendations by using Iproved Feature Cobination Masoud Sattari Middle East Technical University Ankara, Turkey e76326@ceng.etu.edu.tr Murat Manguoglu Middle East Technical University Ankara,

More information

Data & Knowledge Engineering

Data & Knowledge Engineering Data & Knowledge Engineering 7 (211) 17 187 Contents lists available at ScienceDirect Data & Knowledge Engineering journal hoepage: www.elsevier.co/locate/datak An approxiate duplicate eliination in RFID

More information

Energy-Efficient Disk Replacement and File Placement Techniques for Mobile Systems with Hard Disks

Energy-Efficient Disk Replacement and File Placement Techniques for Mobile Systems with Hard Disks Energy-Efficient Disk Replaceent and File Placeent Techniques for Mobile Systes with Hard Disks Young-Jin Ki School of Coputer Science & Engineering Seoul National University Seoul 151-742, KOREA youngjk@davinci.snu.ac.kr

More information

The Boundary Between Privacy and Utility in Data Publishing

The Boundary Between Privacy and Utility in Data Publishing The Boundary Between Privacy and Utility in Data Publishing Vibhor Rastogi Dan Suciu Sungho Hong ABSTRACT We consider the privacy proble in data publishing: given a database instance containing sensitive

More information

The optimization design of microphone array layout for wideband noise sources

The optimization design of microphone array layout for wideband noise sources PROCEEDINGS of the 22 nd International Congress on Acoustics Acoustic Array Systes: Paper ICA2016-903 The optiization design of icrophone array layout for wideband noise sources Pengxiao Teng (a), Jun

More information

Optimal Route Queries with Arbitrary Order Constraints

Optimal Route Queries with Arbitrary Order Constraints IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL.?, NO.?,? 20?? 1 Optial Route Queries with Arbitrary Order Constraints Jing Li, Yin Yang, Nikos Maoulis Abstract Given a set of spatial points DS,

More information

THE rapid growth and continuous change of the real

THE rapid growth and continuous change of the real IEEE TRANSACTIONS ON SERVICES COMPUTING, VOL. 8, NO. 1, JANUARY/FEBRUARY 2015 47 Designing High Perforance Web-Based Coputing Services to Proote Teleedicine Database Manageent Syste Isail Hababeh, Issa

More information

MULTI-INDEX VOTING FOR ASYMMETRIC DISTANCE COMPUTATION IN A LARGE-SCALE BINARY CODES. Chih-Yi Chiu, Yu-Cyuan Liou, and Sheng-Hao Chou

MULTI-INDEX VOTING FOR ASYMMETRIC DISTANCE COMPUTATION IN A LARGE-SCALE BINARY CODES. Chih-Yi Chiu, Yu-Cyuan Liou, and Sheng-Hao Chou MULTI-INDEX VOTING FOR ASYMMETRIC DISTANCE COMPUTATION IN A LARGE-SCALE BINARY CODES Chih-Yi Chiu, Yu-Cyuan Liou, and Sheng-Hao Chou Departent of Coputer Science and Inforation Engineering, National Chiayi

More information

Detection of Outliers and Reduction of their Undesirable Effects for Improving the Accuracy of K-means Clustering Algorithm

Detection of Outliers and Reduction of their Undesirable Effects for Improving the Accuracy of K-means Clustering Algorithm Detection of Outliers and Reduction of their Undesirable Effects for Iproving the Accuracy of K-eans Clustering Algorith Bahan Askari Departent of Coputer Science and Research Branch, Islaic Azad University,

More information

Scheduling Parallel Real-Time Recurrent Tasks on Multicore Platforms

Scheduling Parallel Real-Time Recurrent Tasks on Multicore Platforms IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL., NO., NOV 27 Scheduling Parallel Real-Tie Recurrent Tasks on Multicore Platfors Risat Pathan, Petros Voudouris, and Per Stenströ Abstract We

More information

Privacy-preserving String-Matching With PRAM Algorithms

Privacy-preserving String-Matching With PRAM Algorithms Privacy-preserving String-Matching With PRAM Algoriths Report in MTAT.07.022 Research Seinar in Cryptography, Fall 2014 Author: Sander Sii Supervisor: Peeter Laud Deceber 14, 2014 Abstract In this report,

More information

Evaluation of a multi-frame blind deconvolution algorithm using Cramér-Rao bounds

Evaluation of a multi-frame blind deconvolution algorithm using Cramér-Rao bounds Evaluation of a ulti-frae blind deconvolution algorith using Craér-Rao bounds Charles C. Beckner, Jr. Air Force Research Laboratory, 3550 Aberdeen Ave SE, Kirtland AFB, New Mexico, USA 87117-5776 Charles

More information

Image Filter Using with Gaussian Curvature and Total Variation Model

Image Filter Using with Gaussian Curvature and Total Variation Model IJECT Vo l. 7, Is s u e 3, Ju l y - Se p t 016 ISSN : 30-7109 (Online) ISSN : 30-9543 (Print) Iage Using with Gaussian Curvature and Total Variation Model 1 Deepak Kuar Gour, Sanjay Kuar Shara 1, Dept.

More information

Design Optimization of Mixed Time/Event-Triggered Distributed Embedded Systems

Design Optimization of Mixed Time/Event-Triggered Distributed Embedded Systems Design Optiization of Mixed Tie/Event-Triggered Distributed Ebedded Systes Traian Pop, Petru Eles, Zebo Peng Dept. of Coputer and Inforation Science, Linköping University {trapo, petel, zebpe}@ida.liu.se

More information

IN many interactive multimedia streaming applications, such

IN many interactive multimedia streaming applications, such 840 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 8, NO. 5, MAY 06 A Geoetric Approach to Server Selection for Interactive Video Streaing Yaochen Hu, Student Meber, IEEE, DiNiu, Meber, IEEE, and Zongpeng Li, Senior

More information

Carving Differential Unit Test Cases from System Test Cases

Carving Differential Unit Test Cases from System Test Cases Carving Differential Unit Test Cases fro Syste Test Cases Sebastian Elbau, Hui Nee Chin, Matthew B. Dwyer, Jonathan Dokulil Departent of Coputer Science and Engineering University of Nebraska - Lincoln

More information

Investigation of The Time-Offset-Based QoS Support with Optical Burst Switching in WDM Networks

Investigation of The Time-Offset-Based QoS Support with Optical Burst Switching in WDM Networks Investigation of The Tie-Offset-Based QoS Support with Optical Burst Switching in WDM Networks Pingyi Fan, Chongxi Feng,Yichao Wang, Ning Ge State Key Laboratory on Microwave and Digital Counications,

More information

Predicting x86 Program Runtime for Intel Processor

Predicting x86 Program Runtime for Intel Processor Predicting x86 Progra Runtie for Intel Processor Behra Mistree, Haidreza Haki Javadi, Oid Mashayekhi Stanford University bistree,hrhaki,oid@stanford.edu 1 Introduction Progras can be equivalent: given

More information

Solving the Damage Localization Problem in Structural Health Monitoring Using Techniques in Pattern Classification

Solving the Damage Localization Problem in Structural Health Monitoring Using Techniques in Pattern Classification Solving the Daage Localization Proble in Structural Health Monitoring Using Techniques in Pattern Classification CS 9 Final Project Due Dec. 4, 007 Hae Young Noh, Allen Cheung, Daxia Ge Introduction Structural

More information

Smarter Balanced Assessment Consortium Claims, Targets, and Standard Alignment for Math

Smarter Balanced Assessment Consortium Claims, Targets, and Standard Alignment for Math Sarter Balanced Assessent Consortiu s, s, Stard Alignent for Math The Sarter Balanced Assessent Consortiu (SBAC) has created a hierarchy coprised of clais targets that together can be used to ake stateents

More information

Computer Aided Drafting, Design and Manufacturing Volume 26, Number 2, June 2016, Page 13

Computer Aided Drafting, Design and Manufacturing Volume 26, Number 2, June 2016, Page 13 Coputer Aided Drafting, Design and Manufacturing Volue 26, uber 2, June 2016, Page 13 CADDM 3D reconstruction of coplex curved objects fro line drawings Sun Yanling, Dong Lijun Institute of Mechanical

More information

1 Extended Boolean Model

1 Extended Boolean Model 1 EXTENDED BOOLEAN MODEL It has been well-known that the Boolean odel is too inflexible, requiring skilful use of Boolean operators to obtain good results. On the other hand, the vector space odel is flexible

More information

Modeling Parallel Applications Performance on Heterogeneous Systems

Modeling Parallel Applications Performance on Heterogeneous Systems Modeling Parallel Applications Perforance on Heterogeneous Systes Jaeela Al-Jaroodi, Nader Mohaed, Hong Jiang and David Swanson Departent of Coputer Science and Engineering University of Nebraska Lincoln

More information

OPTIMAL COMPLEX SERVICES COMPOSITION IN SOA SYSTEMS

OPTIMAL COMPLEX SERVICES COMPOSITION IN SOA SYSTEMS Key words SOA, optial, coplex service, coposition, Quality of Service Piotr RYGIELSKI*, Paweł ŚWIĄTEK* OPTIMAL COMPLEX SERVICES COMPOSITION IN SOA SYSTEMS One of the ost iportant tasks in service oriented

More information

Analysing Real-Time Communications: Controller Area Network (CAN) *

Analysing Real-Time Communications: Controller Area Network (CAN) * Analysing Real-Tie Counications: Controller Area Network (CAN) * Abstract The increasing use of counication networks in tie critical applications presents engineers with fundaental probles with the deterination

More information

A Low-Cost Multi-Failure Resilient Replication Scheme for High Data Availability in Cloud Storage

A Low-Cost Multi-Failure Resilient Replication Scheme for High Data Availability in Cloud Storage 216 IEEE 23rd International Conference on High Perforance Coputing A Low-Cost Multi-Failure Resilient Replication Schee for High Data Availability in Cloud Storage Jinwei Liu* and Haiying Shen *Departent

More information

Structural Balance in Networks. An Optimizational Approach. Andrej Mrvar. Faculty of Social Sciences. University of Ljubljana. Kardeljeva pl.

Structural Balance in Networks. An Optimizational Approach. Andrej Mrvar. Faculty of Social Sciences. University of Ljubljana. Kardeljeva pl. Structural Balance in Networks An Optiizational Approach Andrej Mrvar Faculty of Social Sciences University of Ljubljana Kardeljeva pl. 5 61109 Ljubljana March 23 1994 Contents 1 Balanced and clusterable

More information

Boosted Detection of Objects and Attributes

Boosted Detection of Objects and Attributes L M M Boosted Detection of Objects and Attributes Abstract We present a new fraework for detection of object and attributes in iages based on boosted cobination of priitive classifiers. The fraework directly

More information

Using Imperialist Competitive Algorithm in Optimization of Nonlinear Multiple Responses

Using Imperialist Competitive Algorithm in Optimization of Nonlinear Multiple Responses International Journal of Industrial Engineering & Production Research Septeber 03, Volue 4, Nuber 3 pp. 9-35 ISSN: 008-4889 http://ijiepr.iust.ac.ir/ Using Iperialist Copetitive Algorith in Optiization

More information

ELEVATION SURFACE INTERPOLATION OF POINT DATA USING DIFFERENT TECHNIQUES A GIS APPROACH

ELEVATION SURFACE INTERPOLATION OF POINT DATA USING DIFFERENT TECHNIQUES A GIS APPROACH ELEVATION SURFACE INTERPOLATION OF POINT DATA USING DIFFERENT TECHNIQUES A GIS APPROACH Kulapraote Prathuchai Geoinforatics Center, Asian Institute of Technology, 58 Moo9, Klong Luang, Pathuthani, Thailand.

More information

Utility-Based Resource Allocation for Mixed Traffic in Wireless Networks

Utility-Based Resource Allocation for Mixed Traffic in Wireless Networks IEEE IFOCO 2 International Workshop on Future edia etworks and IP-based TV Utility-Based Resource Allocation for ixed Traffic in Wireless etworks Li Chen, Bin Wang, Xiaohang Chen, Xin Zhang, and Dacheng

More information

HIGH PERFORMANCE PRE-SEGMENTATION ALGORITHM FOR SONAR IMAGES

HIGH PERFORMANCE PRE-SEGMENTATION ALGORITHM FOR SONAR IMAGES HIGH PERFORMANCE PRE-SEGMENTATION ALGORITHM FOR SONAR IMAGES Benjain Lehann*, Konstantinos Siantidis*, Dieter Kraus** *ATLAS ELEKTRONIK GbH Sebaldsbrücker Heerstraße 235 D-28309 Breen, GERMANY Eail: benjain.lehann@atlas-elektronik.co

More information

Scheduling Parallel Task Graphs on (Almost) Homogeneous Multi-cluster Platforms

Scheduling Parallel Task Graphs on (Almost) Homogeneous Multi-cluster Platforms 1 Scheduling Parallel Task Graphs on (Alost) Hoogeneous Multi-cluster Platfors Pierre-François Dutot, Tchiou N Takpé, Frédéric Suter and Henri Casanova Abstract Applications structured as parallel task

More information

Efficient Learning of Generalized Linear and Single Index Models with Isotonic Regression

Efficient Learning of Generalized Linear and Single Index Models with Isotonic Regression Efficient Learning of Generalized Linear and Single Index Models with Isotonic Regression Sha M. Kakade Microsoft Research and Wharton, U Penn skakade@icrosoft.co Varun Kanade SEAS, Harvard University

More information

A CRYPTANALYTIC ATTACK ON RC4 STREAM CIPHER

A CRYPTANALYTIC ATTACK ON RC4 STREAM CIPHER A CRYPTANALYTIC ATTACK ON RC4 STREAM CIPHER VIOLETA TOMAŠEVIĆ, SLOBODAN BOJANIĆ 2 and OCTAVIO NIETO-TALADRIZ 2 The Mihajlo Pupin Institute, Volgina 5, 000 Belgrade, SERBIA AND MONTENEGRO 2 Technical University

More information

I ve Seen Enough : Incrementally Improving Visualizations to Support Rapid Decision Making

I ve Seen Enough : Incrementally Improving Visualizations to Support Rapid Decision Making I ve Seen Enough : Increentally Iproving Visualizations to Support Rapid Decision Making Sajjadur Rahan Marya Aliakbarpour 2 Ha Kyung Kong Eric Blais 3 Karrie Karahalios,4 Aditya Paraeswaran Ronitt Rubinfield

More information

A Hybrid Network Architecture for File Transfers

A Hybrid Network Architecture for File Transfers JOURNAL OF IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 6, NO., JANUARY 9 A Hybrid Network Architecture for File Transfers Xiuduan Fang, Meber, IEEE, Malathi Veeraraghavan, Senior Meber,

More information

Different criteria of dynamic routing

Different criteria of dynamic routing Procedia Coputer Science Volue 66, 2015, Pages 166 173 YSC 2015. 4th International Young Scientists Conference on Coputational Science Different criteria of dynaic routing Kurochkin 1*, Grinberg 1 1 Kharkevich

More information

Gromov-Hausdorff Distance Between Metric Graphs

Gromov-Hausdorff Distance Between Metric Graphs Groov-Hausdorff Distance Between Metric Graphs Jiwon Choi St Mark s School January, 019 Abstract In this paper we study the Groov-Hausdorff distance between two etric graphs We copute the precise value

More information

Ranking Spatial Data by Quality Preferences

Ranking Spatial Data by Quality Preferences 1 Ranking Spatial Data by Quality Preferences Man Lung Yiu, Hua Lu, Nikos Maoulis, and Michail Vaitis Abstract A spatial preference query ranks objects based on the qualities of features in their spatial

More information

Ascending order sort Descending order sort

Ascending order sort Descending order sort Scalable Binary Sorting Architecture Based on Rank Ordering With Linear Area-Tie Coplexity. Hatrnaz and Y. Leblebici Departent of Electrical and Coputer Engineering Worcester Polytechnic Institute Abstract

More information

Reconstruction of Time Series using Optimal Ordering of ICA Components

Reconstruction of Time Series using Optimal Ordering of ICA Components Reconstruction of Tie Series using Optial Ordering of ICA Coponents Ar Goneid and Abear Kael Departent of Coputer Science & Engineering, The Aerican University in Cairo, Cairo, Egypt e-ail: goneid@aucegypt.edu

More information

Approximate String Matching with Reduced Alphabet

Approximate String Matching with Reduced Alphabet Approxiate String Matching with Reduced Alphabet Leena Salela 1 and Jora Tarhio 2 1 University of Helsinki, Departent of Coputer Science leena.salela@cs.helsinki.fi 2 Aalto University Deptartent of Coputer

More information

The Application of Bandwidth Optimization Technique in SLA Negotiation Process

The Application of Bandwidth Optimization Technique in SLA Negotiation Process The Application of Bandwidth Optiization Technique in SLA egotiation Process Srećko Krile University of Dubrovnik Departent of Electrical Engineering and Coputing Cira Carica 4, 20000 Dubrovnik, Croatia

More information

3 Conference on Inforation Sciences and Systes, The Johns Hopkins University, March, 3 Sensitivity Characteristics of Cross-Correlation Distance Metric and Model Function F. Porikli Mitsubishi Electric

More information

Survivability Function A Measure of Disaster-Based Routing Performance

Survivability Function A Measure of Disaster-Based Routing Performance Survivability Function A Measure of Disaster-Based Routing Perforance Journal Club Presentation on W. Molisz. Survivability function-a easure of disaster-based routing perforance. IEEE Journal on Selected

More information

The Unit Bar Visibility Number of a Graph

The Unit Bar Visibility Number of a Graph The Unit Bar Visibility Nuber of a Graph Eily Gaub 1,4, Michelle Rose 2,4, and Paul S. Wenger,4 August 20, 2015 Abstract A t-unit-bar representation of a graph G is an assignent of sets of at ost t horizontal

More information

6.1 Topological relations between two simple geometric objects

6.1 Topological relations between two simple geometric objects Chapter 5 proposed a spatial odel to represent the spatial extent of objects in urban areas. The purpose of the odel, as was clarified in Chapter 3, is ultifunctional, i.e. it has to be capable of supplying

More information

A Beam Search Method to Solve the Problem of Assignment Cells to Switches in a Cellular Mobile Network

A Beam Search Method to Solve the Problem of Assignment Cells to Switches in a Cellular Mobile Network A Bea Search Method to Solve the Proble of Assignent Cells to Switches in a Cellular Mobile Networ Cassilda Maria Ribeiro Faculdade de Engenharia de Guaratinguetá - DMA UNESP - São Paulo State University

More information

Brian Noguchi CS 229 (Fall 05) Project Final Writeup A Hierarchical Application of ICA-based Feature Extraction to Image Classification Brian Noguchi

Brian Noguchi CS 229 (Fall 05) Project Final Writeup A Hierarchical Application of ICA-based Feature Extraction to Image Classification Brian Noguchi A Hierarchical Application of ICA-based Feature Etraction to Iage Classification Introduction Iage classification poses one of the greatest challenges in the achine vision and achine learning counities.

More information

An Integrated Processing Method for Multiple Large-scale Point-Clouds Captured from Different Viewpoints

An Integrated Processing Method for Multiple Large-scale Point-Clouds Captured from Different Viewpoints 519 An Integrated Processing Method for Multiple Large-scale Point-Clouds Captured fro Different Viewpoints Yousuke Kawauchi 1, Shin Usuki, Kenjiro T. Miura 3, Hiroshi Masuda 4 and Ichiro Tanaka 5 1 Shizuoka

More information

Planet Scale Software Updates

Planet Scale Software Updates Planet Scale Software Updates Christos Gkantsidis 1, Thoas Karagiannis 2, Pablo Rodriguez 1, and Milan Vojnović 1 1 Microsoft Research 2 UC Riverside Cabridge, UK Riverside, CA, USA {chrisgk,pablo,ilanv}@icrosoft.co

More information

NON-RIGID OBJECT TRACKING: A PREDICTIVE VECTORIAL MODEL APPROACH

NON-RIGID OBJECT TRACKING: A PREDICTIVE VECTORIAL MODEL APPROACH NON-RIGID OBJECT TRACKING: A PREDICTIVE VECTORIAL MODEL APPROACH V. Atienza; J.M. Valiente and G. Andreu Departaento de Ingeniería de Sisteas, Coputadores y Autoática Universidad Politécnica de Valencia.

More information

Geometry. The Method of the Center of Mass (mass points): Solving problems using the Law of Lever (mass points). Menelaus theorem. Pappus theorem.

Geometry. The Method of the Center of Mass (mass points): Solving problems using the Law of Lever (mass points). Menelaus theorem. Pappus theorem. Noveber 13, 2016 Geoetry. The Method of the enter of Mass (ass points): Solving probles using the Law of Lever (ass points). Menelaus theore. Pappus theore. M d Theore (Law of Lever). Masses (weights)

More information

Debiasing Crowdsourced Batches

Debiasing Crowdsourced Batches Debiasing Crowdsourced Batches Honglei Zhuang, Aditya Paraeswaran, Dan Roth and Jiawei Han Departent of Coputer Science University of Illinois at Urbana-Chapaign {hzhuang3, adityagp, danr, hanj}@illinois.edu

More information

CS 361 Meeting 8 9/24/18

CS 361 Meeting 8 9/24/18 CS 36 Meeting 8 9/4/8 Announceents. Hoework 3 due Friday. Review. The closure properties of regular languages provide a way to describe regular languages by building the out of sipler regular languages

More information

2013 IEEE Conference on Computer Vision and Pattern Recognition. Compressed Hashing

2013 IEEE Conference on Computer Vision and Pattern Recognition. Compressed Hashing 203 IEEE Conference on Coputer Vision and Pattern Recognition Copressed Hashing Yue Lin Rong Jin Deng Cai Shuicheng Yan Xuelong Li State Key Lab of CAD&CG, College of Coputer Science, Zhejiang University,

More information

The Internal Conflict of a Belief Function

The Internal Conflict of a Belief Function The Internal Conflict of a Belief Function Johan Schubert Abstract In this paper we define and derive an internal conflict of a belief function We decopose the belief function in question into a set of

More information

COLOR HISTOGRAM AND DISCRETE COSINE TRANSFORM FOR COLOR IMAGE RETRIEVAL

COLOR HISTOGRAM AND DISCRETE COSINE TRANSFORM FOR COLOR IMAGE RETRIEVAL COLOR HISTOGRAM AND DISCRETE COSINE TRANSFORM FOR COLOR IMAGE RETRIEVAL 1 Te-Wei Chiang ( 蔣德威 ), 2 Tienwei Tsai ( 蔡殿偉 ), 3 Jeng-Ping Lin ( 林正平 ) 1 Dept. of Accounting Inforation Systes, Chilee Institute

More information

KS3 Maths Progress Delta 2-year Scheme of Work

KS3 Maths Progress Delta 2-year Scheme of Work KS3 Maths Progress Delta 2-year Schee of Work See the spreadsheet for the detail. Essentially this is an express route through our higher-ability KS3 Maths Progress student books Delta 1, 2 and 3 as follows:

More information

Handling Indivisibilities. Notes for AGEC 622. Bruce McCarl Regents Professor of Agricultural Economics Texas A&M University

Handling Indivisibilities. Notes for AGEC 622. Bruce McCarl Regents Professor of Agricultural Economics Texas A&M University Notes for AGEC 6 Bruce McCarl Regents Professor of Agricultural Econoics Texas A&M University All or None -- Integer Prograing McCarl and Spreen Chapter 5 Many investent and probles involve cases where

More information

Data Caching for Enhancing Anonymity

Data Caching for Enhancing Anonymity Data Caching for Enhancing Anonyity Rajiv Bagai and Bin Tang Departent of Electrical Engineering and Coputer Science Wichita State University Wichita, Kansas 67260 0083, USA Eail: {rajiv.bagai, bin.tang}@wichita.edu

More information

Distributed Multicast Tree Construction in Wireless Sensor Networks

Distributed Multicast Tree Construction in Wireless Sensor Networks Distributed Multicast Tree Construction in Wireless Sensor Networks Hongyu Gong, Luoyi Fu, Xinzhe Fu, Lutian Zhao 3, Kainan Wang, and Xinbing Wang Dept. of Electronic Engineering, Shanghai Jiao Tong University,

More information

A GRAPH-PLANARIZATION ALGORITHM AND ITS APPLICATION TO RANDOM GRAPHS

A GRAPH-PLANARIZATION ALGORITHM AND ITS APPLICATION TO RANDOM GRAPHS A GRAPH-PLANARIZATION ALGORITHM AND ITS APPLICATION TO RANDOM GRAPHS T. Ozawa and H. Takahashi Departent of Electrical Engineering Faculty of Engineering, Kyoto University Kyoto, Japan 606 Abstract. In

More information

Supplementary Section. A. Algorithm. B. Datasets. C. More on Labeling Functions

Supplementary Section. A. Algorithm. B. Datasets. C. More on Labeling Functions Suppleentary Section In this section, we include ore details about our algorith, labeling functions, datasets as well as additional results and coparisons, which could not be included in the ain paper

More information

Defining and Surveying Wireless Link Virtualization and Wireless Network Virtualization

Defining and Surveying Wireless Link Virtualization and Wireless Network Virtualization 1 Defining and Surveying Wireless Link Virtualization and Wireless Network Virtualization Jonathan van de Belt, Haed Ahadi, and Linda E. Doyle The Centre for Future Networks and Counications - CONNECT,

More information

A Study of the Relationship Between Support Vector Machine and Gabriel Graph

A Study of the Relationship Between Support Vector Machine and Gabriel Graph A Study of the Relationship Between Support Vector Machine and Gabriel Graph Wan Zhang and Irwin King {wzhang, king}@cse.cuhk.edu.hk Departent of Coputer Science & Engineering The Chinese University of

More information

Design and Implementation of Business Logic Layer Object-Oriented Design versus Relational Design

Design and Implementation of Business Logic Layer Object-Oriented Design versus Relational Design Design and Ipleentation of Business Logic Layer Object-Oriented Design versus Relational Design Ali Alharthy Faculty of Engineering and IT University of Technology, Sydney Sydney, Australia Eail: Ali.a.alharthy@student.uts.edu.au

More information

An Ad Hoc Adaptive Hashing Technique for Non-Uniformly Distributed IP Address Lookup in Computer Networks

An Ad Hoc Adaptive Hashing Technique for Non-Uniformly Distributed IP Address Lookup in Computer Networks An Ad Hoc Adaptive Hashing Technique for Non-Uniforly Distributed IP Address Lookup in Coputer Networks Christopher Martinez Departent of Electrical and Coputer Engineering The University of Texas at San

More information

International Journal of Scientific & Engineering Research, Volume 4, Issue 10, October ISSN

International Journal of Scientific & Engineering Research, Volume 4, Issue 10, October ISSN International Journal of Scientific & Engineering Research, Volue 4, Issue 0, October-203 483 Design an Encoding Technique Using Forbidden Transition free Algorith to Reduce Cross-Talk for On-Chip VLSI

More information

Novel Image Representation and Description Technique using Density Histogram of Feature Points

Novel Image Representation and Description Technique using Density Histogram of Feature Points Novel Iage Representation and Description Technique using Density Histogra of Feature Points Keneilwe ZUVA Departent of Coputer Science, University of Botswana, P/Bag 00704 UB, Gaborone, Botswana and Tranos

More information

A Periodic Dynamic Load Balancing Method

A Periodic Dynamic Load Balancing Method 2016 3 rd International Conference on Engineering Technology and Application (ICETA 2016) ISBN: 978-1-60595-383-0 A Periodic Dynaic Load Balancing Method Taotao Fan* State Key Laboratory of Inforation

More information

A Comparative Study of Two-phase Heuristic Approaches to General Job Shop Scheduling Problem

A Comparative Study of Two-phase Heuristic Approaches to General Job Shop Scheduling Problem IE Vol. 7, No. 2, pp. 84-92, Septeber 2008. A Coparative Study of Two-phase Heuristic Approaches to General Job Shop Scheduling Proble Ji Ung Sun School of Industrial and Managent Engineering Hankuk University

More information

Real-Time Detection of Invisible Spreaders

Real-Time Detection of Invisible Spreaders Real-Tie Detection of Invisible Spreaders MyungKeun Yoon Shigang Chen Departent of Coputer & Inforation Science & Engineering University of Florida, Gainesville, FL 3, USA {yoon, sgchen}@cise.ufl.edu Abstract

More information

EFFICIENT VIDEO SEARCH USING IMAGE QUERIES A. Araujo1, M. Makar2, V. Chandrasekhar3, D. Chen1, S. Tsai1, H. Chen1, R. Angst1 and B.

EFFICIENT VIDEO SEARCH USING IMAGE QUERIES A. Araujo1, M. Makar2, V. Chandrasekhar3, D. Chen1, S. Tsai1, H. Chen1, R. Angst1 and B. EFFICIENT VIDEO SEARCH USING IMAGE QUERIES A. Araujo1, M. Makar2, V. Chandrasekhar3, D. Chen1, S. Tsai1, H. Chen1, R. Angst1 and B. Girod1 1 Stanford University, USA 2 Qualco Inc., USA ABSTRACT We study

More information

On the Computation and Application of Prototype Point Patterns

On the Computation and Application of Prototype Point Patterns On the Coputation and Application of Prototype Point Patterns Katherine E. Tranbarger Freier 1 and Frederic Paik Schoenberg 2 Abstract This work addresses coputational probles related to the ipleentation

More information

Keyword Search in Spatial Databases: Towards Searching by Document

Keyword Search in Spatial Databases: Towards Searching by Document IEEE International Conference on Data Engineering Keyword Search in Spatial Databases: Towards Searching by Docuent Dongxiang Zhang #1, Yeow Meng Chee 2, Anirban Mondal 3, Anthony K. H. Tung #4, Masaru

More information

A Model Free Automatic Tuning Method for a Restricted Structured Controller. by using Simultaneous Perturbation Stochastic Approximation

A Model Free Automatic Tuning Method for a Restricted Structured Controller. by using Simultaneous Perturbation Stochastic Approximation 28 Aerican Control Conference Westin Seattle Hotel, Seattle, Washington, USA June -3, 28 WeC9.5 A Model Free Autoatic Tuning Method for a Restricted Structured Controller by Using Siultaneous Perturbation

More information

Deterministic Voting in Distributed Systems Using Error-Correcting Codes

Deterministic Voting in Distributed Systems Using Error-Correcting Codes IEEE TRASACTIOS O PARALLEL AD DISTRIBUTED SYSTEMS, VOL. 9, O. 8, AUGUST 1998 813 Deterinistic Voting in Distributed Systes Using Error-Correcting Codes Lihao Xu and Jehoshua Bruck, Senior Meber, IEEE Abstract

More information

Enhancing Real-Time CAN Communications by the Prioritization of Urgent Messages at the Outgoing Queue

Enhancing Real-Time CAN Communications by the Prioritization of Urgent Messages at the Outgoing Queue Enhancing Real-Tie CAN Counications by the Prioritization of Urgent Messages at the Outgoing Queue ANTÓNIO J. PIRES (1), JOÃO P. SOUSA (), FRANCISCO VASQUES (3) 1,,3 Faculdade de Engenharia da Universidade

More information

Improve Peer Cooperation using Social Networks

Improve Peer Cooperation using Social Networks Iprove Peer Cooperation using Social Networks Victor Ponce, Jie Wu, and Xiuqi Li Departent of Coputer Science and Engineering Florida Atlantic University Boca Raton, FL 33431 Noveber 5, 2007 Corresponding

More information

Minimax Sensor Location to Monitor a Piecewise Linear Curve

Minimax Sensor Location to Monitor a Piecewise Linear Curve NSF GRANT #040040 NSF PROGRAM NAME: Operations Research Miniax Sensor Location to Monitor a Piecewise Linear Curve To M. Cavalier The Pennsylvania State University University Par, PA 680 Whitney A. Conner

More information

Joint Measurement- and Traffic Descriptor-based Admission Control at Real-Time Traffic Aggregation Points

Joint Measurement- and Traffic Descriptor-based Admission Control at Real-Time Traffic Aggregation Points Joint Measureent- and Traffic Descriptor-based Adission Control at Real-Tie Traffic Aggregation Points Stylianos Georgoulas, Panos Triintzios and George Pavlou Centre for Counication Systes Research, University

More information