Random Clustering for Multiple Sampling Units to Speed Up Run-time Sample Generation

Size: px

Start display at page:

Download "Random Clustering for Multiple Sampling Units to Speed Up Run-time Sample Generation"

Ursula Kennedy
5 years ago
Views:

1 DEIM Forum 2018 I4-4 Abstract Ranom Clustering for Multiple Sampling Units to Spee Up Run-time Sample Generation uzuru OKAJIMA an Koichi MARUAMA NEC Solution Innovators, Lt Shinkiba, Koto-ku, Tokyo, Japan Ranom sample generation from a atabase can be helpful to exploratory ata analysis because of its capability of fast approximate aggregation of large ata. For efficient run-time ranom sample generation, ranom clustering, a technique to ranomly sort table rows on isk in avance, has been extensively use in previous research. However, this technique can ranomize rows of a table only for a single unit. Thus, if ata analysts require sampling for other units than the one use for clustering, traitional single-unit ranom clustering is no longer of help an causes a significant slowown ue to many ranom isk accesses. In this paper, we propose a ranom clustering technique for multiple units rather than for a single unit. This technique can ranomize table rows for multiple units on isk. After clustering, a ata analyst can select any one of the units an then samples for the selecte unit can be efficiently rea from isk until one satisfies the given filtering criteria an size conition. Experiments show that the new ranom clustering technique for multiple units generate samples significantly faster than traitional single-unit ranom clustering. Key wors Ranom sampling from atabases, Exploratory ata analysis, Online aggregation 1. Introuction Ranom sampling from a atabase has been stuie for a long time, an it is gaining even more significance owing to the emerging eman for large ata analyses. Calculating exact aggregates over massive ata requires long execution times an vast amounts of resources. Sampling can reuce such costs by calculating approximate aggregates over a small amount of ata extracte from the original tables. Run-time sample generation has been use in approximate query processing (AQP) as well as sample precomputing. In precomputing approaches, samples are precompute an store apart from the original ata so that approximate aggregates can be calculate from the samples. On the other han, run-time generation approaches o not store such precompute samples. They extract rows in ranom orer at query time an calculate approximate answers from them. An avantage of run-time approaches is their high aaptability to a-hoc queries. Run-time approaches can generate a sample of appropriate size for estimating the answer to a given query by continuously expaning the sample size until it becomes sufficient. This property has mae run-time sample generation a key component of online aggregation [8], which is a framework for presenting a running estimate base on a sample until the user is satisfie with it. The time efficiency of run-time sample generation epens on the physical ata layout on isk. Without any assumption, ranomly sample rows may be istribute over a large part of the isk-resient original table, wherein the ranom isk accesses substantially slow the sampling process. To avoi this problem, the first online aggregation paper [8] propose the ranom clustering technique, which ranomly sorts rows in a isk-resient table in a pre-processing phase. If rows are ranomly clustere, a sample can be efficiently obtaine through a sequential scan because rows in continuous blocks can be consiere a ranom sample an this sample is easily expane by scanning more blocks. Since then, ranom clustering has come to be viewe in the online aggregation literature, e.g., [7], [9], [13], as a key technique for accelerating the run-time sample generation process. AQP approaches base on sampling have been evelope to present approximate answers by automatically choosing or generating samples in the backgroun. However, simply giving approximate answers is not sufficient for ata analysts familiar with avance statistical analysis tools. Most statistical analysis tools toay have more sophisticate analysis an visualization functions than the aggregate functions executable in atabase systems. As analysts woul want to take full avantage of their favorite tools, they woul prefer simply getting samples accoring to the requirements they give explicitly for further analysis in their own tools, instea of only getting approximate answers.

2 Generating samples accoring to explicitly given requirements is a challenging task, as is AQP. In exploratory ata analysis, analysts nee to perform analyses iteratively an from ifferent perspectives; thus, the sampling unit of concern may be ifferent in each trial. However, ranom sampling using ifferent sampling units causes a severe slowown even if the ata are ranomly clustere. This is because the traitional ranom clustering technique can only ranomize a table accoring to a single unit (in most cases, a row) an so we call the traitional technique single-unit ranom clustering. If analysts choose other units for sampling than the one use for clustering, traitional single-unit ranom clustering is no longer of help an causes a significant slowown ue to many ranom isk accesses. We call this the unit mismatch problem. This problem is inherent to singleunit ranom clustering, an any online aggregation system base on traitional ranom clustering may suffer from this problem when the aggregation nees to use ifferent units. In this paper, to avoi the unit mismatch problem, we propose a ranom clustering technique for multiple sampling units rather than a single unit. Our multi-unit ranom clustering technique can ranomize a table for several sampling units in avance so that samples of ifferent sizes for any one of the units can be rea from isk quickly. If a ata analyst gives her requirements as to the target sampling unit, the filtering criteria an the size conition, we can iterate the sample generation by increasing the sample size until the sample satisfies the given requirements. Experiments show that our multi-unit ranom clustering generate samples for multiple units significantly faster than traitional single-unit ranom clustering because the traitional metho is affecte by the unit mismatch problem. By iterative sampling, samples satisfying the given requirements were efficiently generate from the atabase clustere using our technique even though the requirements involve sampling of istinct values in a foreign key (typical examples that suffer from the unit mismatch problem). The source coe of the sampling system is available online ( The rest of this paper is organize as follows. In Section 2, we iscuss the unit mismatch problem. Section 3 explains the multi-unit clustering technique, the iterative sampling algorithm that returns samples accoring to analytical requirements an the system that executes the iterative sampling. In Section 4, we experimentally evaluate our approach. We iscuss relate work in Section 5 an give our conclusions in Section 6. Figure 1 Example atabase for illustrating the unit mismatch problem. PK represents a primary key, an FK represents a foreign key. 2. Unit Mismatch Problem The unit mismatch problem is a problem of many ranom isk accesses that occur when the sampling unit iffers from the unit of ranom clustering. To illustrate this problem, let us consier the example atabase shown in Figure 1. Each row in table lineitem represents a line item of a transactional orer an has a primary key l linekey an foreign keys corresponing to the primary keys of three other tables that represent orers, parts, an suppliers relate to the line items. lineitem an the other three tables have many-to-one relationships: An orer (part or supplier) may be relate to more than one row in lineitem. In aition, we assume that all tables are so large that sampling is neee to analyze them. This means that this atabase has four entity types that can be chosen as the sampling unit: a line item, an orer, a part, an a supplier. Let us see what happens when all the tables are ranomly clustere in the typical way, i.e., when all rows are ranomly shuffle in avance. If the analyst wants to estimate the average sales per line item by sampling from table lineitem, the sampling can be one efficiently by making a sequential scan of the first rows in lineitem. However, If the analyst next wants to estimate the average total sales per orer, she nees to ranomly choose a small number of orers, collect all line items relate to the sample orers, an then average the sum of the prices for the orers. The orers can be efficiently sample by making a sequential scan of the first rows in table orer. However, the rows relate to each orer are ranomly istribute over table lineitem because the rows in lineitem are ranomly shuffle an there is no guarantee that rows having the same l orerkey value can be foun in ajacent isk blocks. Thus, ranom sampling consiering an orer as the sampling unit is slowe own by many ranom isk accesses to lineitem. In general, by using traitional ranom clustering, a table can be clustere for only a single unit, an many ranom isk accesses can be avoie only if the sampling unit matches the unit of the clustering. If one of the other units is chosen as

3 the sampling unit, ranom isk accesses are not reuce. Thus, the purpose of this paper is to overcome this problem. 3. Multi-unit Clustering 3. 1 Clustering Proceure We explain how rows can be ranomly clustere for multiple sampling units. Large tables generally have several attributes such that the istinct values in each of them represent numerous entities such as IP aresses or URLs an can be use as sampling units. We call such attributes unit keys an our multi-unit clustering technique clusters rows in a table so that we can efficiently sample rows consiering the istinct values of the chosen unit keys as the sampling units. In Figure 1, for example, l linekey, l orerkey, l partkey an l suppkey are goo choices for the unit keys of lineitem. Our technique leverages compoun clustering inexes, which are supporte by many atabase systems. First, we esignate several attributes as the unit keys. Then, we set compoun clustering keys in the tables by appening several attributes erive from the given unit keys. We use a pairwise inepenent hash function h(v) that maps each istinct value v of the unit keys to [0, 2 L 1]. We assigne each istinct value v of the unit keys a level:level(v) = L lg h(v) 1 if h(v) > 0 an Level(v) = L for h(v) = 0. The minimum level is 0, an the maximum is L. The probability that the level of v is at least l [0, L] is P r(level(v) > = l) = 2 l. This level function is the same as the one use in istinct sampling [6] to keep the sample size small. We use it to cluster the original table in preparation for sampling using ifferent units. If the table has U unit keys, we appen U level attributes that store the levels for the unit keys an appen a level sum attribute that stores the sum of the U levels. Then, we cluster the table by setting a compoun clustering key in it. The first component of the key is the level sum attribute, an the successive components are the level attributes. If the table was alreay assigne a clustering key, the original clustering key is appene as the last component of the new clustering key. For example, if table orers has one clustering key o orerate an three unit keys o orerkey, o custkey an o prouctkey, we set a new clustering key compose of (level sum, o orerkey level, o custkey level, o prouctkey level, o orerate) after appening the level sum attribute an the three level attributes Extract Samples of Different Sizes Let us see how samples of ifferent sizes can be extracte from a clustere table. First, one of the unit keys of the table is chosen as the target unit key that represents the sampling unit. Then, the target level l [0, L] is chosen to etermine the sample size. All istinct values of the target unit key that have levels being at least l are sample. A sample table s of a target table t is a subset of t that consists of rows in t that have the sample values in the target unit key of t. For example, if the analyst wants to extract a sample from table lineitem in Figure 1 consiering an orer as the sampling unit, then the target unit key is l orerkey an sample table l sample is efine as follows: WITH l sample AS (SELECT * FROM lineitem WHERE l orerkey level > = l) l orerkey level stores the levels of l orerkey. The sample for l = L is the smallest sample, an the sample size increases exponentially as l ecreases. Finally, the sample for l = 0 contains all the rows of the original table. The istinct values of the target unit key in a sample table are selecte on the basis of the pairwise inepenent hash function h(v) an thus these selecte values can be seen as a ranom sample of the istinct values of the target unit key in the original table.thus, the ata analyst can estimate the properties about the target unit in the original table by only analyzing a sample table erive from it. For example, the average total sales per orer in lineitem can be estimate using the average total sales per orer in l sample because all istinct orers are sample with the same probability. We will see that the sample tables of ifferent target levels can be rea from isk efficiently. To iscuss the efficiency, we call a set of rows sharing the same set of levels in a clustere table a bucket. A bucket can be viewe as an I/O unit. All rows in a bucket share the same prefix in their clustering keys, an thus, they are contiguously clustere. If a row in a bucket is inclue in a sample table, all other rows in the bucket are also inclue. The single-unit ranom clustering suffers from the unit mismatch problem because rows can be sorte in only one ranom orer while each row is relate to multiple units. Thus, our approach uses ranom bucketing rather than orering. The rows relate to the units of the same level are ivie into multiple buckets on isk. Given the target unit at query time, our algorithm restores these rows by selecting buckets relate to the target level. Sample extraction is surely efficient if the target table has the target unit key as its only unit key. In that case, all rows are sorte only by the levels of that key an collecting rows above a certain level is efficiently one by making a sequential scan. However, if the table has extra unit keys other than the target unit key, the rows having the same level of the target unit key are ivie into many smaller buckets because of the existence of the levels of the other unit keys. Therefore, we nee to estimate the effect of the number of unit keys on the efficiency of our sample generation. For a table having N rows, the expecte size of a bucket with levels (l 0, l 1,..., l U 1) is N U 1 u=0 2 1 lu = N2 U U 1 u=0 lu.

4 Figure 2 l e e e e El Ele Ele Upper boun of the probability that a row is in a minibucket This is a monotonic function of the level sum. Our technique clusters rows in ascening orer of the level sum, i.e., their expecte size. Thus, the smallest buckets are probabilistically clustere in the blocks at the bottom of the table. We assume that each isk block stores B rows. Buckets having at least B rows can be efficiently rea from isk because they are store in at least one contiguous block. On the other han, buckets with fewer than B rows may reuce the isk-rea efficiency. A bucket is calle a mini-bucket if the expecte number of its rows is smaller than B. Here, we show that mini-buckets actually occupy a small fraction of the whole table an o not significantly reuce efficiency if the table has many rows an a few unit keys. Consier a row r in a table t of N rows with U unit keys. Denote by S the set of all rows in mini-buckets of t. The probability that r S is boune as follows: Γ(U, ln(n/b)) (U 1)! < Γ(U, ln(n/b) U ln 2) = P r(r S) < (U 1)! = P ub (U, N/B) where Γ(a, x) is the upper incomplete gamma function. The upper boun P ub (U, N/B) is plotte for varying U an N/B in Figure 2. This graph shows that the probability is reasonably small if N/B is large an U is small. For example, the probability is smaller than for N/B = 10 6 an U = 4. The probability is equal to the expecte ratio of the number of rows in the mini-buckets to N. Thus, the mini-buckets occupy fewer than N/B P ub (U, N/B) blocks in expectation. Thus, our algorithm can collect buckets of the same level with a small overhea cause by reaing the blocks storing mini-buckets. The expecte size of the smallest sample table, which is obtaine at l = L, is N2 L. Thus, L shoul be at least O(log N), which makes the expecte size of the smallest sample constant. If L is larger than O(log N), the choice of L oes not affect the efficiency, because the samples for l > O(log N) are very small an clustere in the blocks storing mini-buckets an o not cause many ranom accesses Iterative Sampling We showe that sample tables of ifferent target levels can be efficiently extracte from a clustere table. However, the sample table probably still inclues ata irrelevant to the analyses, an the ata analyst nees to filter out such ata by the filtering criteria. If the remaining ata after filtering are too small for statistical analysis, the analyst nees to get larger samples by setting lower target levels. To reuce the buren, we give an algorithm that automatically chooses samples satisfying her requirements, i.e., the filtering criteria an size conition, as well as the target sampling unit. This algorithm iteratively extracts samples from multiple tables by ecreasing the target level until the requirements are met. We efine how the analyst can represent her requirements as the input to the system. First, the target sampling unit is represente by choosing one of the unit keys for each target table to be sample. For example, if sampling of orers is neee, o orerkey for table orers an l orerkey for lineitem are chosen as the target unit keys. Next, we introuce the notion of filtere table to escribe the filtering criteria an size conition. A filtere table is a table erive from sample tables an, if neee, other tables. The filtering criteria is escribe as a selection from the sample tables to the filtere table. If the analyst wants to exclue a certain kin of unit, she efines the filtere table so that rows relate to such units are exclue. The size conition can be escribe as an aggregation over the filtere table because the filtere table only contains the units that pass the filtere criteria. Given these requirements, we return a sample of sufficient size by ecreasing the target level until the given requirements are satisfie. The pseuo-coe is given in Algorithm 1. We are given a query that contains T pairs of clustere tables an their unit keys (t 0, u 0), (t 1, u 1),..., (t T 1, u T 1), filtering criteria f, size conition c, an main query q. First, we initialize the target level l to L. For each clustere table t i, we efine a sample table s i as the set of all rows in t i whose levels of u i are at least l. Next, we ask the atabase system if the size conition c is satisfie by submitting a query with the efinitions of the sample tables an the filtering criteria (checkconition in Algorithm 1). The atabase system creates the sample tables, filters them with the given criteria, an returns a truth value inicating whether the filtere tables satisfy the size conition. If the conition is not satisfie, we ecrement l by one, reefine the sample tables, an submit a size conition query again. The size conition check is repeate until the conition is satisfie or l reaches 0. Finally, we submit the main query q to the atabase system with the final efinitions of the sample tables (runmainquery). If the sampling process reaches the final step l = 0, the sample tables are the same as their

5 Algorithm 1 Sampling(pairs of a target table an its unit key (t 0, u 0),..., (t T 1, u T 1), filtering criteria f, size conition c, main query q) for l = L to 0 o for i = 0 to T 1 o efine sample table s i as the set of all rows in t i whose level of u i is at least l if checkconition(c, f, s 0,..., s T 1 ) then break return runmainquery(q, f, s 0,..., s T 1 ) Figure 3 ^ Z Z Overview of our sampling architecture original tables an the main query is exactly answere. The istinct values of the target unit keys are ranomly sample by the hash function. Thus, if one of the values is inclue in the sample of one table, all rows relate to the same value must be inclue in the samples of other tables. This assures that we can safely join the sample tables on the sample values of the target unit keys without lacking relate rows. This capability of joining on sample values is a major avantage of our algorithm Sampling System We here present a sampling system that can bring out the strengths of our clustering technique. This system receives sampling queries from ata analysts an runs the iterative sampling algorithm accoring to it. An overview is shown in Figure 3. To simplify the explanation, we assume that we are given a atabase system that alreay stores the ata the analyst wants to examine. Our sampling system works as a proxy server between the client application the analyst operates an the atabase system. It first clusters the ata of the tables in the atabase system (the clustering phase) an then waits for sampling queries to be submitte by the analyst (the sampling phase). In the clustering phase, the analyst sens a clustering request to the sampling system through the client application. In this request, the analyst esignates several attributes in large tables as unit keys. Then, the sampling system clusters the ata of the tables in the atabase system by using multi-unit clustering. After the clustering finishes, the sampling system waits for sampling queries to be submitte by the analyst. Once it receives a query, the sampling system runs the iterative sampling algorithm an returns the result to the analyst. The clustering process generally takes a long time, because all rows are sorte on isk. However, once the clustering is one, subsequent sampling queries can be answere in shorter amounts of time. We exten the SQL grammar so that the analyst can escribe her requirements. Figure 4 shows our extene SQL grammar, an Figure 5 shows an example query. The requirements are represente by SAMPLE, WITH an UNTIL clauses, an they are inserte before the main query block. The SAMPLE clause is use to choose which clustere tables to sample an the target unit key for each of them. The names of the sample tables are specifie as AS clauses. If they are omitte, the name of the original clustere table is use to refer to its sample table in the following clauses. The WITH clause is use to efine filtere tables erive from the sample tables. The filtere tables can be use later in the UNTIL clause an the main query. The WITH clause can be omitte if no filtere tables are necessary. The UNTIL clause specifies the size conition to be satisfie by the sample an filtere tables. Once the conition is satisfie, the main query following the sampling requirements is execute using the sample an filtere tables. Figure 5 shows an example query with the sampling requirements. The query is base on the atabase schema of the TPC-H benchmark. It calculates the average total price per customer who live in Japan an orere in Experimental Evaluation In the experiments, we compare our multi-unit ranom clustering with traitional single-unit ranom clustering Setting The atabase system use in the experiments was Amazon Reshift c2.xlarge single noe. We use the TPC-H benchmark atasets with a scale factor of 300. To simplify the explanation, we moifie table lineitem so that it woul have a primary key attribute l linekey that stores row is because the original lineitem oes not have a single primary key. After that, we create three ientical copies of the TPC-H atabase in the same system an clustere these three atabases using ifferent clustering techniques. We set L = 31 because it is large enough to keep the smallest sample small. All unit keys ha integer values, an we use a pairwise inepenent hash function of the form h(v) = (a v + b) mo p mo 2 L. In the first atabase, MU, each table was clustere using

6 query ::= [ sampling requirements ] main query sampling requirements ::= sample clause [ with clause ] until clause sample clause ::= SAMPLE clustere table [AS sample table] B target unit key {, clustere table [AS sample table] B target unit key} until clause ::= UNTIL size conition Figure 4 SQL grammar with sampling requirements SAMPLE orers AS o sample B o custkey, customer AS c sample B c custkey } (1) WITH filtere AS (SELECT * FROM o sample, c sample, nation WHERE o custkey = c custkey AND c nationkey = n nationkey AND n nation = JAPAN AND o orerate BETWEEN AND ) (2) UNTIL 1000 < = (SELECT COUNT(DISTINCT o custkey) FROM filtere) (3) SELECT AVG(sum) FROM (SELECT SUM(price) FROM filtere GROUP B o custkey) (4) Figure 5 Example query that calculates the average total sales in 2015 per customer who live in Japan multi-unit ranom clustering. We esignate all primary keys an foreign keys as unit keys for all tables. As a result, lineitem was assigne four unit keys: l linekey, l orerkey, l partkey, an l suppkey. Next, level an level sum attributes were appene to all tables, an the tables were clustere in the manner explaine in Section In the secon atabase, MU*, each table was clustere using multi-unit ranom clustering in the same way as MU except that the level sum was not inclue in the clustering key. This means the buckets in the tables of MU* were not arrange in orer of expecte size. Thus, unlike MU, MU* i not have the benefit of clustering small buckets. The thir atabase, SU, was a atabase that shows the effect of the traitional single-unit ranom clustering technique that consiers a row as the sampling unit. SU was basically the same as MU, but we change the level attributes * level to hash attributes * hash that store raw hash values h(v) instea of Level(v). Each table was clustere using the hash attribute of the primary key. The tables in this atabase can be viewe as having been ranomly clustere because the rows were shuffle accoring to the hash values. These three atabases containe exactly the same ata except the newly appene attributes for clustering. In the following experiments, we execute ientical queries over these atabases. The atabases returne the same results, but ha ifferent query performances because their tables were clustere using ifferent techniques Experiment 1: Sample Table Generation We evaluate the efficiency of extracting a sample table from a clustere table for ifferent target levels by irectly querying the atabase system with traitional SQL queries. We chose lineitem as the target table from which the sample tables were generate, because lineitem is the largest table in the TPC-H ataset an it has four unit keys, the largest number among the tables. Queries were execute with col cache ; i.e., all selecte rows were rea from isk. We submitte the following queries to the three atabases. A SELECT COUNT(DISTINCT unitkey) FROM lineitem WHERE unitkey level > = l B SELECT COUNT(DISTINCT unitkey) FROM lineitem WHERE unitkey hash < 2 L l where l is an integer in [0, L]. A is for MU an MU*, an B is for SU. Both queries return the number of units in the sample table, i.e., the sample size. l is a parameter to control the sample size. Upon receiving these queries for the same l, the three atabases must return the same sample size because the rows satisfying unitkey level > = l in the MU an MU* are equal to the rows satisfying unitkey hash < 2 L l in the SU. The results are shown in Figure 6. We varie l from L to 0 an plotte the query time versus the sample size. The graph for l linekey exemplifies the case that the sampling unit matches the unit of single-unit ranom clustering. For this key, the three atabases showe almost the same performance. For the other three keys, however, SU performe significantly worse than l linekey while MU still worke efficiently, as well as l linekey. This result shows how the unit mismatch problem egraes the efficiency of single-unit ranom clustering an that multi-unit ranom clustering can avoi this problem. The ifference between SU an MU was large when the sample size was moerate, but was small in the minimum an maximum limit because the minimum sample require a constant query processing cost an the maximum sample require a full scan of the table. MU performe more efficiently than MU* for l partkey an l suppkey. This ifference shows the effect of using the level sum as the first element of the clustering key, which arranges the smallest buckets in contiguous blocks Experiment 2: Iterative Sampling In Experiment 2, we evaluate the efficiency of the iterative sampling algorithm by submitting sampling queries to our sampling system. We esigne the sampling queries on the basis of three TPC-H queries: Q4, Q13, an Q17. These

7 l l ^ l l l l ^ l l l l l le le l ^^ Dh Dh ^h l l l le le ^^ Dh Dh ^h l l ^ l l l l ^ l l l l l le le ^^ l l l le ^^ Dh Dh ^h Dh Dh ^h Figure 6 Query times require to generate sample tables of varying sizes from the three atabases with respect to the four unit keys in lineitem. ^ l l l l Dh D^h ^ l l l Dh D^h ^ l l l Dh D^h l l l le le l l l l l l le le ^^ ^^ ^^ Figure 7 Query times require to answer sampling queries base on TPC-H queries Q4, Q13, an Q17 for varying sample sizes. queries are typical examples that suffer from the unit mismatch problem because their aggregation units are numerically istinct values in foreign keys. We rewrote these queries in extene SQL by appening sampling requirements. The target sampling unit an the filtering criteria were etermine accoring to the nature of the queries. For all queries, we set the size conition so as to stop sampling when l = θ, where θ is an integer scaling parameter in [0, L]. For each θ, the sampling system ran size conition queries from l = L to l = θ, then performe the main query for l = θ. For each θ, we compare the performance of our iterative sampling in the MU atabase with the performance of irectly running the main query for the final sample with l = θ in the SU atabase without running conition queries. To perform sampling in SU, the main query was execute with tables restricte by unitkey hash < 2 L l instea of unitkey level > = l in the same manner as Experiment 1. The results are shown in Figure 7. Total-MU represents the costs to run iterative sampling on the MU atabase, i.e., the total costs to run the conition queries an the main queries. Main-SU represents the costs to run the main queries for the final sample in the SU atabase. Exact represents the costs to get exact answers by running the original TPC-H queries. Although the costs of iteratively running conition queries were inclue in Total-MU, Total-MU was still significantly more efficient than Main-SU except when the sample size was very large. For Q17, a large number of blocks were rea in Main-SU even when the sample size was small because the atabase system chose to join tables to avoi inefficient ranom accesses. Nevertheless, Total-MU still worke well for Q17 because the sample rows were clustere well. These results show that samples can be efficiently extracte from the tables clustere by multi-unit ranom clustering. 5. Relate Work Sample generation. There are a number of tech-

8 niques for ranomly sampling rows from isk-resient files or atabases. They can be classifie as to whether the isk access metho is a sequential scan or ranom access. Reservoir sampling [12] is a typical example of sequential techniques. Among ranom techniques, the technique propose by Olkan an Rotem generates samples using an inexing structure like B+-tree [10] or hash file [11]. A rawback of the sequential-scan-base techniques is to nee to process the whole ata to get a ranom sample. Ranom techniques can avoi having to make a whole scan, but still require, at maximum, as many ranom block accesses as the sample size. To reuce I/Os, several stuies ranomly collect blocks instea of rows, e.g. [5]. However, rows in the same block may have a strong correlation with each other an cannot be seen as an inepenent ranom sample. The most closely relate approach to ours is istinct sampling [6], which generates samples by sequential scan consiering istinct values of a target attribute as the sampling units an using them to answer queries asking for the number of istinct values. Distinct sampling can eal with ifferent sampling units, but nees to scan the whole ata to generate samples, an it only eals with istinct value queries. AQP base on precompute samples. AQP systems typically store precompute synopses incluing samples, histograms, an wavelets for approximate aggregation. AQUA [2] stores join synopses [3] for foreign key joins an congressional samples [1] for group-by queries. BlinkDB [4] is an AQP system storing stratifie samples. Online Aggregation. Online aggregation [8] is an AQP framework base on runtime sample generation. It presents to the user a running estimate with an error boun base on the samples retrieve so far. The first online aggregation paper [8] consiere aggregation in a single table. This approach has been extene to multi-table joins [7], large tables [9] an neste queries [13]. The single-unit ranom clustering technique has been use for online aggregation in many stuies [7], [9], [13], but it cannot work efficiently if the sampling unit oes not match the unit of ranom clustering. Our clustering technique is a solution to this problem. 6. Conclusion We propose a ranom clustering technique that enables fast sample generation from isk-resient tables for multiple sampling units. It is a solution to the unit mismatch problem inherent to single-unit ranom clustering, a traitional technique use for run-time sample generation. Our technique can be use to help ata analysts by generating a sample until it satisfies their requirements. Experimental results showe our multi-unit clustering worke equally efficiently for ifferent sampling units, whereas single-unit ranom clustering worke well only for a single unit an performe significantly worse for other units because of the unit mismatch problem. References [1] S. Acharya, P. B. Gibbons, an V. Poosala. Congressional samples for approximate answering of group-by queries. In Proceeings of the 2000 ACM SIGMOD International Conference on Management of Data, SIGMOD 00, pages , New ork, N, USA, ACM. [2] S. Acharya, P. B. Gibbons, V. Poosala, an S. Ramaswamy. The aqua approximate query answering system. In Proceeings of the 1999 ACM SIGMOD International Conference on Management of Data, SIGMOD 99, pages , New ork, N, USA, ACM. [3] S. Acharya, P. B. Gibbons, V. Poosala, an S. Ramaswamy. Join synopses for approximate query answering. In Proceeings of the 1999 ACM SIGMOD International Conference on Management of Data, SIGMOD 99, pages , New ork, N, USA, ACM. [4] S. Agarwal, B. Mozafari, A. Pana, H. Milner, S. Maen, an I. Stoica. BlinkDB: Queries with boune errors an boune response times on very large ata. In Proceeings of the 8th ACM European Conference on Computer Systems, EuroSys 13, pages 29 42, New ork, N, USA, ACM. [5] S. Chauhuri, G. Das, an U. Srivastava. Effective use of block-level sampling in statistics estimation. In Proceeings of the 2004 ACM SIGMOD International Conference on Management of Data, SIGMOD 04, pages , New ork, N, USA, ACM. [6] P. B. Gibbons. Distinct sampling for highly-accurate answers to istinct values queries an event reports. In Proceeings of the 27th International Conference on Very Large Data Bases, VLDB 01, pages , San Francisco, CA, USA, Morgan Kaufmann Publishers Inc. [7] P. J. Haas an J. M. Hellerstein. Ripple joins for online aggregation. In Proceeings of the 1999 ACM SIGMOD International Conference on Management of Data, SIGMOD 99, pages , New ork, N, USA, ACM. [8] J. M. Hellerstein, P. J. Haas, an H. J. Wang. Online aggregation. In Proceeings of the 1997 ACM SIGMOD International Conference on Management of Data, SIGMOD 97, pages , New ork, N, USA, ACM. [9] C. Jermaine, S. Arumugam, A. Pol, an A. Dobra. Scalable approximate query processing with the bo engine. ACM Trans. Database Syst., 33(4):23:1 23:54, Dec [10] F. Olken an D. Rotem. Ranom sampling from b+ trees. In Proceeings of the 15th International Conference on Very Large Data Bases, VLDB 89, pages , San Francisco, CA, USA, Morgan Kaufmann Publishers Inc. [11] F. Olken, D. Rotem, an P. Xu. Ranom sampling from hash files. In Proceeings of the 1990 ACM SIGMOD International Conference on Management of Data, SIGMOD 90, pages , New ork, N, USA, ACM. [12] J. S. Vitter. Ranom sampling with a reservoir. ACM Trans. Math. Softw., 11(1):37 57, Mar [13] K. Zeng, S. Agarwal, A. Dave, M. Armbrust, an I. Stoica. G-OLA: Generalize on-line aggregation for interactive analysis on big ata. In Proceeings of the 2015 ACM SIG- MOD International Conference on Management of Data, SIGMOD 15, pages , New ork, N, USA, ACM.

Non-homogeneous Generalization in Privacy Preserving Data Publishing

Non-homogeneous Generalization in Privacy Preserving Data Publishing W. K. Wong, Nios Mamoulis an Davi W. Cheung Department of Computer Science, The University of Hong Kong Pofulam Roa, Hong Kong {wwong2,nios,cheung}@cs.hu.h