Bitmap Index Partition Techniques for Continuous and High Cardinality Discrete Attributes

Bitmap Index Partition Techniques for Continuous and High Cardinality Discrete Attributes Songrit Maneewongvatana Department of Computer Engineering King s Mongkut s University of Technology, Thonburi, Thailand songrit@cpe.kmutt.ac.th For submission to InTech 2003 Abstract Bitmap indexing is a technique to index data. The main advantage of bitmap indexing is that boolean operations on bitmaps are very fast. This is essential for queries in OLAP applications. Typically, bitmap indexing is used for low cardinality attributes since the overall space requirement depends on the cardinality. For high cardinality attributes, a technique of associating a range of contiguous values to a single bitmap is generally applied to reduce the space requirement. This technique requires an additional step, candidate check, which checks the actual records to verify if they satisfy the condition in the query or not. In this paper, we study techniques on partitioning the attribute domain into intervals, each is assigned to a bitmap. The goal is to minimize the candidate check cost. We propose two partitioning techniques. the first technique is for the situation that the query distribution is similar to the data distribution in the database. We also prove that this partitioning scheme is optimal when both query and data distributions are the same. The second technique uses a set of training queries in addition to the data in database. The partition generated by this technique has the minimal candidate checking cost with respect to the training query set. We consider both equality queries and range queries. 1 Introduction Indexing is a well-known methodology for optimizing the query performance on large databases. In traditional database applications, many indexing data structures have been proposed, but the most popular one is B-tree family [3, 6]. B-tree and its variants provide fast access method and relatively efficient index maintenance. Both properties are essential for traditional OLTP (OnLine Transaction Processing) applications where the systems have to handle many concurrent insert, delete and update operations. However, B-tree takes up a lot of space which can slow down the retrieval process if it can not fit in main memory. Recent uses of database are not limited to OLTP (OnLine Transaction Processing) applications only, but there have been increasing uses of OLAP (OnLine Analytical Processing)/data warehousing applications. In data warehousing applications, the data are relatively static with periodically bulk inserts. Data select criteria are usually in complex form consisting of many attributes joined by boolean operations. With these access patterns, B-Tree or its variants might not be the best indexing structure. Bitmap indexing usually provides a superior performance for such scenarios. It provides an alternative way to index the attribute values from B-tree. Instead of a list of RIDs (Record ID) in B-tree, a string of bit or bitmap is used to store the information about which records contain a particular attribute value. The main advantage is that the boolean operations on a set of bitmaps are very fast. This is essential for data warehousing applications since the most queries are in forms of multi-attributes, connected by logical operators. Another benefit of bitmap 1

indexing is that it requires less space, for attributes with low or medium cardinality. However, the attribute with high cardinality such as a typical scientific data stored in floating-point format are not suitable for a simple bitmap index scheme since the large space requirement outweighs other strengths. A common technique to accommodate large cardinality attributes is to assign a bitmap to a set of attribute values [10]. The set of all possible values is partitioned into k subsets, each with associated bitmap. For categorical attributes, each set can be expressed as enumeration of the attribute values. For continuous attributes, we can assign a continuous range of values to a set. Bits in a bitmap is set if the corresponding records contain one of the values associated with the bitmap. It is possible to have false hits, the attribute value is outside the interval associated with hit bitmap, in the bitmap. Therefore, an additional stage is required during the retrieval to filter the set bits so that only records with the desired attribute value are retrieved. The filtering stage involves accessing the actual database in the disk and thus it is time-consuming. Making the right partition can reduce this overhead. The data distribution (the distribution of the attribute values in the database) determines the size of attribute set and how it should be partitioned. The attribute set must contain all values in the data distribution. The query distribution also affects the choice on how to partition the attribute set. In some applications, it is possible to obtain query distribution or the training set of queries. This information can be used for optimization during the partition process. The goal is to have the minimal cost with respect to the data and query distributions. In this paper we focus on continuous and high cardinality discrete attributes. We improve the partition algorithm used to create bitmap index for such attributes. Our contributions include: We present a method to partition the attribute set when the actual query distribution is unknown but assumed to be the same as data distribution. We also prove that this partition is optimal with respect to the data and query distribution. We present a partitioning algorithm when a query training set is present. This algorithm is an extension of an existing algorithm to find a partition of for bitmap index of highcardinality discrete attributes [10]. Its major improvements are: 1) it is applicable to non-discrete attributes 2) it reduces the number of candidate partition points. 3) it allows range queries as well as equality queries We also prove that our algorithm, even using a smaller set of candidate partition points, is still optimal. 2 Related Work Bitmap indexing is an index structure consisting of a collection of bitmaps. It has been used in numerous applications, including commercial database systems like Oracle, Informix and Sybase [7]. Bitmap indexing was first introduced by O Neil for the Model 204 DBMS [11]. Since then, several improvements have been proposed [12, 5, 15, 13, 2]. A major advantage of bitmap indexing is that complex bitmap selections can be performed very quickly using bitwise boolean operations such as AND, OR, XOR and NOT. Its small space requirement, especially for low cardinality attributes, is another benefit. In [12], the authors gave a review of simple bitmap indexing and introduced two approaches for encoding the bitmaps. The first method is called projection index which stores a sequence of attribute values in the tuple-id order. The projection index is particularly efficient when query results are the values of the indexed attribute of all tuples that satisfy the query criteria since it is faster to scan the smaller projection index than to scan the full table. Bit-sliced index is the second method. Its organization is somewhat orthogonal to the structure of projection index. Each bit-slice holds only bits from a single position of the encoding of the attribute value. For example, if k bits are required to encode all possible attribute values (that is, the number of all possible attribute values can be up to 2 k ), then number of bit-slices is k. Wu and Buchmann extended the features of bit-slice in [15] and called it encodedbitmapindexing. Encoded bitmap indexing has the same space requirement as bit-slice index but it adds the flexibility on how 2

to encode the attribute values. Encoded bitmap has a separate mapping table that contains mapping from each attribute value to a unique k bit vector. Each bit in this vector corresponds to a bitmap. The mapping function can be adjusted so that the number of bitmaps accessed can be minimized in some queries. Bit-slice is a special encoding bitmap indexing where the mapping function is the a mapping from attribute value to its own binary. Chan and Ioannidis generalized the bitmap encoding by using two-dimensional framework: bitmap index decomposition, bitmap encoding scheme [4]. Bitmap index decomposition determines how to relate collection of bitmaps to a set of attribute values. For example, in a simple bitmap scheme, it is a one-to-one mapping from bitmaps to attribute values. The number of bitmaps required equals to the cardinality of the attribute. In a better space economical schemes, bitmaps can be divided into groups. Each bitmap in a group determines a unique group value. The combination of group values of all groups is used to identify a particular attribute value. Bitmap encoding scheme decides which bit(s) in bit vector should be set to 1. In equality encoding scheme, only bitmap(s) that corresponds to the attribute value is set. The same authors also introduced new encoding schemes range encoding and interval encoding [4, 5]. Each of these encoding schemes is suitable for different types of queries. Another effort to reduce the space-requirement of bitmap indexing is through compression. Each bitmap is compressed separately. The compressed bitmaps are generally much smaller than the uncompressed ones, especially for sparse bitmaps. The typical compression method is to convert the code into a run length encoding (RLE), which keeps distances between adjacent set bits ( 1 bits in this case). Other compression algorithms can also be used, for example: gzip/zlib [8], byte-aligned bitmap code (BBC) [2] or word-aligned hybrid code (WAH) [14]. The main goal of compression for bitmap indices is to reduce the size of bitmaps as much as possible (because it means faster disk scan) but at the same time maintain fast logical operations, the main strength of bitmap indexing. Specifically designed algorithm like BBC allows the logical operations to perform directly against the compressed bitmaps, therefore its speed is the main advantage. [9, 1, 14] discussed the performance of compressed bitmaps. Most of the bitmap indices were designed for the low cardinality discrete attributes. However, bitmap index can also be applied to non-discrete attributes which are common in scientific communities [13]. However, it is impractical to assign a bitmap to each possible value of a typical floating point variable. An approach to solve this problem is to assign a bitmap to a set of attribute values. [13] proposed a method for continuous attributes. The attribute domain is partitioned uniformly and each interval is covered by a bitmap. A related task of finding an optimal partition with respect to data and query distributions is proposed in [10]. 3 Problem Definition For an attribute A of a table T of size n, bitmap indexing is a set of k bitmaps. The value of k is determined by the cardinality of A, the encoding scheme, the decomposition and the domain partitioning. Each bitmap B is a string of n bit, b 1,b 2,...,b n.bitb i corresponds to the value of A in row i. The attribute value and the encoding scheme determine if a bit is set ( 1 ) or reset ( 0 ). For simplicity, we will focus only on the indexed attribute A and ignore the remaining attributes. With this in mind, the record value implies the attribute value. Also, the algorithm will be presented in the simple bitmap context but it generally can be applied to other bitmap schemes. Let us discuss other concepts related to the domain partitioning. The attribute domain X is the set of all possible attribute values. We assume that the attribute is at least in ordinal scale. We denote x min and x max the lowest and highest values. X canbeexpressedinintervalform of {x min,x max },where {, } can be substituted by either close interval [, ] or open interval (, ). The domain partitioning problem is to partition {x min,x max } into k disjoint intervals I i, 1 i k. Let {s i,e i } denote the interval I i, which starts at s i and ends at e i. During a query q is processed, the system identifies the range of the attribute values that needs to be accessed. All the bitmaps whose interval intersects the query range will be examined. However, some records associated with set bits may store the value that is outside the query range. For example, bitmap B i covers interval [40, 50] and a query which extracts all records with A = 42. 3

For every set bit in B i, the value of corresponding record could be anything from 40 to 50. Thus those records must be checked. This stage is referred to as candidate check. Wemodelthe candidate check cost to be the number of records covered by the accessed bitmap. The candidate check cost typically dominates the query time, in some cases it consumes more than 80% of the total I/O [13]. The choice of the bitmap partition affects the performance of the index because it decides how many records each bitmap covers. The basic idea is that frequent accessed bitmaps should have relatively few records, and thus lower candidate check cost. In the next two sections, we present the methods for finding partition with minimal candidate check cost in different situations. The query pattern also has influence on how the partition should be made. There are several types of queries. An equality query seeks for records that contain a certain value only. A query range is specified in a range query, e.g. to retrieve any records that contain anything from 20 to 40. 4 Partitioning without Query Training Set In [13], a straightforward partitioning was discussed. The attribute domain is partitioned uniformly so that interval of each bitmap is of the same size (the size is the number of unique attribute values for discrete attributes, or the length of interval for continuous attributes), we refer to this method as equal-interval partitioning. We can formulate the partition as: X = I 1 I 2... I k where I i = {s i,e i },e i s i = xmax xmin k for all i s, s 1 = x min,e k = x max,s i+1 = e i, 1 i<k 1 and I i I j = if i j. One variable of equal-interval partitioning is to use the range of data in the T to determine the interval of each bitmap (except the first and the last which must extend to x min and x max ). This is more efficient when the data distribution spreads over a small portion of attribute domain. Let D = {d 1,d 2,...,d n } denote the data set in table T and d min,d max denote the entries with minimum and maximum values. The size of the intervals (except I 1 and I k ) changes to dmax dmin k. It is easy to find intervals in the first form of equal-interval partitioning since x min and x max are known. For the second form, the table needs to be scanned to obtain d min and d max. In most of the real world situations, the data distribution is not uniform. In some cases, the query distribution Q resembles the data distribution. In this paper, we propose a partitioning method, called equal-density partitioning. Equal-density partitioning is optimal when the data (D)and query come from the same distribution. Also, in the case that query distribution is uniform its candidate check cost is the same as that of equal-interval partitioning. In equal-density partitioning, the attribute domain is partitioned such that the number of records (or density) that fall in each interval is roughly equal. In other words, the number of set bits in each bitmap is about the same. Sometimes, it is not possible to partition such that the density of each bitmap is identical because the n is indivisible by k or because there are duplicate values around partition points (all records with the same value must fall in the same bitmap s interval). We use the criteria min( i j ( I i I j ) 2 )where I i is the number of data records whose value that falls into the interval of bitmap i. Fig. 1 shows the equal-interval and equal-density partitionings. Equal-density partitioning is sensitive to the data set. The interval is smaller, e.g. interval [24,25] in the figure, when point density is dense. We will now show that the bitmap indexing obtaining from equal-density partitioning has the minimal candidate check cost when both data and query distributions are the same. Lemma 4.1 Given data set D and a query training set Q both are drawn from the same distribution, the minimal candidate check cost can be obtained when the intervals of all bitmaps have thesamedensity. Proof sketch: Assume that n is divisible by k. We would like to assign n k data to each bitmap. Let p i be the fraction of data records falling in the interval I i of the bitmap B i. Therefore p i = 1 k 4

[0,10) [10,20) [20,30) [30,40) [40,50) [50,60] 0 10 20 30 40 50 60 Equal interval partitioning [0,15] (15,24) (25,32) [32,38) [38,60] Equal density partitioning Figure 1: Equal-interval and equal density interval partitionings for all i s because each interval has the same record density. Since the distribution of Q is the same as D, the probability q i that a query falls in I i is q i = p i = 1 k for all i s. Each query that falls in I i, the candidate check cost is the number of records associated with B i,whichisnp i. For simplicity, a constant factor n factor is removed. The expected candidate check cost is: p i q i = (p i ) 2 = ( 1 k )2 = 1 k. (1) We claim that this expected candidate check cost is the minimum. For any other set of intervals that p i 1 k for some i s, we can rewrite p i as p i = 1 k + δ i,whereδ i 0. Since k p i =1and k δ i = 0, the expected candidate cost is: p i q i = = (p i ) 2 = ( 1 k + δ i) 2 (( 1 k )2 + 2δ i k +(δ i) 2 ). (2) The first term in Eq. 2 is the same as Eq. 1, the second term k 2δ i k = 2 k k δ i =0andthe last term k (δ i) 2 > 0 because some δ i is not equal to 0. Therefore the expected candidate check cost of Eq. 2 is greater than that of Eq. 1 and that completes the proof. Moreover, the equal-density partitioning has the same expected candidate check cost as equal-interval partitioning when the query distribution is uniform. We will show this in the next lemma. Lemma 4.2 Given data set D and uniform query distribution, the expected candidate cost of bitmap indexing obtained from either equal-density or equal-interval partitioning method is identical. Proof sketch: We first formulate the expect candidate check cost for uniform query distribution with equal-density partitioning: p i q i = q i k = 1 k q i. (3) Since k q i = 1, the candidate check cost is 1 k. For equal-interval partitioning, the candidate check cost can be derived using the same method as in Eq. 3, but now q i = 1 k for all i s and 5

k p i =1: p i q i = p i k = 1 k. (4) Therefore, equal-density and equal-interval partitioning methods result in the same expected candidate check cost. However, in this case, both equal-density and equal-interval partitioning methods may not yield the minimal candidate check cost. For example, if the data distribution contains few tightly clusters that are far away from one another. The partition that yields the minimal candidate check cost is the one with interval snugly contain each cluster. But this situation seldom occurs in the real world. 5 Partitioning with Query Training Set The overall efficiency of each bitmap partition also depends on the queries. For example, if it is known that most of the queries are in a specific range, such range should be partitioned into intervals whose size is smaller than average. This reduces overall candidate check cost. In this section, we consider the situation when a query training set Q, which reflects the actual query distribution, is present during the partitioning. The size of Q affects the performance and the quality of the partitioning. Basically Q should be as compact as possible but still capture major characteristic of the query distribution. For large data table, the size of Q is usually much smaller than the size of D. In this section, we extend a partitioning algorithm based on dynamic programming technique presented by Koudas in [10]. We now briefly discuss the original algorithm. It is designed for large cardinality discrete domain and supports only equality queries. Let p x and q x denote the number of records and queries that contain attribute value x respectively. The goal of the algorithm is to create a partition that minimizes the number of all false hits based on sets of p x and q x, for all x in attribute domain. The number of false hits F i associated with the interval of bitmap B i can be defined as: F i = e i q j j=s i s i k e iandk j And the number of all false hits is F = k F i. Since its attribute is discrete, the cardinality m is finite. Dynamic algorithm technique is used to efficiently find the partition with k intervals that minimizes the number of all false hits. Suppose that the attribute domain is embedded in a horizontal line and the attribute values are sorted from left to right. We can state our problem of finding optimal k 1 partition points in a recursive form. First, find the optimal split point, split1, that partitions the attribute domain into an interval on the left [x min,split1) and a collection of k 1 intervals on the right (split1,x max ]: split1 =findoptimalsplit(x min,x max,d,q,s)wheres is the set of split point candidates (that is, set of attribute domain). We split the range on the right side: split2 = findoptimalsplit(split1,x max,d D 1,Q Q 1,S S 1 )whered 1, Q 1 and S 1 are data records, queries and split point candidates that are in the first (leftmost) interval. The range on the right side is recursively split until k intervals are found. A naive recursive algorithm tests all possible sets of k unique values and reports the one with the lowest number of all false hits. This method requires O(m k ) time. Dynamic programming technique can significantly reduce the run time to O(m 2 k) by precomputing the solutions of smaller subproblems first, these solutions could be reused when the algorithm finds a solution of a larger subproblem. The detail of dynamic programming technique for partitioning the attribute domain was presented in [10]. Our algorithm has slightly different criteria, it uses the candidate check cost instead of false hits. The candidate check cost includes both true hits and false hits. Obviously one can use either criteria and having the same partition since the number of true hits remains constant for p k, 6

a particular query. Candidate check cost gives a closer resemblance to the performance since it is proportional to the number of accessed records. Theoretically, the cardinality of the continuous attribute is infinite. But since these attributes are typically stored as fixed-length variables, the cardinality of the stored data is very large but finite. Even with dynamic programming technique, it is inefficient to consider every unique value in function findoptimalsplit(). We notice that any attribute value x with p x = q x =0 can be removed from the set S. This is because the candidate check cost depends only on p x and q x. With this reduction, the finding optimal partitioning can be significantly faster than the original algorithm in [10] if D and Q do not uniformly spread over the attribute domain which is quite common for both continuous and large discrete attributes. We can further reduce the size of S and improve the speed of the algorithm by removing any attribute value x with q x =0fromS. We will later prove that there exists an optimal partition that does not use such value as a split point. However, we require 2 split point candidates for each value x with q x 0. This is because value x can be assigned to an interval either on the left side of x (the left one has closed end) or on the interval on the right side of x (the right one has closed start). Fig. 2 illustrates this concept and the markers x and x +. The set S can now be expressed as S = {x,x + q x 0}. As discussed above, continuous attributes must be discretized before stored in the system. Hence, for actual implementations, we can use x to represent x and the smallest value that is greater than x to represent x +.Iftwoqueryvalues x, y are adjacent and x<y, x + and y can be combined and represented by y. We now give the proof showing that the set S = {x,x + q x 0} is sufficient for the algorithm to find an optimal partition. [a,x) x [x,b] [a,x] x+ (x,b] a x b Figure 2: Two split points for x, x and x +. Lemma 5.1 Given data set D, query training set Q and set of split point candidates S = {x,x + q x 0}, it is possible the find a partition I that has minimum candidate check cost with respect to D and Q. Proof sketch: Suppose that there is a partition I that has a split point s S, we show that we can move the split point from s to a nearby point s S without additional candidate check cost. Let s l,s r S be the nearest points on the left and right side of s. Since s is a split point, we can define I l and I r to be intervals on the left and right side of s and p Il, q Il be the numbers of data records and queries that fall in interval I l (similar definitions apply to I r ). The candidate check cost of I l + I r is p Il q Il + p Ir q Ir. (5) If q Il > q Ir, the split point can be moved to s + l. Let I l and I r be new intervals on the left and right side of s + l. The candidate check cost of I l + I r is p I l q I l + p I r q I r. (6) Since there is no query point between s + l and s, q I l = q Il and q I r = q Ir.Buttheremight be some c data records in interval (s l,s ), therefore p I l + c = p Il and p I r c = p Ir for some c 0. (6) can be rewritten as: ( p Il c) q Il +( p Ir + c) q Ir, (7) 7

which is not greater than (5). Therefore split points can be moved from s to s + l without any increment on candidate check cost. Similar argument also applies in the case that q Il < q Ir, the split point can be moved from s to s r without additional candidate check cost. If q I l = q Ir, the candidate check cost is the same for split point s, s + l, s r. The candidate check cost for equality queries Ci E associated with the interval of bitmap B i is Ci E = q Ii p Ii, and the overall candidate check cost for equality query is the sum of Ci E over all intervals. Up till this point, the queries in the training set are equality queries. We now extend the algorithm for range queries. For range query q =[s, e], any data values which fall in intervals that intersect [s, e] can match the query and are subjected to candidate check. However, if the query range fully covers an interval, all data records in the bitmap associated with such interval are true hits. Hence the candidate check for those fully covered intervals are not required. The candidate check cost for range query Ci R of interval I i is then: C R i = w i p Ii, where w i is the number of range queries whose range intersects but does not contain interval I i. Notice that the candidate check cost changes only when the split points are at certain values: start and end points of query ranges. If [s, e] is a query range, s and e + are added into set S of split point candidates. s + needs to be added into set S only if s is an end point of another query range or if p s 0 (there is at least a data record with value s). Placing a splitting point at s + never gives a better candidate check cost than at s with respect to this particular range query since the bitmap with the interval ending at s ( {..., s] ) needs to be scanned. The same analogy can be applied when adding e into set S. 6 Conclusions In this paper, we presented the two techniques for finding the partition for the bitmap indexing. The first is based on the assumption that the query and data distributions are the same. Basically, the attribute domain is divided so that each interval has equal density of data records. We proved that the candidate check cost of the bitmap indexing with equal-density interval is minimal. In addition, we showed that the candidate check cost of equal-density method is the same as previously proposed equal-interval method. The second partitioning technique is based on an existing dynamic programming technique. We extended the technique so that it can be used for continuous attributes and it accepts range queries which are common in OLAP applications. Also we reduced the size of the split point candidate set without compromising the optimality, hence faster execution time of the algorithm. It is interesting to see how much speed up the bitmap indexing generated by these partitioning techniques can achieve. In future work, we plan to conduct the experiments on query time comparison between bitmaps generated by our methods and the conventional method. Also we would like to extend this work to more general multi-dimensional range queries. References [1] S. Amer-Yahia and T. Johnson. Optimizing queries on compressed bitmaps. In Proc. 26th Int. Conf. Very Large Data Bases (VLDB), pages 329 338, 2000. [2] G. Antoshenkov. Byte-aligned bitmap compression. Technical report, Oracle Corp., 1994. [3] R. Bayer and E. McCreight. Organization and maintenance of large ordered indices. Acta Informatica, 1(3):173 189, 1972. [4] C.-Y. Chan and Y. Ioannidis. Bitmap index design and evaluation. In Proc. ACM SIGMOD Int. Conf. on Management of Data, pages 355 366, 1998. 8

[5] C.-Y. Chan and Y. Ioannidis. An efficient bitmap encoding scheme for selection queries. In Proc. ACM SIGMOD Int. Conf. on Management of Data, pages 215 226, 1999. [6] D. Comer. The ubiquitous b-tree. ACM Computing Surveys, 11(2):121 137, 1979. [7] H. Edelstein. Faster data warehouses. Information Week, pages 77 88, December 1995. [8] J.-L. Gailly and M. Adler. Zlib home page. http://www.gzip.org/zlib. [9] T. Johnson. Performance measurements of compressed bitmap indice. In Proc. 25th Int. Conf. Very Large Data Bases (VLDB), pages 278 289, 1999. [10] N. Koudas. Space efficient bitmap indexing. In Conf. Information and Knowledge Management, pages 194 201, 2000. [11] P. O Neil. Model 204 architecture and performance. In Int. Workshop on High Performance Transactions Systems, pages 40 59, 1987. [12] P. O Neil and D. Quass. Improved query performance with variant indexes. In Proc. ACM SIGMOD Int. Conf. on Management of Data, pages 38 49, 1997. [13] K. Stockinger. Design and implementation of bitmap indices for scientific data. In Int. Database Engineering and Application Sympos., pages 47 57, 2001. [14] K. Wu, E. Otoo, and A. Shoshani. Compressing bitmap indexes for faster search operations. In Proc. 14th Int. Conf. Scientific and Statistical Database Management, pages 99 108, 2002. [15] M.-C. Wu and A. Buchmann. Encoding bitmap indexing for data warehouses. In Proc. 14th IEEE Int. Conf. Data Engineering, pages 220 230, 1998. 9