As an additional safeguard on the total buer size required we might further

Size: px

Start display at page:

Download "As an additional safeguard on the total buer size required we might further"

Cameron Hampton
5 years ago
Views:

1 As an additional safeguard on the total buer size required we might further require that no superblock be larger than some certain size. Variable length superblocks would then require the reintroduction into the upper index of the R array, again with only a small overall impact on the total index size. For practical use we would not, however, advocate that the B array and Hirschberg's searching strategy be used, since simple binary search in an R array of even 1,000,000 integers would require at most perhaps 0.1 milliseconds. This is the indexing scheme that we have used in our compressed full-text retrieval scheme [7]. Although it provides fast access within modest memory requirements, the index (as described in this section) does not conform to the exact detail of Sections 2 and 3, and so is asymptotically inecient in a worst-case bit-access model of computation. Nevertheless, the coded lower index is all that is necessary to generate an index that provides fast average-case random access to a large le of variable length records. Acknowledgements The authors gratefully acknowledge the assistance of Guy Jacobson. This work was supported by the Australian Research Council. References [1] R.G. Gallager and D.C. Van Voorhis. Optimal source codes for geometrically distributed alphabets. IEEE Transactions on Information Theory, IT{21:228{230, March [2] S.W. Golomb. Run-length encodings. IEEE Transactions on Information Theory, IT{12:399{401, July [3] D.S. Hirschberg. On the complexity of searching a set of vectors. SIAM Journal on Computing, 9:126{129, February [4] G. Jacobson. Random access in Human-coded les. In J.A. Storer and M. Cohn, editors, Proc. IEEE Data Compression Conference, pages 368{377. IEEE Computer Society Press, Los Alamitos, CA, March [5] M.D. McIlroy. Development of a spelling list. IEEE Transactions on Communications, COM{30:91{99, January [6] A. Moffat. Economical inversion of large text les. Computing Systems, 5:125{139, Spring [7] A. Moffat and J. Zobel. Coding for compression in full-text retrieval systems. In J.A. Storer and M. Cohn, editors, Proc. IEEE Data Compression Conference, pages 72{81. IEEE Computer Society Press, Los Alamitos, CA, March [8] O. Petersson. Personal communication. 10

2 Stored naively, a record-level index would require bits, about 3.6 Mbyte. Access to any record could then be eected in a total of 20 milliseconds, assuming that the index can be held in memory. Decoding of the desired record would take an additional millisecond, for a total of 21 milliseconds. Since skipping over the average record by decoding until a record terminator is found requires 1 millisecond, an index that points to only every b'th record will require (10 6 =b) 30 index bits, and 20 + b milliseconds will be required per record access. Suppose further that we must provide, on average, access to any record within 30 milliseconds. Then, in this simple scheme, we should choose b = 10, and thereby require 366 Kbyte of memory for the index. Let us now consider the space required by the two level approach given the same constraints. If blocks of exactly b 0 records are formed they will have average length b bits, and the dierence elds will on average require 12+ log b 0 bits each, for a total lower index of (10 6 =b ) 0 (12+ log b 0 ) bits. Moreover, if superblocks each of s 0 records are formed then the total access time will be 20+b 0 +(s 0 =b 0 )(12+log b 0 )=1000 milliseconds. In this case we should choose b 0 = 9 and s 0 = 600; with this choice we will still be able to access any record within 30 milliseconds, provided that the lower index can be held in memory. The total size of the lower index will then be about 199 Kbyte. The upper index will require 10 6 =s entries, each containing two pointers: one to the start of the superblock in the primary le on disk; and one to the memory location of the lower index records describing the block lengths within this superblock. The upper index will thus consume at most an additional 11 Kbyte. Even stored as 32-bit integers for ease of access the upper index only requires a modest amount of memory. If main memory is scarce then the lower index records can be interleaved on disk with the blocks of the primary le that they measure the length of. In this case a single pointer suces, and just 6 Kbyte of main memory is enough to obtain an average random access time of 30 milliseconds. With s 0 = 600 and an interleaved lower index each superblock will be on average 75 Kbyte long, and a buer of this size must be allocated to allow entire superblocks of the main le to be read. If this memory is unavailable, or if the physical characteristics of the disk are such that reading large blocks is uneconomic, s 0 should be reduced to allow a smaller block and buer size. This will have only a marginal eect on the overall size of the index, since the bulk of the space is in the lower index, which is unaected by the choice of s 0. 9

3 4. process bits in the primary le starting at (P p [block] + d), continuing until the start of the (m? R[block]? s)'th record (within the block) is found; 5. return the current bit location it is the address of record m. By construction, both the upper and lower indexes require O(N 1 ) bits, and since N 1 = O((N log log N)= log N), we have thus met our stated objective for index size. We must make a further slight modication to obtain the desired bound on the running time. Step 2 must now locate one entry in a sub-table of R that contains as many as p=p 2 = O(log 2 N) items, and the linear method of Hirschberg is no longer sucient. However we apply his algorithm rst to a collection of at most O(log N) entries, selecting every log N'th item in the set R[low : : : high]. This rst search will identify which region of log N items contains the desired value; we then apply the same technique a second time on the items of this region, again spending O(log N) bit accesses. These two 3 applications of the searching strategy will require in total O(log N) bit accesses, and the overall O(log N) bound can be preserved. As was pointed out by Jacobson, O(log log N= log N) = o(1) and tends toward zero, but only extremely slowly, and the constant factors ignored in the asymptotic analysis are very important. In the next section we consider the actual savings that are possible in one typical database application. 4 Practical Application In practice the primary le and at least some of the index will be stored on disk, and we are more interested in the size of the memory resident component of the index and the number of disk accesses consumed than the asymptotic number of bit operations. In this section we consider one application involving large les of variable length records, and describe the performance obtained by a `stripped down' two level indexing scheme of the type described above. We suppose that we have a le of p = 10 6 records of total length N = 10 9 bits, i.e., 120 Mbyte. We also suppose that records in the lower index and primary le can be accessed at a rate of 1 Mbit/second, that is, 1 Kbit per millisecond; and that any disk operations require 20 milliseconds. These values are all typical for the compressed main text of a full-text retrieval system when implemented on a Sun SparcStation 2 [7]; the relatively slow access to the index and primary le is the rate at which they can be decompressed rather than a disk transfer speed. 3 In general, for a table of M items, each of K symbols, applying Hirschberg's algorithm recursively on K items at a time will require O((K log M)= log K) symbol comparisons [8]. 8

4 0 6 B R P l P p upper index: p 2 entries, O(N 1 ) bits lower index: p 1 records, O(N 1 ) bits primary le: p records, N bits Figure 2: Two level structure bits. With each pointer dierence we must also store the number of records in the block in a blocksize eld, but this too can be represented using the same code, and will require at most an additional N 1 = O((N log log N)= log N) bits. This lower index is now a le of variable length records, each storing a dierence and a blocksize eld, and random access is not possible. On top of this le of O(N 1 ) bits we build an index of the form described in Section 2, requiring O(N 1 ) additional bits. In this upper index the R and B arrays are as before, but indicate `superblocks': blocks in the lower index each record of which corresponds to a block of records in the primary le. The P array must be cloned, with one set of pointers P p pointing to the primary le, and one set of pointers P l pointing into the lower level index (Figure 2). The new access sequence is described below. It is assumed that there are p 2 records in the upper index, and that p=p 2 is the nominal size of each superblock. 1. low B[b m high p=p 2 c], B[d m p=p 2 e]; 2. determine block in the range low: : :high such that R[block] m < R[block +1]; 3. process bits in the lower index starting at P l [block], accumulating decompressed blocksize and dierence elds until the sum of the blocksize elds is as large as possible without becoming greater than (m? R[block]), d s sum of the dierence elds, sum of the blocksize elds; 7

5 code. This results in an index le of variable length records which is then indexed using the technique of the previous section. The key to reducing the space required by this lower index is the encoding used to represent the dierences between successive pointers. The method we use was rst described by Golomb [2] and Gallager and Van Voorhis [1], and has subsequently been used to good eect by McIlroy [5], Moat [6], and Moat and Zobel [7]. The code is controlled by a single parameter b. To code integer x 1 we rst code (x? 1) div b in unary, and then code d = (x? 1) mod b in binary using either blog bc bits if d < 2 dlog be? b or in dlog be bits if d 2 dlog be? b. Some example codes for small values of b are shown in Table 1. The comma is indicative only, and does not appear in the output codeword. x b = 2 b = 3 b = 5 b = ,0 0,0 0,00 0, ,1 0,10 0,01 0, ,0 0,11 0,10 0, ,1 10,0 0,110 0, ,0 10,10 0,111 0, ,1 10,11 10,00 0, ,0 110,0 10,01 0,1100 Table 1: Example codes for dierences The use of this code to store a list of p integers summing to N or less requires B w p (log N p + 2) bits [6, 7], provided that N?p blog b = 2 p c ; and so, as before, the lower level index will store pointers but now will require at most N 1 N k log N p 1 N k log N (log(k log N) + 2) 6

6 0 c < S S A a = S??? z M? 1 0 > S K? 1 Figure 1: Lexicographic searching 2. while a z and A[a; c] < S[c] do a a + 1, while a z and A[z; c] > S[c] do z z? 1, c c + 1; 3. repeat step 2 until either c = K or a > z; 4. return z it is the index in A of the string lexicographically preceding S. This algorithm is a variant [8] of a method presented by Hirschberg [3], and solves the K-dimensional searching problem in O(K + M) time. When applied to the searching step 2 of the lookup algorithm above, we have M p=p 1 = O(log N) and K = dlog Ne, with a total searching cost of O(log N) bit accesses. Since every step of the lookup process can be implemented in O(log N) bit accesses, we have thus met our rst goal: provision of an O(N) space index that allows random access using O(log N) bit accesses. 3 Adding a Second Level To further reduce the space required by the index we interpose an additional level of blocking between the primary le and the index of the previous section. As before, we break the primary le into blocks so that each record is within k log N bits of the start of its block. Then, rather than store these pointers absolutely, we code the dierences between pointers using a prex-free variable length 5

7 the block containing the m'th record has at least been roughly located. The full access sequence to determine the location of record m is then: 1. low B[b m high p=p 1 c], B[d m p=p 1 e]; 2. determine block in the range low: : :high such that R[block] m < R[block +1]; 3. process bits in the primary le starting at P [block] until the start of the (m? R[block])'th record (within the block) is found; 4. return the current bit location it is the address of record m. We will defer discussion of step 2 for the moment. Instead, let us count the space requirements of this structure. Array P contains at most N=(k log N) pointers, each requiring dlog Ne bits. This totals N=k bits, which is O(N). Array R contains the same number of record numbers, each requiring dlog pe dlog Ne bits, contributing in total not more than N=k bits. Finally, each item in array B is in the range 1 : : : p 1, with p 1 N=(k log N) < N. In total, no more than 3N=k bits are required by the three index arrays. Now let us return to the time requirements. Step 3 requires at most k log N bit accesses, by design. Steps 1 and 4 do not dominate this, leaving step 2 as the only problem. In this step we must search an ordered set of record numbers, where each record number requires at most dlog Ne bits. As noted above, high? low p=p 1 k log N, but this restriction is still not enough to allow a normal binary search, since (log log N) probes would be required, each necessitating (log N) bit accesses. Jacobson [4] solved a similar problem by noting that R[high]? R[low] k log N, and thus that only the low order dlog(k log N)e bits needed to be accessed in each probe of the binary search, resulting in a total step 2 cost of O((log log N) 2 ) = O(log N). We suggest an alternative approach, described by the following algorithm. We suppose that A[i; j] is an array of symbols, 0 i < M, 0 j < K for some M and K, and that S[j], 0 j < K is a string of K symbols we wish to search for in A to nd either an exact match z for which A[z; j] = S[j]; 0 j < K, or the entry A[z] that is the lexicographic predecessor of S (Figure 1): 1. c 0, a 0, z M? 1; 4

8 bit of the next subsequent record, and so on down the le. This rule ensures that no record starts further than k log N bits from the pointer to the block containing the record, and is sucient to guarantee that the total number of pointers in the index is less than or equal to N=(k log N). With this structure, to access record m we must identify the `last pointer' prior to m, and then access at most k log N bits within that block to locate the start of record m. Identifying the block containing record m requires some care. If we associate with each pointer P [i] the ordinal record number R[i] in the primary le that it points to then a search of the (sorted) array R will suce to locate the correct pointer into the primary le. However a straightforward binary search may require (log(n=(k log N))) = (log N) probes, each accessing a record number of (log p) bits. When p = (N) this is (log 2 N), and the search becomes too expensive. An alternative would be to make each block a xed number of records so that direct access to the index (rather than requiring a search) would be possible. But in this case a long record in the primary le might mean that other records in the same block lie more than k log N bits away from their indexing pointer. Jacobson [4] described a third possibility in which each block contains an exact number of bits. This makes the pointers P unnecessary, but extra synchronising information is required for each block to allow the starting point of the rst record to be found, and the equivalent of the R array must still be retained. Moreover, in this approach extra information is required to handle long records that span more than one block. Here we choose to allow blocks to contain both a variable number of records and a variable number of bits. To speed the search in the R array we add another value to each pointer. As before, let the i'th pointer be P [i] and the corresponding record number be R[i]. Suppose that the index contains p 1 pointers, p 1 N=(k log n), and, without loss of generality, that p 1 divides evenly into p. The `block number' eld B[i] stores the number of the block that contains the i (p=p 1 )'th record of the le. For example, B[0] = 0, to indicate that the rst record (i.e., record number zero) of the le is in block zero; B[1] stores the number of the block that contains record p=p 1, and so on. Note that, since every block contains at least one record, B[i + 1] B[i] + (p=p 1 ) for all i. This bound will be used below. Suppose now that we must locate the m'th record. Rather than search the whole of the R array, the search can be constrained to the range m R[i] : B p=p 1 m i B : p=p 1 That is, if (p=p 1 ) divides m evenly the correct block number has been found. If not, 3

9 a few tens or hundreds of bits long an index that contains the address of every record will constitute a signicant overhead. Moreover, even when the average record size is (log N) bits as will be the case, for example, when each record contains a unique key the O(N) space required by a simple index might still be an unacceptable overhead. Here we consider implementations of the index that allow the O(log N) bit-access cost to be retained, but require less space. Our results closely parallel those given by Jacobson [4], but the structure we describe is simpler to implement. The next section describes a one level index that allows random access in O(log N) bit accesses and requires O(N) bits of overhead space. Section 3 then shows how the addition of a second compressed index level allows the space to be reduced to O((N log log N)= log N) = o(n). Finally, Section 4 gives the results of applying the technique to a practical situation. 2 One Level Indexing Suppose we are required to determine the bit address at which the m'th record of a le begins. The cost of accessing this record has two components: rst, we spend some time consulting the index, and second, we spend some time consulting the primary le. The only constraint is that the total number of bit accesses be O(log N), and so O(log N) time can be spent in each of the index and the primary le. The ineciency of the simple approach is that only O(1) time is spent accessing the primary le, forcing the index to contain too many pointers. In fact there is no need for the index to list every record address, and the records can be grouped into blocks, provided only that no record starts further than k log N bits from the pointer that marks the start of the block, for some constant k. Note that we assume that each record stores its own length, either implicitly or explicitly. For example, in the Human coding case discussed by Jacobson [4], the prex-free nature of a Human code means that the decoder can always know when the end of a symbol (record) has been reached. More generally, variable length records must either contain an explicit eld storing the total length of the record or must contain a unique `end-of-record' symbol, since without either even sequential processing would not be possible. Suppose then that we store pointers to records roughly k log N bits 2 apart, for some constant k. The rst pointer always points at the rst bit of the rst record in the le. To establish the next pointer, we skip over k log N bits and index the rst 2 All logarithms are binary. 2

10 Supporting Random Access in Files of Variable Length Records Alistair Moat Department of Computer Science, The University of Melbourne, Parkville 3052, Australia. Justin Zobel Department of Computer Science, Royal Melbourne Institute of Technology, GPO Box 2476V, Melbourne 3001, Australia. Abstract: We consider the problem of providing a random access index to a le of variable length records. For a le of N bits and (N) records the index we describe requires O((N log log N)= log N) = o(n) bits, and access to any record is possible after O(log N) bit accesses in the index and the le itself. This compares favourably with the (N log N) space that would be required by a conventional index with the same access bound. Jacobson has also presented an O((N log log N)= log N) space method of indexing; our method is simpler and leads to an implementation that is suitable for practical applications. Keywords: data structures, le structures, analysis of algorithms. 1 Introduction We suppose that a le of variable length records contains a total of N bits and p records; that the records are numbered sequentially from zero to p?1; and that it is necessary to be able to eciently access any record of the le given only an ordinal record number. The problem we consider is this: how should an index to the le be constructed so that random access is fast, but the index is small? One simple indexing scheme would be to maintain the address of each record in an index array of xed size pointers, allowing access to any record of the primary le using just one pointer and thus O(log N) bit accesses 1. However in this case the index consumes (p log N) bits, and when the average record size is (1) and p is (N), the index will asymptotically dominate the size of the original le. Of course, for practical purposes, a 32-bit pointer will index les of up to 4 Gbit, and, if that is not enough, 64-bit pointers could be used. Nevertheless, if the average record is just 1 We assume a model of computation where the unit cost operation is the reading of a single bit from either the index or the primary le. 1

number of passes. Each pass of the data improves the overall compression achieved and, in almost all cases, compression achieved is better than the ad

number of passes. Each pass of the data improves the overall compression achieved and, in almost all cases, compression achieved is better than the ad A General-Purpose Compression Scheme for Databases Adam Cannane Hugh E. Williams Justin Zobel Department of Computer Science, RMIT University, GPO Box 2476V, Melbourne 3001, Australia fcannane,hugh,jzg@cs.rmit.edu.au