Variable-Prefix Identifiers (Adjustable Oid s)

Variable-Prefix Identifiers (Adjustable Oid s) William Kent Database Technology Department Hewlett-Packard Laboratories Palo Alto, California 1 Introduction Object identifiers come from various sources, and often conflict in content and form. Identifiers generated in different databases, or at different network nodes, may accidentally be the same. User-specified identifiers have the same problem: an employee number might accidentally match a part number. Literals are susceptible as well: the representations of some numbers and character strings are the same. The natural solution for conflicting contents is to attach some sort of qualifier (prefix) to differentiate one set of identifiers from another. Sometimes this leads to identifiers whose overall length is variable, as in hierarchical name spaces. While this may be appropriate in many cases, nonuniform identifier lengths impose a heavy penalty on system design and performance. We limit our attention to designs in which identifiers have some fixed length n (64 bits in most of our examples). The usual approach to prefixes is to fix the identifier format, dividing all identifiers into a prefix of length p and a suffix of length s. The length p is rigidly fixed, requiring a difficult and irrevocable design decision. The maximum number of regions into which identifiers may be partitioned is fixed in advance, being limited to 2 p. Similarly, the length s of the suffix must be decided and fixed in advance, in the hopes that no region will contain more than 2 s identifiers. Once fixed, this same capacity is reserved for all regions, which wastes identifier capacity in padding bits for all but the largest region. Identifiers will go unused in databases whose capacity is less than 2 s objects, and also in user-specified identifiers (employee numbers, part numbers) less than s bits in length. A naive approach to variable-length prefixes would embed a length field in the identifier. This is exorbitantly wasteful of identifier capacity, and still puts a difficult restriction on the maximum number of regions. For example, to allow a prefix length of up to 16 bits requires a four-bit length field, reducing the number of identifiable objects by a factor of 16. We could also consider reserving a bit pattern to serve as a delimiter at the end of the prefix, but this also restricts the usable bit patterns and hence the total identifier capacity. Two innovations alleviate these difficulties: a global table of valid prefix values, and an expansion zone of zeros designed into the middle of the identifier format, allowing various sorts of adaptation and extension. The cost of maintaining the prefix table needs to be weighed against the benefits of this approach. The technique also supports a hierarchical nesting of subregions. 1.1 The Prefix Table With variable-length prefixes and no embedded length field or delimiter, the only way to recognize the prefix is to maintain a table of valid prefixes. A simple rule makes prefixes uniquely distinguishable: no prefix value may match the initial portion of any other prefix value. Thus 0, 10, and 11 constitute a valid set of prefixes, and 01 could not be added as a valid prefix. Internal Accession Date Only 1

The rule can be refined to support subregions. If the prefix of a region r 1 matches the initial portion of the prefix of region r 2, then r 2 is a subregion of r 1. Any identifier belonging to r 2 also belongs to r 1. For example, regions with prefixes 01 and 0110 are nested subregions of a region with prefix 0. Such regions and prefixes exist only if recorded in the prefix table. For each region, the table contains: The prefix length p. The prefix code (bit pattern) c. Optionally, other information about the region, such as a name or description, or a network node address, or other control information as needed. For convenience, it might include a flag indicating whether a region has subregions, though this can be determined algorithmically. This technique can be used to hierarchically partition oid s for any reason, not just for location purposes. Regions could be used to encode different identifier types and formats, e.g., literals and non-literals. Other object types could also be encoded in the prefix, so long as they are invariant over the object s lifetime. Figure 1 shows a plausible set of prefixes, which can be adapted to changing needs. For user-specified identifiers, the suffix length could be fixed at the actual length of the value (part number, employee number). region prefix length value c Literal 1 0 AtomicLiteral 2 00 Char 3 000 CityName 5 00000 MonthName 5 00001 Number 3 001 Integer 4 0010 Aggregate 2 01 Set 4 0100 List 4 0101 NonLiteral 1 1 SystemTypeObject 2 10 Type 4 1000 Function 4 1001 UserObject 2 11 SystemOid 3 110 Iris_Oid 4 1100 IrisDB00_Oid 6 110000 IrisDB01_Oid 6 110001 UserOid 3 111 HP_EmpNum 5 11100 HP_PartNum 5 11101 HP_DocNum 5 11110 Figure 1. An example set of prefixes. Identifying the region to which an identifier belongs is a little more complicated than with fixed-length or lengthencoded prefixes. It is necessary to compare different initial segments of an identifier with various table entries until a matching prefix is found. If subgroups are present, determining the most restrictive subregion should begin by comparing with the longer prefixes in the table, then progressing to the shorter ones. In the absence of subregions, it s hard 2

to know whether it s more efficient to start with the long ones or the short ones. Also, it may be possible to apply or adapt other search techniques such as indexes or hashing. This bears further investigation. Testing for a specific region is still relatively simple, since its prefix value can be compared with the initial portion of the identifier. However, under certain conditions, the prefix for a given region might change (Section 3.2 ), and tests for that region need to accommodate such changes. 2 Customized Capacities In identifiers of length n, if the length of a region s prefix is p, then individuals in that region are identified by a suffix of length s=n-p, and the region has a capacity for identifying 2 s objects. For example, if 64-bit identifiers are being used globally, and a particular database generates 32-bit oid s, then its oid s can be qualified with a 32-bit prefix. If employee numbers are 24 bits long, then the Employee Number region can be identified with a 40-bit prefix. If part numbers are 48 bits in length, then the Part Number region is identified with a 16-bit prefix. Of course, this scheme can only accommodate regions for which s<n. To accommodate four databases independently generating 64-bit oid s, the global identifier length n must be at least 66 bits. 80-bit product numbers cannot be used as user-specified identifiers unless n>80. 3 Expansion Zones The expansion zone for a given region is a string of z bits between the prefix and the suffix, reserved to be zero in all identifiers belonging to that region. The expansion zone allows for several kinds of extensibility. The format of any n- bit identifier in a given region is: prefix code zeros individual identifier p bits z bits s bits The prefix table is now extended to explicitly specify the suffix length s or the expansion zone length z=n-(p+s), or both for convenience. The current capacity of a region is 2 s, while its ultimate capacity is 2 (n-p). Maintaining all zeros in the expansion zone is crucial to extensibility. By making appropriate adjustments to the table, these bits can be used to: Increase the capacity of a region. Introduce new independent (disjoint) regions. Introduce subregions of an existing region. 3.1 Increasing Region Capacity Each region can be managed by an independent manager which issues s-bit individual identifiers, with zeros in the expansion zone, without having to refer to the table. If this capacity is exhausted, the suffix length s in the table can be increased, up to n-p if necessary, to increase the current capacity of the region. If a region has subregions (Section 3.3 ), then the subregions should be expanded rather than the parent region itself. The current capacity of a region can be multiplied by 2 k by shifting k bits from the expansion zone to the suffix. The prefix identifying the region is unchanged, and identifiers previously belonging to the region still belong to it. The only difference is that previously reserved bits in the expansion zone have been released for use in new identifiers in that region. 3

For example, the current capacity of a region with a suffix length of 50 bits and an expansion zone of 8 bits can be quadrupled by stretching its suffix to 52 bits and shrinking its expansion zone to 6 bits (Figure 2). This might be required, for example, because the estimated number of objects in a region has been revised upwards, or because the company has expanded the length of its employee numbers, etc. Under certain conditions, the current capacity of a region can also be reduced, if needed for load balancing. If it is known that no identifiers have been issued in a given region with high order 1-bits in the suffix, then those high-order bits can be returned to the expansion zone, and the suffix length correspondingly reduced. before after region prefix exp zone suffix region prefix exp zone z s z lnth p value c lnth p value c suffix s r 1 6 101101 8 50 r 1 6 101101 6 52 before: the last oid that filled region r 1... 1011010000000011...1 p z s after: the next oid after expansion... 1011010000000100...0 p z s Figure 2. Expanding the current capacity of a region. 3.2 Adding Independent Regions To introduce a new independent region, one first seeks a new prefix which does not conflict with any existing prefixes. An acceptable new prefix would not coincide with the initial portion of any existing prefix, nor would any existing prefix match the initial portion of the new prefix. For example, if the existing prefixes are 1 and 01, then 00 would be a legal new prefix. This is not always possible. When the existing prefixes are 1, 01, and 00, no other independent prefix can be added. These existing prefixes would match the initial portion of any new prefix that might be added. At this point we take clever advantage of the expansion zones to split an existing region in order to accommodate a new region. In effect, the prefix of an existing region is extended with zeros taken from the expansion zone. Since all identifiers previously issued in this region necessarily had zeros in these bits, they are still recognized as belonging to this region. The expansion zone is made smaller, with a corresponding reduction in the ultimate capacity of the region. Suppose (Figure 3) region r 1 with prefix code 101101 has a suffix length s=50, i.e., a current capacity of 2 50 objects. It has an expansion zone of z=8 bits, so that all identifiers in that region begin with a code of 101101 followed by 00000000 in the expansion zone. This region r 1 can be split by extending its prefix with up to eight zeros. If we extend its prefix with two zeros to be 10110100, its prefix length becomes p=8, and its expansion zone is reduced to z=6. Note again that all identifiers in region r 1 did already begin with 10110100, due to the expansion zone, so they still belong to region r 1. It is now possible to introduce 10110101, 10110110, and 10110111 as new independent prefixes, each with an arbitrary suffix length s 56, i.e., a capacity of up to 2 56 objects. In general, taking k bits from the expansion zone of r 1 4

allows the introduction of up to 2 k -1 new independent regions. We don t have to introduce all of them if we don t need them. If we only wanted to introduce one new region, we could have extended the prefix of r 1 with just one 0. The region r 1 chosen to be so subdivided must have z>0 at the outset, and should have no subregions. It should be able to tolerate the reduction in its ultimate capacity, and it should be able to give up enough ultimate capacity to meet the requirements of the new region(s). Since the prefix identifying region r 1 has changed, any routine testing whether an oid belongs to this region would now have to use the new prefix. This could be a problem in some cases. before after region prefix exp zone suffix region prefix exp zone z s z lnth p value c lnth p value c suffix s r 1 6 101101 8 50 r 1 8 10110100 6 50 r 3 8 10110101 3 53 r 4 8 10110110 0 56 r 5 8 10110111 6 50 p z s...before a typical oid in region r 1... 1011010000000010...10 p z s...after a typical oid in region r 5...1011011100000010...10 p z s Figure 3. Adding new independent regions. 3.3 Adding Subregions Adding subregions is somewhat similar to the previous case of adding independent regions, the main difference being that the parent region retains its original prefix, with the new subregions being assigned extensions of this prefix. A region can only be extended with immediate subregions once, but the subregions themselves can acquire further subregions. Though it can be determined algorithmically, it might be useful to put a flag in the table to indicate that a region has subregions. The maximum number of subregions which can be added to a region r is 2 z, where z is the length of the expansion zone of r. Figure 4 shows an example. Regions r 2, r 3, r 4, and r 5 are now subregions of r 1, since the prefix of r 1 matches the initial portion of the prefixes of these four new regions. 5

These four regions partition r 1, which can have no direct members of its own; all members of r 1 must be members of one of the subregions. We therefore show the suffix and expansion size of r 1 to be 0. Previous members of r 1 have become members of r 2. It is safest to define the subregions of a region before populating it. If we didn t add all four subregions, then 10110100 should not be assigned as a prefix. Region r 1 would not be covered, and could still have direct members of its own. However, its ultimate capacity is constrained to be no larger than any of its subregions, in order to avoid oid conflicts. For example, region r 1 could not have any direct members with an oid beginning with 10110101. In effect, region r 2 would not be added to the table, but it would serve as a virtual region containing members not covered by the other subregions. before after region prefix exp zone suffix region prefix exp zone z s z lnth p value c lnth p value c suffix s r 1 6 101101 8 50 r 1 6 101101 0 0 r 2 8 10110100 6 50 r 3 8 10110101 3 53 r 4 8 10110110 0 56 r 5 8 10110111 4 52 Figure 4. Adding subregions. Subregions need not be managed globally in the master prefix table. If a region is able to give up future growth by accepting a permanent suffix length, then it could treat that suffix as a fixed-length identifier and reapply the general algorithm to subregions of its own. The difference is in the scope over which the table has to be maintained. For example, within a 64-bit identifier system, a region could be described in the master table as having a prefix of 1001 and a suffix length of 60, i.e., no expansion zone and no room for growth. Another table could be maintained within this region for subregions within the region, managing prefixes, suffixes, and expansion zones within this 60- bit identifier. These subregions would not be recognized in the context of other regions, though the individual oid s would still be globally unique and recognized as belonging to the parent region. 4 Conclusions When fixed-length identifiers are grouped using fixed-length prefixes, the number of possible regions is fixed at the outset by the size of the prefix. The fixed-length suffix dictates that the same number of possible identifiers is reserved for each region. The suffix size is dictated by the needs of the potentially largest region, leading to a potential waste of unused identifiers in the other regions. Supporting variable-length prefixes by encoding the prefix length in the identifier wastes precious bits that could be used to identify individual objects. We have described a flexible system of variable-length prefixes employing a global prefix table and expansion zones in the identifiers. The disadvantages: 6

The need to maintain the global prefix table. More instructions executed to determine an identifier s region. Testing for a given region becomes sensitive to changes in prefix length. The overhead for maintaining the prefix table may not be significant. The table is not likely to be altered very often; a similar table may be present in any case to support global addressing in a network. The benefits of the approach: Regions can have different capacities. The number of regions is dynamically variable. New regions can be added as needed. The capacities of regions can often be increased. Subregions can be defined, and added dynamically. No prefix-length field is required in identifiers. Overall, it would seem that this scheme should lead to more efficient use of identifier capacity, since fewer identifiers will be reserved for regions where they are not needed. It remains to be seen how these advantages and disadvantages balance out in various contexts. 7