Variable-Prefix Identifiers (Adjustable Oid s)

Similar documents
SOME TYPES AND USES OF DATA MODELS

Global Search And Replace User s Manual

Chapter 12: Indexing and Hashing. Basic Concepts

Chapter 12: Indexing and Hashing

Chapter 11: Indexing and Hashing

Chapter 12: Indexing and Hashing

SEMANTIC ANALYSIS TYPES AND DECLARATIONS

Implementing a Statically Adaptive Software RAID System

COMPUTER SCIENCE 4500 OPERATING SYSTEMS

UNIT-II. Part-2: CENTRAL PROCESSING UNIT

B.H.GARDI COLLEGE OF ENGINEERING & TECHNOLOGY (MCA Dept.) Parallel Database Database Management System - 2

CHAPTER TWO. Data Representation ( M.MORRIS MANO COMPUTER SYSTEM ARCHITECTURE THIRD EDITION ) IN THIS CHAPTER

CPS352 Lecture - Indexing

Bits, Words, and Integers

E-R Model. Hi! Here in this lecture we are going to discuss about the E-R Model.

Optimizing Closures in O(0) time

P Is Not Equal to NP. ScholarlyCommons. University of Pennsylvania. Jon Freeman University of Pennsylvania. October 1989

Data Partitioning. Figure 1-31: Communication Topologies. Regular Partitions

IEEE LANGUAGE REFERENCE MANUAL Std P1076a /D3

PRINCIPLES OF COMPILER DESIGN UNIT I INTRODUCTION TO COMPILERS

CS Operating Systems

CS Operating Systems

Item Number Change for Sage Accpac ERP

TotalCost = 3 (1, , 000) = 6, 000

Lecture Notes on Priority Queues

Seminar on. A Coarse-Grain Parallel Formulation of Multilevel k-way Graph Partitioning Algorithm

1. NUMBER SYSTEMS USED IN COMPUTING: THE BINARY NUMBER SYSTEM

A Hybrid Approach to CAM-Based Longest Prefix Matching for IP Route Lookup

Hashing. Hashing Procedures

Lecture Notes on Memory Layout

Integrity and Security

Representing Data Elements

Advanced Algorithms Class Notes for Monday, October 23, 2012 Min Ye, Mingfu Shao, and Bernard Moret

Module 2: Classical Algorithm Design Techniques

File System Interface and Implementation

20-EECE-4029 Operating Systems Spring, 2013 John Franco

CSCI 4500 / 8506 Sample Questions for Quiz 5

Chapter 3. Planning and Scheduling. Chapter Objectives. Check off these skills when you feel that you have mastered them.

Example Lecture 12: The Stiffness Method Prismatic Beams. Consider again the two span beam previously discussed and determine

Unit 6 Chapter 15 EXAMPLES OF COMPLEXITY CALCULATION

Managing Scopes, Prefixes, and Link Templates

192 Chapter 14. TotalCost=3 (1, , 000) = 6, 000

Chapter 8: Subnetting IP Networks

CSC 553 Operating Systems

FROM A RELATIONAL TO A MULTI-DIMENSIONAL DATA BASE

Concept as a Generalization of Class and Principles of the Concept-Oriented Programming

Range Queries. Kuba Karpierz, Bruno Vacherot. March 4, 2016

Heap Management portion of the store lives indefinitely until the program explicitly deletes it C++ and Java new Such objects are stored on a heap

Dictionaries and Hash Tables

(Refer Slide Time 6:48)

ECE 122 Engineering Problem Solving with Java

Chapter 12. File Management

Qualifying Exam in Programming Languages and Compilers

Requirements, Partitioning, paging, and segmentation

CMSC424: Database Design. Instructor: Amol Deshpande

2.3 Algorithms Using Map-Reduce

Chapter 11: Indexing and Hashing

Top-Level View of Computer Organization

Chapter 11: Indexing and Hashing

RAQUEL s Relational Operators

Class modelling (part 2)

Modules:Context-Sensitive Keyword

Intermediate Representations & Symbol Tables

1 Motivation for Improving Matrix Multiplication

Eli System Administration Guide

Run-time Environments

Run-time Environments

File Structures and Indexing

Database System Concepts, 6 th Ed. Silberschatz, Korth and Sudarshan See for conditions on re-use

Lecture 5. Treaps Find, insert, delete, split, and join in treaps Randomized search trees Randomized search tree time costs

REGION BASED SEGEMENTATION

Redes de Computadores (RCOMP) 2017/2018 Laboratory Class Script - PL09

12 Advanced IP Addressing

The Grid File: An Adaptable, Symmetric Multikey File Structure

Internet Engineering Task Force (IETF) Request for Comments: 8156 Category: Standards Track ISSN: June 2017

Type Bindings. Static Type Binding

DP2 Report: A collaborative text editor

Chapter 14 Global Search Algorithms

Intermediate Code Generation

File Systems. OS Overview I/O. Swap. Management. Operations CPU. Hard Drive. Management. Memory. Hard Drive. CSI3131 Topics. Structure.

An Improved Algebraic Attack on Hamsi-256

CREATE INDEX. Syntax CREATE INDEX

Lecture 6: External Interval Tree (Part II) 3 Making the external interval tree dynamic. 3.1 Dynamizing an underflow structure

Oracle NoSQL Database. Creating Index Views. 12c Release 1

CS 5520/ECE 5590NA: Network Architecture I Spring Lecture 10: IP Routing and Addressing Extensions

Chapter 2. Data Representation in Computer Systems

Teradata. This was compiled in order to describe Teradata and provide a brief overview of common capabilities and queries.

Chapter 12. Selected Pentium Instructions

Fast Bit Sort. A New In Place Sorting Technique. Nando Favaro February 2009

Short Notes of CS201

General Objective:To understand the basic memory management of operating system. Specific Objectives: At the end of the unit you should be able to:

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2014/15

Table of Contents 1 AAA Overview AAA Configuration 2-1

Chapter 1. Introduction. 1.1 More about SQL More about This Book 5

Sorting. Order in the court! sorting 1

Excerpt from: Stephen H. Unger, The Essence of Logic Circuits, Second Ed., Wiley, 1997

File Management. Ezio Bartocci.

CSE100. Advanced Data Structures. Lecture 8. (Based on Paul Kube course materials)

Concurrency Control Service 7

CS201 - Introduction to Programming Glossary By

Transcription:

Variable-Prefix Identifiers (Adjustable Oid s) William Kent Database Technology Department Hewlett-Packard Laboratories Palo Alto, California 1 Introduction Object identifiers come from various sources, and often conflict in content and form. Identifiers generated in different databases, or at different network nodes, may accidentally be the same. User-specified identifiers have the same problem: an employee number might accidentally match a part number. Literals are susceptible as well: the representations of some numbers and character strings are the same. The natural solution for conflicting contents is to attach some sort of qualifier (prefix) to differentiate one set of identifiers from another. Sometimes this leads to identifiers whose overall length is variable, as in hierarchical name spaces. While this may be appropriate in many cases, nonuniform identifier lengths impose a heavy penalty on system design and performance. We limit our attention to designs in which identifiers have some fixed length n (64 bits in most of our examples). The usual approach to prefixes is to fix the identifier format, dividing all identifiers into a prefix of length p and a suffix of length s. The length p is rigidly fixed, requiring a difficult and irrevocable design decision. The maximum number of regions into which identifiers may be partitioned is fixed in advance, being limited to 2 p. Similarly, the length s of the suffix must be decided and fixed in advance, in the hopes that no region will contain more than 2 s identifiers. Once fixed, this same capacity is reserved for all regions, which wastes identifier capacity in padding bits for all but the largest region. Identifiers will go unused in databases whose capacity is less than 2 s objects, and also in user-specified identifiers (employee numbers, part numbers) less than s bits in length. A naive approach to variable-length prefixes would embed a length field in the identifier. This is exorbitantly wasteful of identifier capacity, and still puts a difficult restriction on the maximum number of regions. For example, to allow a prefix length of up to 16 bits requires a four-bit length field, reducing the number of identifiable objects by a factor of 16. We could also consider reserving a bit pattern to serve as a delimiter at the end of the prefix, but this also restricts the usable bit patterns and hence the total identifier capacity. Two innovations alleviate these difficulties: a global table of valid prefix values, and an expansion zone of zeros designed into the middle of the identifier format, allowing various sorts of adaptation and extension. The cost of maintaining the prefix table needs to be weighed against the benefits of this approach. The technique also supports a hierarchical nesting of subregions. 1.1 The Prefix Table With variable-length prefixes and no embedded length field or delimiter, the only way to recognize the prefix is to maintain a table of valid prefixes. A simple rule makes prefixes uniquely distinguishable: no prefix value may match the initial portion of any other prefix value. Thus 0, 10, and 11 constitute a valid set of prefixes, and 01 could not be added as a valid prefix. Internal Accession Date Only 1

The rule can be refined to support subregions. If the prefix of a region r 1 matches the initial portion of the prefix of region r 2, then r 2 is a subregion of r 1. Any identifier belonging to r 2 also belongs to r 1. For example, regions with prefixes 01 and 0110 are nested subregions of a region with prefix 0. Such regions and prefixes exist only if recorded in the prefix table. For each region, the table contains: The prefix length p. The prefix code (bit pattern) c. Optionally, other information about the region, such as a name or description, or a network node address, or other control information as needed. For convenience, it might include a flag indicating whether a region has subregions, though this can be determined algorithmically. This technique can be used to hierarchically partition oid s for any reason, not just for location purposes. Regions could be used to encode different identifier types and formats, e.g., literals and non-literals. Other object types could also be encoded in the prefix, so long as they are invariant over the object s lifetime. Figure 1 shows a plausible set of prefixes, which can be adapted to changing needs. For user-specified identifiers, the suffix length could be fixed at the actual length of the value (part number, employee number). region prefix length value c Literal 1 0 AtomicLiteral 2 00 Char 3 000 CityName 5 00000 MonthName 5 00001 Number 3 001 Integer 4 0010 Aggregate 2 01 Set 4 0100 List 4 0101 NonLiteral 1 1 SystemTypeObject 2 10 Type 4 1000 Function 4 1001 UserObject 2 11 SystemOid 3 110 Iris_Oid 4 1100 IrisDB00_Oid 6 110000 IrisDB01_Oid 6 110001 UserOid 3 111 HP_EmpNum 5 11100 HP_PartNum 5 11101 HP_DocNum 5 11110 Figure 1. An example set of prefixes. Identifying the region to which an identifier belongs is a little more complicated than with fixed-length or lengthencoded prefixes. It is necessary to compare different initial segments of an identifier with various table entries until a matching prefix is found. If subgroups are present, determining the most restrictive subregion should begin by comparing with the longer prefixes in the table, then progressing to the shorter ones. In the absence of subregions, it s hard 2

to know whether it s more efficient to start with the long ones or the short ones. Also, it may be possible to apply or adapt other search techniques such as indexes or hashing. This bears further investigation. Testing for a specific region is still relatively simple, since its prefix value can be compared with the initial portion of the identifier. However, under certain conditions, the prefix for a given region might change (Section 3.2 ), and tests for that region need to accommodate such changes. 2 Customized Capacities In identifiers of length n, if the length of a region s prefix is p, then individuals in that region are identified by a suffix of length s=n-p, and the region has a capacity for identifying 2 s objects. For example, if 64-bit identifiers are being used globally, and a particular database generates 32-bit oid s, then its oid s can be qualified with a 32-bit prefix. If employee numbers are 24 bits long, then the Employee Number region can be identified with a 40-bit prefix. If part numbers are 48 bits in length, then the Part Number region is identified with a 16-bit prefix. Of course, this scheme can only accommodate regions for which s<n. To accommodate four databases independently generating 64-bit oid s, the global identifier length n must be at least 66 bits. 80-bit product numbers cannot be used as user-specified identifiers unless n>80. 3 Expansion Zones The expansion zone for a given region is a string of z bits between the prefix and the suffix, reserved to be zero in all identifiers belonging to that region. The expansion zone allows for several kinds of extensibility. The format of any n- bit identifier in a given region is: prefix code zeros individual identifier p bits z bits s bits The prefix table is now extended to explicitly specify the suffix length s or the expansion zone length z=n-(p+s), or both for convenience. The current capacity of a region is 2 s, while its ultimate capacity is 2 (n-p). Maintaining all zeros in the expansion zone is crucial to extensibility. By making appropriate adjustments to the table, these bits can be used to: Increase the capacity of a region. Introduce new independent (disjoint) regions. Introduce subregions of an existing region. 3.1 Increasing Region Capacity Each region can be managed by an independent manager which issues s-bit individual identifiers, with zeros in the expansion zone, without having to refer to the table. If this capacity is exhausted, the suffix length s in the table can be increased, up to n-p if necessary, to increase the current capacity of the region. If a region has subregions (Section 3.3 ), then the subregions should be expanded rather than the parent region itself. The current capacity of a region can be multiplied by 2 k by shifting k bits from the expansion zone to the suffix. The prefix identifying the region is unchanged, and identifiers previously belonging to the region still belong to it. The only difference is that previously reserved bits in the expansion zone have been released for use in new identifiers in that region. 3

For example, the current capacity of a region with a suffix length of 50 bits and an expansion zone of 8 bits can be quadrupled by stretching its suffix to 52 bits and shrinking its expansion zone to 6 bits (Figure 2). This might be required, for example, because the estimated number of objects in a region has been revised upwards, or because the company has expanded the length of its employee numbers, etc. Under certain conditions, the current capacity of a region can also be reduced, if needed for load balancing. If it is known that no identifiers have been issued in a given region with high order 1-bits in the suffix, then those high-order bits can be returned to the expansion zone, and the suffix length correspondingly reduced. before after region prefix exp zone suffix region prefix exp zone z s z lnth p value c lnth p value c suffix s r 1 6 101101 8 50 r 1 6 101101 6 52 before: the last oid that filled region r 1... 1011010000000011...1 p z s after: the next oid after expansion... 1011010000000100...0 p z s Figure 2. Expanding the current capacity of a region. 3.2 Adding Independent Regions To introduce a new independent region, one first seeks a new prefix which does not conflict with any existing prefixes. An acceptable new prefix would not coincide with the initial portion of any existing prefix, nor would any existing prefix match the initial portion of the new prefix. For example, if the existing prefixes are 1 and 01, then 00 would be a legal new prefix. This is not always possible. When the existing prefixes are 1, 01, and 00, no other independent prefix can be added. These existing prefixes would match the initial portion of any new prefix that might be added. At this point we take clever advantage of the expansion zones to split an existing region in order to accommodate a new region. In effect, the prefix of an existing region is extended with zeros taken from the expansion zone. Since all identifiers previously issued in this region necessarily had zeros in these bits, they are still recognized as belonging to this region. The expansion zone is made smaller, with a corresponding reduction in the ultimate capacity of the region. Suppose (Figure 3) region r 1 with prefix code 101101 has a suffix length s=50, i.e., a current capacity of 2 50 objects. It has an expansion zone of z=8 bits, so that all identifiers in that region begin with a code of 101101 followed by 00000000 in the expansion zone. This region r 1 can be split by extending its prefix with up to eight zeros. If we extend its prefix with two zeros to be 10110100, its prefix length becomes p=8, and its expansion zone is reduced to z=6. Note again that all identifiers in region r 1 did already begin with 10110100, due to the expansion zone, so they still belong to region r 1. It is now possible to introduce 10110101, 10110110, and 10110111 as new independent prefixes, each with an arbitrary suffix length s 56, i.e., a capacity of up to 2 56 objects. In general, taking k bits from the expansion zone of r 1 4

allows the introduction of up to 2 k -1 new independent regions. We don t have to introduce all of them if we don t need them. If we only wanted to introduce one new region, we could have extended the prefix of r 1 with just one 0. The region r 1 chosen to be so subdivided must have z>0 at the outset, and should have no subregions. It should be able to tolerate the reduction in its ultimate capacity, and it should be able to give up enough ultimate capacity to meet the requirements of the new region(s). Since the prefix identifying region r 1 has changed, any routine testing whether an oid belongs to this region would now have to use the new prefix. This could be a problem in some cases. before after region prefix exp zone suffix region prefix exp zone z s z lnth p value c lnth p value c suffix s r 1 6 101101 8 50 r 1 8 10110100 6 50 r 3 8 10110101 3 53 r 4 8 10110110 0 56 r 5 8 10110111 6 50 p z s...before a typical oid in region r 1... 1011010000000010...10 p z s...after a typical oid in region r 5...1011011100000010...10 p z s Figure 3. Adding new independent regions. 3.3 Adding Subregions Adding subregions is somewhat similar to the previous case of adding independent regions, the main difference being that the parent region retains its original prefix, with the new subregions being assigned extensions of this prefix. A region can only be extended with immediate subregions once, but the subregions themselves can acquire further subregions. Though it can be determined algorithmically, it might be useful to put a flag in the table to indicate that a region has subregions. The maximum number of subregions which can be added to a region r is 2 z, where z is the length of the expansion zone of r. Figure 4 shows an example. Regions r 2, r 3, r 4, and r 5 are now subregions of r 1, since the prefix of r 1 matches the initial portion of the prefixes of these four new regions. 5

These four regions partition r 1, which can have no direct members of its own; all members of r 1 must be members of one of the subregions. We therefore show the suffix and expansion size of r 1 to be 0. Previous members of r 1 have become members of r 2. It is safest to define the subregions of a region before populating it. If we didn t add all four subregions, then 10110100 should not be assigned as a prefix. Region r 1 would not be covered, and could still have direct members of its own. However, its ultimate capacity is constrained to be no larger than any of its subregions, in order to avoid oid conflicts. For example, region r 1 could not have any direct members with an oid beginning with 10110101. In effect, region r 2 would not be added to the table, but it would serve as a virtual region containing members not covered by the other subregions. before after region prefix exp zone suffix region prefix exp zone z s z lnth p value c lnth p value c suffix s r 1 6 101101 8 50 r 1 6 101101 0 0 r 2 8 10110100 6 50 r 3 8 10110101 3 53 r 4 8 10110110 0 56 r 5 8 10110111 4 52 Figure 4. Adding subregions. Subregions need not be managed globally in the master prefix table. If a region is able to give up future growth by accepting a permanent suffix length, then it could treat that suffix as a fixed-length identifier and reapply the general algorithm to subregions of its own. The difference is in the scope over which the table has to be maintained. For example, within a 64-bit identifier system, a region could be described in the master table as having a prefix of 1001 and a suffix length of 60, i.e., no expansion zone and no room for growth. Another table could be maintained within this region for subregions within the region, managing prefixes, suffixes, and expansion zones within this 60- bit identifier. These subregions would not be recognized in the context of other regions, though the individual oid s would still be globally unique and recognized as belonging to the parent region. 4 Conclusions When fixed-length identifiers are grouped using fixed-length prefixes, the number of possible regions is fixed at the outset by the size of the prefix. The fixed-length suffix dictates that the same number of possible identifiers is reserved for each region. The suffix size is dictated by the needs of the potentially largest region, leading to a potential waste of unused identifiers in the other regions. Supporting variable-length prefixes by encoding the prefix length in the identifier wastes precious bits that could be used to identify individual objects. We have described a flexible system of variable-length prefixes employing a global prefix table and expansion zones in the identifiers. The disadvantages: 6

The need to maintain the global prefix table. More instructions executed to determine an identifier s region. Testing for a given region becomes sensitive to changes in prefix length. The overhead for maintaining the prefix table may not be significant. The table is not likely to be altered very often; a similar table may be present in any case to support global addressing in a network. The benefits of the approach: Regions can have different capacities. The number of regions is dynamically variable. New regions can be added as needed. The capacities of regions can often be increased. Subregions can be defined, and added dynamically. No prefix-length field is required in identifiers. Overall, it would seem that this scheme should lead to more efficient use of identifier capacity, since fewer identifiers will be reserved for regions where they are not needed. It remains to be seen how these advantages and disadvantages balance out in various contexts. 7