A Summary of Out of the Tar Pit Introduction This document summarises some points from the paper Out of the Tar Pit (written by Ben Moseley and Peter Marks, dated 6 th February 2006) which are relevant to the development of the RAQUEL DBMS. The paper is split into 2 parts. The first considers the problem of Complexity, the single major difficulty in the successful development of large-scale software systems (quote), its common causes and a general approach to eliminating the causes. The second part considers the relational model (not SQL) as an example of the successful application of their general approach. Since the intention is to build a DBMS that implements the relational model, this summary considers only the first part. A DBMS is a large-scale software system. Therefore it is helpful to summarise the common causes of complexity and how to eliminate complexity with a view to applying this understanding to the building of the RAQUEL DBMS. Essence and Accident Out of the Tar Pit distinguishes between essential difficulty and accidental difficulty as defined in Frederick P. Brooks paper No Silver Bullet Essence and Accident in Software Engineering. Brooks definitions are : Essential Difficulties Those inherent in the nature of software, the fashioning of the complex conceptual structures that compose the abstract software entity. Accidental Difficulties Those that today attend its production but are not inherent, the representation of those abstract entities in programming languages, and the mapping of these onto machine languages within space and speed constraints. Complexity Complexity is considered to be the root cause of the vast majority of problems with software today. The relevance of complexity is widely recognised. Complexity is caused by : State Quoting Brooks : From the complexity comes the difficulty of enumerating, much less understanding, all the possible states of the program. However Moseley and Marks believe it is the presence of many possible states which give rise to the complexity; i.e. the large number of states is the causal factor. Control Control is about the order in which things happen (and includes concurrency). Programming languages force a far more detailed concern Page 1 of 6
about the order than is usually desirable or necessary. Only the relative order needs to be correct, but usually a precise detailed order must be specified. The relative order may be defined as follows. For a sequence of items, s, containing items x and y, x s, y s ( x Precede y ) ( Pos(x) < Pos(y) ) where Precede and Pos(ition) are functions with the obvious semantics. Code Volume This is considered to be caused by the state and control problems. Complexity Breeds Complexity In other words, complex systems generate more complexity in the attempt to manage the initial complexity. Simplicity is Hard This is now increasingly acknowledged 1. Classical Approaches to Managing Complexity Object-Orientation OO is treated as essentially an Abstract Data Type approach to imperative programming, with integrity constraints over an object s states being applied by access methods. The following problems arising with integrity constraints. Firstly, if several object procedures access the same part of the object s state, then they are all need to enforce the integrity constraints. Secondly, it is awkward to enforce multi-object integrity constraints (as opposed to single-entity constraints). OO has both intensional identity (i.e. its object identity) and extensional identity (determined by the values of its attributes). This complicates the states to be considered. OO uses traditional control flow mechanisms. Therefore OO suffers from state-derived and control-derived complexity. Functional Programming The primary strength of functional programming, in its pure form, is that it avoids state and side effects, and provides referential transparency. This means that a given set of input values to a function always results in the same set of output values from the function. By avoiding state, functional programming avoids all the weaknesses that stem from it. However functions can pass parameters through themselves, thereby undermining their statelessness. The order in which functions appear in expressions applies control sequencing. However this can be ameliorated by the use of high level functions (e.g. fold, map) to apply control. 1 In passing, it is the author s opinion that not only is simplicity hard to achieve, it is hard to understand simplicity and what simplicity is to be achieved; also when the resulting simplicity is obvious with hindsight, the hard work is not appreciated by one s peers. Page 2 of 6
The weakness of functional programming stems from its strength, because some systems include state as part of their very nature, and so this must be permitted in the system s program code. Logic Programming Pure logic programming specifies what needs to be done (as a set of axioms which describe the problem, and the attributes of the solution), and leaves the infrastructure to derive the solution (by using the axioms to prove the solution value). As a result, pure logic programming avoids states. Unfortunately, a lot of control features are built in. These are the implicit ordering of sub-goals in statements, and the implicit of ordering of clause application via the statement sequence. Often in practice, this needs to be supplemented by extra-logical features (e.g. Prolog s cut facility) to prevent non-termination of the proofs. The logic programming paradigm does not lend itself to the development of many computer systems. Accidents and Essence Moseley and Marks use these terms in a similar way to Brooks, but start with the complexity of the user s problem. Thus their terms are defined as follows : Essential Complexity That which is inherent to the logic of the user s problem, even in an ideal world. Anything of which the user is unaware cannot be essential complexity. Accidental Complexity All the other complexity, which would not exist in an ideal world. In practice it arises from performance problems, sub-optimal programming languages and infrastructure, etc, which the software developers have to deal with to produce a user-acceptable working system. The complexity concerned is that with which software developers have to contend. Brooks and others suggest that the complexity of software is an essential property, not an accidental one. Moseley and Marks contend that complexity is not necessarily an inherent, essential property of software, and that much of today s complexity is accidental. Moseley and Marks suggest that systems contain both essential and accidental complexity, and that the goal of software engineering is to eliminate as much accidental complexity as possible and to assist with essential complexity. Determining Essential and Accidental Complexity In the ideal world, developments start with a formally specified version of the user s requirements for their system. These are essential requirements. No accidental aspects can be allowed in the formal specification. As a Page 3 of 6
consequence, no aspects of the specification can include any aspect of its execution; i.e. no performance requirements, no ease-of-use requirements, no infrastructure requirements, etc 2. Ideally one should just be able to execute the user s (functional) specification. It is hoped that most system state in the ideal world is accidental and can be got rid of. Data that is part of the user s requirements is essential. Not all of that data may give rise to essential state. All data in the system is either input or derived from input. Derived data is either immutable or mutable (because the users can update that data). Input Data User-required input data is essential. It falls into 2 cases : 1. Data which the system may need to refer to in future. This gives rise to essential state. 2. Data which is never referenced in future. Such data need not be kept. Essential Derived Data Immutable This can always be re-derived from essential state data that has been input. Ideally it need not be kept. If kept in practice, it gives rise to accidental state. Essential Derived Data Mutable This can always be re-derived from essential state data that has been input if the function carrying out the mutation has an inverse function. If the mutating function has no inverse, then changes to the data have to be considered as input. If needed in future, this gives rise to essential state. Accidental Derived Data State which is derived but not in the user s requirements is accidental state. Data Classifications :- Data Essentiality Data Type Data Mutability Classification Essential Input - Essential State Essential Derived Immutable Accidental State Essential Derived Mutable Accidental State Accidental Derived - Accidental State The classification Accidental State means that in the ideal world the corresponding data can be excluded from the system. The implication of the above is that there are large amounts of accidental state in typical systems, which ideally would be removed. 2 Traditionally the essential requirements were called Functional Requirements, and any other requirements (performance, ease-of-use, infrastructure, etc) were called Non-Functional Requirements. Page 4 of 6
Ideally the only system state is that which is visible to the user and part of their specification. Control in the Ideal World Control can generally be completely omitted from the ideal world, and hence is entirely accidental. This is because it rarely appears in the user s formal specification of their requirements. Summary Most complexity is accidental. Therefore it may be possible to significantly reduce the complexity of real large systems. How close is it possible to get to the ideal? Theoretical and Practical Limitations There may in practice be a fuzzy boundary between what is essential and what is accidental, but the distinction is still viable and worthwhile. Systems limited to essentiality may be too inefficient to be practical. Therefore accidental components may be need to be included for efficiency. Situations can arise where derived data is dependent on both initial input values and a later series of user input values. (This is normally true of the value of a DB). To achieve ease of expression and usage (of say the DB) for the user, it is preferable to maintain the current accidental state and treat it as if it were essential state, even in the ideal world. Thus some accidental complexity may need to be added to provide acceptable performance and ease of use/expression. The recommended strategy for dealing with complexity is : 1. Avoid. Accept only essential complexity. 2. Separate. Where accidental complexity is required for performance or ease of use, separate it out in order to better manage it. These recommendations are not new, but not typically applied to the development of today s software. It is recommended that the accidental complexity required for performance reasons is put into a completely separate infrastructure that handles performance and is separate from the essential complexity. It is recommended that the accidental complexity required for ease of use is treated as essential complexity and separated as discussed below. Separation takes 2 forms : 1. Separate all complexity of any kind from the pure logic of the system. (Logic is not considered to have anything to do with either state or control, and therefore not part of the complexity). This may be called the Logic/State split. 2. Divide the retained complexity into the essential and accidental. This may be called the Essential/Accidental split. Page 5 of 6
3. Split the state and control components of the Useful Accidental Complexity system component. The following table summarises the recommendations :- Complexity Type Recommendation Essential Complexity State Separate Essential Logic - Separate Useful Accidental Complexity State Separate Useful Accidental Complexity Control Separate Not-Useful Accidental Complexity State / Control Avoid The differing nature of the 4 components to be retained but kept separate from each other leads to the following relationships between them : Essential State This is the foundation of the system and is self-contained. It makes no reference to other parts of the system. Changes here may require changes to other parts of the system, but changes to other parts of the system never require changes here. This component drives the entire system. Essential Logic This is the heart of the system. It expresses in terms of state what must be true. It makes no reference to the accidental components of the system. Changes to the Essential State may require changes here. Changes here may require changes to the accidental components of the system, but changes to the accidental components of the system never require changes here. Accidental State and Control These components support the essential components. Changes here never affect the essential components. Changes to the essential components may affect these components. Summary A system should be split into the 4 non-avoidable components described above. The goals of avoid and separate must be at the top of the design agenda for a system, not merely desirable constraints. There should be no premature optimisation or designing for performance ; i.e. the design should be as simple as possible and only made more complicated when hot-spot analysis of performance reveals what optimisation is actually needed. Improving the performance of a simple, slow system is far easier than removing complexity from a complex system, which probably is not as fast as intended anyway. Page 6 of 6